We use cookies to improve your experience and analyse site traffic.
Meeting the Article 72 PMM obligation requires engineering infrastructure that operates continuously across five monitoring dimensions: performance, fairness, data drift, operational health, and human oversight. This section covers the technical implementation, infrastructure architecture, and specific requirements for generative AI and composite systems.
Meeting the PMM obligation requires engineering infrastructure that operates continuously, not a manual reporting process performed periodically.
Meeting the PMM obligation requires engineering infrastructure that operates continuously, not a manual reporting process performed periodically. The Technical SME computes the system's core performance metrics continuously on production data, comparing against thresholds established during development and documented in the aisdp. The key metrics include AUC-ROC, F1 score, precision, recall, Brier score, and calibration error, computed on production data with ground truth labels where available. Where ground truth is delayed, as in many real-world applications where outcomes may not be known for weeks or months, the Technical SME defines proxy metrics and leading indicators.
Aggregate performance metrics can mask subgroup-specific degradation. The Technical SME computes all performance metrics across protected characteristic subgroups, where data is available and lawful to process under Article 10(5). A system whose aggregate accuracy remains stable but whose accuracy for a specific subgroup has degraded is experiencing a compliance-relevant change. Temporal stability tracking identifies slow, consistent decline that does not breach the threshold on any single measurement but represents significant cumulative degradation over months.
Fairness monitoring computes selection rate ratios, equalised odds, predictive parity, and calibration within groups on production data at defined intervals, weekly or monthly depending on the system's volume and risk profile. Fairness metrics are computed for single protected characteristics and, where cell sizes are sufficient, for intersectional subgroups. In deployment contexts where demographic data is unavailable, compensating strategies include proxy-based estimation with documented methodology and accuracy bounds, periodic deployer surveys or sampling studies, external benchmark comparison against known population distributions, and structured feedback analysis examining complaint and appeal patterns for demographic signals.
A key difference between development and production monitoring is that production data typically lacks ground truth labels. During development, predictions are evaluated against known labels. In production, the true outcome may not be available for weeks or months: a credit decision's true outcome is not known until the borrower repays or defaults, potentially years later. NannyML addresses this gap through performance estimation without ground truth, using the Confidence-Based Performance Estimation method to estimate accuracy from the confidence score distribution without requiring labels. Estimated performance is monitored continuously, with alerts when the estimate drops below the declared AISDP threshold.
Drift detection provides a complementary signal. Feature drift indicates the model is receiving data it was not designed for. Prediction drift indicates the model's behaviour is shifting. Evidently AI computes both types of drift using PSI, Jensen-Shannon divergence, and the KS test, generating reports that can be automated on a daily or weekly cadence. The drift metrics should have defined thresholds: PSI below 0.1 is stable, between 0.1 and 0.2 warrants investigation, and above 0.2 requires immediate attention. Fairlearn's MetricFrame enables periodic fairness recomputation on production predictions enriched with protected characteristic data, with results compared against the declared thresholds and any breach triggering the alert and escalation process.
Data drift monitoring compares the distribution of incoming data against the training data distribution using statistical measures.
Data drift monitoring compares the distribution of incoming data against the training data distribution using statistical measures. Input drift uses Population Stability Index, Kolmogorov-Smirnov test statistics, Jensen-Shannon divergence, or Wasserstein distance. Each input feature is monitored individually, and the Technical SME computes composite drift scores to capture overall distributional shift.
Concept drift, where the underlying relationship between inputs and outputs changes, is often more consequential than input drift because it means the model's learned patterns no longer reflect reality. Where ground truth labels become available even with delay, the Technical SME monitors the relationship between input features and outcomes for changes. Feature drift monitors individual feature distributions for shifts that may not be captured by aggregate drift measures. A single feature shifting significantly while others remain stable can cause localised performance degradation that aggregate metrics miss.
Feedback loop detection addresses a pernicious risk specific to AI systems. A recruitment screening model that systematically underscores candidates from a particular demographic group results in fewer hires from that group, reducing the group's presence in training data for the next model cycle and reinforcing the original bias. Detecting this requires comparing the training data's demographics against the demographics of production decisions, controlling for the system's influence. The test examines whether the demographic makeup of the system's positive decisions differs from that of the eligible population in a way that would shift the training data. This is a custom statistical analysis, not a standard tool feature, that the Technical SME conducts quarterly and documents in the PMM report.
Operational monitoring tracks system availability, inference latency, error rates, throughput, and resource utilisation continuously.
Operational monitoring tracks system availability, inference latency, error rates, throughput, and resource utilisation continuously. Operational degradation can affect Article 15 compliance even if the model itself is unchanged. A system experiencing intermittent timeouts may produce incomplete or inconsistent results that affect decision quality.
Availability is measured against a defined service level objective documented in the AISDP. For high-risk systems where unavailability could force deployers to make decisions without AI support, availability degradation is a compliance concern. The monitoring system tracks both planned and unplanned downtime, computes rolling availability percentages over defined windows, and alerts when availability trends downward or when a single outage exceeds the maximum tolerable duration.
Latency monitoring tracks mean response times and tail latencies at the 95th and 99th percentiles, because high-percentile spikes can cause timeouts in downstream systems. Where latency exceeds the timeout threshold, the downstream system may receive no output or a default fallback value, introducing silent failures. Error rates are classified by type: input validation failures, inference failures, post-processing failures, and timeout failures, each with different compliance implications and its own alert threshold. A rising input validation failure rate may indicate a data source change, while a rising inference failure rate may indicate a model or infrastructure problem.
Resource utilisation tracking covers CPU, GPU, memory, and storage against capacity limits, with warning thresholds typically at 70 to 80 per cent of capacity and critical thresholds at 90 per cent. Capacity monitoring should provide sufficient lead time, typically weeks, for infrastructure scaling decisions. Dependency health monitoring tracks the availability, latency, and error rates of upstream services and downstream consumers, because a degradation in an upstream data source can corrupt inputs without triggering any model-level alert.
Human oversight monitoring tracks four dimensions that directly measure Article 14 compliance in practice: override rates, review times, escalation frequency, and automation bias indicators.
Human oversight monitoring tracks four dimensions that directly measure Article 14 compliance in practice: override rates, review times, escalation frequency, and automation bias indicators.
Override rates carry compliance significance in both directions. An override rate consistently below 2 to 5 per cent may indicate automation bias, where operators accept recommendations without meaningful scrutiny. An override rate consistently above 20 to 30 per cent may indicate the system is underperforming or operators disagree with its logic. The PMM plan defines an expected override rate range based on the system's documented accuracy and the decision context, with both upper and lower threshold breaches generating alerts. Override analysis is disaggregated per operator, per decision type, and per time period to detect trends such as declining override rates that indicate growing automation bias.
Review time analysis uses the time operators spend per case as a proxy for engagement depth. Operators who consistently review cases in under five seconds for decisions requiring substantive analysis are unlikely to be performing meaningful oversight. Review time distribution is as informative as the average: a bimodal distribution where most cases are reviewed in seconds but a small proportion take minutes may indicate operators are skimming the majority and only engaging deeply with cases triggering intuitive concern.
Escalation monitoring tracks frequency over time disaggregated by reason. A decline may indicate growing operator confidence, improving system performance, or avoidance of a process perceived as burdensome. The PMM plan defines a baseline escalation rate with categorised reasons enabling trend analysis. Beyond override rates and review times, more granular automation bias indicators examine whether operators who override high-confidence recommendations do so at the same rate as low-confidence ones, and whether operators engage with explanatory features before accepting recommendations.
The monitoring infrastructure must be reliable, scalable, and independent of the AI system it monitors.
The monitoring infrastructure must be reliable, scalable, and independent of the AI system it monitors. A monitoring system that fails when the AI system fails provides no information at the moment it is most needed. Five layers compose the architecture.
The data collection layer captures inference inputs, outputs, and metadata asynchronously to avoid adding latency to the inference path, typically streaming events to a message queue from which the monitoring pipeline consumes. The collection layer must handle peak throughput without data loss, as dropped monitoring events create blind spots in the compliance record. The storage layer uses a time-series database optimised for aggregation, comparison, and disaggregation queries, with a tiered strategy retaining raw data at full granularity for 30 to 90 days then aggregating to summaries for long-term retention.
The computation layer runs metric calculations on a scheduled basis as defined in the PMM plan. Computations must be idempotent and deterministic to ensure reproducibility and auditability. Where metrics depend on delayed ground truth labels, the pipeline handles late-arriving data and recomputes affected metrics. The alerting layer routes through a dedicated service ensuring delivery, acknowledgement tracking, and escalation of unacknowledged alerts. Alert fatigue is managed through threshold tuning and alert suppression for known, documented conditions. The dashboard layer serves both operational audiences with real-time visibility and governance audiences with compliance-relevant summaries.
The Technical SME documents the aggregation methodology in the PMM plan, and the raw data retention period must be sufficient to support serious incident investigations under Article 73. Where metrics depend on delayed ground truth labels, the computation pipeline handles late-arriving data and recomputes affected metrics to ensure the compliance record remains accurate.
Generative AI systems present monitoring challenges that differ substantially from classical ML models.
Generative AI systems present monitoring challenges that differ substantially from classical ML models. The output space is open-ended, ground truth is often subjective, and failure modes include hallucination, tone deviation, and off-topic generation that traditional performance metrics cannot capture.
Output quality monitoring requires a mix of approaches: rule-based checks for prohibited content, length limits, and structural requirements; embedding-based similarity measures flagging outputs semantically distant from expected patterns; and classifier-based monitoring where a separate lightweight model evaluates output quality, relevance, and safety. Hallucination detection for RAG systems compares generated claims against source documents using entailment scoring, citation verification, and consistency checking that flags contradictory answers to the same query on different occasions.
For non-RAG systems, hallucination detection relies on Natural Language Inference models that assess whether the generated statement is entailed by a reference corpus. These detectors are imperfect: they miss subtle hallucinations and occasionally flag correct statements. The monitoring should therefore combine automated detection with periodic human evaluation, where a random sample of generated outputs is reviewed by domain experts who rate factual accuracy, relevance, and safety. Argilla, Label Studio, and Prodigy provide annotation platforms for structuring this human evaluation.
Prompt and response distribution monitoring tracks the topical distribution of inputs and outputs over time using topic classification. BERTopic or custom embedding-based clustering categorises each prompt and response into a topic, with the topic distribution tracked over time. A sudden shift, such as a large increase in prompts about a topic the system was not designed for, may indicate misuse, a change in the user population, or an adversarial probing campaign. Safety and alignment monitoring tracks policy violation rates for systems with content policies, behavioural boundaries, or use-case restrictions. A rising violation rate may indicate that the model's safety alignment has degraded, that users have discovered bypass techniques, or that the system is encountering input patterns it was not designed to handle. Lakera Guard scans model inputs and outputs for prompt injection attempts, PII leakage, toxic content, and other safety violations. NVIDIA NeMo Guardrails enforces conversational guardrails at the application layer. The safety monitoring should run on every output in production, with violations logged, counted, and reported in the PMM report.
Composite systems combining multiple modalities or chaining several models create monitoring challenges that single-modality metrics do not capture.
Composite systems combining multiple modalities or chaining several models create monitoring challenges that single-modality metrics do not capture. Degradation in one modality can be masked by stability in another when only the aggregate output is monitored. If the vision component of a medical imaging system becomes less accurate but the text generation component continues to produce fluent summaries, the human operator may not detect the underlying accuracy loss.
PMM for composite systems must monitor each modality independently in addition to monitoring the fused output. The Technical SME computes performance metrics at the component level and at the system level, with discrepancies generating alerts. Cascading failure detection tracks intermediate representations between pipeline stages, comparing their distributions against baseline, because errors in one stage can propagate and compound through subsequent stages.
Modality-specific drift monitoring operates independently on each input modality, since image inputs may drift due to camera changes, text inputs due to formatting changes, and structured data due to upstream system changes. A composite drift score aggregating across modalities can miss modality-specific shifts diluted in the aggregate. Fusion logic monitoring tracks the mechanism combining modality-specific outputs, detecting changes in the relative contribution of each modality even when individual performance is stable.
The monitoring architecture operates at three levels: per-component monitoring tracking each model or pipeline stage independently including its input distribution, output distribution, latency, error rate, and component-specific quality metrics; aggregate monitoring tracking end-to-end system behaviour including final output quality, fairness metrics, and user-facing performance; and intermediate representation monitoring detecting problems that neither component-level nor aggregate-level monitoring catches. Cross-modal consistency checks apply to systems processing multiple input modalities, logging and monitoring the resolution of conflicts where different modalities suggest different conclusions. A persistently high inconsistency rate may indicate that one modality's model has drifted or the fusion mechanism is not functioning as designed. The PMM plan specifies thresholds at both component and system levels, and AISDP Module 10 captures the monitoring strategy for each component model, the cross-modal interaction approach, and the fusion logic monitoring configuration.
PMM can be conducted through periodic manual analysis where continuous automated monitoring is unavailable.
PMM can be conducted through periodic manual analysis where continuous automated monitoring is unavailable. Monthly, the data scientist extracts a sample of production predictions from the inference log, computes the declared metrics covering performance, fairness, and drift, and produces a structured report. Drift analysis compares the current month's input feature distributions against the training baseline using standard statistical tests implemented in a simple script. Fairness recomputation calculates the declared fairness metrics on the production sample disaggregated by protected subgroup.
For LLM systems, monitoring can be partially procedural through daily human evaluation sampling, where a reviewer examines 50 to 100 outputs rating each for accuracy, relevance, safety, and PII leakage. Weekly ratings are aggregated into a quality report with trend analysis. Any output rated as harmful or containing PII is immediately escalated.
The procedural approach detects problems at the next review rather than when they occur. For systems where a week of undetected drift could affect thousands of individuals, this delay is a significant compliance risk. Automated monitoring is strongly recommended. Evidently AI, NannyML, Prometheus, Grafana, RAGAS, and Guardrails AI all have open-source editions available at zero licence cost.
NannyML's Confidence-Based Performance Estimation estimates accuracy from the model's confidence score distribution without requiring ground truth. This is monitored continuously with alerts when the estimate drops below the declared threshold.
Monthly manual extraction and analysis of production prediction samples. Problems are detected at the next review, not when they occur. For high-risk systems, this delay is a significant compliance risk.
Three levels: per-component monitoring of each model independently, aggregate monitoring of end-to-end behaviour, and intermediate representation monitoring between pipeline stages to detect problems that neither level catches alone.
Output quality (rule-based, embedding-based, classifier-based), hallucination detection (entailment, citation verification, consistency), prompt distribution, safety monitoring, and human evaluation sampling.
Five layers: asynchronous data collection, time-series storage with tiered retention, idempotent computation, dedicated alerting with fatigue management, and dual-audience dashboards.
The failure modes for each dependency, and the system's expected behaviour when a dependency is unavailable, are documented by the Technical SME in the PMM plan and tested periodically. Dependency monitoring should cover every external integration documented in the system architecture, with alerting thresholds calibrated to each dependency's criticality.
When a monitoring alert fires, the triage process must determine whether the root cause is operational (infrastructure, configuration, dependency) or model-related (drift, degradation, adversarial input). This distinction matters because the response path, the responsible team, and the regulatory implications differ. The PMM plan defines diagnostic procedures for common alert patterns: a simultaneous spike in latency and error rate with stable model metrics suggests an infrastructure issue, while stable infrastructure metrics with degrading accuracy suggest a model issue. Where the cause is ambiguous, both the engineering and ML teams are engaged simultaneously to avoid sequential diagnosis delays.
Operator wellbeing and workload monitoring tracks cases per shift, shift duration, break frequency, and overtime hours. Cognitive fatigue degrades oversight quality: an operator who has reviewed three hundred cases in a single shift is less likely to catch a subtle error in case three hundred and one. The PMM plan defines maximum workload parameters. Human oversight metrics, thresholds, and alert configurations feed into AISDP Module 7, with monitoring results included in quarterly PMM reports and retained as Module 10 evidence.
The PMM infrastructure produces a structured monthly report covering inference volume and availability, performance estimates using NannyML CBPE or equivalent, drift metrics per feature and prediction level, fairness metrics per subgroup with confidence intervals, human oversight metrics including override rate, review time, and calibration case performance, and any alerts triggered with their resolution status. Reports are reviewed by the AI Governance Lead and retained as AISDP Module 10 evidence.
Human evaluation sampling provides qualitative assessment that automated monitoring cannot capture. Trained evaluators review a random sample of 100 to 500 production interactions weekly, rated on a structured rubric covering accuracy, relevance, safety, and explanation quality. The evaluation results feed into the PMM report and provide the ground truth against which automated quality metrics are calibrated. The AI Governance Lead defines the human evaluation cadence in the AISDP.
For RAG systems, retrieval-specific metrics supplement output quality monitoring alongside the output quality measures. Retrieval relevance measures whether the documents retrieved for a given query are actually relevant, assessed through automated relevance scoring using a cross-encoder re-ranker or separate relevance classifier and periodic human evaluation. Retrieval coverage measures whether the knowledge base contains documents relevant to the queries the system receives in production; a rising proportion of queries with no relevant documents indicates a knowledge base coverage gap. Retrieval consistency measures whether the same query retrieves the same documents at different times; inconsistency may indicate embedding model drift or knowledge base instability. Retrieval fairness monitoring extends the embedding bias assessment into production, with the Technical SME periodically running paired-query retrieval bias tests to verify retrieval quality does not differ systematically across protected dimensions.