What monitoring dimensions must be tracked in production?

Five dimensions: performance metrics disaggregated by subgroup, fairness metrics, data drift (input, concept, and feature), operational health (availability, latency, errors), and human oversight (override rates, review times, automation bias).

What constitutes a serious incident under Article 73?

An incident resulting in death, serious health harm, irreversible critical infrastructure disruption, fundamental rights infringement, or serious property/environmental harm. Reporting deadlines range from 2 to 15 days depending on severity.

What happens when an AI system reaches end-of-life?

Seven workstreams: deployer transition, technical shutdown, data lifecycle closure, downstream decision monitoring, documentation finalisation, EU database update, and ten-year archival with ongoing obligations.

How much should organisations budget for post-market monitoring?

Between 15 and 25 per cent of the system's annual development cost, covering dedicated analytical personnel (0.25-0.5 FTE), monitoring infrastructure, periodic re-validation testing, and incident response contingency.

What are the three PMM maturity tiers?

Reactive (responding only to incidents and complaints), structured (following a documented plan with defined metrics and thresholds), and proactive (using trend analysis and predictive indicators to anticipate problems). Organisations should aim for the proactive tier.

What are the three PMM maturity tiers?

Reactive (responding only to incidents and complaints), structured (following a documented plan with defined metrics and thresholds), and proactive (using trend analysis and predictive indicators to anticipate problems). Organisations should aim for the proactive tier.

Monthly manual analysis provides the minimum viable approach, but problems are detected at review time rather than when they occur. For systems where undetected drift could affect thousands of individuals, the delay is a significant compliance risk. Open-source tools like Evidently AI and NannyML provide automation at minimal cost.

Q: What deployer obligations exist for post-market monitoring?

Under Article 26, deployers must monitor operation using the provider's instructions, inform the provider of serious incidents, and suspend use if they believe the system presents a risk. Providers must establish structured feedback channels and define clear suspension criteria in the Instructions for Use.

Abstract

Read abstract

Post-market monitoring is the mechanism through which organisations maintain compliance after deploying high-risk AI systems. Article 72 requires a documented PMM plan covering data collection, analysis methodology, thresholds, escalation, and feedback loops. Technical implementation spans five monitoring dimensions: performance, fairness, data drift, operational health, and human oversight. Three severity tiers structure the alerting and escalation framework. Article 73 imposes strict timelines for serious incident reporting, from 2 days for fundamental rights infringements to 15 days for other serious incidents. Deployers bear their own monitoring obligations under Article 26. The PMM feedback loop connects findings to system improvements through a tiered decision authority. System end-of-life requires structured decommissioning across seven workstreams, with documentation retention obligations persisting for ten years. Organisations should budget 15 to 25 per cent of annual development cost for ongoing monitoring.

What does Article 72 require for post-market monitoring?

Regulatory Requirement

Article 72 requires providers of high-risk AI systems to establish a post-market monitoring system that actively and systematically collects, documents, and analyses relevant data provided by deployers or collected through other sources throughout the system's lifetime.

PMM is the mechanism through which the organisation detects problems not anticipated during development, identifies drift that develops gradually over time, gathers evidence needed for serious incident reports, and generates data that feeds back into the risk management system for continuous improvement. A PMM system that collects data but does not act on it is non-compliant with the spirit of Article 72.

Article 72(3) requires a documented PMM plan as part of the technical documentation under Annex IV. The plan must define a data collection strategy specifying what data is collected, from which sources, and at what frequency. It must define an analysis methodology specifying how data is analysed, what metrics are computed, and what statistical tests are applied. A threshold and trigger framework determines what constitutes normal variation versus an alert condition. Escalation procedures define who is notified, how quickly, and what actions follow. The feedback loop defines how PMM findings are integrated into the risk management system, the aisdp, and the system's development cycle.

Meeting the PMM obligation requires engineering infrastructure that operates continuously, not a manual reporting process performed periodically. The PMM plan is documented in AISDP Module 12 (Post-Market Monitoring and Change History). The outputs inform Module 6 (Risk Management), Module 7 (Human Oversight), and Module 5 (Performance and Validation) as the system evolves in production.

How is performance, fairness, and drift monitored in production?

Engineering Approach

Performance monitoring requires computing the system's core metrics continuously on production data and comparing against thresholds established during development.

Performance monitoring requires computing the system's core metrics continuously on production data and comparing against thresholds established during development. Accuracy metrics including AUC-ROC, F1 score, precision, recall, and calibration error are computed on production data with ground truth labels where available. Where ground truth is delayed, the Technical SME defines proxy metrics and leading indicators. All performance metrics must be disaggregated across protected characteristic subgroups, because aggregate performance can mask subgroup-specific degradation. A system whose aggregate accuracy remains stable but whose accuracy for a specific subgroup has degraded is experiencing a compliance-relevant change.

Fairness monitoring computes selection rate ratios, equalised odds, predictive parity, and calibration within groups on production data at defined intervals. Fairness metrics are computed for single protected characteristics and, where cell sizes are sufficient, for intersectional subgroups. Where demographic data is unavailable, compensating strategies include proxy-based estimation, periodic deployer surveys, external benchmark comparison, and structured feedback analysis examining complaint patterns for demographic signals.

Data drift monitoring compares input distributions against training data using statistical measures such as Population Stability Index, Kolmogorov-Smirnov test statistics, Jensen-Shannon divergence, or Wasserstein distance. Input drift is monitored per feature and as composite scores. Concept drift, where the relationship between inputs and outputs changes, is often more consequential because the model's learned patterns no longer reflect reality. Feature drift monitors individual distributions for shifts that aggregate measures may miss.

What operational and human oversight monitoring is required?

Engineering Approach

Operational monitoring tracks system availability, inference latency, error rates, throughput, and resource utilisation continuously.

Operational monitoring tracks system availability, inference latency, error rates, throughput, and resource utilisation continuously. Operational degradation can affect Article 15 compliance even if the model itself is unchanged. A system experiencing intermittent timeouts may produce incomplete results affecting decision quality.

Availability is measured against a defined service level objective documented in the AISDP. Latency monitoring tracks tail latencies at the 95th and 99th percentiles, because high-percentile spikes can cause timeouts in downstream systems. Error rates are classified by type: input validation failures, inference failures, post-processing failures, and timeout failures, each with different compliance implications. Resource utilisation tracking provides lead time for infrastructure scaling decisions, with warning thresholds at 70 to 80 per cent capacity and critical thresholds at 90 per cent.

Dependency health monitoring tracks the availability, latency, and error rates of upstream services and downstream consumers. A degradation in an upstream data source can corrupt inputs without triggering any model-level alert. The failure modes for each dependency and the system's expected behaviour when a dependency is unavailable are documented in the PMM plan and tested periodically.

Human oversight monitoring tracks four dimensions that directly measure Article 14 compliance in practice. Override rates measure the proportion of recommendations modified by human operators, with both upper and lower threshold breaches generating alerts. Consistently low rates may indicate automation bias; consistently high rates may indicate underperformance. Review time analysis uses the time operators spend per case as a proxy for depth of engagement, with minimum thresholds defined based on decision complexity. Escalation monitoring tracks frequency and reasons over time, with declining rates potentially indicating avoidance rather than improvement. Automation bias detection examines whether operators use confidence scores and explanatory features meaningfully, computed where the human oversight interface captures sufficient interaction data.

How should monitoring infrastructure be architected?

Engineering Approach

The monitoring infrastructure must be reliable, scalable, and independent of the AI system it monitors.

The monitoring infrastructure must be reliable, scalable, and independent of the AI system it monitors. A monitoring system that fails when the AI system fails provides no information at the moment it is most needed. Five layers compose the architecture.

The data collection layer captures inference inputs, outputs, and metadata from the production system asynchronously, typically streaming events to a message queue such as Kafka or AWS Kinesis. This layer must handle peak throughput without data loss. The storage layer uses a time-series database optimised for aggregation, comparison, and disaggregation queries. A tiered storage strategy retains raw data at full granularity for 30 to 90 days, then aggregates to summaries for long-term retention.

The computation layer runs metric calculations on a scheduled basis as defined in the PMM plan. Computations must be idempotent and deterministic to ensure auditability. The pipeline handles late-arriving ground truth data and recomputes affected metrics. The alerting layer routes alerts through a dedicated service with acknowledgement tracking and escalation. Alert fatigue management through threshold tuning is essential. The dashboard layer serves both operational audiences with real-time visibility and governance audiences with compliance-relevant summaries.

For generative AI and LLM systems, monitoring differs substantially from classical ML. Output quality monitoring uses rule-based checks, embedding-based similarity, and classifier-based evaluation. Hallucination detection for RAG systems compares generated claims against source documents using entailment scoring, citation verification, and consistency checking. Prompt and response distribution monitoring tracks topical drift using classification. Safety monitoring tracks policy violation rates.

How does the alerting and escalation framework operate?

Engineering Approach

The value of PMM lies in acting on the data it produces.

The value of PMM lies in acting on the data it produces. Three severity tiers structure the response.

Informational alerts indicate a metric shift within the established tolerance band. These are logged and reviewed at the next scheduled PMM meeting with no immediate action required. Warning alerts indicate a metric has breached its warning threshold, set at a level indicating potential drift before the compliance threshold is reached. The Technical SME reviews within five working days, initiates root cause analysis, and escalates to the AI Governance Lead if the cause is unclear or concerning.

Critical alerts indicate a metric has breached its compliance threshold, a fundamental rights concern has been identified, or multiple warning-level alerts have occurred simultaneously. Immediate investigation is initiated, the AI Governance Lead is notified within 24 hours, the break-glass procedure is considered, and the serious incident reporting process is assessed for applicability.

The escalation path defines who is notified, through which channel, within what timeframe, and what actions are expected. Paths account for out-of-hours scenarios, key person unavailability with named alternates, and multi-jurisdiction incidents. The alerting system tracks not only acknowledgement but subsequent actions and outcomes. An alert acknowledged but producing no root cause analysis, decision, or resolution indicates an escalation framework gap.

Threshold calibration balances sensitivity against operational burden. The Technical SME derives initial thresholds from validation performance, defining warning and critical thresholds as deviations from baseline. For fairness metrics, thresholds reflect both statistical significance and practical significance. Thresholds are reviewed quarterly, with those that have never triggered potentially too loose and those that trigger weekly on benign variation too tight. The Technical SME documents each review as part of the PMM plan's continuous improvement cycle.

What constitutes a serious incident and how is it reported?

Regulatory Requirement

Article 73 requires providers to report serious incidents to the market surveillance authority of the member state where the incident occurred.

Article 73 requires providers to report serious incidents to the market surveillance authority of the member state where the incident occurred. Article 3(49) defines a serious incident as one that directly or indirectly results in death or serious harm to health, serious and irreversible disruption to critical infrastructure management, infringement of fundamental rights obligations, or serious harm to property or the environment. Indirect causation is sufficient: an AI system providing incorrect medical analysis that leads to patient harm through subsequent physician decisions constitutes a serious incident.

Reporting timelines are tiered by severity. Widespread fundamental rights infringement or irreversible critical infrastructure disruption requires reporting within 2 days of awareness. Death or suspected causal link to death requires reporting within 10 days. All other serious incidents require reporting within 15 days. Article 73(5) permits an initial incomplete report followed by supplementary information.

Building a serious incident response capability requires detection infrastructure with alerts mapped to Article 3(49) criteria, a pre-defined triage process determining whether events meet the definition and classifying severity, and evidence preservation. Article 73(6) prohibits altering the system in ways that could affect evaluation of causes prior to informing authorities. The engineering team must preserve model version, configuration, input and output data, feature values, and all relevant logs.

The reporting execution involves the incident lead preparing the initial report, the Legal and Regulatory Advisor reviewing for completeness, the AI Governance Lead authorising submission, and submission to the relevant authority. Investigation must determine root cause, scope of impact, and appropriate remedy. For systems under multiple reporting regimes such as NIS2, DORA, or medical device vigilance, under Article 73(9) the AI Act reporting is limited to fundamental rights infringements, with other incidents reported through sector-specific channels.

What are deployer responsibilities in post-market monitoring?

Regulatory Requirement

Deployers of high-risk systems have their own PMM obligations under Article 26.

Deployers of high-risk systems have their own PMM obligations under Article 26. They must monitor the system's operation using the provider's instructions for use, inform the provider of any serious incidents, and suspend use if they believe the system presents a risk. The provider's ability to fulfil its PMM obligations depends on information flowing from deployers back to the provider.

The Instructions for Use must include sufficient operational monitoring guidance for deployers, specifying minimum monitoring activities: reviewing outputs for consistency, tracking human oversight metrics, monitoring complaint rates, and observing behaviour for degradation. Clear criteria for when to suspend the system must be provided, including specific observable indicators such as systematically biased results, sudden output distribution changes, or error rates exceeding defined thresholds.

Structured feedback channels must be established. A dedicated reporting portal or API endpoint collects incident reports and anomaly observations in structured format. The Technical SME triages incoming feedback within defined timeframes: 24 hours for incident reports, five working days for general feedback. Individual deployer reports may appear minor in isolation, but patterns across multiple deployers can reveal systemic issues.

For systems with limited production visibility, where the deployer controls the production environment, the deployment contract should specify minimum data the deployer must provide, including inference volumes, error rates, human oversight metrics, and anomaly summaries. Technical data pipelines using telemetry agents, callback APIs, or periodic batch exports automate collection where feasible. Where full visibility cannot be achieved, compensating strategies include periodic deployer audits, synthetic monitoring with sentinel test cases, and deployer satisfaction surveys. The PMM plan documents which mechanisms are used for each deployment, the coverage they provide, and residual monitoring gaps. governance ensures that changes to the deployed system are traceable across all deployment contexts.

How does the PMM feedback loop operate?

Engineering Approach

Every completed feedback loop cycle, from PMM finding through decision, action, validation, and AISDP update, must be documented as a single traceable record.

Every completed feedback loop cycle, from PMM finding through decision, action, validation, and AISDP update, must be documented as a single traceable record. The loop's value depends entirely on its execution: findings must translate into system improvements, not accumulate in dashboards.

Decision authority is tiered by impact. Threshold adjustments based on operational experience can be authorised by the Technical SME. Model retraining on updated data, where the retrained model passes all validation gates, is authorised by the Technical Owner with notice to the AI Governance Lead. Architecture changes, hyperparameter shifts, or feature set changes require AI Governance Lead approval and a substantial change assessment. System suspension requires AI Governance Lead sign-off with immediate notice to the Legal and Regulatory Advisor.

PMM-triggered remediation competes with feature development and other engineering priorities. Organisations should establish a PMM action backlog separate from the general engineering backlog. Critical actions override all other work. Warning-level actions are scheduled within the next sprint. The AI Governance Lead has authority to elevate priority when engineering prioritisation is inconsistent with compliance risk.

PMM resource planning should budget 15 to 25 per cent of the system's annual development cost for ongoing monitoring and compliance maintenance. Personnel costs include dedicated analytical capacity of 0.25 to 0.5 FTE for a medium-complexity system. Infrastructure costs grow over time as monitoring data accumulates.

PMM data retention must balance the AI Act's ten-year documentation obligation with GDPR's storage limitation principle. A tiered approach retains individual-level data at full granularity for 90 days, then aggregates to statistical summaries for long-term retention. Access controls restrict PMM data containing personal information to authorised analysts, with access logged and reviewed. The feedback loop itself is monitored: time from finding to decision, time from decision to fix, proportion of findings resulting in system changes, and proportion of fixes resolving the originating finding.

What happens at system end-of-life and decommissioning?

Regulatory Requirement

Every AI system will eventually reach a point where it must be taken out of service.

Every AI system will eventually reach a point where it must be taken out of service. The EU AI Act defines these transitions through Articles 3, 16, 18, 20, and 79. Three pathways trigger end-of-life: planned retirement, voluntary withdrawal for compliance reasons where irremediable non-conformities are identified, and mandated withdrawal or recall where a market surveillance authority orders action under Article 79 within 15 working days.

The end-of-life plan covers seven workstreams. Deployer transition requires notifying all known deployers of the withdrawal decision, reason, timeline, and recommended transition arrangements. For API-served systems, this includes deprecation notices, sunset headers, and access cutoff. The Technical SME coordinates technical shutdown including inference endpoint deactivation with informative error responses, model artefact archival with integrity verification, credential and access revocation for all production credentials and API keys, infrastructure decommission releasing dedicated resources, and monitoring infrastructure transition with a final data snapshot.

Data lifecycle closure reconciles the AI Act's ten-year documentation retention with GDPR's storage limitation. Training data containing personal data is deleted or anonymised at decommission, while metadata and statistical summaries are retained. Inference logs follow the PMM data retention policy. Data subjects retain GDPR rights after decommission, and the organisation must maintain response capability.

Downstream decision monitoring addresses the fact that the system's historical outputs may still affect individuals. Credit scoring, recruitment screening, and medical diagnostic decisions can have persistent effects. The AI Governance Lead assesses whether historical outputs continue to affect individuals and defines post-decommission monitoring focused on relevant fairness and accuracy dimensions.

How does PMM maintain continuous compliance?

Compensating Controls

The PMM system is the primary mechanism through which the organisation maintains its compliance posture after deployment.

The PMM system is the primary mechanism through which the organisation maintains its compliance posture after deployment. PMM findings feed back into the risk management system for risk register updates, the model development cycle for retraining and recalibration, the data governance framework for training data updates, the human oversight design for interface improvements, and the AISDP for documentation updates reflecting the system's evolving behaviour.

Each PMM cycle that results in a system change creates a new AISDP version, maintains the version control discipline, and triggers evaluation of whether the change constitutes a substantial modification under Article 3(23). The maturity of an organisation's PMM programme can be assessed across three tiers. At the reactive tier, the organisation responds only to incidents and complaints. At the structured tier, it follows a documented plan with defined metrics and thresholds. At the proactive tier, it uses trend analysis and predictive indicators to anticipate problems before they materialise.

Organisations should aim for the proactive tier, where PMM becomes a source of continuous improvement rather than a compliance burden. A proactive PMM programme identifies opportunities to improve performance and fairness before degradation reaches a compliance-relevant threshold, feeding insights into the development cycle as part of normal operations.

For organisations without continuous automated monitoring, monthly manual analysis of production prediction samples against declared metrics provides the minimum viable approach. Drift analysis using standard statistical tests and fairness recomputation on production samples are conducted periodically. The manual approach detects problems at the next monthly review, not when they occur, creating a significant compliance risk for systems where undetected drift could affect thousands of individuals. Where automated monitoring is not yet in place, the PMM plan must document the monitoring gap and the timeline for implementing automation. Open-source tools including Evidently AI, NannyML, and Prometheus provide the monitoring capability at minimal licensing cost.

Post-Market Monitoring for AI Systems: Plans and Incident Reporting

Written by

What does Article 72 require for post-market monitoring?

How is performance, fairness, and drift monitored in production?

What operational and human oversight monitoring is required?

How should monitoring infrastructure be architected?

How does the alerting and escalation framework operate?

What constitutes a serious incident and how is it reported?

What are deployer responsibilities in post-market monitoring?

How does the PMM feedback loop operate?

What happens at system end-of-life and decommissioning?

How does PMM maintain continuous compliance?

Frequently Asked Questions

Related Pages

Start your compliance journey