What is a substantial modification under the EU AI Act?

A change not foreseen in the initial conformity assessment that affects compliance or intended purpose. The regulation is qualitative; organisations must define and automate their own numeric thresholds.

What quantitative thresholds should organisations define?

Four categories: performance (AUC-ROC change), fairness (subgroup metric thresholds), distribution shift (PSI), and feature importance (top-five ranking changes). Each calibrated to the system's risk profile.

Why is cumulative change tracking essential?

Individual-change thresholds miss gradual drift. Ten minor updates, each below threshold, can cumulatively transform the system. Version-to-baseline comparison catches drift that version-to-version comparison misses.

How is the detection pipeline implemented in CI/CD?

Every candidate model is compared against production and baseline on all metrics. Evidently AI or custom pytest assertions gate the pipeline. Reports are stored as audit trail artefacts.

Does a threshold breach automatically mean a substantial modification?

No. A breach means assessment is warranted, not that the change is automatically a substantial modification. The outcome is a documented determination: either a substantial modification triggering new conformity assessment, or an acceptable change with documented rationale and evidence.

How should PSI thresholds be interpreted?

PSI below 0.1 is typically stable. Between 0.1 and 0.2 warrants investigation. Above 0.2 indicates significant distribution shift. Jensen-Shannon divergence and Kolmogorov-Smirnov tests provide complementary perspectives for confirmation.

How often should the conformity assessment baseline be refreshed?

The baseline is captured at each conformity assessment or substantial modification determination. It remains frozen until the next assessment. Quarterly comparisons against this baseline catch cumulative drift that individual-change tracking misses.

What tools support automated substantial modification detection?

Evidently AI provides CI-integrated profile comparison with configurable thresholds. NannyML adds performance estimation without ground truth labels. Custom solutions use pytest assertions comparing evaluation results against declared thresholds and baseline metrics.

Does a threshold breach automatically mean a substantial modification?

No. A breach means assessment is warranted, not that the change is automatically a substantial modification. The outcome is a documented determination: either a substantial modification triggering new conformity assessment, or an acceptable change with documented rationale and evidence.

How should PSI thresholds be interpreted?

PSI below 0.1 is typically stable. Between 0.1 and 0.2 warrants investigation. Above 0.2 indicates significant distribution shift. Jensen-Shannon divergence and Kolmogorov-Smirnov tests provide complementary perspectives for confirmation.

How often should the conformity assessment baseline be refreshed?

The baseline is captured at each conformity assessment or substantial modification determination. It remains frozen until the next assessment. Quarterly comparisons against this baseline catch cumulative drift that individual-change tracking misses.

What tools support automated substantial modification detection?

Evidently AI provides CI-integrated profile comparison with configurable thresholds. NannyML adds performance estimation without ground truth labels. Custom solutions use pytest assertions comparing evaluation results against declared thresholds and baseline metrics.

Article 3(23) defines substantial modification as a change affecting compliance or intended purpose, but provides no numeric thresholds. Organisations must define, document, and automate their own detection framework. This page covers the four threshold categories, the decision process, cumulative change tracking, CI/CD pipeline integration, and the AISDP evidence requirements.

Abstract

Read abstract

Substantial modification detection is the automated mechanism connecting compliance thresholds to the CI/CD pipeline. Article 3(23) provides a qualitative definition with no numeric thresholds; organisations must define, document, and automate their own detection framework, calibrated to each system's risk profile. Four threshold categories form the framework: performance (AUC-ROC plus or minus three percentage points as a starting point, contextualised to real-world impact), fairness (non-negotiable subgroup metric thresholds reflecting both statistical and practical significance), distribution shift (Population Stability Index below 0.1 stable, 0.1 to 0.2 warrants investigation, above 0.2 significant, complemented by Jensen-Shannon divergence and Kolmogorov-Smirnov tests), and feature importance (top-five ranking changes as a leading indicator for proxy variable risk). The decision process flows from automated gate detection through Technical SME assessment to AI Governance Lead determination, with Legal and Regulatory Advisor input on borderline cases involving intended purpose or deployment context changes. Cumulative change tracking is essential because individual-change thresholds miss gradual drift; the system must be compared against both the preceding version and the conformity assessment baseline. A baseline snapshot captured at the last conformity assessment is maintained as a frozen reference. The detection pipeline runs on every CI execution, storing comparison reports as audit trail artefacts via Evidently AI or custom pytest assertions.

What defines a substantial modification under Article 3(23)?

Regulatory Requirement

Article 3(23) defines substantial modification as a change not foreseen or planned in the initial CONFORMITY ASSESSMENT that affects compliance with Chapter 2 requirements or modifies the intended purpose.

Article 3(23) defines substantial modification as a change not foreseen or planned in the initial conformity assessment that affects compliance with Chapter 2 requirements or modifies the intended purpose. The definition is qualitative; the regulation does not specify numeric thresholds, meaning the organisation must define its own thresholds, document the rationale, and encode them in automated gates. Getting this right is consequential: thresholds too loose allow changes that should trigger re-assessment to slip through; thresholds too tight flag routine updates unnecessarily, creating bottlenecks and compliance fatigue.

The version control system must support detection of changes that cross this threshold by tracking quantitative metrics across every measurable dimension of change.

What quantitative thresholds detect substantial modifications?

Engineering Approach

The Technical SME defines thresholds for each measurable dimension, calibrated to the system's specific risk profile.

The Technical SME defines thresholds for each measurable dimension, calibrated to the system's specific risk profile. Performance thresholds such as AUC-ROC change exceeding plus or minus three hundredths provide a starting point for binary classification, but context matters: for a credit scoring system processing millions of applications, a three-hundredth change could shift thousands of decisions; for an internal classifier, the same change may be immaterial. The threshold should answer: what magnitude of change would alter real-world impact enough to warrant fresh compliance review.

Fairness thresholds are non-negotiable in a way performance thresholds are not. Any subgroup metric breaching its declared threshold is a potential substantial modification because it directly affects Article 10 and Article 9 compliance. The four-fifths rule providing a selection rate ratio below 0.80 and equalised odds deviations provide thresholds reflecting both statistical significance and practical significance. For small subgroups, the Technical SME computes confidence intervals and applies thresholds to the lower bound.

Distribution shift thresholds use PSI as the standard metric. PSI below 0.1 is stable, between 0.1 and 0.2 warrants investigation, and above 0.2 indicates significant shift. Jensen-Shannon divergence and the Kolmogorov-Smirnov test provide complementary perspectives. Feature importance thresholds detect changes in the model's top-five feature ranking measured by SHAP values or permutation importance. A feature correlating with a protected characteristic moving into the top five warrants investigation as a leading indicator even if fairness metrics have not yet breached.

These thresholds are encoded as assertions in the CI/CD pipeline. Evidently AI provides profile-comparison reports with configurable pass/fail conditions. NannyML offers performance estimation without ground truth labels.

How does the substantial modification decision process work?

Engineering Approach

When automated gates flag a change approaching or exceeding a threshold, the decision process follows a defined sequence.

When automated gates flag a change approaching or exceeding a threshold, the decision process follows a defined sequence. The Technical SME conducts an initial assessment documenting which metrics changed, by how much, and why. The AI Governance Lead determines whether the change constitutes a substantial modification under Article 3(23). The Legal and Regulatory Advisor provides input on borderline cases, particularly those involving intended purpose or deployment context changes.

If determined to be a substantial modification, a new conformity assessment is required before the modified system can be placed on the market. This is a significant operational consequence. Organisations should design change management to anticipate it: the AI System Assessor assesses changes against thresholds before implementation, not after. The pipeline provides early warning when a change-in-progress trends toward the threshold.

If determined not to be a substantial modification, the AI System Assessor documents the determination with supporting evidence in aisdp Module 12. Critically, cumulative drift must also be tracked. A series of individually sub-threshold changes that collectively alter behaviour significantly may constitute a cumulative substantial modification. The organisation tracks cumulative change metrics over defined windows, quarterly and semi-annually, assessing whether aggregate change exceeds the threshold even when no individual change did.

The cumulative baseline comparison is the most important check. It compares the current version against the version assessed at the last conformity assessment, not just the current production version. This catches gradual drift where each change is within thresholds but twenty changes over six months have collectively transformed the system.

What is the procedural alternative for substantial modification detection?

Compensating Controls

Without automated detection, the AI System Assessor conducts the evaluation manually before each deployment.

Without automated detection, the AI System Assessor conducts the evaluation manually before each deployment. After each model evaluation, the Assessor retrieves the current metrics from the evaluation reports, compares them against the documented thresholds, and records the comparison in a structured Substantial Modification Assessment Form. The form captures each dimension of change, the current score, the candidate score, whether the threshold is breached, and the Assessor's determination with supporting reasoning.

The cumulative change assessment requires the Assessor to compare against the baseline version from the last conformity assessment, not merely the current production version. The Assessor maintains a running record of all changes since the last assessment and reviews the aggregate at each evaluation.

Manual detection is adequate for systems with infrequent changes, quarterly or less. For systems undergoing continuous retraining, the volume of evaluations makes manual assessment unsustainable. Evidently AI and NannyML are both open-source tools that can automate the threshold comparison at minimal cost.

Substantial Modification Detection Under the EU AI Act

Written by

What defines a substantial modification under Article 3(23)?

What quantitative thresholds detect substantial modifications?

How does the substantial modification decision process work?

What is the procedural alternative for substantial modification detection?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline