We use cookies to improve your experience and analyse site traffic.
Article 3(23) defines substantial modification as a change affecting compliance or intended purpose, but provides no numeric thresholds. Organisations must define, document, and automate their own detection framework. This page covers the four threshold categories, the decision process, cumulative change tracking, CI/CD pipeline integration, and the AISDP evidence requirements.
Article 3(23) defines substantial modification as a change not foreseen or planned in the initial CONFORMITY ASSESSMENT that affects compliance with Chapter 2 requirements or modifies the intended purpose.
Article 3(23) defines substantial modification as a change not foreseen or planned in the initial conformity assessment that affects compliance with Chapter 2 requirements or modifies the intended purpose. The definition is qualitative; the regulation does not specify numeric thresholds, meaning the organisation must define its own thresholds, document the rationale, and encode them in automated gates. Getting this right is consequential: thresholds too loose allow changes that should trigger re-assessment to slip through; thresholds too tight flag routine updates unnecessarily, creating bottlenecks and compliance fatigue.
The version control system must support detection of changes that cross this threshold by tracking quantitative metrics across every measurable dimension of change.
The Technical SME defines thresholds for each measurable dimension, calibrated to the system's specific risk profile.
The Technical SME defines thresholds for each measurable dimension, calibrated to the system's specific risk profile. Performance thresholds such as AUC-ROC change exceeding plus or minus three hundredths provide a starting point for binary classification, but context matters: for a credit scoring system processing millions of applications, a three-hundredth change could shift thousands of decisions; for an internal classifier, the same change may be immaterial. The threshold should answer: what magnitude of change would alter real-world impact enough to warrant fresh compliance review.
Fairness thresholds are non-negotiable in a way performance thresholds are not. Any subgroup metric breaching its declared threshold is a potential substantial modification because it directly affects Article 10 and Article 9 compliance. The four-fifths rule providing a selection rate ratio below 0.80 and equalised odds deviations provide thresholds reflecting both statistical significance and practical significance. For small subgroups, the Technical SME computes confidence intervals and applies thresholds to the lower bound.
Distribution shift thresholds use PSI as the standard metric. PSI below 0.1 is stable, between 0.1 and 0.2 warrants investigation, and above 0.2 indicates significant shift. Jensen-Shannon divergence and the Kolmogorov-Smirnov test provide complementary perspectives. Feature importance thresholds detect changes in the model's top-five feature ranking measured by SHAP values or permutation importance. A feature correlating with a protected characteristic moving into the top five warrants investigation as a leading indicator even if fairness metrics have not yet breached.
These thresholds are encoded as assertions in the CI/CD pipeline. Evidently AI provides profile-comparison reports with configurable pass/fail conditions. NannyML offers performance estimation without ground truth labels.
When automated gates flag a change approaching or exceeding a threshold, the decision process follows a defined sequence.
When automated gates flag a change approaching or exceeding a threshold, the decision process follows a defined sequence. The Technical SME conducts an initial assessment documenting which metrics changed, by how much, and why. The AI Governance Lead determines whether the change constitutes a substantial modification under Article 3(23). The Legal and Regulatory Advisor provides input on borderline cases, particularly those involving intended purpose or deployment context changes.
If determined to be a substantial modification, a new conformity assessment is required before the modified system can be placed on the market. This is a significant operational consequence. Organisations should design change management to anticipate it: the AI System Assessor assesses changes against thresholds before implementation, not after. The pipeline provides early warning when a change-in-progress trends toward the threshold.
If determined not to be a substantial modification, the AI System Assessor documents the determination with supporting evidence in aisdp Module 12. Critically, cumulative drift must also be tracked. A series of individually sub-threshold changes that collectively alter behaviour significantly may constitute a cumulative substantial modification. The organisation tracks cumulative change metrics over defined windows, quarterly and semi-annually, assessing whether aggregate change exceeds the threshold even when no individual change did.
The cumulative baseline comparison is the most important check. It compares the current version against the version assessed at the last conformity assessment, not just the current production version. This catches gradual drift where each change is within thresholds but twenty changes over six months have collectively transformed the system.
Without automated detection, the AI System Assessor conducts the evaluation manually before each deployment.
Without automated detection, the AI System Assessor conducts the evaluation manually before each deployment. After each model evaluation, the Assessor retrieves the current metrics from the evaluation reports, compares them against the documented thresholds, and records the comparison in a structured Substantial Modification Assessment Form. The form captures each dimension of change, the current score, the candidate score, whether the threshold is breached, and the Assessor's determination with supporting reasoning.
The cumulative change assessment requires the Assessor to compare against the baseline version from the last conformity assessment, not merely the current production version. The Assessor maintains a running record of all changes since the last assessment and reviews the aggregate at each evaluation.
Manual detection is adequate for systems with infrequent changes, quarterly or less. For systems undergoing continuous retraining, the volume of evaluations makes manual assessment unsustainable. Evidently AI and NannyML are both open-source tools that can automate the threshold comparison at minimal cost.
No. A breach means assessment is warranted, not that the change is automatically a substantial modification. The outcome is a documented determination: either a substantial modification triggering new conformity assessment, or an acceptable change with documented rationale and evidence.
PSI below 0.1 is typically stable. Between 0.1 and 0.2 warrants investigation. Above 0.2 indicates significant distribution shift. Jensen-Shannon divergence and Kolmogorov-Smirnov tests provide complementary perspectives for confirmation.
The baseline is captured at each conformity assessment or substantial modification determination. It remains frozen until the next assessment. Quarterly comparisons against this baseline catch cumulative drift that individual-change tracking misses.
Evidently AI provides CI-integrated profile comparison with configurable thresholds. NannyML adds performance estimation without ground truth labels. Custom solutions use pytest assertions comparing evaluation results against declared thresholds and baseline metrics.
Automated gates flag threshold breaches, the Technical SME assesses impact, the AI Governance Lead determines whether it constitutes a substantial modification, with Legal Advisor input on borderline cases.
Every candidate model is compared against production and baseline on all metrics. Evidently AI or custom pytest assertions gate the pipeline. Reports are stored as audit trail artefacts.