We use cookies to improve your experience and analyse site traffic.
Articles 12, 3(23), and 72 of the EU AI Act collectively require organisations to maintain rigorous version control across every artefact that constitutes a high-risk AI system. Without compliance-grade versioning, organisations cannot demonstrate traceability, assess substantial modifications, or satisfy the ten-year retention obligation.
Three EU AI Act provisions collectively mandate version control for high-risk AI systems.
Three EU AI Act provisions collectively mandate version control for high-risk AI systems. Article 12 requires automatic recording of events during the system's operation. Article 3(23) defines substantial modification as a change that affects compliance or intended purpose. Article 72 requires a post-market monitoring system that tracks changes. Together, these provisions demand rigorous version control across every artefact that constitutes the AI system: code, models, data, configurations, documentation, and the ai system description package itself.
Without version control, an organisation cannot demonstrate which version of the system was deployed at any given time, what changed between versions, whether a change constitutes a substantial modification, or that the system assessed during conformity assessment is the same system deployed in production. For notified bodies and competent authorities, the version control record serves as evidence that the organisation exercises deliberate control over its AI system's evolution. Its absence suggests the opposite.
Compliance-grade version control differs from development-grade version control in several important respects. Every version must be immutable once committed, with no force-pushes, no history rewriting, and no retroactive modifications. Every version must be attributable to a named individual with a verified identity and must carry a timestamp from a trusted source. Every version must be retrievable for the full ten-year retention period required under Article 18, including versions of artefacts that are no longer in active use. The engineering team must also protect the version control infrastructure against tampering, because an attacker who can modify the version history can undermine the entire compliance record.
Traditional software version control tracks one artefact type: code.
Traditional software version control tracks one artefact type: code. A Git commit captures a complete, deterministic snapshot of the application's behaviour. Given the same commit, the same build environment, and the same inputs, the application produces the same outputs. The relationship between a version and the system's behaviour is unambiguous.
AI system version control must track six artefact types simultaneously, each with its own versioning semantics, its own change cadence, and its own compliance implications. The system's behaviour at any point in time is determined not by any single artefact but by the specific combination of all six. A change to any one artefact can alter the system's outputs, its fairness profile, its risk posture, and its compliance status.
This is the composite versioning problem, and it is the reason that Version Control and Change Management describes the version control infrastructure as the traceability backbone of the entire AISDP. The challenge is not merely technical; it is structural. Code versioning with Git is a solved problem. Data versioning requires specialised tools such as DVC or LakeFS. Model versioning relies on model registries. Prompt versioning is an emerging discipline with tooling still maturing. Coordinating all six artefact types into a single coherent version record demands deliberate architectural decisions that go well beyond installing Git.
Each artefact type has distinct characteristics that shape how it must be versioned.
Each artefact type has distinct characteristics that shape how it must be versioned. Code contains application logic, feature engineering, inference pipelines, post-processing, API contracts, and monitoring configuration. It changes continuously during active development and carries moderate compliance sensitivity because code changes affect behaviour but are typically reviewed through standard pull request processes. Git remains the standard versioning tool for code.
Data encompasses training datasets, validation datasets, test datasets, evaluation benchmarks, and knowledge bases for RAG systems. Data changes periodically, from weekly to monthly for training data, and continuously for knowledge bases. Its compliance sensitivity is high because data changes directly affect model behaviour, fairness, and the Article 10 compliance posture. Tools such as DVC, LakeFS, Delta Lake, or cloud-native versioning handle data versioning.
Model artefacts include trained weights, hyperparameters, training configuration, and the complete specification needed to reproduce the model. Changes are event-driven, with each training or fine-tuning run producing a new version. Compliance sensitivity is very high because model changes are the primary trigger for substantial modification assessment. Model registries from MLflow, Vertex AI, or SageMaker provide the versioning infrastructure.
Configuration covers decision thresholds, feature flags, system parameters, deployment topology, environment variables, and prompt templates for LLM systems. Configuration changes frequently because it is often the mechanism through which the system's behaviour is tuned without retraining. A threshold change can shift the system's decision boundary, affecting every individual assessed by the system, making its compliance sensitivity high.
The composite version ties all six artefact types together into a single, auditable record.
The composite version ties all six artefact types together into a single, auditable record. It is a structured identifier that captures the specific version of each artefact type deployed at a given point in time. It answers the question: "What exactly was running in production at any given moment?" The composite version identifier should be a deterministic function of its component versions, typically a hash of the concatenated component version identifiers, or a structured manifest listing each component and its version.
The composite version is the unit of compliance. The AISDP describes a specific composite version. The conformity assessment evaluates a specific composite version. The Declaration of Conformity attests to a specific composite version. When any component changes, the composite version changes, and the change management process must determine whether the new composite version remains within the compliance boundaries established by the last conformity assessment.
This has practical consequences for every stage of the compliance lifecycle. The conformity assessment report must reference the exact composite version evaluated. The post-market monitoring system must track composite version changes over time. When a competent authority requests evidence of compliance, the organisation must be able to reconstruct the composite version that was deployed at any specific date and demonstrate that it matched the version described in the AISDP. Conformity Assessment Processes covers how these assessments are structured and when they must be repeated.
Traditional software is deterministic: the same code produces the same outputs given the same inputs.
Traditional software is deterministic: the same code produces the same outputs given the same inputs. AI systems are designed to change. Models are retrained on new data, knowledge bases are updated with new documents, prompt templates are refined based on operational feedback, and configuration thresholds are adjusted based on monitoring data. Each change is individually rational, yet their cumulative effect may move the system far from the version assessed during conformity assessment. Three patterns characterise this temporal challenge.
Gradual drift occurs when no single change crosses a threshold, but the aggregate of many small changes over months transforms the system's behaviour. The composite version after twenty changes may bear little resemblance to the composite version at conformity assessment. The cumulative baseline comparison is the primary defence against this pattern: it measures the current system against the version evaluated during conformity assessment, not merely against the immediately preceding version. Without it, gradual drift remains invisible until a competent authority compares the deployed system against the AISDP.
Silent updates occur when third-party components change without the organisation's knowledge or consent. A GPAI provider may apply a minor update to its model behind the same API endpoint. An embedding model provider may release a new version that the system's dependency manager installs automatically. A knowledge base data feed may include a document that changes the information available to the model. Each silent update changes the composite version, yet the organisation's version control system may not detect the change. The underlying principle is that if the organisation cannot control the change, it must at least detect it. describes sentinel monitoring approaches that serve as the compensating control for silent updates.
The determination of whether a change constitutes a substantial modification requires structured analysis against the Article 3(23) criteria.
The determination of whether a change constitutes a substantial modification requires structured analysis against the Article 3(23) criteria. Seven borderline cases illustrate how organisations apply this framework in practice, where the answer is not immediately obvious.
Routine retraining on updated data is not automatically a substantial modification, even when the data distribution has shifted. Consider a credit scoring system retrained quarterly on the most recent 24 months of application data, where the model architecture, hyperparameters, features, and intended purpose are unchanged. The new training data reflects a shift in the applicant population's income distribution due to economic conditions. The determination rests on whether the version-to-version performance delta, fairness delta, cumulative drift from the baseline, and output distribution metrics all stay within declared thresholds. If they do, the AI System Assessor documents the determination with the metrics, noting any approaching thresholds and recommending enhanced monitoring. The test is whether the resulting system still complies with the requirements, not whether the inputs have changed.
Replacing a model component with a different architecture family is presumptively a substantial modification even when aggregate metrics remain within thresholds. When a recruitment screening system replaces a BERT-based NER model with a RoBERTa-based alternative, the qualitative nature of the change goes beyond parameter adjustment. The cascade impact matters: the new NER model extracts competencies with different confidence distributions, which changes the inputs to downstream classification and ranking stages. Even with improved overall accuracy and adequate fairness metrics, a focused re-assessment of the affected AISDP modules is required before deployment, because the change is qualitative, not merely quantitative.
Adding a new language to a multilingual RAG system typically constitutes a substantial modification.
Adding a new language to a multilingual RAG system typically constitutes a substantial modification. When a regulatory compliance advisory system with English, French, and German coverage adds Italian regulatory texts and enables Italian language query processing, the intended purpose remains unchanged, but the deployment context expands to include Italian-speaking users and Italian regulatory content. Lower grounding scores for Italian queries indicate higher hallucination risk, and lower retrieval precision compared to established languages suggests the embedding model has weaker coverage. These quality gaps mean that compliance with Article 15 accuracy requirements and Article 9 risk management obligations cannot be presumed from the existing assessment. A re-assessment covering the new language capability is required before the expansion is deployed.
When a GPAI provider updates a model behind the same API endpoint without changing the model identifier, the organisation did not initiate the change. The system's composite version has changed because the model component version has changed, but the change was not under the organisation's control. The organisation must assess whether the changed system still complies with the requirements established during its conformity assessment. Sentinel monitoring results, including output distribution shifts and fairness metric deltas, form the basis for this assessment. If metrics breach declared thresholds, the organisation must either roll back to the previous model version through version pinning, implement compensating controls, or conduct a re-assessment.
Prompt refinements represent the version control blind spot for LLM-based systems. Individually trivial changes can prove cumulatively transformative. When fourteen incremental prompt refinements over three months each adjust category definitions, add examples, or modify output format instructions, no single refinement changes aggregate accuracy by more than half a percentage point. Yet cumulatively, the classification distribution shifts materially and accuracy degrades by four percentage points against the benchmark dataset. The version-to-version comparison for each change showed sub-threshold shifts, but the version-to-baseline comparison reveals the significant behavioural drift. The cumulative effect constitutes a substantial modification under Article 3(23), even though no individual refinement did. Measuring current behaviour against the conformity assessment baseline, rather than only the immediately preceding version, is essential for detecting this pattern.
Version control is inherently a tool-based capability, and there is no manual alternative to Git for code versioning.
Version control is inherently a tool-based capability, and there is no manual alternative to Git for code versioning. Git itself is open-source and free. GitHub, GitLab, and Bitbucket all offer free tiers adequate for most compliance purposes. The minimum tooling requirement is Git with a hosted repository, either through the free tiers of GitHub or GitLab, or through a self-hosted GitLab Community Edition installation.
These tooling choices shape the access control policies and backup procedures that protect the compliance record. The version control infrastructure must enforce immutability, so that committed versions cannot be retroactively altered. It must enforce identity verification, so that every change is attributable to a named individual. It must provide trusted timestamps from a reliable source, so that the temporal sequence of changes is verifiable. The engineering team protects this infrastructure against tampering because an attacker who can modify the version history can undermine the entire compliance record.
Organisations should evaluate their existing version control setup against these three properties: immutability, attribution, and trusted timestamps. Where gaps exist, the compliance-grade requirements described throughout this guide identify the specific controls needed to bridge from development-grade to compliance-grade version control.
No. Routine retraining is not automatically a substantial modification even when the data distribution has shifted. The test is whether the resulting system still complies with the requirements, assessed through performance, fairness, cumulative drift, and output distribution metrics against declared thresholds.
The organisation must detect and assess the change through sentinel monitoring. If metrics breach declared thresholds, the organisation must roll back through version pinning, implement compensating controls, or conduct a re-assessment.
Yes, provided the adjustment falls within the provider's documented operating parameters in the Instructions for Use. The deployer documents the configuration choice and assesses its impact, including fundamental rights implications.
Adding a new language expands the system to a population whose risk profile was not evaluated in the original conformity assessment. Lower grounding scores and retrieval quality for the new language mean compliance with accuracy and risk management requirements cannot be presumed.
Intended purpose drift occurs when a system is gradually used for purposes beyond its assessed scope. It is the most insidious form of substantial modification because it happens at the deployment level. Using a candidate screening tool for internal promotion decisions, for example, triggers provider status under Article 25(1)(c).
A composite version is a structured record capturing the specific version of every artefact type deployed at a given point in time, serving as the unit against which conformity assessments are evaluated.
A change is substantial when it affects the system's compliance with Articles 9 through 15 or changes the intended purpose, assessed through performance, fairness, and drift metrics against declared thresholds.
Three patterns: gradual drift from accumulated small changes, silent updates from third-party providers, and retroactive invalidity from post-deployment discoveries.
Git with a hosted repository through free tiers of GitHub, GitLab, or self-hosted GitLab Community Edition, enforcing immutability, identity verification, and trusted timestamps.
Prompts include system prompts, few-shot examples, prompt chains, tool definitions, and output format specifications for LLM and agentic systems. Prompt engineering is iterative, with changes made daily during development and periodically in production. Compliance sensitivity is very high because prompt changes can alter behaviour as profoundly as model retraining, yet they are often treated as informal configuration.
Knowledge base content, including documents, embeddings, metadata, and retrieval configuration for RAG systems, changes continuously as documents are added, updated, and removed. Knowledge base changes alter the information available to the model, which alters the system's outputs. Its compliance sensitivity is high.
Retroactive invalidity occurs when a discovery after deployment, such as biased or unlicensed training data, a known model vulnerability, or a changed regulatory interpretation, retroactively invalidates a composite version that was compliant at deployment time. The version control system must support retrospective analysis: given a specific composite version, what data was it trained on, what model components did it contain, and what configuration was active? The ten-year retention obligation means these questions may need to be answered years after the version was deployed.
Configuration changes within a provider's documented operating parameters are deployer-level activities, not substantial modifications. When a clinical decision support system's risk score threshold is adjusted from the default of 0.55 to 0.45 to increase sensitivity, and the provider's Instructions for Use document a permissible range of 0.40 to 0.70, the deployer has not modified the system. Article 25(1) provider-status triggers are not engaged. The deployer documents the configuration choice in its compliance record with the rationale for the threshold selection, and re-evaluates its fundamental rights impact assessment to determine whether the increased false-positive rate has implications for the individuals assessed.
Intended purpose drift is the most insidious form of substantial modification because it occurs at the deployment level, not the engineering level. When a system assessed as a candidate screening tool under Annex III Area 4(a) is gradually adopted for evaluating internal employees for promotion eligibility, the deployer has moved beyond the provider's documented intended purpose. Internal promotion evaluation involves different data, different affected persons with existing employment protections, and a materially different risk profile. The deployer has triggered Article 25(1)(c) provider status and must either prepare a full AISDP for the promotion use case or cease using the system for that purpose. The original provider's conformity assessment does not cover the new use. Monitoring for intended purpose drift requires operational management oversight, not just technical monitoring.