How does AI version control differ from traditional software version control?

AI systems require simultaneous versioning of six artefact types with different change cadences and compliance sensitivities, whereas traditional software tracks only code.

What is a composite version in AI compliance?

A composite version is a structured record capturing the specific version of every artefact type deployed at a given point in time, serving as the unit against which conformity assessments are evaluated.

When does a change to an AI system become a substantial modification?

A change is substantial when it affects the system's compliance with Articles 9 through 15 or changes the intended purpose, assessed through performance, fairness, and drift metrics against declared thresholds.

Can a deployer adjust configuration thresholds without triggering a substantial modification?

Yes, provided the adjustment falls within the provider's documented operating parameters in the Instructions for Use. The deployer documents the configuration choice and assesses its impact, including fundamental rights implications.

Why does adding a new language to a RAG system typically require re-assessment?

Adding a new language expands the system to a population whose risk profile was not evaluated in the original conformity assessment. Lower grounding scores and retrieval quality for the new language mean compliance with accuracy and risk management requirements cannot be presumed.

What is intended purpose drift and why is it dangerous?

Intended purpose drift occurs when a system is gradually used for purposes beyond its assessed scope. It is the most insidious form of substantial modification because it happens at the deployment level. Using a candidate screening tool for internal promotion decisions, for example, triggers provider status under Article 25(1)(c).

Can a deployer adjust configuration thresholds without triggering a substantial modification?

Yes, provided the adjustment falls within the provider's documented operating parameters in the Instructions for Use. The deployer documents the configuration choice and assesses its impact, including fundamental rights implications.

Why does adding a new language to a RAG system typically require re-assessment?

Adding a new language expands the system to a population whose risk profile was not evaluated in the original conformity assessment. Lower grounding scores and retrieval quality for the new language mean compliance with accuracy and risk management requirements cannot be presumed.

What is intended purpose drift and why is it dangerous?

Intended purpose drift occurs when a system is gradually used for purposes beyond its assessed scope. It is the most insidious form of substantial modification because it happens at the deployment level. Using a candidate screening tool for internal promotion decisions, for example, triggers provider status under Article 25(1)(c).

Why Version Control is a Compliance Requirement for AI Systems

Q: Is routine retraining on updated data automatically a substantial modification?

No. Routine retraining is not automatically a substantial modification even when the data distribution has shifted. The test is whether the resulting system still complies with the requirements, assessed through performance, fairness, cumulative drift, and output distribution metrics against declared thresholds.

Q: What happens when a GPAI provider silently updates the model behind the same API endpoint?

The organisation must detect and assess the change through sentinel monitoring. If metrics breach declared thresholds, the organisation must roll back through version pinning, implement compensating controls, or conduct a re-assessment.

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Articles 12, 3(23), and 72 of the EU AI Act collectively require organisations to maintain rigorous version control across every artefact that constitutes a high-risk AI system. Without compliance-grade versioning, organisations cannot demonstrate traceability, assess substantial modifications, or satisfy the ten-year retention obligation.

Abstract

Read abstract

The EU AI Act imposes version control as a legal obligation for high-risk AI systems through three interlocking provisions: Article 12 on automatic event recording, Article 3(23) on substantial modification, and Article 72 on post-market monitoring. Unlike traditional software, AI systems require simultaneous versioning of six artefact types: code, data, models, configuration, prompts, and knowledge bases. The composite version, which captures the specific version of each artefact deployed at any point in time, is the unit of compliance against which conformity assessments are evaluated. Three temporal challenges complicate this obligation: gradual drift through accumulated small changes, silent updates from third-party providers, and retroactive invalidity when post-deployment discoveries affect previously compliant versions. Determining whether a change constitutes a substantial modification requires structured analysis, with borderline cases ranging from routine retraining to prompt drift and intended purpose creep. Compliance-grade version control demands immutability, identity verification, trusted timestamps, and ten-year retrievability, backed by minimum tooling of Git with a hosted repository service.

Why is version control a compliance requirement for AI systems?

Regulatory Requirement

Three EU AI Act provisions collectively mandate version control for high-risk AI systems.

Three EU AI Act provisions collectively mandate version control for high-risk AI systems. Article 12 requires automatic recording of events during the system's operation. Article 3(23) defines substantial modification as a change that affects compliance or intended purpose. Article 72 requires a post-market monitoring system that tracks changes. Together, these provisions demand rigorous version control across every artefact that constitutes the AI system: code, models, data, configurations, documentation, and the ai system description package itself.

Without version control, an organisation cannot demonstrate which version of the system was deployed at any given time, what changed between versions, whether a change constitutes a substantial modification, or that the system assessed during conformity assessment is the same system deployed in production. For notified bodies and competent authorities, the version control record serves as evidence that the organisation exercises deliberate control over its AI system's evolution. Its absence suggests the opposite.

Compliance-grade version control differs from development-grade version control in several important respects. Every version must be immutable once committed, with no force-pushes, no history rewriting, and no retroactive modifications. Every version must be attributable to a named individual with a verified identity and must carry a timestamp from a trusted source. Every version must be retrievable for the full ten-year retention period required under Article 18, including versions of artefacts that are no longer in active use. The engineering team must also protect the version control infrastructure against tampering, because an attacker who can modify the version history can undermine the entire compliance record.

How does AI version control differ from traditional software?

Engineering Approach

Traditional software version control tracks one artefact type: code.

Traditional software version control tracks one artefact type: code. A Git commit captures a complete, deterministic snapshot of the application's behaviour. Given the same commit, the same build environment, and the same inputs, the application produces the same outputs. The relationship between a version and the system's behaviour is unambiguous.

AI system version control must track six artefact types simultaneously, each with its own versioning semantics, its own change cadence, and its own compliance implications. The system's behaviour at any point in time is determined not by any single artefact but by the specific combination of all six. A change to any one artefact can alter the system's outputs, its fairness profile, its risk posture, and its compliance status.

This is the composite versioning problem, and it is the reason that Version Control and Change Management describes the version control infrastructure as the traceability backbone of the entire AISDP. The challenge is not merely technical; it is structural. Code versioning with Git is a solved problem. Data versioning requires specialised tools such as DVC or LakeFS. Model versioning relies on model registries. Prompt versioning is an emerging discipline with tooling still maturing. Coordinating all six artefact types into a single coherent version record demands deliberate architectural decisions that go well beyond installing Git.

What are the six artefact types that require versioning?

Engineering Approach

Each artefact type has distinct characteristics that shape how it must be versioned.

Each artefact type has distinct characteristics that shape how it must be versioned. Code contains application logic, feature engineering, inference pipelines, post-processing, API contracts, and monitoring configuration. It changes continuously during active development and carries moderate compliance sensitivity because code changes affect behaviour but are typically reviewed through standard pull request processes. Git remains the standard versioning tool for code.

Data encompasses training datasets, validation datasets, test datasets, evaluation benchmarks, and knowledge bases for RAG systems. Data changes periodically, from weekly to monthly for training data, and continuously for knowledge bases. Its compliance sensitivity is high because data changes directly affect model behaviour, fairness, and the Article 10 compliance posture. Tools such as DVC, LakeFS, Delta Lake, or cloud-native versioning handle data versioning.

Model artefacts include trained weights, hyperparameters, training configuration, and the complete specification needed to reproduce the model. Changes are event-driven, with each training or fine-tuning run producing a new version. Compliance sensitivity is very high because model changes are the primary trigger for substantial modification assessment. Model registries from MLflow, Vertex AI, or SageMaker provide the versioning infrastructure.

Configuration covers decision thresholds, feature flags, system parameters, deployment topology, environment variables, and prompt templates for LLM systems. Configuration changes frequently because it is often the mechanism through which the system's behaviour is tuned without retraining. A threshold change can shift the system's decision boundary, affecting every individual assessed by the system, making its compliance sensitivity high.

What is a composite version and why does it matter?

Engineering Approach

The composite version ties all six artefact types together into a single, auditable record.

The composite version ties all six artefact types together into a single, auditable record. It is a structured identifier that captures the specific version of each artefact type deployed at a given point in time. It answers the question: "What exactly was running in production at any given moment?" The composite version identifier should be a deterministic function of its component versions, typically a hash of the concatenated component version identifiers, or a structured manifest listing each component and its version.

The composite version is the unit of compliance. The AISDP describes a specific composite version. The conformity assessment evaluates a specific composite version. The Declaration of Conformity attests to a specific composite version. When any component changes, the composite version changes, and the change management process must determine whether the new composite version remains within the compliance boundaries established by the last conformity assessment.

This has practical consequences for every stage of the compliance lifecycle. The conformity assessment report must reference the exact composite version evaluated. The post-market monitoring system must track composite version changes over time. When a competent authority requests evidence of compliance, the organisation must be able to reconstruct the composite version that was deployed at any specific date and demonstrate that it matched the version described in the AISDP. Conformity Assessment Processes covers how these assessments are structured and when they must be repeated.

What temporal challenges affect AI system versioning?

Regulatory Requirement

Traditional software is deterministic: the same code produces the same outputs given the same inputs.

Traditional software is deterministic: the same code produces the same outputs given the same inputs. AI systems are designed to change. Models are retrained on new data, knowledge bases are updated with new documents, prompt templates are refined based on operational feedback, and configuration thresholds are adjusted based on monitoring data. Each change is individually rational, yet their cumulative effect may move the system far from the version assessed during conformity assessment. Three patterns characterise this temporal challenge.

Gradual drift occurs when no single change crosses a threshold, but the aggregate of many small changes over months transforms the system's behaviour. The composite version after twenty changes may bear little resemblance to the composite version at conformity assessment. The cumulative baseline comparison is the primary defence against this pattern: it measures the current system against the version evaluated during conformity assessment, not merely against the immediately preceding version. Without it, gradual drift remains invisible until a competent authority compares the deployed system against the AISDP.

Silent updates occur when third-party components change without the organisation's knowledge or consent. A GPAI provider may apply a minor update to its model behind the same API endpoint. An embedding model provider may release a new version that the system's dependency manager installs automatically. A knowledge base data feed may include a document that changes the information available to the model. Each silent update changes the composite version, yet the organisation's version control system may not detect the change. The underlying principle is that if the organisation cannot control the change, it must at least detect it. describes sentinel monitoring approaches that serve as the compensating control for silent updates.

When does a change become a substantial modification?

Regulatory Requirement

The determination of whether a change constitutes a substantial modification requires structured analysis against the Article 3(23) criteria.

The determination of whether a change constitutes a substantial modification requires structured analysis against the Article 3(23) criteria. Seven borderline cases illustrate how organisations apply this framework in practice, where the answer is not immediately obvious.

Routine retraining on updated data is not automatically a substantial modification, even when the data distribution has shifted. Consider a credit scoring system retrained quarterly on the most recent 24 months of application data, where the model architecture, hyperparameters, features, and intended purpose are unchanged. The new training data reflects a shift in the applicant population's income distribution due to economic conditions. The determination rests on whether the version-to-version performance delta, fairness delta, cumulative drift from the baseline, and output distribution metrics all stay within declared thresholds. If they do, the AI System Assessor documents the determination with the metrics, noting any approaching thresholds and recommending enhanced monitoring. The test is whether the resulting system still complies with the requirements, not whether the inputs have changed.

Replacing a model component with a different architecture family is presumptively a substantial modification even when aggregate metrics remain within thresholds. When a recruitment screening system replaces a BERT-based NER model with a RoBERTa-based alternative, the qualitative nature of the change goes beyond parameter adjustment. The cascade impact matters: the new NER model extracts competencies with different confidence distributions, which changes the inputs to downstream classification and ranking stages. Even with improved overall accuracy and adequate fairness metrics, a focused re-assessment of the affected AISDP modules is required before deployment, because the change is qualitative, not merely quantitative.

How do language expansion, silent updates, and prompt drift trigger re-assessment?

Regulatory Requirement

Adding a new language to a multilingual RAG system typically constitutes a substantial modification.

Adding a new language to a multilingual RAG system typically constitutes a substantial modification. When a regulatory compliance advisory system with English, French, and German coverage adds Italian regulatory texts and enables Italian language query processing, the intended purpose remains unchanged, but the deployment context expands to include Italian-speaking users and Italian regulatory content. Lower grounding scores for Italian queries indicate higher hallucination risk, and lower retrieval precision compared to established languages suggests the embedding model has weaker coverage. These quality gaps mean that compliance with Article 15 accuracy requirements and Article 9 risk management obligations cannot be presumed from the existing assessment. A re-assessment covering the new language capability is required before the expansion is deployed.

When a GPAI provider updates a model behind the same API endpoint without changing the model identifier, the organisation did not initiate the change. The system's composite version has changed because the model component version has changed, but the change was not under the organisation's control. The organisation must assess whether the changed system still complies with the requirements established during its conformity assessment. Sentinel monitoring results, including output distribution shifts and fairness metric deltas, form the basis for this assessment. If metrics breach declared thresholds, the organisation must either roll back to the previous model version through version pinning, implement compensating controls, or conduct a re-assessment.

Prompt refinements represent the version control blind spot for LLM-based systems. Individually trivial changes can prove cumulatively transformative. When fourteen incremental prompt refinements over three months each adjust category definitions, add examples, or modify output format instructions, no single refinement changes aggregate accuracy by more than half a percentage point. Yet cumulatively, the classification distribution shifts materially and accuracy degrades by four percentage points against the benchmark dataset. The version-to-version comparison for each change showed sub-threshold shifts, but the version-to-baseline comparison reveals the significant behavioural drift. The cumulative effect constitutes a substantial modification under Article 3(23), even though no individual refinement did. Measuring current behaviour against the conformity assessment baseline, rather than only the immediately preceding version, is essential for detecting this pattern.

What are the minimum tooling requirements for compliance-grade version control?

Compensating Controls

Version control is inherently a tool-based capability, and there is no manual alternative to Git for code versioning.

Version control is inherently a tool-based capability, and there is no manual alternative to Git for code versioning. Git itself is open-source and free. GitHub, GitLab, and Bitbucket all offer free tiers adequate for most compliance purposes. The minimum tooling requirement is Git with a hosted repository, either through the free tiers of GitHub or GitLab, or through a self-hosted GitLab Community Edition installation.

These tooling choices shape the access control policies and backup procedures that protect the compliance record. The version control infrastructure must enforce immutability, so that committed versions cannot be retroactively altered. It must enforce identity verification, so that every change is attributable to a named individual. It must provide trusted timestamps from a reliable source, so that the temporal sequence of changes is verifiable. The engineering team protects this infrastructure against tampering because an attacker who can modify the version history can undermine the entire compliance record.

Organisations should evaluate their existing version control setup against these three properties: immutability, attribution, and trusted timestamps. Where gaps exist, the compliance-grade requirements described throughout this guide identify the specific controls needed to bridge from development-grade to compliance-grade version control.

Why Version Control is a Compliance Requirement for AI Systems

Written by

Why is version control a compliance requirement for AI systems?

How does AI version control differ from traditional software?

What are the six artefact types that require versioning?

What is a composite version and why does it matter?

What temporal challenges affect AI system versioning?

When does a change become a substantial modification?

How do language expansion, silent updates, and prompt drift trigger re-assessment?

What are the minimum tooling requirements for compliance-grade version control?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline