We use cookies to improve your experience and analyse site traffic.
Traceability underpins the entire AISDP. Article 12 requires automatic recording, Article 3(23) defines substantial modification, and Article 72 requires change tracking. This section covers version control governance across six artefact types: code, data, models, configuration, prompts, and knowledge bases.
Article 12 requires automatic recording of events during the system's operation.
Article 12 requires automatic recording of events during the system's operation. Article 3(23) defines substantial modification as a change that affects compliance or the intended purpose. Article 72 requires post-market monitoring that tracks changes. These provisions collectively demand rigorous version control across every artefact that constitutes the AI system.
Without version control, the organisation cannot demonstrate which version was deployed at any given time, what changed between versions, whether a change constitutes a substantial modification, or that the system assessed during conformity assessment is the same system deployed in production. For notified bodies and competent authorities, the version control record is evidence that the organisation exercises deliberate control over its system's evolution.
Compliance-grade version control differs from development-grade version control. Every version must be immutable once committed with no force-pushes or history rewriting. Every version must be attributable to a named individual with verified identity. Every version must carry a timestamp from a trusted source. Every version must be retrievable for the full ten-year retention period, including versions no longer in active use.
AI system version control must track six artefact types simultaneously, each with its own versioning semantics and compliance implications.
AI system version control must track six artefact types simultaneously, each with its own versioning semantics and compliance implications. The system's behaviour at any point is determined by the specific combination of all six.
Code (application logic, pipelines, API contracts) changes continuously and is versioned in Git. Data (training datasets, validation datasets, knowledge bases) changes periodically and carries high compliance sensitivity because data changes directly affect model behaviour and the Article 10 posture; DVC, LakeFS, or Delta Lake provide versioning. Model artefacts (weights, hyperparameters, training configuration) change event-driven with each training run and carry very high sensitivity as the primary trigger for substantial modification assessment; MLflow, Vertex AI, or SageMaker model registries provide versioning.
Configuration (thresholds, feature flags, environment variables, prompt templates) changes frequently and carries high sensitivity because a threshold change can shift the decision boundary affecting every individual assessed. Prompts (system prompts, few-shot examples, tool definitions) for LLM systems change frequently with very high sensitivity because prompt changes can alter behaviour as profoundly as model retraining. Knowledge bases (documents, embeddings, retrieval configuration) for RAG systems change continuously with high sensitivity as content changes alter the information available to the model.
The composite version ties these together. It captures the specific version of each artefact type deployed at a given point in time. The composite version is the unit of compliance: the aisdp describes a specific composite version, the conformity assessment evaluates it, and the Declaration of Conformity attests to it. When any component changes, the composite version changes and the change management process must determine whether the new version remains within compliance boundaries.
The three-repository pattern provides the architectural structure.
The three-repository pattern provides the architectural structure. A code repository in Git holds application code, pipeline definitions, configuration, tests, and infrastructure-as-code templates. A data repository using DVC, Delta Lake, or LakeFS holds dataset versions, feature store snapshots, and data quality baselines. A model repository using MLflow or equivalent holds trained model artefacts, evaluation results, and model cards. Each repository has its own versioning mechanism but they are linked through cross-references, enabling end-to-end traceability from any deployed model version to the exact code, data, and configuration that produced it.
Branch protection rules enforce governance technically. The production branch requires approved reviews before merge, all CI pipeline checks to pass including validation gates, signed commits for attribution integrity, and no force-push. CODEOWNERS files add compliance-specific review requirements: paths affecting fairness require a fairness reviewer, paths affecting security require security review, and paths affecting the AISDP require AI Governance Lead approval.
Signed commits using GPG or SSH cryptographically bind each commit to a verified identity, providing assurance that the change history is authentic. Secret management ensures API keys, credentials, and encryption keys never appear in the Git history, with secrets sourced from dedicated managers such as HashiCorp Vault in production.
High-risk AI systems built on microservice architectures present a distinctive challenge.
High-risk AI systems built on microservice architectures present a distinctive challenge. Each service is independently deployable and versioned. A change to any single service can alter the system's overall behaviour in ways difficult to predict from the change in isolation.
Service dependency mapping is a compliance artefact. It shows how each service communicates with every other, the data contracts between them, the processing sequence for inference requests, and the failure modes that propagate across service boundaries. Without it, the organisation cannot assess whether a change to one service constitutes a substantial modification to the system. Both declared dependencies from the service catalogue and observed dependencies from distributed tracing should be maintained, with discrepancies investigated.
Change impact analysis traces a modification's effects through the dependency map before deployment. A modification to the data ingestion service that alters missing value handling changes feature vectors, which changes inference behaviour, which changes outputs to operators. The Technical SME examines each link in this chain.
Contract testing validates that each service's outputs conform to consumer expectations. Consumer-driven contract testing using Pact has each consumer define expectations verified against the provider on every build. Statistical contract testing using Great Expectations extends this to data quality: expectations on distributions, means, and completeness that detect silent shifts the schema contract would miss. A data delivery that satisfies schema checks but has a shifted distribution is more dangerous than one that fails the schema check because it will be silently accepted. Contract tests run in CI for every service, with failures blocking deployment.
Data changes are often more consequential than code changes for AI systems, yet receive less rigorous version control.
Data changes are often more consequential than code changes for AI systems, yet receive less rigorous version control. A change to the training dataset, a label correction, or a preprocessing modification can alter behaviour as profoundly as rewriting the algorithm. The version control system must capture additions, removals, corrections, transformations, and schema changes to datasets.
DVC tracks dataset versions alongside Git, storing large data files in remote storage while recording their content hashes in the repository. LakeFS provides Git-like branching for data lakes, enabling safe experimentation. Delta Lake provides ACID transactions and time-travel queries. Each approach links a specific dataset version to the code and model versions that consumed it.
Cascading data change analysis assesses how a data modification propagates through the system. New training data may shift feature distributions, which may change model behaviour, which may affect fairness metrics. The Technical SME traces each data change through its impacts on the feature engineering, model inference, and post-processing layers, documenting the assessment in the change log.
For the model registry, each entry records the model version, its architecture definition, training configuration, evaluation metrics, the data version and code commit that produced it, and the stage (experimental, staging, production, archived). Stage transitions require human approval. Only models in production stage can be loaded by the inference service. The registry provides the traceability link between any deployed model and its complete provenance chain. The model registry's contents map directly to AISDP Module 3 evidence.
Third-party models, particularly GPAI models accessed through APIs, present a versioning challenge: the organisation depends on an external provider who may update the model without notice.
Third-party models, particularly GPAI models accessed through APIs, present a versioning challenge: the organisation depends on an external provider who may update the model without notice. The composite version changes, but the organisation's version control system may not detect it.
Sentinel testing provides the detection mechanism. A fixed set of inputs with known expected outputs is submitted to the third-party model on a scheduled basis. Changes in the outputs indicate that the model's behaviour has shifted. For embedding models in RAG systems, sentinel tests compare the embeddings produced for fixed documents against a stored baseline; shifts in the embedding space indicate a model update that may affect retrieval quality and fairness.
Prompt governance applies version control discipline to LLM prompts. System prompts, few-shot examples, prompt chains, and tool definitions are stored as version-controlled text files in the code repository. Each prompt version is linked to the evaluation results that validated it. Prompt registries using emerging tools from LangSmith, Humanloop, or PromptLayer provide additional metadata tracking. Prompt changes can alter system behaviour as profoundly as model retraining, yet they are often treated as informal configuration; the version control system must treat them with the same rigour.
For knowledge bases in RAG systems, document additions, updates, and removals must be tracked. DVC or LakeFS provide document snapshots. Vector database versioning tracks embedding changes. A knowledge base change that alters the information available to the model alters outputs and may trigger the substantial modification question.
Article 3(23) defines a substantial modification as a change to the AI system after market placement that affects compliance with requirements or modifies the intended purpose.
Article 3(23) defines a substantial modification as a change to the AI system after market placement that affects compliance with requirements or modifies the intended purpose. The substantial modification determination is the highest-stakes version control decision because it may trigger a new conformity assessment.
Quantitative thresholds provide the first assessment layer. The Technical SME evaluates each change against defined metrics: version-to-version performance delta, version-to-version fairness delta, output distribution shift, and critically, version-to-baseline cumulative drift comparing the current version against the version assessed at the last conformity assessment. Cumulative drift is essential because gradual changes may individually remain below thresholds but collectively transform the system.
The decision process has three tiers. Minor changes with all metrics within thresholds and no qualitative flags require documentation in the change log. Moderate changes with metrics approaching thresholds or qualitative concerns such as component replacement require focused re-assessment of affected AISDP modules. Major changes crossing quantitative thresholds, changing the intended purpose, or replacing core components require full conformity re-assessment.
Seven borderline cases illustrate the framework. Retraining on updated data with the same architecture is typically not substantial if metrics remain within thresholds. Replacing a model component with a different architecture family is presumptively substantial even when aggregate metrics pass, because the change is qualitative. Changing a threshold within the provider's documented range is a deployer-level activity. Adding a new language to a multilingual RAG system is likely substantial due to new populations. GPAI provider updates behind the same API require sentinel detection. Accumulating prompt refinements may collectively constitute a substantial modification through gradual drift. Changing the intended purpose incrementally through scope expansion is the most dangerous pattern because it occurs gradually.
The version control system generates evidence that feeds into multiple AISDP modules.
The version control system generates evidence that feeds into multiple AISDP modules. Module 10 (Record-Keeping) captures the complete change history across all six artefact types. Module 12 (Change History) captures each change with its composite version, the change description, the impact assessment, and the substantial modification determination. Module 3 (Architecture and Design) references the model registry and the service dependency map.
The temporal challenge manifests in three patterns that the version control system must address. Gradual drift occurs when no single change crosses a threshold but twenty small changes over six months transform behaviour; cumulative baseline comparison is the primary defence. Silent updates occur when third-party components change without notice; sentinel monitoring is the compensating control. Retroactive invalidity occurs when a post-deployment discovery invalidates a previously compliant version; the version control system must support retrospective analysis tracing any composite version to its complete provenance.
For organisations at earlier maturity levels, version control governance can be achieved procedurally through documented merge policies, a manual review matrix equivalent to CODEOWNERS, and a named repository administrator with exclusive merge permission. Branch protection and signed commits require platform features but Git itself is open-source and free. The minimum viable tooling is Git with a hosted repository, DVC for data, and MLflow for models, all available at zero licence cost. The procedural approach depends on administrator discipline, and a lapse in merging without proper review creates a compliance gap with no automatic detection.
Not automatically. The test is whether the resulting system still complies with requirements, not whether the inputs changed. If performance and fairness metrics remain within thresholds and cumulative drift is below baseline, it is typically not substantial, though the determination must be documented.
Git for code (free), DVC for data (open-source), and MLflow for models (open-source). All are available at zero licence cost. The three-repository pattern with cross-references provides compliance-grade traceability.
Contract testing detects silent breaking changes that integration testing misses. Consumer-driven contracts (Pact) verify provider outputs match consumer expectations. Statistical contracts (Great Expectations) detect distributional shifts that satisfy schema checks but alter model behaviour.
Sentinel testing submits fixed inputs on a schedule and compares outputs against baselines. For embedding models, sentinel tests compare embeddings for fixed documents. Changes indicate model updates that may affect compliance.