We use cookies to improve your experience and analyse site traffic.
Data changes are often more consequential than code changes for AI systems, yet they receive less rigorous version control. Under the EU AI Act, providers of high-risk AI systems must maintain full traceability of datasets through tools such as DVC, Delta Lake, or LakeFS.
Data changes are often more consequential than code changes for AI systems, yet many organisations apply less rigorous version control to their datasets than to their source code.
Data changes are often more consequential than code changes for AI systems, yet many organisations apply less rigorous version control to their datasets than to their source code. A modification to the training dataset, a correction to a label, an expansion of data collection scope, or a change to the preprocessing pipeline can alter a model's behaviour as profoundly as rewriting its algorithm. The eu ai act requires that high-risk AI system providers maintain full traceability of all inputs that shape system behaviour, and data is the most influential input of all.
The version control system must capture six categories of data change. These include additions to training, validation, or test datasets such as new records, features, or data sources. They also include removals from datasets, whether excluded records, dropped features, or retired sources. Modifications to existing records, including label corrections, value updates, and error fixes, must be tracked. Changes to data preprocessing logic such as imputation methods, normalisation parameters, and outlier treatment require versioning. Changes to feature engineering logic, covering new derived features, modified transformations, and altered selection criteria, must likewise be recorded. Finally, changes to data quality rules, including new validation checks, modified thresholds, and altered exception handling, fall within scope.
Each of these changes must be versioned, timestamped, attributed to a named individual, and accompanied by a rationale. The governance discipline of ensuring that every data change follows the same approval workflow as a code change represents the organisational challenge that Version Control and Change Management addresses at the pillar level.
A data change propagates through the system in a predictable sequence: the raw data changes, the computed features change, the model's learned parameters change if retrained, the model's inference behaviour changes, and the system's outputs change.
A data change propagates through the system in a predictable sequence: the raw data changes, the computed features change, the model's learned parameters change if retrained, the model's inference behaviour changes, and the system's outputs change. The Technical SME evaluates each step in this cascade to determine whether the cumulative effect is material.
Consider a scenario where the data engineering team corrects a labelling error affecting two per cent of the training dataset. The correction is clearly beneficial from a data quality perspective. Yet the model retrained on the corrected data may produce different outputs for certain subgroups, potentially altering fairness metrics. Corrected labels may shift the model's decision boundary in ways that affect accuracy on specific edge cases that were previously classified correctly due to the error. Such a change may cause the model's output distribution to shift, which may in turn cause post-processing thresholds to admit or exclude a different proportion of cases.
The version control system must make these cascading effects visible. Automated impact analysis, where a data change triggers re-evaluation of the model's performance and fairness metrics before the change is approved, provides the technical control. The governance control is a data change approval workflow that requires Technical SME review, fairness impact assessment, and AI Governance Lead sign-off for changes exceeding defined materiality thresholds.
DVC (Data Version Control) is the most widely adopted tool for machine learning data versioning.
DVC (Data Version Control) is the most widely adopted tool for machine learning data versioning. It works alongside Git: the dataset itself is stored in remote storage such as S3, GCS, or Azure Blob, while DVC creates a small metadata file (a .dvc file) in the Git repository that records the storage location, content hash, and version of the dataset. When a team member checks out a Git commit, DVC retrieves the exact dataset corresponding to that commit. This provides a clean link between the code version at the Git commit level and the data version at the DVC-tracked dataset level. The model registry entry then references both, completing the traceability chain required under The Model Registry.
The practical advantage of DVC is that it fits into existing Git workflows. Data scientists and engineers already know Git, and DVC adds data versioning without requiring a fundamentally different way of working. The limitation is that DVC tracks whole files: if a ten-gigabyte CSV changes by one row, DVC stores a new ten-gigabyte version. For datasets that change frequently with small increments, this can consume substantial storage.
A DVC pipeline definition makes cascading dependencies explicit by declaring inputs and outputs for each stage.
A DVC pipeline definition makes cascading dependencies explicit by declaring inputs and outputs for each stage. A typical four-stage pipeline covers data preparation, feature engineering, training, and evaluation. The prepare_data stage takes raw data and preprocessing parameters as inputs and produces processed train, validation, and test datasets.
The feature_engineering stage depends on the processed training data, so any change to raw data cascades automatically into feature computation.
The train stage depends on the engineered features, cascading changes one step further. The evaluate stage depends on the trained model and the processed test data, meaning it cascades from both the training and data preparation stages. Any change to data/raw/ therefore cascades through all four stages automatically. DVC detects which stages need re-execution via content hashing, and the dvc repro command reproduces only the changed stages. The dvc metrics diff command shows metric changes versus the previous version, providing immediate visibility into the impact of any data modification.
This declarative pipeline structure ensures that no intermediate artefact can become stale without detection. The dependency graph is explicit, auditable, and reproducible, which satisfies the traceability requirements for providers.
Delta Lake addresses the incremental change problem that DVC handles less efficiently.
Delta Lake addresses the incremental change problem that DVC handles less efficiently. Built on top of Apache Spark, Delta Lake provides ACID transactions on data lakes. Each transaction, whether adding rows, deleting rows, or modifying rows, is recorded as a separate versioned operation in a transaction log. Time-travel functionality allows querying the dataset as it existed at any historical version or timestamp. For organisations already running Spark-based data pipelines, Delta Lake integrates naturally and provides efficient storage of incremental changes. The limitation is its dependency on the Spark ecosystem; organisations not using Spark face a steeper adoption curve.
LakeFS provides Git-like semantics, including branches, commits, and merges, directly on object storage. It sits in front of S3-compatible storage and intercepts all operations, creating versioned snapshots. Data engineers can create a branch, experiment with transformations, and merge the results only after validation. This is particularly powerful for data quality workflows: failed transformations can be discarded without affecting the main dataset. LakeFS works with any tool that reads from S3, giving it broad compatibility across the data engineering ecosystem. Selecting the right tool depends on existing infrastructure; Tool Selection and Integration provides a framework for evaluating these options.
For all versioning tools, the retention requirement is ten years from the date the system is placed on the market.
For all versioning tools, the retention requirement is ten years from the date the system is placed on the market. Older dataset versions must remain retrievable for the entire period. This has significant infrastructure implications: the versioning backend's storage must be durable, with replication and backup in place. Access credentials must survive personnel changes, and the AI Governance Lead must budget storage costs for a decade.
Many organisations underestimate this requirement. A dataset versioning system that runs on a team's cloud account and is forgotten when the team reorganises fails the retention test. Versioned datasets should be stored in the organisation's long-term compliance storage, such as S3 Glacier, Azure Archive, or equivalent services, with lifecycle policies that prevent accidental deletion. The retention and storage strategy should be documented as part of the broader compliance infrastructure described in Audit Trail and Compliance Logging.
Without tools such as DVC, Delta Lake, or LakeFS, data versioning reverts to manual snapshot management.
Without tools such as DVC, Delta Lake, or LakeFS, data versioning reverts to manual snapshot management. Each dataset version becomes a complete copy stored with a naming convention, for example training_data_v2.4_2026-02-15/, and accompanied by a manifest file. That manifest records the record count, column schema, content hash computed via a SHA-256 algorithm, and the identity of the person who created the snapshot.
The snapshot naming convention should include both a version number and a date. Each snapshot needs an accompanying manifest file in YAML or JSON format, recording the version identifier, creation date, creator identity, record count, column schema, hash of each data file, source description, and any transformations applied since the previous version. Storage must have access controls and no-delete policies in place. The model registry entry should cross-reference the dataset version identifier so that traceability is maintained.
This approach sacrifices incremental storage efficiency, since every version is a full copy, consuming significant storage for large datasets. Automated hash verification on retrieval is also lost, as is integration with Git for code-data cross-referencing. For datasets above approximately ten gigabytes, the storage cost and manual management burden become substantial, and adopting dedicated tooling becomes strongly advisable.
Even a correction affecting two per cent of training data can shift the model's decision boundary and alter fairness metrics for specific subgroups. The data change approval workflow should require Technical SME review and fairness impact assessment before approval.
Delta Lake integrates naturally with Spark-based data pipelines, providing ACID transactions and time-travel for efficient incremental versioning. Organisations not using Spark face a steeper adoption curve with Delta Lake.
Yes, through manual snapshot management using complete dataset copies with naming conventions and manifest files recording record counts, schemas, and SHA-256 hashes. This sacrifices storage efficiency and becomes impractical above approximately ten gigabytes.
DVC stores datasets in remote storage and creates small metadata files in Git that record content hashes, enabling retrieval of the exact dataset for any Git commit.
A dvc.yaml pipeline definition declares inputs and outputs for each stage so that changes to raw data automatically cascade through preparation, feature engineering, training, and evaluation.
Delta Lake provides ACID transactions and time-travel on Spark-based data lakes, while LakeFS offers Git-like branching directly on S3-compatible object storage.
All versioned datasets must be retained for ten years from the date the AI system is placed on the market, with durable storage, surviving credentials, and deletion prevention.