What happens if a small labelling correction changes model fairness metrics?

Even a correction affecting two per cent of training data can shift the model's decision boundary and alter fairness metrics for specific subgroups. The data change approval workflow should require Technical SME review and fairness impact assessment before approval.

Which data versioning tool should we choose if we already use Spark?

Delta Lake integrates naturally with Spark-based data pipelines, providing ACID transactions and time-travel for efficient incremental versioning. Organisations not using Spark face a steeper adoption curve with Delta Lake.

Can we manage data versioning without specialised tools?

Yes, through manual snapshot management using complete dataset copies with naming conventions and manifest files recording record counts, schemas, and SHA-256 hashes. This sacrifices storage efficiency and becomes impractical above approximately ten gigabytes.

What happens if a small labelling correction changes model fairness metrics?

Even a correction affecting two per cent of training data can shift the model's decision boundary and alter fairness metrics for specific subgroups. The data change approval workflow should require Technical SME review and fairness impact assessment before approval.

Which data versioning tool should we choose if we already use Spark?

Delta Lake integrates naturally with Spark-based data pipelines, providing ACID transactions and time-travel for efficient incremental versioning. Organisations not using Spark face a steeper adoption curve with Delta Lake.

Can we manage data versioning without specialised tools?

Yes, through manual snapshot management using complete dataset copies with naming conventions and manifest files recording record counts, schemas, and SHA-256 hashes. This sacrifices storage efficiency and becomes impractical above approximately ten gigabytes.

Data Version Control with DVC: Cascading Pipeline Propagation

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Data changes are often more consequential than code changes for AI systems, yet they receive less rigorous version control. Under the EU AI Act, providers of high-risk AI systems must maintain full traceability of datasets through tools such as DVC, Delta Lake, or LakeFS.

Abstract

Read abstract

Data modifications, whether label corrections, new records, or changes to preprocessing logic, can alter an AI model's behaviour as profoundly as rewriting its algorithm. The EU AI Act requires high-risk AI system providers to maintain full traceability of all data that shapes system behaviour, making disciplined data version control essential. This page examines six categories of data change that require versioning, from dataset additions and removals to feature engineering and data quality rule modifications. It explains how data changes cascade through the system: raw data flows into computed features, then into model parameters, inference behaviour, and system outputs. Three tooling approaches are compared. DVC works alongside Git by storing content hashes in the repository while datasets reside in remote storage. Delta Lake provides ACID transactions and time-travel on data lakes for organisations using Spark. LakeFS offers Git-like branching directly on S3-compatible object storage. For organisations without specialised tooling, a procedural alternative using manual snapshots with manifest files provides basic traceability at the cost of storage efficiency. All approaches must satisfy a ten-year retention requirement from the date the system is placed on the market.

Why do data changes need version control under the EU AI Act?

Regulatory Requirement

Data changes are often more consequential than code changes for AI systems, yet many organisations apply less rigorous version control to their datasets than to their source code.

Data changes are often more consequential than code changes for AI systems, yet many organisations apply less rigorous version control to their datasets than to their source code. A modification to the training dataset, a correction to a label, an expansion of data collection scope, or a change to the preprocessing pipeline can alter a model's behaviour as profoundly as rewriting its algorithm. The eu ai act requires that high-risk AI system providers maintain full traceability of all inputs that shape system behaviour, and data is the most influential input of all.

The version control system must capture six categories of data change. These include additions to training, validation, or test datasets such as new records, features, or data sources. They also include removals from datasets, whether excluded records, dropped features, or retired sources. Modifications to existing records, including label corrections, value updates, and error fixes, must be tracked. Changes to data preprocessing logic such as imputation methods, normalisation parameters, and outlier treatment require versioning. Changes to feature engineering logic, covering new derived features, modified transformations, and altered selection criteria, must likewise be recorded. Finally, changes to data quality rules, including new validation checks, modified thresholds, and altered exception handling, fall within scope.

Each of these changes must be versioned, timestamped, attributed to a named individual, and accompanied by a rationale. The governance discipline of ensuring that every data change follows the same approval workflow as a code change represents the organisational challenge that Version Control and Change Management addresses at the pillar level.

What are the cascading effects of a data change?

Regulatory Requirement

A data change propagates through the system in a predictable sequence: the raw data changes, the computed features change, the model's learned parameters change if retrained, the model's inference behaviour changes, and the system's outputs change. The Technical SME evaluates each step in this cascade to determine whether the cumulative effect is material.

Consider a scenario where the data engineering team corrects a labelling error affecting two per cent of the training dataset. The correction is clearly beneficial from a data quality perspective. Yet the model retrained on the corrected data may produce different outputs for certain subgroups, potentially altering fairness metrics. Corrected labels may shift the model's decision boundary in ways that affect accuracy on specific edge cases that were previously classified correctly due to the error. Such a change may cause the model's output distribution to shift, which may in turn cause post-processing thresholds to admit or exclude a different proportion of cases.

The version control system must make these cascading effects visible. Automated impact analysis, where a data change triggers re-evaluation of the model's performance and fairness metrics before the change is approved, provides the technical control. The governance control is a data change approval workflow that requires Technical SME review, fairness impact assessment, and AI Governance Lead sign-off for changes exceeding defined materiality thresholds.

How does DVC track data alongside code?

Engineering Approach

DVC (Data Version Control) is the most widely adopted tool for machine learning data versioning.

DVC (Data Version Control) is the most widely adopted tool for machine learning data versioning. It works alongside Git: the dataset itself is stored in remote storage such as S3, GCS, or Azure Blob, while DVC creates a small metadata file (a .dvc file) in the Git repository that records the storage location, content hash, and version of the dataset. When a team member checks out a Git commit, DVC retrieves the exact dataset corresponding to that commit. This provides a clean link between the code version at the Git commit level and the data version at the DVC-tracked dataset level. The model registry entry then references both, completing the traceability chain required under The Model Registry.

The practical advantage of DVC is that it fits into existing Git workflows. Data scientists and engineers already know Git, and DVC adds data versioning without requiring a fundamentally different way of working. The limitation is that DVC tracks whole files: if a ten-gigabyte CSV changes by one row, DVC stores a new ten-gigabyte version. For datasets that change frequently with small increments, this can consume substantial storage.

How does a DVC pipeline express cascading dependencies?

Engineering Approach

A DVC pipeline definition makes cascading dependencies explicit by declaring inputs and outputs for each stage.

A DVC pipeline definition makes cascading dependencies explicit by declaring inputs and outputs for each stage. A typical four-stage pipeline covers data preparation, feature engineering, training, and evaluation. The prepare_data stage takes raw data and preprocessing parameters as inputs and produces processed train, validation, and test datasets.

The feature_engineering stage depends on the processed training data, so any change to raw data cascades automatically into feature computation.

The train stage depends on the engineered features, cascading changes one step further. The evaluate stage depends on the trained model and the processed test data, meaning it cascades from both the training and data preparation stages. Any change to data/raw/ therefore cascades through all four stages automatically. DVC detects which stages need re-execution via content hashing, and the dvc repro command reproduces only the changed stages. The dvc metrics diff command shows metric changes versus the previous version, providing immediate visibility into the impact of any data modification.

This declarative pipeline structure ensures that no intermediate artefact can become stale without detection. The dependency graph is explicit, auditable, and reproducible, which satisfies the traceability requirements for providers.

What alternatives to DVC exist for data versioning?

Engineering Approach

Delta Lake addresses the incremental change problem that DVC handles less efficiently.

Delta Lake addresses the incremental change problem that DVC handles less efficiently. Built on top of Apache Spark, Delta Lake provides ACID transactions on data lakes. Each transaction, whether adding rows, deleting rows, or modifying rows, is recorded as a separate versioned operation in a transaction log. Time-travel functionality allows querying the dataset as it existed at any historical version or timestamp. For organisations already running Spark-based data pipelines, Delta Lake integrates naturally and provides efficient storage of incremental changes. The limitation is its dependency on the Spark ecosystem; organisations not using Spark face a steeper adoption curve.

LakeFS provides Git-like semantics, including branches, commits, and merges, directly on object storage. It sits in front of S3-compatible storage and intercepts all operations, creating versioned snapshots. Data engineers can create a branch, experiment with transformations, and merge the results only after validation. This is particularly powerful for data quality workflows: failed transformations can be discarded without affecting the main dataset. LakeFS works with any tool that reads from S3, giving it broad compatibility across the data engineering ecosystem. Selecting the right tool depends on existing infrastructure; Tool Selection and Integration provides a framework for evaluating these options.

What retention requirements apply to versioned datasets?

Regulatory Requirement

For all versioning tools, the retention requirement is ten years from the date the system is placed on the market.

For all versioning tools, the retention requirement is ten years from the date the system is placed on the market. Older dataset versions must remain retrievable for the entire period. This has significant infrastructure implications: the versioning backend's storage must be durable, with replication and backup in place. Access credentials must survive personnel changes, and the AI Governance Lead must budget storage costs for a decade.

Many organisations underestimate this requirement. A dataset versioning system that runs on a team's cloud account and is forgotten when the team reorganises fails the retention test. Versioned datasets should be stored in the organisation's long-term compliance storage, such as S3 Glacier, Azure Archive, or equivalent services, with lifecycle policies that prevent accidental deletion. The retention and storage strategy should be documented as part of the broader compliance infrastructure described in Audit Trail and Compliance Logging.

How can organisations manage data versions without specialised tooling?

Compensating Controls

Without tools such as DVC, Delta Lake, or LakeFS, data versioning reverts to manual snapshot management.

Without tools such as DVC, Delta Lake, or LakeFS, data versioning reverts to manual snapshot management. Each dataset version becomes a complete copy stored with a naming convention, for example training_data_v2.4_2026-02-15/, and accompanied by a manifest file. That manifest records the record count, column schema, content hash computed via a SHA-256 algorithm, and the identity of the person who created the snapshot.

The snapshot naming convention should include both a version number and a date. Each snapshot needs an accompanying manifest file in YAML or JSON format, recording the version identifier, creation date, creator identity, record count, column schema, hash of each data file, source description, and any transformations applied since the previous version. Storage must have access controls and no-delete policies in place. The model registry entry should cross-reference the dataset version identifier so that traceability is maintained.

This approach sacrifices incremental storage efficiency, since every version is a full copy, consuming significant storage for large datasets. Automated hash verification on retrieval is also lost, as is integration with Git for code-data cross-referencing. For datasets above approximately ten gigabytes, the storage cost and manual management burden become substantial, and adopting dedicated tooling becomes strongly advisable.

Data Version Control with DVC: Cascading Pipeline Propagation

Written by

Why do data changes need version control under the EU AI Act?

What are the cascading effects of a data change?

How does DVC track data alongside code?

How does a DVC pipeline express cascading dependencies?

What alternatives to DVC exist for data versioning?

What retention requirements apply to versioned datasets?

How can organisations manage data versions without specialised tooling?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline