We use cookies to improve your experience and analyse site traffic.
Article 10 of the EU AI Act requires organisations to demonstrate full traceability of training data. Data lineage, the ability to trace every data element from source collection through transformation to model use, is the mechanism that makes this possible.
Data lineage is the ability to trace every data element across its entire journey: from source collection, through each transformation step, to final use in the model's training, validation, or inference.
Data lineage is the ability to trace every data element across its entire journey: from source collection, through each transformation step, to final use in the model's training, validation, or inference. This traceability forms a foundational requirement for AISDP Module 4 under article 10. Without data lineage, the organisation cannot demonstrate compliance with Article 10's data governance requirements because it cannot prove what data the model was trained on, how that data was prepared, or whether the preparation steps introduced bias.
Lineage operates at three levels of granularity, and most organisations need all three to provide complete coverage. Source-level lineage identifies where raw data originated. Transformation-level lineage records what happened to that data at each processing step. Consumption-level lineage tracks how the processed data was split and used for training, validation, and testing. The aisdp must describe the lineage mechanisms in place, the granularity of lineage tracking, and any gaps in the lineage chain. The AI System Assessor evaluates gaps for risk and documents them as non-conformities if they affect the organisation's ability to satisfy Article 10.
The data engineering team documents every data engineering step before execution.
The data engineering team documents every data engineering step before execution. Each record includes an explicit statement of the input datasets referenced by version identifier, the intended transformation, and the rationale for that transformation: what data quality, completeness, or fairness problem does it address. The record also captures the expected output characteristics, including schema, record count, and distribution properties, along with the validation criteria that will be applied to the output.
This pre-step documentation serves two purposes. It creates an audit trail demonstrating that data engineering was deliberate and considered, not ad hoc. It also provides a reference point for comparing the actual output against expectations, enabling detection of unexpected transformation effects. Data Quality Frameworks covers the quality metrics and expectation suites that feed into these pre-step records.
After each data engineering step, the Technical SME records the actual output dataset referenced by version identifier along with the actual output characteristics: schema, record count, and distribution properties.
After each data engineering step, the Technical SME records the actual output dataset referenced by version identifier along with the actual output characteristics: schema, record count, and distribution properties. A comparison against the pre-step expectations is required, noting any deviations and their explanation.
The record must also capture the impact on data quality metrics, including error rates, missing value rates, and distributional properties. Fairness-relevant distributions are documented separately, covering changes to protected characteristic subgroup representation or feature distributions. The identity of the person who executed the step and the date of execution are recorded to maintain individual accountability. Together with the pre-step records, these post-step captures create the structured audit trail that the ai system assessor reviews during conformity assessment.
Pipeline-level lineage captures the macro view of data processing: which pipeline steps ran, in what order, with what inputs and outputs.
Pipeline-level lineage captures the macro view of data processing: which pipeline steps ran, in what order, with what inputs and outputs. This is the coarsest level of lineage, equivalent to knowing that the data went through ingestion, cleaning, feature engineering, and training split. DAG-based orchestration tools such as Apache Airflow, Prefect, or Dagster provide this automatically because the pipeline definition is itself a directed acyclic graph of steps with declared dependencies.
The practical requirement is to ensure that every pipeline execution is logged with a unique execution ID, a timestamp, the input dataset versions, the output dataset versions, and the execution status. This metadata is retained as a compliance artefact within the evidence pack. Pipeline-level lineage provides a high-level view of the data transformation workflow but may not capture the detailed transformation logic within each step, which is why finer-grained mechanisms are also needed. Data Governance sets out the broader governance framework within which pipeline lineage operates.
Transformation-level lineage captures the logic within each pipeline step: what the cleaning step actually did, what the feature engineering step computed, what the imputation strategy was.
Transformation-level lineage captures the logic within each pipeline step: what the cleaning step actually did, what the feature engineering step computed, what the imputation strategy was. This level requires that each transform is defined as version-controlled code rather than ad hoc SQL queries or Jupyter notebook cells. Each transformation being defined as code and version-controlled ensures reproducibility across the entire pipeline.
dbt is the strongest tool for SQL-based transforms: each model is a SQL file in a Git repository, with tests, documentation, and automatic lineage graph output. Tools such as Great Expectations provide finer-grained lineage at the level of individual data validations. For Python-based transforms, the code itself provides lineage when version-controlled, but the team must also capture each transform's parameters, including thresholds, imputation values, and normalisation statistics, as structured metadata.
Column-level lineage is the most granular tracking level and the most valuable for bias analysis.
Column-level lineage is the most granular tracking level and the most valuable for bias analysis. It tracks how each column, or feature, in the model's input dataset relates to columns in the source datasets. If the model uses a "risk_score" feature, column-level lineage reveals that "risk_score" was derived from "annual_income" sourced from the payroll system and "postcode" sourced from the address database.
This matters because "postcode" may be a proxy variable for ethnicity, and without column-level lineage the proxy relationship is invisible. Column-level lineage is therefore essential for proxy variable analysis: if a feature in the model's input is derived from a transformation that incorporates a protected characteristic, this level of tracking reveals the dependency. Bias Detection and Mitigation covers how proxy analysis uses these lineage outputs to identify and address hidden sources of discrimination in practice.
OpenLineage provides an open standard for emitting lineage events at all three levels: pipeline, transformation, and column.
OpenLineage provides an open standard for emitting lineage events at all three levels: pipeline, transformation, and column. Marquez implements the OpenLineage standard as a lineage server, collecting events from pipeline tools and providing a queryable lineage graph. DataHub and Apache Atlas offer similar capabilities within their broader metadata platforms.
A lineage event emitted by a training pipeline step typically records the event type, timestamp, and run identifier; the job namespace and name with documentation; input datasets with their schema, data source references, and data quality metrics such as row count and null percentage; and output datasets with model metrics such as evaluation scores and fairness measures. This structured event format enables automated compliance reporting because the lineage graph can be queried programmatically to answer audit questions about data provenance and processing history.
Feature stores such as Feast, Tecton, and Hopsworks address a specific lineage gap: the connection between raw data and computed features.
Feature stores such as Feast, Tecton, and Hopsworks address a specific lineage gap: the connection between raw data and computed features. A feature store centralises three things: feature definitions, meaning the code that computes each feature; feature values, meaning versioned snapshots of computed features; and feature metadata, covering descriptions, owners, and freshness requirements.
Consistency between the features used in training and the features used in inference is enforced by the store, eliminating a common source of training-serving skew that can degrade performance and fairness in production. This enforcement is particularly important for Article 10 compliance because discrepancies between training and inference features can introduce biases that were not present during model evaluation.
The data engineering team versions every dataset with an immutable identifier that allows the exact dataset to be retrieved at any future point.
The data engineering team versions every dataset with an immutable identifier that allows the exact dataset to be retrieved at any future point. Dataset versions are linked to model versions so that the AISDP can state precisely which data was used to train each model version. Tools such as DVC (Data Version Control), Delta Lake, LakeFS, or cloud-native versioning such as S3 object versioning provide this capability.
The AISDP must reference the data versioning mechanism, the storage location, the retention policy, and the access controls governing the versioned datasets. Without immutable dataset versioning, the organisation cannot reproduce a model's training conditions or verify that a specific dataset was used, both of which are requirements for demonstrating compliance under Article 10.
The pre-step and post-step record methodology wraps each data engineering step in a structured log that serves as both an engineering discipline and a compensating control for organisations without fully automated lineage tooling.
The pre-step and post-step record methodology wraps each data engineering step in a structured log that serves as both an engineering discipline and a compensating control for organisations without fully automated lineage tooling. Before running the step, the engineer records the input datasets by version identifier, the intended transform, the rationale, and the expected output characteristics. Afterwards, the actual output characteristics are recorded and compared against expectations.
The data engineering team structures the records as JSON or YAML, attached as metadata to the pipeline step, and stored in the evidence pack. Great Expectations integrates naturally here: an expectation suite defines the expected output characteristics, and the test result, whether pass or fail with specifics, serves as the post-step record. This creates an audit trail demonstrating that data engineering was deliberate and considered. Evidence Pack Structure details how these records fit into the broader compliance evidence framework.
Pipeline-level lineage captures which steps ran and in what order with their inputs and outputs, providing a macro view. Transformation-level lineage captures the detailed logic within each step, such as what the cleaning or feature engineering step actually computed, requiring version-controlled code rather than ad hoc queries.
OpenLineage provides the open standard for emitting lineage events. Marquez implements it as a lineage server with a queryable graph. DataHub and Apache Atlas offer similar capabilities within broader metadata platforms. DAG-based tools like Airflow, Prefect, and Dagster provide pipeline-level lineage automatically.
Column-level lineage tracks how each feature in the model's input relates to source columns. If a risk score is derived from postcode data, column-level lineage reveals this dependency, making it possible to identify that postcode may serve as a proxy for ethnicity.
Column-level lineage reveals how each model input feature relates to source columns, making proxy variable relationships visible that would otherwise be hidden.
Feature stores centralise feature definitions, versioned values, and metadata, enforcing consistency between training and inference features to eliminate training-serving skew.
Every dataset requires an immutable version identifier linked to model versions, with documented storage location, retention policy, and access controls.
Structured JSON or YAML logs recorded before and after each data engineering step create an audit trail when fully automated lineage tooling is not in place.