What does Article 10 require for data governance of high-risk AI systems?

Training, validation, and testing datasets must satisfy governance practices covering relevance, representativeness, freedom from errors, and completeness, documented in AISDP Module 4 with structured records of provenance, composition, preparation, quality, annotation, and known limitations.

How should bias be detected in training data for high-risk AI systems?

Pre-training assessment covers distributional analysis, label bias, proxy variable detection, and intersectional analysis. Post-training evaluation uses five complementary metrics: selection rate ratio, equalised odds, predictive parity, calibration within groups, and counterfactual fairness testing.

How do GDPR and the EU AI Act interact for AI training data?

They operate as cumulative obligations. Lawful basis selection, data subject rights (especially the right to erasure), DPIA/FRIA coordination, and data retention tensions between GDPR storage limitation and Article 18's ten-year documentation retention all require integrated treatment.

What governance applies to third-party data used in high-risk AI systems?

Article 10 requirements apply regardless of data source. Governance operates across contractual controls (provenance, quality, bias warranties, audit rights), technical validation (automated intake pipelines), and monitoring for silent supplier changes.

Does Article 10 apply to knowledge bases used in RAG architectures?

Whether a knowledge base constitutes 'training, validation and testing data' within Article 10 is an open legal question, since the knowledge base conditions outputs at inference time rather than training model parameters. The recommended compliance approach is to apply Article 10 governance to the knowledge base, adapted for inference-time retrieval. The cost of discovering post-deployment that an ungoverned knowledge base introduced bias or inaccuracy into a high-risk system materially exceeds the cost of applying governance from the outset.

Can stored vector embeddings be personal data under GDPR?

Yes. Research demonstrates that original text can be partially or fully reconstructed from embeddings through inversion attacks. If the embedding model encodes documents containing personal data, the stored embeddings may constitute personal data under GDPR Article 4(1) because they relate to an identifiable natural person. The DPO Liaison must assess re-identification feasibility considering embedding dimensionality, available inversion techniques, and whether embeddings are stored alongside identifying metadata.

What happens when different fairness metrics conflict with each other?

Fairness metrics can pull in opposite directions: a model achieving equalised odds may fail predictive parity, and a model with good calibration within groups may violate the four-fifths rule. The organisation must decide which fairness concept takes priority for its specific system, document the rationale, and declare the chosen metric as the primary threshold for the deployment blocking gate. Fairlearn's MetricFrame reports all metrics simultaneously so the trade-offs are visible.

How should an organisation handle a data subject's right to erasure for AI training data?

Three approaches exist with different cost and assurance profiles. Full retraining removes the individual's records and retrains from scratch — cleanest but expensive for large models. SISA training (Sharded, Isolated, Sliced, Aggregated) partitions data into shards so only the affected shard needs retraining. Approximate unlearning attempts to reverse the effect of specific records without full retraining, but lacks formal guarantees and should be treated as supplementary only for high-risk systems.

What validation should run on third-party data before it enters the training pipeline?

Every delivery should pass through an automated intake validation pipeline checking schema compliance, completeness against contracted thresholds, statistical distribution against the historical baseline for that source, and anomaly detection for unusual records or batches. Deliveries that fail are quarantined and do not enter the pipeline until the failure is resolved. The quarantine log and validation results are retained as Module 4 evidence artefacts.

Does Article 10 apply to knowledge bases used in RAG architectures?

Whether a knowledge base constitutes 'training, validation and testing data' within Article 10 is an open legal question, since the knowledge base conditions outputs at inference time rather than training model parameters. The recommended compliance approach is to apply Article 10 governance to the knowledge base, adapted for inference-time retrieval. The cost of discovering post-deployment that an ungoverned knowledge base introduced bias or inaccuracy into a high-risk system materially exceeds the cost of applying governance from the outset.

Can stored vector embeddings be personal data under GDPR?

Yes. Research demonstrates that original text can be partially or fully reconstructed from embeddings through inversion attacks. If the embedding model encodes documents containing personal data, the stored embeddings may constitute personal data under GDPR Article 4(1) because they relate to an identifiable natural person. The DPO Liaison must assess re-identification feasibility considering embedding dimensionality, available inversion techniques, and whether embeddings are stored alongside identifying metadata.

What happens when different fairness metrics conflict with each other?

Fairness metrics can pull in opposite directions: a model achieving equalised odds may fail predictive parity, and a model with good calibration within groups may violate the four-fifths rule. The organisation must decide which fairness concept takes priority for its specific system, document the rationale, and declare the chosen metric as the primary threshold for the deployment blocking gate. Fairlearn's MetricFrame reports all metrics simultaneously so the trade-offs are visible.

How should an organisation handle a data subject's right to erasure for AI training data?

Three approaches exist with different cost and assurance profiles. Full retraining removes the individual's records and retrains from scratch — cleanest but expensive for large models. SISA training (Sharded, Isolated, Sliced, Aggregated) partitions data into shards so only the affected shard needs retraining. Approximate unlearning attempts to reverse the effect of specific records without full retraining, but lacks formal guarantees and should be treated as supplementary only for high-risk systems.

What validation should run on third-party data before it enters the training pipeline?

Every delivery should pass through an automated intake validation pipeline checking schema compliance, completeness against contracted thresholds, statistical distribution against the historical baseline for that source, and anomaly detection for unusual records or batches. Deliveries that fail are quarantined and do not enter the pipeline until the failure is resolved. The quarantine log and validation results are retained as Module 4 evidence artefacts.

Data Governance for AI Systems: Article 10 Compliance

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Article 10 of the EU AI Act imposes detailed data governance requirements on training, validation, and testing datasets for high-risk AI systems. These requirements cover dataset documentation, completeness assessment, fairness and bias evaluation, data lineage, special category data processing under Article 10(5), GDPR alignment, third-party data governance, and embedding model and knowledge base governance. This page translates each obligation into concrete engineering practices documented in AISDP Module 4.

Abstract

Read abstract

Data governance under Article 10 of the EU AI Act is among the most technically demanding compliance obligations for high-risk AI systems. Training, validation, and testing datasets must satisfy governance practices covering relevance, representativeness, freedom from errors, and completeness. Dataset documentation requires structured records of provenance, composition, preparation, quality, annotation, and known limitations using frameworks such as Datasheets for Datasets. Bias detection spans pre-training assessment (distributional analysis, label bias, proxy variable detection, intersectional analysis) and post-training evaluation (selection rate ratio, equalised odds, predictive parity, calibration, counterfactual fairness). Data lineage must trace every data element from source through transformation to model consumption at pipeline, transformation, and column levels. Article 10(5) permits processing of special category personal data for bias monitoring subject to a sufficiency test and five layers of technical safeguards. GDPR obligations operate cumulatively alongside the AI Act, creating specific challenges around data subject rights, the right to erasure, and data retention. Third-party data governance requires contractual, technical, and monitoring controls. Embedding models and knowledge bases in RAG architectures require governance proportionate to their influence on system outputs.

Why is data governance the most technically demanding obligation under the EU AI Act?

Regulatory Requirement

Article 10 of the EU AI Act establishes data governance requirements for training, validation, and testing datasets that are among the most prescriptive in the regulation.

These requirements feed directly into aisdp Module 4, covering dataset documentation, completeness assessment, fairness and bias evaluation, data lineage, special category data processing, GDPR alignment, and third-party data governance.

Data governance under Article 10 is an engineering discipline embedded in every stage of the data lifecycle, from collection through preparation, labelling, training, validation, and ongoing monitoring. The AISDP must demonstrate that governance was designed into the data pipeline from the outset, not documented retrospectively. Training, validation, and testing datasets must satisfy governance practices covering relevance, representativeness, freedom from errors, and completeness, with particular attention to the persons or groups on whom the system is intended to operate. The Technical SME ensures appropriate statistical properties across all datasets.

The scope extends beyond conventional training data. Knowledge bases in retrieval-augmented generation (RAG) architectures, embedding models that encode semantic relationships, and third-party data sources all fall within the governance perimeter. An organisation that governs its primary training data meticulously but neglects the knowledge base feeding its RAG pipeline has a compliance gap that could surface during conformity assessment. Risk assessment establishes the risk profile that determines the depth of data governance required, while model selection decisions constrain the types of data the system will consume. The outputs from data governance feed primarily into AISDP Module 4 (Data Governance and Dataset Documentation) but are cross-referenced by Module 5 (Testing and Validation) and Module 6 (Monitoring).

What must dataset documentation cover to satisfy Article 10?

Regulatory Requirement

Every dataset used in the system lifecycle requires structured documentation covering provenance, composition, preparation, quality, annotation, and known limitations.

Every dataset used in the system lifecycle requires structured documentation covering provenance, composition, preparation, quality, annotation, and known limitations. Generic data catalogue entries do not satisfy Article 10; documentation must be specific enough for an assessor to evaluate suitability for training the system in question.

Provenance requires specificity: "data collected from deployer ATS systems between January 2021 and December 2023 under data processing agreements" is acceptable; "data from various sources" is not. The record must state the collection methodology, whether informed consent was obtained or another legal basis under GDPR Article 6 applies, and the licensing terms for any third-party data including whether those terms permit the intended use.

Composition covers dataset size (record count, feature count, storage size), temporal coverage, and geographic and demographic distribution. Statistics must be presented both in aggregate and disaggregated by relevant subgroups, particularly where protected characteristics are represented and in what proportions relative to the deployment population.

Preparation documents every preprocessing, cleaning, transformation, augmentation, and feature engineering step. Records removed must be logged with the reason and count. Missing value handling and imputation methods must be recorded with the assumptions they encode. Quality captures the metrics applied, error rates observed, how errors were detected and corrected, and the automated quality checks enforced. records annotator qualifications, guidelines, inter-annotator agreement rates, disagreement resolution processes, and whether annotators were compensated fairly under conditions that support quality, since annotation quality directly affects label accuracy which directly affects model fairness. identifies gaps, biases, underrepresented subgroups, and temporal or geographic skew.

How should data completeness gaps be assessed and compensated?

Engineering Approach

Article 10(3) requires datasets to be "relevant, sufficiently representative, and to the best extent possible, free of errors and complete.

Article 10(3) requires datasets to be "relevant, sufficiently representative, and to the best extent possible, free of errors and complete." Completeness has three dimensions that the AISDP must address: feature completeness, population completeness, and temporal completeness.

Feature completeness means every feature the model's intended purpose logically requires should be present and populated. Missing features force the model to rely on proxy variables, which may introduce bias. The AISDP must document which features are available, which are missing and why, and what compensating controls are in place.

Population completeness requires the dataset to represent the full range of persons and groups on whom the system will operate. If deployed across the EU/EEA, training data should reflect the demographic diversity of the deployment population. Underrepresentation of specific subgroups degrades the model's performance for those subgroups and creates fairness risk. Temporal completeness requires sufficient coverage of seasonal, cyclical, and trend variations. A model trained on one year of data may not capture multi-year patterns. Module 4 records the temporal coverage and the assessment of whether it is sufficient for the system's intended purpose.

Complete data is rarely achievable in practice. Module 4 must record the compensating controls applied when completeness gaps are identified. Synthetic data augmentation can address underrepresentation of specific subgroups, though the AISDP must document the generation algorithm, validation against real data distributions, the proportion of synthetic data in the final training set, and the trade-off between coverage and the risk that synthetic data fails to capture real-world complexity. Transfer learning from related domains can compensate for limited data in the target domain, provided the domain relevance is justified and performance degradation from domain shift is measured and documented. Stratified sampling ensures small subgroups appear in validation and test sets in sufficient numbers for meaningful performance metrics. Ensemble methods combining predictions from multiple models trained on overlapping but non-identical subsets can improve robustness to completeness gaps, with the ensemble composition and combination logic documented in the AISDP.

What automated validation gates prevent bad data from entering the training pipeline?

Compensating Controls

Data quality validation is the automated gate that prevents corrupted or drifting data from reaching the model.

Data quality validation is the automated gate that prevents corrupted or drifting data from reaching the model. Without it, a single upstream change, such as a source system schema modification, a data provider's silent methodology change, or a pipeline bug introducing null values, can propagate through to the model and degrade performance or introduce bias.

Validation should run at three pipeline checkpoints: at ingestion (before raw data enters the system), after each transformation step (confirming the transformation produced expected output), and before training (confirming the final dataset meets all quality standards). Each checkpoint enforces a different set of expectations.

Schema validation catches structural problems: renamed columns, changed data types, and unexpected formats. Pandera provides lightweight schema validation for Pandas DataFrames using a decorator pattern. For SQL pipelines, dbt's built-in tests (unique, not_null, accepted_values, relationships) provide schema-level assertions. Statistical validation catches distributional problems: sudden feature distribution shifts, correlation structure changes, or anomalous batches. The Kolmogorov-Smirnov test and chi-squared test are the workhorses, with reference distributions captured from a validated baseline dataset and updated periodically (for example, quarterly, aligned with the risk register review). Evidently AI generates data quality reports that include distribution comparisons, correlation analysis, and drift metrics suitable for both pipeline gating and periodic review. catches individual-record problems: extreme values, multi-feature statistical outliers, duplicates, or corrupted entries. Isolation forests and z-score methods are standard approaches.

How should pre-training and post-training bias be detected?

Regulatory Requirement

Bias detection is the most technically demanding aspect of data governance under Article 10.

Bias detection is the most technically demanding aspect of data governance under Article 10. Article 10(2)(f) requires providers to examine training data for "possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination." The assessment operates in two phases: before training and after training.

Pre-training bias assessment examines the data before any model is trained. Distributional analysis computes the distribution of each feature across protected characteristic subgroups, using chi-squared tests for categorical features and Kolmogorov-Smirnov or Mann-Whitney tests for continuous features. The practical output is a matrix with features on one axis, protected characteristics on the other, and each cell showing the test statistic and p-value. Features with statistically significant distributional differences (p less than 0.05 after Bonferroni correction for multiple comparisons) are flagged for investigation. ydata-profiling automates this for tabular datasets, producing an HTML report with correlation matrices and distribution comparisons.

Label bias analysis examines whether outcome labels themselves reflect historical discrimination. In a recruitment context, the "hired/not hired" label encodes human recruiter decisions that may carry conscious or unconscious bias. Detecting label bias requires stepping outside the data: inter-rater reliability analysis (having multiple independent labellers rate the same instances, measuring agreement via Cohen's kappa or Krippendorff's alpha) reveals the extent to which labels are subjective. Re-labelling by diverse panels provides a corrective dataset for comparison. Where relabelling is infeasible, the AISDP documents the known label bias, its potential effect on the model, and the compensating controls.

What bias mitigation techniques are available and what trade-offs do they carry?

Compensating Controls

No bias mitigation technique eliminates bias without side effects.

No bias mitigation technique eliminates bias without side effects. Every technique trades one form of accuracy or fairness for another, and the AISDP must document what technique was used, why it was chosen over alternatives, what trade-off it introduced, and whether that trade-off is acceptable.

Pre-processing mitigations modify the training data before the model sees it. Resampling is the simplest approach: oversampling underrepresented subgroups or undersampling overrepresented subgroups. SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic examples by interpolating between existing minority records, reducing overfitting risk compared to simple duplication. ADASYN focuses synthetic generation on boundary regions where the classifier struggles. Reweighting assigns higher training weights to underrepresented subgroups, with weight for each instance inversely proportional to its subgroup's prevalence. AI Fairness 360 provides a reweighting preprocessor that computes optimal weights automatically. The disparate impact remover (Feldman et al., 2015) modifies feature values to reduce correlations with protected characteristics while preserving predictive value. Learning fair representations (Zemel et al., 2013) learns a new feature space explicitly uninformative about protected characteristics.

In-processing mitigations modify the training procedure itself. Fairlearn's ExponentiatedGradient solves a constrained optimisation problem, maximising accuracy subject to a fairness constraint such as demographic parity or equalised odds. It trains many candidate models with different constraint levels and returns the best balance of accuracy and fairness, integrating with scikit-learn estimators. Adversarial debiasing (Zhang et al., 2018) trains an adversary network that tries to predict protected characteristics from the model's internal representations, penalising the model for leaking demographic information. The adversary's learning rate relative to the main model critically affects the trade-off.

How does data lineage support Article 10 compliance?

Engineering Approach

Data lineage, the ability to trace every data element from source collection through each transformation to final use in the model, is foundational to AISDP Module 4.

Data lineage, the ability to trace every data element from source collection through each transformation to final use in the model, is foundational to AISDP Module 4. Without it, the organisation cannot prove what data the model was trained on, how that data was prepared, or whether preparation steps introduced bias.

Lineage operates at three levels, and most organisations need all three. Pipeline-level lineage captures the macro view: which steps ran, in what order, with what inputs and outputs. DAG-based orchestration tools (Airflow, Prefect, Dagster) provide this automatically because the pipeline definition is itself a directed acyclic graph of steps with declared dependencies. Every pipeline execution must be logged with a unique execution ID, timestamp, input dataset versions, output dataset versions, and execution status.

Transformation-level lineage captures the logic within each step: what the cleaning step actually did, what the feature engineering computed, what the imputation strategy was. This requires each transform to be defined as version-controlled code, not ad hoc SQL queries or Jupyter notebook cells. dbt is the strongest tool for SQL-based transforms, with each model defined as a SQL file in a Git repository with tests, documentation, and automatic lineage graph output. For Python-based transforms, the code provides lineage when version-controlled, but parameters (thresholds, imputation values, normalisation statistics) must also be captured as structured metadata.

Column-level lineage is the finest grain and the most valuable for bias analysis. It tracks how each column in the output dataset relates to columns in source datasets. If the model uses a "risk_score" feature, column-level lineage reveals that it was derived from "annual_income" (source: payroll system) and "postcode" (source: address database), making the proxy relationship with ethnicity visible. OpenLineage provides an open standard for emitting lineage events at all three levels, with Marquez implementing it as a queryable lineage server. DataHub and Apache Atlas offer similar capabilities.

When can special category data be processed for bias detection under Article 10(5)?

Regulatory Requirement

Article 10(5) permits processing of special category personal data strictly for bias monitoring and detection, subject to specific safeguards.

Article 10(5) permits processing of special category personal data strictly for bias monitoring and detection, subject to specific safeguards. This provision resolves a fundamental tension: meaningful bias detection is frequently impossible without access to the demographic data that data protection law restricts.

The gateway condition is the sufficiency test. Before processing real special category data, the organisation must demonstrate that bias detection "cannot reasonably be carried out" using synthetic or anonymised alternatives. This is not a formality; it requires a documented technical assessment. Synthetic data evaluation involves generating datasets replicating protected characteristic distributions (using tools like SDV, Gretel.ai, or MOSTLY AI), running the full bias detection suite, and assessing whether results reliably indicate real-world fairness behaviour. Synthetic data frequently falls short because it fails to capture the correlational structure between protected characteristics and features such as educational attainment, postcode, or employment history. The evaluation should quantify this by comparing bias metrics on synthetic data against metrics on a small, carefully governed sample of real data.

Anonymisation evaluation assesses whether anonymised data preserves the subgroup structure needed for disaggregated fairness analysis. k-anonymity, l-diversity, and t-closeness provide formal privacy models; the choice depends on dataset size and data sensitivity. The DPO Liaison documents the anonymisation technique, privacy parameters, re-identification risk assessment, and suitability conclusion.

Regulatory Requirement

The AI Act's data governance requirements operate alongside the GDPR as cumulative obligations.

The AI Act's data governance requirements operate alongside the GDPR as cumulative obligations. An organisation that satisfies Article 10 but violates the GDPR is non-compliant with both regulations, because the AI Act's data governance provisions presuppose GDPR compliance. Module 4 of the AISDP must address both frameworks in an integrated manner.

Lawful basis selection under GDPR Article 6 is one of the most consequential data governance decisions. Consent (Article 6(1)(a)) offers the strongest legal footing but is often impractical for large-scale datasets, since consent must be freely given, specific, informed, and unambiguous. Withdrawal rights create operational challenges: if a data subject withdraws consent, the organisation must remove their data from the training set and either retrain or demonstrate the data cannot be recovered from model parameters. Legitimate interest (Article 6(1)(f)) is more commonly relied upon, requiring a documented three-part balancing test (legitimate interest identified, processing necessity demonstrated, interests balanced against data subjects' rights) for each dataset and processing purpose. Public interest (Article 6(1)(e)) is available for public authority AI systems, while contract performance (Article 6(1)(b)) applies where the AI system processes personal data to fulfil a contract.

Data subject rights present specific technical challenges in the AI context. The right to erasure (GDPR Article 17) is the most demanding: the organisation must remove data from the training dataset and either retrain the model or demonstrate that the individual's data cannot be recovered from the model's parameters. Three approaches exist with different cost and assurance profiles. Full retraining removes the records and retrains from scratch, which is cleanest but expensive for large models. SISA training (Sharded, Isolated, Sliced, Aggregated) partitions training data into shards with separate models, so only the shard containing the data subject's records requires retraining. Approximate unlearning attempts to reverse the effect of specific records without full retraining but lacks formal guarantees and should be treated as supplementary for high-risk systems.

What governance applies to third-party data and embedding models?

Engineering Approach

Many high-risk AI systems rely on data the organisation does not collect, curate, or control.

Many high-risk AI systems rely on data the organisation does not collect, curate, or control. Article 10's governance requirements apply regardless of data source, and the organisation bears full compliance responsibility for data it has limited ability to govern directly.

Embedding models and knowledge bases in RAG architectures similarly require governance proportionate to their influence on system outputs.

Third-party data governance operates across three layers. The contractual layer establishes the baseline across five domains: provenance disclosure (collection methodology, GDPR lawful basis, populations represented, known limitations, prior processing); quality specifications (measurable standards for completeness, accuracy, timeliness, and consistency); bias and representativeness warranties (demographic composition statistics where lawful); change notification (30 to 90 days notice before material methodology changes); and audit rights (risk-proportionate inspection of supplier governance practices, annual for high-sensitivity data, biennial for lower-risk sources).

The technical layer validates every delivery regardless of contractual promises. An automated intake validation pipeline checks schema compliance, completeness against contracted thresholds, statistical distribution against historical baselines, and anomaly detection for unusual records or batches. Great Expectations or Soda Core can define a dedicated expectation suite per supplier encoding contractual quality specifications as automated checks. Deliveries that fail are quarantined and do not enter the training pipeline. Periodic re-assessment, at least annually, evaluates whether data remains representative. The monitoring layer defends against silent changes through statistical comparison of each delivery's distributional profile against the historical baseline.

Data Governance for AI Systems: Article 10 Compliance

Written by

Why is data governance the most technically demanding obligation under the EU AI Act?

What must dataset documentation cover to satisfy Article 10?

How should data completeness gaps be assessed and compensated?

What automated validation gates prevent bad data from entering the training pipeline?

How should pre-training and post-training bias be detected?

What bias mitigation techniques are available and what trade-offs do they carry?

How does data lineage support Article 10 compliance?

When can special category data be processed for bias detection under Article 10(5)?

What governance applies to third-party data and embedding models?

Frequently Asked Questions

Related Pages

Start your compliance journey

Why is data governance the most technically demanding obligation under the EU AI Act?

What must dataset documentation cover to satisfy Article 10?

How should data completeness gaps be assessed and compensated?

What automated validation gates prevent bad data from entering the training pipeline?

How should pre-training and post-training bias be detected?

What bias mitigation techniques are available and what trade-offs do they carry?

How does data lineage support Article 10 compliance?

When can special category data be processed for bias detection under Article 10(5)?

How do GDPR and the EU AI Act interact for training data?

What governance applies to third-party data and embedding models?

Frequently Asked Questions

Related Pages

Start your compliance journey