We use cookies to improve your experience and analyse site traffic.
Article 10(3) requires that training, validation, and testing datasets for high-risk AI systems be sufficiently representative and complete. This guide covers the three dimensions of data completeness, practical compensating controls for gaps, and how to implement automated validation pipelines that produce compliance evidence.
Article 10(3) mandates that training, validation, and testing datasets be "relevant, sufficiently representative, and to the best extent possible, free of errors and complete.
Article 10(3) mandates that training, validation, and testing datasets be "relevant, sufficiently representative, and to the best extent possible, free of errors and complete." Completeness under this provision operates across three distinct dimensions: feature completeness, population completeness, and temporal completeness. Each dimension addresses a different way in which incomplete data can compromise the reliability and fairness of a high-risk AI system.
Practitioners documenting compliance in the ai system description protocol must assess all three dimensions and record their findings in Module 4. Where gaps exist, the regulation expects compensating controls to be documented alongside a justification for why complete data was not achievable. The requirement recognises that perfect completeness is rarely possible, but demands a structured, evidenced approach to managing incompleteness. Data Governance and Management
Feature completeness requires that every feature the model's intended purpose logically demands is present and populated in the dataset.
Feature completeness requires that every feature the model's intended purpose logically demands is present and populated in the dataset. When required features are missing, the model is forced to rely on proxy variables, which may introduce bias or degrade the accuracy of predictions for certain groups.
The AISDP must document which features are available in the dataset, which features are missing, the reasons for their absence, and what compensating controls have been put in place. This documentation serves a dual purpose: it provides the technical team with a clear picture of data limitations, and it gives conformity assessors evidence that the provider has considered and addressed feature gaps rather than ignoring them.
Missing features are particularly problematic when they relate to protected characteristics or variables that correlate with them. A recruitment model missing direct educational attainment data might rely on postcode as a proxy, inadvertently encoding socioeconomic bias. The documentation requirement ensures these risks are surfaced and managed explicitly.
Population completeness requires that the dataset represent the full range of persons and groups on whom the AI system will operate.
Population completeness requires that the dataset represent the full range of persons and groups on whom the AI system will operate. For systems deployed across the EU and EEA, the training data should reflect the demographic diversity of the deployment population, including all relevant subgroups.
Underrepresentation of specific subgroups degrades model performance for those subgroups and creates fairness risk. A system that performs well on average but poorly for a minority population fails the representativeness requirement of Article 10(3). The assessment must consider geographic, demographic, and contextual variation across the intended deployment scope. Bias Detection and Mitigation
The provider should compare the composition of the training dataset against known population distributions in the deployment context. Where gaps are identified, the AISDP records the nature and magnitude of the underrepresentation, the likely impact on model performance, and the compensating controls applied.
Training datasets should cover a sufficient time period to capture seasonal, cyclical, and trend variations relevant to the system's intended purpose.
Training datasets should cover a sufficient time period to capture seasonal, cyclical, and trend variations relevant to the system's intended purpose. A model trained on a single year of data may fail to capture multi-year patterns, economic cycles, or evolving behavioural trends that affect its predictions.
Module 4 records the temporal coverage of the dataset and the assessment of whether that coverage is sufficient. The sufficiency assessment must be specific to the system's intended purpose: a fraud detection model may need several years of transaction history to capture emerging fraud patterns, while a document classification model may need data spanning regulatory changes that altered document structures.
Temporal gaps are especially significant when the deployment context is subject to external shocks or regime changes. If the training period does not include analogous events to those the system will encounter in production, the model's behaviour during such events is untested.
Complete data is rarely achievable in practice, and Module 4 must record the compensating controls applied when gaps are identified.
Complete data is rarely achievable in practice, and Module 4 must record the compensating controls applied when gaps are identified. Four principal techniques address different types of completeness shortfall, each with its own documentation requirements and risk considerations.
Synthetic data augmentation addresses underrepresentation of specific subgroups by generating additional training examples. The Technical SME documents the augmentation methodology, including the generation algorithm, the validation of synthetic data against real data distributions, and the proportion of synthetic data in the final training set. Over-reliance on synthetic data introduces its own risks, as synthetic examples may not capture the full complexity of real-world data, so the AISDP must assess this trade-off explicitly.
Transfer learning from related domains compensates for limited training data in the target domain. The Technical SME justifies the source domain's relevance to the target domain, and the performance degradation from domain shift must be measured and documented. This approach works best when the source and target domains share fundamental data structures even if the specific distributions differ.
Stratified sampling ensures that small subgroups are represented in validation and test sets in sufficient numbers to compute meaningful performance metrics. The Technical SME documents the sampling strategy, including the stratification variables and the minimum sample sizes required for statistical validity. Bias Testing and Validation
Ensemble methods combine predictions from multiple models trained on overlapping but non-identical data subsets, improving robustness to completeness gaps. The Technical SME documents ensemble composition, the rationale for subset selection, and the combination logic used to aggregate predictions.
Automated data quality validation serves as the gate that prevents bad data from entering the training pipeline or corrupting production inference.
Automated data quality validation serves as the gate that prevents bad data from entering the training pipeline or corrupting production inference. Without automated checks, a single upstream change, such as a source system schema modification, a data provider's silent methodology change, or a pipeline bug introducing null values, can propagate through to the model, degrading performance and potentially introducing bias.
Validation should run at three points in the data pipeline. At ingestion, checks confirm that raw data meets structural and content expectations before it enters the system. After each transformation step, checks confirm the transformation produced the expected output. Before training, a final comprehensive check confirms the assembled dataset meets all quality standards. Each checkpoint enforces a different set of expectations appropriate to that stage.
Declarative validation frameworks allow practitioners to define expectations as code, organise them into expectation suites for specific datasets, and run those suites as part of the pipeline. An expectation might assert that a column has no null values, that values fall within a specified range, or that a distribution matches a reference baseline. When an expectation fails, the pipeline halts and produces a structured validation result documenting exactly which expectations failed and by how much.
The practical value of this approach is twofold. The expectation suites serve as executable documentation of data quality standards, meaning an assessor reviewing Module 4 can read the suite and understand exactly what checks were applied. The validation results serve as evidence that the checks actually ran and passed, or failed and were addressed, for each dataset version. Declarative frameworks generate structured reports summarising validation results, which serve as compliance evidence artefacts.
Schema validation catches structural problems in the data: renamed columns, changed data types, and unexpected formats.
Schema validation catches structural problems in the data: renamed columns, changed data types, and unexpected formats. Lightweight schema validation tools for DataFrames use a decorator pattern, making it straightforward to add schema checks to existing data processing code. For SQL-based pipelines, built-in assertion tests covering uniqueness, null presence, accepted values, and referential relationships provide schema-level assurance.
Statistical validation catches distributional problems: a sudden shift in a feature's distribution, a change in the correlation structure between features, or an anomalous batch of records. The Kolmogorov-Smirnov test and chi-squared test are the standard tools. The Technical SME captures the reference distribution from a validated baseline dataset, updated periodically in alignment with the risk register review cycle. Data quality reporting tools generate distribution comparisons, correlation analyses, and drift metrics suitable for both automated pipeline gating and periodic human review.
Anomaly detection catches individual-record problems within otherwise normal batches: records with extreme values, records that are statistical outliers across multiple features simultaneously, or records that appear to be duplicates or corrupted entries. Isolation forests and z-score methods are standard approaches. Automated profiling tools monitor data over time and alert on records or batches that deviate from learned patterns, providing ongoing assurance between formal review cycles.
The automated validation pipeline is particularly important for third-party data sources where the provider has limited control over data production processes.
The automated validation pipeline is particularly important for third-party data sources where the provider has limited control over data production processes. Contractual quality specifications establish the expected standard for each delivery, and the automated validation confirms that each delivery meets those specifications before the data is ingested.
Deliveries that fail validation are quarantined rather than ingested, and flagged for investigation by the data engineering team. This quarantine mechanism prevents potentially corrupted or non-conformant data from entering the training pipeline while the root cause is identified and resolved.
The Technical SME documents the quarantine mechanism and the investigation process in the AISDP, providing assessors with evidence that the provider has a systematic approach to managing data quality risks from external sources. This documentation should include the escalation path, the criteria for releasing quarantined data, and the process for notifying the data supplier of quality failures.
Feature completeness (all logically required variables present), population completeness (all affected groups represented), and temporal completeness (sufficient time period coverage). Each must be assessed and documented in Module 4 of the AISDP.
Document the generation algorithm, validate synthetic data against real distributions, and record the proportion of synthetic data in the training set. The AISDP must explicitly assess the risk that synthetic data may not capture real-world complexity.
Expectation suites serve as executable documentation of quality standards, and validation results provide evidence that checks ran and passed or that failures were addressed. Declarative frameworks generate structured reports suitable as compliance artefacts.
Datasets must represent the full demographic diversity of the deployment population, with underrepresentation documented and addressed.
Synthetic data augmentation, transfer learning, stratified sampling, and ensemble methods each address different types of completeness shortfall.
Validation runs at three pipeline checkpoints using declarative expectation suites that serve as executable documentation and compliance evidence.
Schema validation catches structural problems, statistical validation catches distributional shifts, and anomaly detection catches individual-record issues.