We use cookies to improve your experience and analyse site traffic.
Article 10 of the EU AI Act requires that every dataset used in the lifecycle of a high-risk AI system is documented comprehensively. This guide covers the six documentation categories, the Datasheets for Datasets framework, and proportionality principles for documentation depth.
Every dataset used across the AI system lifecycle requires structured documentation covering six categories.
Every dataset used across the AI system lifecycle requires structured documentation covering six categories. This applies to training, validation, testing, calibration, and fine-tuning datasets, with the technical sme responsible for recording the information in each case.
The six documentation categories are provenance, composition, preparation, quality, annotation, and known limitations. Together, these categories establish whether a dataset is fit for purpose under Data Governance and Dataset Documentation requirements. Each category addresses a distinct dimension of data suitability that an assessor must be able to evaluate independently.
Provenance documentation records where the data originated, how it was collected, and the legal basis for its use.
Provenance documentation records where the data originated, how it was collected, and the legal basis for its use. If informed consent was obtained, that must be recorded. If another lawful basis under GDPR Article 6 applies, that basis must be specified.
Where data was licensed from a third party, the licensing terms must be documented along with confirmation that those terms permit the intended use. Provenance documentation must be specific: "data collected from deployer ATS systems between January 2021 and December 2023 under data processing agreements" is acceptable. "Data from various sources" is not.
Composition documentation records the size of the dataset, including record count, feature count, and storage size.
Composition documentation records the size of the dataset, including record count, feature count, and storage size. It must state the temporal coverage and describe the geographic and demographic distribution of the data.
Where protected characteristics are represented, the documentation must state the proportions relative to the deployment population. The Technical SME presents composition statistics both in aggregate and disaggregated by relevant subgroups, enabling assessors to evaluate representativeness.
Data preparation documentation captures every preprocessing, cleaning, transformation, augmentation, and feature engineering step applied to the dataset.
Data preparation documentation captures every preprocessing, cleaning, transformation, augmentation, and feature engineering step applied to the dataset. Where records were removed, the documentation must state the reason and the number affected.
The handling of missing values must be described, including any imputation methods applied and the assumptions those methods encode. This preparation record creates the trail an assessor needs to understand how raw data became the dataset the model was trained on.
Quality documentation records the metrics applied, the error rates observed, and how errors were detected and corrected.
Quality documentation records the metrics applied, the error rates observed, and how errors were detected and corrected. If automated quality checks were used, the rules they enforced must be described.
For datasets involving human annotation, the documentation must cover annotator qualifications, annotation guidelines, inter-annotator agreement rates, and how disagreements were resolved. Fair compensation and working conditions for annotators must also be documented. Annotation quality directly affects label accuracy, which directly affects model fairness and performance.
Dataset documentation must identify known gaps, biases, and limitations.
Dataset documentation must identify known gaps, biases, and limitations. This includes subgroup underrepresentation, temporal biases such as data collected during a period of unusual economic conditions, and geographic biases such as data collected predominantly from certain member states.
Disclosing limitations is not a weakness in the documentation; it is a requirement. An assessor who discovers undisclosed limitations will question the entire documentation package. Transparent limitation disclosure allows compensating controls to be evaluated on their merits.
The Datasheets for Datasets framework (Gebru et al.
The Datasheets for Datasets framework (Gebru et al., 2021) provides the most thorough structure for organising dataset documentation. It covers seven sections: motivation, composition, collection process, preprocessing and cleaning, uses, distribution, and maintenance.
For EU AI Act compliance, several sections require more depth than a standard datasheet provides. The composition section must include distributional analysis across protected characteristics to feed the bias assessment. The collection process section must document the GDPR lawful basis for processing. The preprocessing section must align with Data Lineage and Traceability requirements. The uses section must explicitly state limitations relevant to the system's intended purpose.
Dataset documentation must reflect the data as it actually exists, not as it existed when first documented.
Dataset documentation must reflect the data as it actually exists, not as it existed when first documented. A dataset version change, whether from new records, modified features, or updated quality rules, should trigger a corresponding documentation update.
Data catalogue tools such as OpenMetadata and DataHub support attaching structured documentation to dataset versions, with change tracking and notification when documentation falls out of date. For organisations that prefer lighter tooling, a Markdown file co-located with the dataset in a data versioning system such as DVC or Delta Lake provides version-controlled documentation that evolves alongside the data.
Documentation depth should be proportionate to the dataset's role in the system.
Documentation depth should be proportionate to the dataset's role in the system. Training datasets for high risk ai system warrant comprehensive datasheets; static reference datasets warrant lighter treatment.
The risk of over-documenting is real: a 50-page datasheet for a simple lookup table adds cost without compliance value. The AI System Assessor must document the standard applied to each dataset category and the rationale for the proportionality decision. This proportionality assessment itself becomes part of the Conformity Assessment Process evidence package.
OpenMetadata and DataHub support attaching structured documentation to dataset versions with change tracking. For lighter tooling, a Markdown file co-located with the dataset in DVC or Delta Lake provides version-controlled documentation.
Provenance must identify concrete sources, timeframes, and legal bases. Statements like 'data collected from deployer ATS systems between January 2021 and December 2023 under data processing agreements' are acceptable; 'data from various sources' is not.
Annotation documentation must cover annotator qualifications, annotation guidelines, inter-annotator agreement rates, disagreement resolution, and fair compensation and working conditions for annotators.
Proportionate to each dataset's role: comprehensive datasheets for training data, lighter treatment for static reference datasets.
Documentation must reflect data as it currently exists, not when first written. Version changes trigger documentation updates.