How does the Datasheets for Datasets framework support EU AI Act compliance?

It provides a seven-section structure covering motivation, composition, collection, preprocessing, uses, distribution, and maintenance, with EU-specific additions.

How should documentation depth be proportioned across datasets?

Proportionate to each dataset's role: comprehensive datasheets for training data, lighter treatment for static reference datasets.

What tools can help maintain dataset documentation as a living artefact?

OpenMetadata and DataHub support attaching structured documentation to dataset versions with change tracking. For lighter tooling, a Markdown file co-located with the dataset in DVC or Delta Lake provides version-controlled documentation.

How detailed should annotation documentation be?

Annotation documentation must cover annotator qualifications, annotation guidelines, inter-annotator agreement rates, disagreement resolution, and fair compensation and working conditions for annotators.

What tools can help maintain dataset documentation as a living artefact?

OpenMetadata and DataHub support attaching structured documentation to dataset versions with change tracking. For lighter tooling, a Markdown file co-located with the dataset in DVC or Delta Lake provides version-controlled documentation.

How detailed should annotation documentation be?

Annotation documentation must cover annotator qualifications, annotation guidelines, inter-annotator agreement rates, disagreement resolution, and fair compensation and working conditions for annotators.

Dataset Documentation Requirements for High-Risk AI Systems

Q: What makes provenance documentation specific enough for compliance?

Provenance must identify concrete sources, timeframes, and legal bases. Statements like 'data collected from deployer ATS systems between January 2021 and December 2023 under data processing agreements' are acceptable; 'data from various sources' is not.

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Article 10 of the EU AI Act requires that every dataset used in the lifecycle of a high-risk AI system is documented comprehensively. This guide covers the six documentation categories, the Datasheets for Datasets framework, and proportionality principles for documentation depth.

Abstract

Read abstract

High-risk AI systems under the EU AI Act must document every dataset used across the system lifecycle, covering training, validation, testing, calibration, and fine-tuning data. Six documentation categories are required: provenance records where data originated and the legal basis for use; composition captures dataset size, temporal coverage, and demographic distribution including protected characteristics; preparation records all preprocessing and transformation steps; quality documents error rates and automated checks; annotation covers annotator qualifications and inter-annotator agreement; and known limitations discloses gaps, biases, and representativeness issues. The Datasheets for Datasets framework (Gebru et al., 2021) provides a seven-section structure that aligns with these requirements, though EU AI Act compliance demands additional depth in composition, collection process, preprocessing, and uses sections. Documentation must be treated as a living artefact, updated whenever datasets change, using version-controlled tooling. Documentation depth should be proportionate to each dataset's role: training datasets for high-risk systems warrant comprehensive treatment while static reference datasets require lighter documentation.

What must be documented for each dataset?

Regulatory Requirement

Every dataset used across the AI system lifecycle requires structured documentation covering six categories.

Every dataset used across the AI system lifecycle requires structured documentation covering six categories. This applies to training, validation, testing, calibration, and fine-tuning datasets, with the technical sme responsible for recording the information in each case.

The six documentation categories are provenance, composition, preparation, quality, annotation, and known limitations. Together, these categories establish whether a dataset is fit for purpose under Data Governance and Dataset Documentation requirements. Each category addresses a distinct dimension of data suitability that an assessor must be able to evaluate independently.

What does provenance documentation require?

Regulatory Requirement

Provenance documentation records where the data originated, how it was collected, and the legal basis for its use.

Provenance documentation records where the data originated, how it was collected, and the legal basis for its use. If informed consent was obtained, that must be recorded. If another lawful basis under GDPR Article 6 applies, that basis must be specified.

Where data was licensed from a third party, the licensing terms must be documented along with confirmation that those terms permit the intended use. Provenance documentation must be specific: "data collected from deployer ATS systems between January 2021 and December 2023 under data processing agreements" is acceptable. "Data from various sources" is not.

What composition information should the documentation capture?

Regulatory Requirement

Composition documentation records the size of the dataset, including record count, feature count, and storage size.

Composition documentation records the size of the dataset, including record count, feature count, and storage size. It must state the temporal coverage and describe the geographic and demographic distribution of the data.

Where protected characteristics are represented, the documentation must state the proportions relative to the deployment population. The Technical SME presents composition statistics both in aggregate and disaggregated by relevant subgroups, enabling assessors to evaluate representativeness.

How should data preparation steps be recorded?

Regulatory Requirement

Data preparation documentation captures every preprocessing, cleaning, transformation, augmentation, and feature engineering step applied to the dataset.

Data preparation documentation captures every preprocessing, cleaning, transformation, augmentation, and feature engineering step applied to the dataset. Where records were removed, the documentation must state the reason and the number affected.

The handling of missing values must be described, including any imputation methods applied and the assumptions those methods encode. This preparation record creates the trail an assessor needs to understand how raw data became the dataset the model was trained on.

What quality and annotation records are needed?

Regulatory Requirement

Quality documentation records the metrics applied, the error rates observed, and how errors were detected and corrected.

Quality documentation records the metrics applied, the error rates observed, and how errors were detected and corrected. If automated quality checks were used, the rules they enforced must be described.

For datasets involving human annotation, the documentation must cover annotator qualifications, annotation guidelines, inter-annotator agreement rates, and how disagreements were resolved. Fair compensation and working conditions for annotators must also be documented. Annotation quality directly affects label accuracy, which directly affects model fairness and performance.

What known limitations must be disclosed?

Regulatory Requirement

Dataset documentation must identify known gaps, biases, and limitations.

Dataset documentation must identify known gaps, biases, and limitations. This includes subgroup underrepresentation, temporal biases such as data collected during a period of unusual economic conditions, and geographic biases such as data collected predominantly from certain member states.

Disclosing limitations is not a weakness in the documentation; it is a requirement. An assessor who discovers undisclosed limitations will question the entire documentation package. Transparent limitation disclosure allows compensating controls to be evaluated on their merits.

How does the Datasheets for Datasets framework support compliance?

Compensating Controls

The Datasheets for Datasets framework (Gebru et al.

The Datasheets for Datasets framework (Gebru et al., 2021) provides the most thorough structure for organising dataset documentation. It covers seven sections: motivation, composition, collection process, preprocessing and cleaning, uses, distribution, and maintenance.

For EU AI Act compliance, several sections require more depth than a standard datasheet provides. The composition section must include distributional analysis across protected characteristics to feed the bias assessment. The collection process section must document the GDPR lawful basis for processing. The preprocessing section must align with Data Lineage and Traceability requirements. The uses section must explicitly state limitations relevant to the system's intended purpose.

Why should dataset documentation be treated as a living artefact?

Compensating Controls

Dataset documentation must reflect the data as it actually exists, not as it existed when first documented.

Dataset documentation must reflect the data as it actually exists, not as it existed when first documented. A dataset version change, whether from new records, modified features, or updated quality rules, should trigger a corresponding documentation update.

Data catalogue tools such as OpenMetadata and DataHub support attaching structured documentation to dataset versions, with change tracking and notification when documentation falls out of date. For organisations that prefer lighter tooling, a Markdown file co-located with the dataset in a data versioning system such as DVC or Delta Lake provides version-controlled documentation that evolves alongside the data.

How should documentation depth be proportioned?

Compensating Controls

Documentation depth should be proportionate to the dataset's role in the system.

Documentation depth should be proportionate to the dataset's role in the system. Training datasets for high risk ai system warrant comprehensive datasheets; static reference datasets warrant lighter treatment.

The risk of over-documenting is real: a 50-page datasheet for a simple lookup table adds cost without compliance value. The AI System Assessor must document the standard applied to each dataset category and the rationale for the proportionality decision. This proportionality assessment itself becomes part of the Conformity Assessment Process evidence package.

Dataset Documentation Requirements for High-Risk AI Systems

Written by

What must be documented for each dataset?

What does provenance documentation require?

What composition information should the documentation capture?

How should data preparation steps be recorded?

What quality and annotation records are needed?

What known limitations must be disclosed?

How does the Datasheets for Datasets framework support compliance?

Why should dataset documentation be treated as a living artefact?

How should documentation depth be proportioned?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline