Why does third-party data governance matter for high-risk AI?

Organisations bear full Article 10 compliance responsibility for externally sourced data used in training, validation, and testing, despite having limited visibility into its origins and processing.

What happens when a third-party data supplier refuses audit rights?

The organisation should assess alternative assurance mechanisms such as independent third-party audits, SOC 2 or ISO 27001 certifications, or regulatory compliance certificates. The absence of any assurance mechanism is a material data governance gap recorded in the risk register.

How often should third-party data be reassessed for suitability?

At least annually, or more frequently for high-sensitivity data sources. Reassessment should include updated representativeness analysis, fairness impact testing with the current model, and comparison against new data sources that have become available.

What happens when a third-party data supplier refuses audit rights?

The organisation should assess alternative assurance mechanisms such as independent third-party audits, SOC 2 or ISO 27001 certifications, or regulatory compliance certificates. The absence of any assurance mechanism is a material data governance gap recorded in the risk register.

How often should third-party data be reassessed for suitability?

At least annually, or more frequently for high-sensitivity data sources. Reassessment should include updated representativeness analysis, fairness impact testing with the current model, and comparison against new data sources that have become available.

Third-Party Data Governance for AI Systems

Q: Does contractual liability allocation reduce regulatory obligations?

No. Under the AI Act, the provider or deployer remains responsible for compliance regardless of the data's source. Contractual remedies against the supplier serve as commercial protection, not a regulatory defence.

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Article 10 of the EU AI Act requires data governance for training, validation, and testing datasets regardless of whether the data was collected first-hand or acquired from a third party. Organisations must extend their governance frameworks to cover every external data source with binding contractual provisions and independent technical verification.

Abstract

Read abstract

High-risk AI systems frequently depend on data the organisation did not collect, curate, or control. Training datasets may be licensed from commercial data brokers, pre-trained models may incorporate web-scale corpora assembled by GPAI providers, and feature enrichment services may supply demographic or behavioural data from external sources. Under Article 10 of the EU AI Act, the organisation bears full compliance responsibility for this data regardless of its origin. This page examines the five contractual domains that third-party data agreements must address: provenance disclosure, data quality specifications, bias and representativeness warranties, change notification obligations, and audit rights. It covers the technical controls required to verify incoming data independently of supplier warranties, including automated intake validation pipelines and periodic reassessment of data suitability. The page also addresses liability allocation between the organisation and its data suppliers, documentation requirements for AISDP Module 4, and the compensating controls available when suppliers refuse to provide adequate transparency. Throughout, the principle holds that contractual remedies serve as commercial protection, not a regulatory defence.

Why third-party data governance matters for high-risk AI

Regulatory Requirement

Organisations deploying high-risk AI systems frequently rely on data they did not collect, curate, or control, yet they bear full compliance responsibility for that data under Article 10.

Organisations deploying high-risk AI systems frequently rely on data they did not collect, curate, or control, yet they bear full compliance responsibility for that data under Article 10. Training datasets may be licensed from commercial data brokers. Pre-trained models may have been trained on web-scale corpora assembled by a gpai provider. Feature enrichment services may supply demographic, firmographic, or behavioural data from external sources.

In each of these scenarios, Article 10's data governance requirements apply to the training, validation, and testing datasets regardless of whether the data was collected first-hand or acquired from a third party. The organisation cannot delegate its regulatory obligations to a supplier, even where the supplier bears contractual responsibility for data quality. This creates a governance challenge: the organisation must satisfy the same data governance standards for externally sourced data as for data it collects directly, despite having limited visibility into the data's origins, collection methodology, and processing history.

Addressing this challenge requires a structured approach spanning contractual provisions, technical validation controls, liability arrangements, documentation requirements, and ongoing monitoring for silent changes in supplier practices. Data Governance and Management covers the full data governance framework within which third-party governance operates.

What contractual provisions should cover third-party data?

Regulatory Requirement

The data governance framework must extend beyond the organisation's own data operations to address every third-party data source through binding contractual requirements.

The data governance framework must extend beyond the organisation's own data operations to address every third-party data source through binding contractual requirements. Each data supplier relationship should be governed by provisions addressing provenance, quality, bias, change management, and audit access.

Provenance disclosure requires the supplier to reveal the data's original collection methodology, the lawful basis under which the data was collected (consent, legitimate interest, public interest, or other), the populations and geographies represented, any known limitations or biases in coverage, and any prior processing or filtering applied. Without this provenance information, the organisation cannot assess the data's suitability for training a high-risk AI system under Article 10(2), which requires that datasets be relevant, sufficiently representative, and to the best extent possible free of errors and complete.

Data quality specifications define measurable standards against which incoming data is validated. Contracts should set completeness thresholds specifying the maximum acceptable proportion of missing values per field, accuracy guarantees specifying error rate bounds verified through the supplier's quality assurance processes, timeliness requirements specifying the maximum age of records and update frequency, and consistency specifications covering format standards, schema compliance, and referential integrity. These specifications become the baseline against which incoming data is validated. Deliveries that fail to meet the specifications are rejected or flagged for remediation by the Technical SME before the data enters the training pipeline.

How do audit rights support data governance?

Regulatory Requirement

Contractual audit rights enable the organisation to verify a supplier's provenance disclosures, quality assurance processes, and bias management practices through direct inspection rather than relying on self-reported claims. These rights should cover on-site or remote inspection of the supplier's data collection and processing infrastructure, review of quality assurance records, access to the supplier's own bias and representativeness assessments, and verification that the supplier's data handling practices comply with GDPR and any applicable sector-specific data protection requirements.

The frequency of audits should be proportionate to the data's risk profile: annual audits for suppliers of high-volume or high-sensitivity data, biennial assessments for lower-risk sources. The Internal Audit Assurance Lead documents and retains the audit findings as part of the AISDP evidence pack for Module 4, ensuring that the evidence trail connects the supplier's practices to the organisation's compliance obligations.

Where a supplier refuses to grant audit rights, the organisation should assess whether alternative assurance mechanisms are available. These may include independent third-party audits commissioned by the supplier, SOC 2 or ISO 27001 certifications covering the data processing operations, or regulatory compliance certificates from the supplier's supervisory authority. The absence of any assurance mechanism represents a material data governance gap that the AI System Assessor records in the risk register. Conformity Assessment Documentation addresses how audit evidence feeds into the broader conformity assessment process.

What does intake validation require for third-party data?

Engineering Approach

Every data delivery from a third-party source should pass through an automated intake validation pipeline before entering the training data store, regardless of the supplier's contractual warranties.

Every data delivery from a third-party source should pass through an automated intake validation pipeline before entering the training data store, regardless of the supplier's contractual warranties. Contractual warranties set the expected standard; independent verification confirms that the standard is met in practice.

The validation pipeline should verify schema compliance covering field names, data types, and value formats, along with completeness covering missing value rates per field against contracted thresholds. It should also perform range and distribution checks using statistical tests that compare each delivery's distribution against the historical baseline for the same source. Anomaly detection identifies records or batches that are statistically unusual, which may indicate collection errors, processing failures, or silent methodology changes by the supplier. The data engineering team extends the pipeline with supplier-specific expectations, encoding the contractual quality specifications as automated checks so that every delivery is measured against the agreed standards.

Deliveries that fail validation are quarantined: they sit in a holding area, a notification is sent to both the data engineering team and the supplier, and the data does not enter the training pipeline until the failure is resolved. The quarantine log records each failed delivery, the nature of the failure, and the resolution, and serves as a Module 4 evidence artefact demonstrating that the organisation actively enforces its data quality standards on external sources.

Beyond individual delivery validation, the organisation should periodically reassess whether the supplier's data remains suitable for the system's intended purpose. A dataset that was representative when initially licensed may become unrepresentative as the deployment population changes, as the supplier's collection methodology evolves, or as societal patterns shift. The Technical SME conducts this reassessment at least annually, or more frequently for high-sensitivity data sources. The reassessment should include updated representativeness analysis examining whether the data still reflects the deployment population, fairness impact testing assessing whether the current model's fairness profile changes when retrained on the latest supplier data, and comparison against any new data sources that have become available. This periodic review prevents the organisation from relying on stale vendor relationships when better options exist. details the fairness testing methodologies applicable to these reassessments.

Liability allocation for third-party data failures

Regulatory Requirement

When third-party data causes a compliance failure, such as a fairness deficiency traceable to unrepresentative training data from an external vendor, the allocation of liability between the organisation and the supplier must be addressed contractually. The Legal and Regulatory Advisor is responsible for structuring these liability arrangements. Contracts should specify the supplier's liability for data quality breaches, including the remedy available to the organisation such as replacement data, reprocessing, or financial compensation.

The contract should also address the supplier's obligation to cooperate with regulatory investigations arising from data quality issues, and the indemnification arrangements for losses arising from the supplier's breach of data quality, provenance, or bias warranties. These provisions ensure that the organisation has contractual recourse when supplier data causes downstream compliance failures, though they serve a commercial rather than regulatory purpose.

However, contractual liability allocation does not diminish the organisation's own regulatory obligations. Under the AI Act, the provider or deployer remains responsible for compliance regardless of the data's source. Contractual remedies against the supplier serve as commercial protection, not a regulatory defence. This distinction is critical: an organisation cannot point to its supplier contract as a defence against regulatory enforcement when the data it used failed to meet Article 10 requirements.

How is third-party data governance documented in the AISDP?

Regulatory Requirement

AISDP Module 4 must document the third-party data governance framework with the same rigour as the first-party data governance framework.

AISDP Module 4 must document the third-party data governance framework with the same rigour as the first-party data governance framework. For each third-party data source, the AISDP must record the supplier identity and contractual reference, the data's purpose within the system (training, validation, testing, feature enrichment, or other), and the provenance information disclosed by the supplier.

The documentation must also capture the quality specifications and intake validation results, the representativeness assessment and any identified gaps, the audit rights and most recent audit findings, and the residual risk from any disclosure or quality gaps that could not be resolved. This documentation requirement ensures that the conformity assessment has a complete picture of the data's provenance chain, from original collection through to ingestion into the training pipeline.

Gaps in supplier disclosure are recorded as non-conformities in the Non-Conformity Register and escalated to the vendor through the procurement function. Where the vendor refuses to disclose adequate information, the organisation must assess whether it can compensate through its own evaluation of the model's outputs, specifically testing for bias on representative data from the deployment context. A gap that cannot be compensated through independent evaluation is a compliance risk that the AI System Assessor reflects in the risk register (Module 6) and communicates to the AI Governance Lead for a residual risk acceptance decision.

What compensating controls address third-party data risks?

Compensating Controls

The mitigation strategy for third-party data operates on three layers: contractual, technical, and ongoing monitoring.

The mitigation strategy for third-party data operates on three layers: contractual, technical, and ongoing monitoring. The contractual layer establishes the baseline through supplier agreements addressing the five domains described above: provenance disclosure, quality specifications, bias and representativeness warranties, change notification, and audit rights.

The technical layer validates every delivery regardless of contractual promises. The intake validation pipeline is extended with supplier-specific expectations by the data engineering team. A dedicated expectation suite for each supplier encodes the contractual quality specifications as automated checks covering schema compliance, completeness thresholds, distributional consistency, and anomaly detection. Every delivery passes through this suite before the data enters the training pipeline. Deliveries that fail are quarantined: they sit in a holding area, a notification is sent to both the data engineering team and the supplier, and the data does not enter the pipeline until the failure is resolved. The quarantine log is a Module 4 evidence artefact.

The ongoing layer monitors for silent changes. Suppliers sometimes change their data collection or processing practices without notification, even when contractually obligated to do so. The technical defence is statistical monitoring of incoming deliveries, comparing each delivery's distributional profile against the historical baseline for that supplier. A sudden shift in the distribution of a key feature, a change in the proportion of missing values, or an unexpected change in demographic composition all signal a potential methodology change. When a silent change is detected, the supplier is notified, and the delivery is treated as if the contracted notification period had not been observed. This ongoing monitoring prevents the organisation from relying on stale vendor relationships when the underlying data characteristics have shifted.

Third-Party Data Governance for AI Systems

Written by

Why third-party data governance matters for high-risk AI

What contractual provisions should cover third-party data?

How do audit rights support data governance?

What does intake validation require for third-party data?

Liability allocation for third-party data failures

How is third-party data governance documented in the AISDP?

What compensating controls address third-party data risks?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline