Should the knowledge base in a RAG system be governed under Article 10?

The recommended approach is to apply Article 10 data governance requirements to the knowledge base, adapted to inference-time retrieval, covering composition, completeness, currency, provenance, and bias assessment.

What bias risks do embedding models create in high-risk AI systems?

Embedding models encode societal biases from training data into vector space geometry, causing differential retrieval quality across demographic groups in recruitment, legal, and medical systems.

When do vector embeddings constitute personal data under GDPR?

Embeddings may constitute personal data under GDPR Article 4(1) where original text can be reconstructed through inversion attacks, particularly for high-dimensional models encoding documents containing personal information.

Why is embedding model version control important for RAG systems?

Version mismatches between the embedding model used for indexing and the one used for queries cause retrieval quality degradation because the vector spaces are no longer aligned.

What compensating controls apply to embedding model governance?

Key controls include retrieval bias testing with demographically paired queries, automated knowledge base quality pipelines with deduplication and currency checks, and embedding inversion monitoring for personal data protection.

Do we need to apply data governance to a RAG knowledge base even though it is not training data?

Yes. Although the legal classification is an open question, the recommended approach is to apply Article 10 requirements to the knowledge base because it directly shapes system outputs. Discovering post-deployment that an ungoverned knowledge base introduced bias or inaccuracy carries worse consequences than applying governance from the outset.

How do we test for embedding bias in practice?

Construct paired queries that differ only in demographic markers such as names, gendered pronouns, or geographic references. Measure whether retrieval results differ systematically across protected dimensions. Statistically significant differences indicate embedding bias requiring remediation.

What happens if we update our embedding model without re-indexing the knowledge base?

Retrieval quality degrades because the knowledge base vectors and query vectors are no longer in the same vector space. In extreme cases, the system may return entirely irrelevant documents. Any embedding model version change must trigger a full re-indexing of the knowledge base.

How do we handle GDPR erasure requests for vector embeddings?

If the vector database supports efficient single-record deletion, delete the embedding directly. If not, maintain a mapping between embeddings and source documents, delete the source document, and flag the corresponding embedding for removal at the next scheduled re-indexing.

Do we need to apply data governance to a RAG knowledge base even though it is not training data?

Yes. Although the legal classification is an open question, the recommended approach is to apply Article 10 requirements to the knowledge base because it directly shapes system outputs. Discovering post-deployment that an ungoverned knowledge base introduced bias or inaccuracy carries worse consequences than applying governance from the outset.

How do we test for embedding bias in practice?

Construct paired queries that differ only in demographic markers such as names, gendered pronouns, or geographic references. Measure whether retrieval results differ systematically across protected dimensions. Statistically significant differences indicate embedding bias requiring remediation.

What happens if we update our embedding model without re-indexing the knowledge base?

Retrieval quality degrades because the knowledge base vectors and query vectors are no longer in the same vector space. In extreme cases, the system may return entirely irrelevant documents. Any embedding model version change must trigger a full re-indexing of the knowledge base.

How do we handle GDPR erasure requests for vector embeddings?

If the vector database supports efficient single-record deletion, delete the embedding directly. If not, maintain a mapping between embeddings and source documents, delete the source document, and flag the corresponding embedding for removal at the next scheduled re-indexing.

Embedding Models and Knowledge Base Governance for RAG Systems

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Embedding models in retrieval-augmented generation architectures introduce learned representations that can encode biases, linguistic disparities, and privacy risks invisible to monitoring focused on the primary model. Article 10 data governance requirements should be applied to both the knowledge base and the embedding models that index it.

Abstract

Read abstract

Retrieval-augmented generation architectures depend on embedding models to convert knowledge base documents and user queries into dense vector representations, with retrieval quality directly determining the primary model's output. Despite this influence, embedding models often escape governance scrutiny because they are neither the decision-making model nor training data in the conventional sense. This guide recommends applying Article 10 data governance requirements to the knowledge base, covering composition, completeness, currency, provenance, and bias assessment. Embedding models encode societal biases from their training data into the geometry of the vector space, manifesting as differential retrieval quality across demographic groups. Multilingual performance disparities can cause materially different service quality across EU member states. Dense vector embeddings may constitute personal data under GDPR Article 4(1) where original text can be reconstructed through inversion attacks, triggering full GDPR compliance obligations including erasure request handling. Version control must coordinate embedding model versions with knowledge base index versions to prevent retrieval degradation. Compensating controls include retrieval bias testing with demographically paired queries, automated knowledge base quality pipelines, and embedding inversion monitoring for systems processing personal data.

Why do embedding models and knowledge bases need governance?

Regulatory Requirement

Embedding models introduce a layer of learned representation between the knowledge base and the decision-making model, and this layer can encode biases, linguistic disparities, and privacy risks that are invisible to monitoring systems focused only on the primary model's inputs and outputs. In retrieval-augmented generation architectures, an embedding model converts both the knowledge base documents and the user query into dense vector representations; the system retrieves documents whose embeddings are closest to the query embedding, and the retrieved documents become the primary model's context for generating a response. In semantic search systems used for recruitment, legal research, or medical case matching, the embedding model's representation of "similarity" directly determines which candidates, precedents, or cases are surfaced to the decision-maker.

Despite their influence on the system's outputs, embedding models tend to escape governance scrutiny. They are neither the primary decision-making model, so Data Governance for High-Risk AI model selection governance may overlook them, nor training data, so data governance processes may not capture them. Embedding models also appear in monitoring pipelines, where they power the output quality and prompt distribution monitoring required under post-market monitoring obligations. This governance gap creates a compliance risk that organisations must address systematically.

How should the knowledge base be treated under Article 10?

Regulatory Requirement

The knowledge base in a RAG architecture functions as the information source that directly shapes the system's outputs.

The knowledge base in a RAG architecture functions as the information source that directly shapes the system's outputs. The foundation model generates its response based on the documents retrieved from the knowledge base; if the knowledge base is incomplete, outdated, biased, or poorly curated, the system's outputs will reflect those deficiencies regardless of how well the model itself performs.

Whether the knowledge base constitutes "training, validation and testing data" within the meaning of article 10 is an open legal question. In the conventional sense, the knowledge base is not used to train the model's parameters; it is used at inference time to condition the model's output. The recommended compliance approach is to apply Article 10's data governance requirements to the knowledge base, adapted to the specific characteristics of inference-time retrieval. Treating the knowledge base as ungoverned data, and discovering post-deployment that it introduced bias, inaccuracy, or incompleteness into a high-risk system's outputs, carries consequences materially worse than the cost of applying governance from the outset.

The Technical SME documents the knowledge base using the same categories required for training data, adapted as follows. Composition covers the number of documents, the document types (regulatory text, clinical guidelines, case law, product specifications, or other), the source distribution, and the temporal coverage. For knowledge bases updated on a rolling basis, the documentation specifies the update cadence, the document selection criteria, and the process for removing outdated material.

Completeness and representativeness require that the knowledge base is representative of the domain the system serves. A medical decision-support RAG system whose knowledge base covers only English-language guidelines from US institutions will produce systematically different, and potentially less appropriate, responses for patients in EU member states where national clinical guidelines differ. A legal research system whose knowledge base underrepresents case law from smaller member states will produce less reliable results for queries concerning those jurisdictions. The Technical SME assesses completeness against the system's intended deployment context and documents any known coverage gaps.

Currency is critical for knowledge bases in domains where information changes, such as medical guidelines, regulatory text, or financial data. The Technical SME defines a staleness threshold: the maximum acceptable age for documents in the knowledge base, which varies by domain. Documents that exceed the staleness threshold are flagged for review, update, or removal. The staleness monitoring process is documented in the post-market monitoring plan.

What bias risks do embedding models introduce?

Engineering Approach

Embedding models encode semantic associations from their training data into the geometry of the vector space, and research has consistently demonstrated that models trained on broad web corpora encode societal biases. These include associations between professions and gender, between names and ethnicity, and between geographic locations and socioeconomic status. These biases are structural features of the embedding space that directly affect retrieval behaviour.

In a high-risk AI system, embedding bias manifests as differential retrieval quality. A RAG-based recruitment system that uses biased embeddings may retrieve systematically different reference materials for candidates whose profiles contain markers associated with different demographic groups. A semantic search system for legal case matching may retrieve different precedents for cases involving individuals from different ethnic or socioeconomic backgrounds. These effects are subtle, difficult to detect through aggregate performance metrics, and fall squarely within the scope of Article 10(2)(f) on examination for possible biases and Article 10(2)(g) on identification of relevant data gaps and shortcomings.

The Technical SME assesses embedding bias through a combination of intrinsic evaluation, examining the embedding space directly for known bias patterns, and extrinsic evaluation, testing whether retrieval quality differs across demographic subgroups. Intrinsic evaluation methods include WEAT (the Word Embedding Association Test) and its sentence-level extensions, which measure the association between target concepts such as male and female names and attribute concepts such as career and family in the embedding space. Extrinsic evaluation is more directly relevant to compliance: it measures whether a fixed set of queries produces systematically different retrieval results when query characteristics vary across protected dimensions. The results are documented in Module 4 alongside the training data bias assessment.

How does multilingual performance affect compliance?

Engineering Approach

Most widely available embedding models perform best on English-language text.

Most widely available embedding models perform best on English-language text. Models offering multilingual support vary in their performance across languages. For a high risk ai system deployed across multiple EU member states, uneven embedding performance across languages could cause the system to retrieve more relevant information for queries in some languages than others.

A medical decision-support system that retrieves highly relevant clinical guidelines for queries in English but less relevant results for queries in Estonian or Maltese is providing a materially different quality of service to users in different member states. The Technical SME evaluates the embedding model's retrieval performance across all languages in which the system will operate.

The evaluation should use language-specific retrieval benchmarks (MIRACL for multilingual information retrieval, MTEB for multilingual text embedding evaluation) and domain-specific test queries in each language. Performance gaps exceeding a defined threshold, set by the AI Governance Lead in consultation with the Technical SME, should be documented as a known limitation in the aisdp and addressed through compensating controls. These controls may include language-specific fine-tuning of the embedding model, translation preprocessing for underperforming languages, or separate embedding models optimised for specific language families.

When do embeddings constitute personal data?

Regulatory Requirement

Dense vector embeddings can encode more information about the input text than is immediately apparent, and research has demonstrated that original text can be partially or fully reconstructed from embeddings through inversion attacks, particularly for models with high-dimensional output spaces. If the embedding model is used to encode documents containing personal data, such as patient records in a medical RAG system, candidate CVs in a recruitment system, or customer correspondence in a financial advisory system, the stored embeddings may themselves constitute personal data under GDPR Article 4(1). This is because they relate to an identified or identifiable natural person and can, with reasonable effort, be linked back to the original text.

The DPO Liaison assesses whether the stored embeddings constitute personal data by evaluating the feasibility of re-identification. The assessment considers the embedding model's dimensionality (higher dimensions preserve more information), the availability of inversion techniques for the specific model architecture, and whether the embeddings are stored alongside metadata such as document identifiers, timestamps, or user identifiers that could facilitate re-identification. Where the DPO Liaison determines that embeddings constitute personal data, the full GDPR compliance framework applies.

A lawful basis must be identified for storing the embeddings, and the retention policy must specify a deletion schedule. Data subject access and erasure requests must be serviceable, which may require the ability to identify and delete specific embeddings from the vector store. The data protection impact assessment must address the embedding-specific risks.

What version control does an embedding model require?

Engineering Approach

Embedding models produce vector representations that are specific to the model version, and when the embedding model is updated the new version may produce different vector representations for the same input text. Updates may occur through a provider-initiated version change for API-accessed models, or through fine-tuning or retraining for self-hosted models. If the knowledge base was indexed using embedding model version A but queries are embedded using version B, the retrieval quality degrades because the vector spaces are no longer aligned. In extreme cases, the retrieval may return entirely irrelevant documents.

This version mismatch risk requires coordination between the embedding model version and the knowledge base index version. The Technical SME maintains a version record linking each knowledge base index to the embedding model version used to generate it. Any change to the embedding model version triggers a re-indexing of the knowledge base. The Version Control and Model Lineage framework applies to embedding models in the same manner as to the primary model: version pinning for API-accessed embedding models, content hashing for downloaded models, and sentinel testing for detecting silent changes.

Documenting embedding and knowledge base governance in the AISDP

Engineering Approach

Module 4 of the AISDP records the knowledge base and embedding model governance comprehensively.

Module 4 of the AISDP records the knowledge base and embedding model governance comprehensively. The documentation includes the knowledge base composition, completeness assessment, currency policy, provenance records, and bias assessment, following the framework described above for treating the knowledge base as governed data.

For each embedding model, the documentation records its provenance, including provider, version, training data description (to the extent available), and known biases. Compliance Documentation and the AISDP embedding bias assessment results, both intrinsic and extrinsic, are recorded alongside the training data bias assessment.

The documentation also captures multilingual performance evaluation results and any compensating controls for underperforming languages. It records whether stored embeddings constitute personal data, as assessed by the DPO Liaison, and the GDPR compliance measures applied if they do. The version control linkage between the embedding model version and the knowledge base index version is maintained as part of the documentation. An entry in the Model Selection Record for each embedding model is also required.

What compensating controls address embedding-specific risks?

Compensating Controls

Retrieval bias testing is the primary extrinsic evaluation method for detecting embedding bias in practice.

Retrieval bias testing is the primary extrinsic evaluation method for detecting embedding bias in practice. The Technical SME constructs a test suite of paired queries that differ only in demographic markers (names, gendered pronouns, geographic references) and measures whether the retrieval results differ systematically. For a recruitment RAG system, this might involve submitting pairs of candidate profiles that are identical except for names associated with different ethnic backgrounds, and comparing the job descriptions or reference materials retrieved for each. Statistically significant differences in retrieval results across protected dimensions indicate embedding bias. The test suite should be run at initial deployment and as part of the post-market monitoring programme, with results documented as Module 4 and Module 12 evidence.

A knowledge base quality pipeline validates new documents before they are added to the knowledge base. The pipeline checks document format and structural integrity, extracts and validates metadata (source, date, language, topic classification), runs currency checks against the staleness threshold, performs deduplication against existing documents using near-duplicate detection (SimHash, MinHash, or embedding-based similarity), and computes an incremental coverage assessment to track whether the knowledge base's domain coverage is expanding, contracting, or drifting. Documents that fail validation are quarantined for manual review, following the same pattern as the Third-Party Data Intake pipeline.

Embedding inversion monitoring applies to systems where the DPO Liaison has determined that stored embeddings constitute personal data. The Technical SME implements access logging for the vector database (recording who queried what and when), anomaly detection on query patterns (bulk extraction attempts, systematic probing of the embedding space), and periodic re-assessment of the state of the art in embedding inversion techniques to ensure that the privacy risk assessment remains current.

Embedding Models and Knowledge Base Governance for RAG Systems

Written by

Why do embedding models and knowledge bases need governance?

How should the knowledge base be treated under Article 10?

What bias risks do embedding models introduce?

How does multilingual performance affect compliance?

When do embeddings constitute personal data?

What version control does an embedding model require?

Documenting embedding and knowledge base governance in the AISDP

What compensating controls address embedding-specific risks?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline