We use cookies to improve your experience and analyse site traffic.
Embedding models in retrieval-augmented generation architectures introduce learned representations that can encode biases, linguistic disparities, and privacy risks invisible to monitoring focused on the primary model. Article 10 data governance requirements should be applied to both the knowledge base and the embedding models that index it.
Embedding models introduce a layer of learned representation between the knowledge base and the decision-making model, and this layer can encode biases, linguistic disparities, and privacy risks that are invisible to monitoring systems focused only on the primary model's inputs and outputs.
Embedding models introduce a layer of learned representation between the knowledge base and the decision-making model, and this layer can encode biases, linguistic disparities, and privacy risks that are invisible to monitoring systems focused only on the primary model's inputs and outputs. In retrieval-augmented generation architectures, an embedding model converts both the knowledge base documents and the user query into dense vector representations; the system retrieves documents whose embeddings are closest to the query embedding, and the retrieved documents become the primary model's context for generating a response. In semantic search systems used for recruitment, legal research, or medical case matching, the embedding model's representation of "similarity" directly determines which candidates, precedents, or cases are surfaced to the decision-maker.
Despite their influence on the system's outputs, embedding models tend to escape governance scrutiny. They are neither the primary decision-making model, so Data Governance for High-Risk AI model selection governance may overlook them, nor training data, so data governance processes may not capture them. Embedding models also appear in monitoring pipelines, where they power the output quality and prompt distribution monitoring required under post-market monitoring obligations. This governance gap creates a compliance risk that organisations must address systematically.
The knowledge base in a RAG architecture functions as the information source that directly shapes the system's outputs.
The knowledge base in a RAG architecture functions as the information source that directly shapes the system's outputs. The foundation model generates its response based on the documents retrieved from the knowledge base; if the knowledge base is incomplete, outdated, biased, or poorly curated, the system's outputs will reflect those deficiencies regardless of how well the model itself performs.
Whether the knowledge base constitutes "training, validation and testing data" within the meaning of article 10 is an open legal question. In the conventional sense, the knowledge base is not used to train the model's parameters; it is used at inference time to condition the model's output. The recommended compliance approach is to apply Article 10's data governance requirements to the knowledge base, adapted to the specific characteristics of inference-time retrieval. Treating the knowledge base as ungoverned data, and discovering post-deployment that it introduced bias, inaccuracy, or incompleteness into a high-risk system's outputs, carries consequences materially worse than the cost of applying governance from the outset.
The Technical SME documents the knowledge base using the same categories required for training data, adapted as follows. Composition covers the number of documents, the document types (regulatory text, clinical guidelines, case law, product specifications, or other), the source distribution, and the temporal coverage. For knowledge bases updated on a rolling basis, the documentation specifies the update cadence, the document selection criteria, and the process for removing outdated material.
Completeness and representativeness require that the knowledge base is representative of the domain the system serves. A medical decision-support RAG system whose knowledge base covers only English-language guidelines from US institutions will produce systematically different, and potentially less appropriate, responses for patients in EU member states where national clinical guidelines differ. A legal research system whose knowledge base underrepresents case law from smaller member states will produce less reliable results for queries concerning those jurisdictions. The Technical SME assesses completeness against the system's intended deployment context and documents any known coverage gaps.
Currency is critical for knowledge bases in domains where information changes, such as medical guidelines, regulatory text, or financial data. The Technical SME defines a staleness threshold: the maximum acceptable age for documents in the knowledge base, which varies by domain. Documents that exceed the staleness threshold are flagged for review, update, or removal. The staleness monitoring process is documented in the post-market monitoring plan.
Embedding models encode semantic associations from their training data into the geometry of the vector space, and research has consistently demonstrated that models trained on broad web corpora encode societal biases.
Embedding models encode semantic associations from their training data into the geometry of the vector space, and research has consistently demonstrated that models trained on broad web corpora encode societal biases. These include associations between professions and gender, between names and ethnicity, and between geographic locations and socioeconomic status. These biases are structural features of the embedding space that directly affect retrieval behaviour.
In a high-risk AI system, embedding bias manifests as differential retrieval quality. A RAG-based recruitment system that uses biased embeddings may retrieve systematically different reference materials for candidates whose profiles contain markers associated with different demographic groups. A semantic search system for legal case matching may retrieve different precedents for cases involving individuals from different ethnic or socioeconomic backgrounds. These effects are subtle, difficult to detect through aggregate performance metrics, and fall squarely within the scope of Article 10(2)(f) on examination for possible biases and Article 10(2)(g) on identification of relevant data gaps and shortcomings.
The Technical SME assesses embedding bias through a combination of intrinsic evaluation, examining the embedding space directly for known bias patterns, and extrinsic evaluation, testing whether retrieval quality differs across demographic subgroups. Intrinsic evaluation methods include WEAT (the Word Embedding Association Test) and its sentence-level extensions, which measure the association between target concepts such as male and female names and attribute concepts such as career and family in the embedding space. Extrinsic evaluation is more directly relevant to compliance: it measures whether a fixed set of queries produces systematically different retrieval results when query characteristics vary across protected dimensions. The results are documented in Module 4 alongside the training data bias assessment.
Most widely available embedding models perform best on English-language text.
Most widely available embedding models perform best on English-language text. Models offering multilingual support vary in their performance across languages. For a high risk ai system deployed across multiple EU member states, uneven embedding performance across languages could cause the system to retrieve more relevant information for queries in some languages than others.
A medical decision-support system that retrieves highly relevant clinical guidelines for queries in English but less relevant results for queries in Estonian or Maltese is providing a materially different quality of service to users in different member states. The Technical SME evaluates the embedding model's retrieval performance across all languages in which the system will operate.
The evaluation should use language-specific retrieval benchmarks (MIRACL for multilingual information retrieval, MTEB for multilingual text embedding evaluation) and domain-specific test queries in each language. Performance gaps exceeding a defined threshold, set by the AI Governance Lead in consultation with the Technical SME, should be documented as a known limitation in the aisdp and addressed through compensating controls. These controls may include language-specific fine-tuning of the embedding model, translation preprocessing for underperforming languages, or separate embedding models optimised for specific language families.
Dense vector embeddings can encode more information about the input text than is immediately apparent, and research has demonstrated that original text can be partially or fully reconstructed from embeddings through inversion attacks, particularly for models with high-dimensional output spaces.
Dense vector embeddings can encode more information about the input text than is immediately apparent, and research has demonstrated that original text can be partially or fully reconstructed from embeddings through inversion attacks, particularly for models with high-dimensional output spaces. If the embedding model is used to encode documents containing personal data, such as patient records in a medical RAG system, candidate CVs in a recruitment system, or customer correspondence in a financial advisory system, the stored embeddings may themselves constitute personal data under GDPR Article 4(1). This is because they relate to an identified or identifiable natural person and can, with reasonable effort, be linked back to the original text.
The DPO Liaison assesses whether the stored embeddings constitute personal data by evaluating the feasibility of re-identification. The assessment considers the embedding model's dimensionality (higher dimensions preserve more information), the availability of inversion techniques for the specific model architecture, and whether the embeddings are stored alongside metadata such as document identifiers, timestamps, or user identifiers that could facilitate re-identification. Where the DPO Liaison determines that embeddings constitute personal data, the full GDPR compliance framework applies.
A lawful basis must be identified for storing the embeddings, and the retention policy must specify a deletion schedule. Data subject access and erasure requests must be serviceable, which may require the ability to identify and delete specific embeddings from the vector store. The data protection impact assessment must address the embedding-specific risks.
Embedding models produce vector representations that are specific to the model version, and when the embedding model is updated the new version may produce different vector representations for the same input text.
Embedding models produce vector representations that are specific to the model version, and when the embedding model is updated the new version may produce different vector representations for the same input text. Updates may occur through a provider-initiated version change for API-accessed models, or through fine-tuning or retraining for self-hosted models. If the knowledge base was indexed using embedding model version A but queries are embedded using version B, the retrieval quality degrades because the vector spaces are no longer aligned. In extreme cases, the retrieval may return entirely irrelevant documents.
This version mismatch risk requires coordination between the embedding model version and the knowledge base index version. The Technical SME maintains a version record linking each knowledge base index to the embedding model version used to generate it. Any change to the embedding model version triggers a re-indexing of the knowledge base. The Version Control and Model Lineage framework applies to embedding models in the same manner as to the primary model: version pinning for API-accessed embedding models, content hashing for downloaded models, and sentinel testing for detecting silent changes.
Module 4 of the AISDP records the knowledge base and embedding model governance comprehensively.
Module 4 of the AISDP records the knowledge base and embedding model governance comprehensively. The documentation includes the knowledge base composition, completeness assessment, currency policy, provenance records, and bias assessment, following the framework described above for treating the knowledge base as governed data.
For each embedding model, the documentation records its provenance, including provider, version, training data description (to the extent available), and known biases. Compliance Documentation and the AISDP embedding bias assessment results, both intrinsic and extrinsic, are recorded alongside the training data bias assessment.
The documentation also captures multilingual performance evaluation results and any compensating controls for underperforming languages. It records whether stored embeddings constitute personal data, as assessed by the DPO Liaison, and the GDPR compliance measures applied if they do. The version control linkage between the embedding model version and the knowledge base index version is maintained as part of the documentation. An entry in the Model Selection Record for each embedding model is also required.
Retrieval bias testing is the primary extrinsic evaluation method for detecting embedding bias in practice.
Retrieval bias testing is the primary extrinsic evaluation method for detecting embedding bias in practice. The Technical SME constructs a test suite of paired queries that differ only in demographic markers (names, gendered pronouns, geographic references) and measures whether the retrieval results differ systematically. For a recruitment RAG system, this might involve submitting pairs of candidate profiles that are identical except for names associated with different ethnic backgrounds, and comparing the job descriptions or reference materials retrieved for each. Statistically significant differences in retrieval results across protected dimensions indicate embedding bias. The test suite should be run at initial deployment and as part of the post-market monitoring programme, with results documented as Module 4 and Module 12 evidence.
A knowledge base quality pipeline validates new documents before they are added to the knowledge base. The pipeline checks document format and structural integrity, extracts and validates metadata (source, date, language, topic classification), runs currency checks against the staleness threshold, performs deduplication against existing documents using near-duplicate detection (SimHash, MinHash, or embedding-based similarity), and computes an incremental coverage assessment to track whether the knowledge base's domain coverage is expanding, contracting, or drifting. Documents that fail validation are quarantined for manual review, following the same pattern as the Third-Party Data Intake pipeline.
Embedding inversion monitoring applies to systems where the DPO Liaison has determined that stored embeddings constitute personal data. The Technical SME implements access logging for the vector database (recording who queried what and when), anomaly detection on query patterns (bulk extraction attempts, systematic probing of the embedding space), and periodic re-assessment of the state of the art in embedding inversion techniques to ensure that the privacy risk assessment remains current.
Yes. Although the legal classification is an open question, the recommended approach is to apply Article 10 requirements to the knowledge base because it directly shapes system outputs. Discovering post-deployment that an ungoverned knowledge base introduced bias or inaccuracy carries worse consequences than applying governance from the outset.
Construct paired queries that differ only in demographic markers such as names, gendered pronouns, or geographic references. Measure whether retrieval results differ systematically across protected dimensions. Statistically significant differences indicate embedding bias requiring remediation.
Retrieval quality degrades because the knowledge base vectors and query vectors are no longer in the same vector space. In extreme cases, the system may return entirely irrelevant documents. Any embedding model version change must trigger a full re-indexing of the knowledge base.
If the vector database supports efficient single-record deletion, delete the embedding directly. If not, maintain a mapping between embeddings and source documents, delete the source document, and flag the corresponding embedding for removal at the next scheduled re-indexing.
Embedding models encode societal biases from training data into vector space geometry, causing differential retrieval quality across demographic groups in recruitment, legal, and medical systems.
Embeddings may constitute personal data under GDPR Article 4(1) where original text can be reconstructed through inversion attacks, particularly for high-dimensional models encoding documents containing personal information.
Version mismatches between the embedding model used for indexing and the one used for queries cause retrieval quality degradation because the vector spaces are no longer aligned.
Key controls include retrieval bias testing with demographically paired queries, automated knowledge base quality pipelines with deduplication and currency checks, and embedding inversion monitoring for personal data protection.
Provenance requires that each document in the knowledge base is traceable to its source. The provenance record captures the document's origin (the publishing institution, the URL, the database from which it was retrieved), the date of retrieval, the version of the document if applicable, and the copyright or licensing status. For knowledge bases assembled from multiple sources, the provenance record enables the organisation to respond to copyright claims and to assess the reliability of individual documents.
Bias assessment examines whether certain perspectives, populations, or viewpoints are overrepresented or underrepresented. A recruitment knowledge base composed primarily of job descriptions from large technology companies may encode assumptions about role requirements that disadvantage candidates from different industry backgrounds. A legal knowledge base that overrepresents decisions from certain courts may introduce jurisdictional bias. The bias assessment methodology should be proportionate to the system's risk profile and documented in Module 4.
The practical challenge is that vector databases are typically optimised for similarity search, not for record-level deletion. Deleting a specific embedding from a vector index may require re-indexing the entire collection, depending on the database technology. The Technical SME assesses the vector database's deletion capabilities at architecture design time and documents the approach for servicing erasure requests in the data lifecycle documentation. Where the vector database does not support efficient single-record deletion, the compensating control is to maintain a mapping between embeddings and their source documents, so that erasure requests can be fulfilled by deleting the source document and flagging the corresponding embedding for removal at the next scheduled re-indexing.
For systems using a small, manually curated knowledge base of fewer than 1,000 documents, a procedural alternative applies. The Technical SME maintains a spreadsheet-based document register listing each document in the knowledge base, its source, date of addition, date of last review, and topic classification. The DPO Liaison manually reviews the knowledge base for personal data at initial deployment and at each scheduled review. Embedding bias testing can be conducted manually by the Technical SME, submitting paired queries and comparing retrieval results in a structured spreadsheet. For the embedding model itself, the procedural alternative follows the same manual model provenance documentation: recording the model's origin, version, and content hash, and manually verifying that the deployed version matches the documented version at each scheduled review.