We use cookies to improve your experience and analyse site traffic.
A RAG system's knowledge base is a compliance-critical data asset requiring dedicated governance controls. This page covers the provenance, completeness, bias, and retrieval pipeline requirements that organisations must implement for EU AI Act compliance.
A RAG system's knowledge base is a compliance-critical data asset that must be governed with the same rigour as training data.
A RAG system's knowledge base is a compliance-critical data asset that must be governed with the same rigour as training data. The knowledge base shapes the model's outputs as profoundly as training data does: a gpai model that produces accurate, unbiased results on its own may produce inaccurate or biased outputs when provided with a knowledge base containing outdated, incomplete, or skewed information. RAG Compliance Framework covers the broader compliance architecture for retrieval-augmented generation; this page focuses on the governance controls for the knowledge base and retrieval pipeline specifically.
Article 10 governs training, validation, and testing data. Retrieved documents do not fall neatly into these categories in the technical sense, as they are inference-time context rather than training inputs. Yet the regulatory intent is clear: any data asset that materially influences a high-risk system's outputs must be governed. The knowledge base, the retrieval pipeline, and the interaction between retrieved context and model behaviour each create compliance implications that base GPAI integration guidance does not fully address.
Every document in the knowledge base must have documented provenance covering the source, the date of acquisition, the licence or permission under which it was acquired, and the conditions attached to its use.
Every document in the knowledge base must have documented provenance covering the source, the date of acquisition, the licence or permission under which it was acquired, and the conditions attached to its use. Copyright exposure is significant: a RAG system that retrieves and reproduces copyrighted content in its outputs may expose the deploying organisation to infringement claims distinct from the GPAI model provider's own copyright compliance obligations under Article 53(1)(c).
The Technical SME maintains a knowledge base catalogue recording, for each document or document source: the original author or publisher; the acquisition date; the licence type (proprietary, Creative Commons, public domain, contractual licence); any usage restrictions; the expiry date if the licence is time-limited; and the last review date.
Automated copyright screening reduces the risk of inadvertent infringement for knowledge bases assembled from web-scraped or publicly available sources.
Automated copyright screening reduces the risk of inadvertent infringement for knowledge bases assembled from web-scraped or publicly available sources. The screening checks each document against known copyright databases, robots.txt restrictions, and the organisation's internal list of prohibited sources. Documents that fail the screening are quarantined pending manual review by the Legal and Regulatory Advisor.
This screening process is particularly important for organisations that assemble knowledge bases at scale, where manual review of every document is impractical. Data Governance for High-Risk AI addresses broader data governance obligations that complement these RAG-specific copyright controls.
A knowledge base that is incomplete or outdated produces outputs that are incomplete or outdated, creating a compliance risk for high-risk systems.
A knowledge base that is incomplete or outdated produces outputs that are incomplete or outdated, creating a compliance risk for high-risk systems. The system's outputs may not reflect the current state of the domain, leading to decisions that are wrong in ways the system's performance metrics do not capture.
The Technical SME defines, before deployment, the domains and topics the knowledge base must cover. The completeness assessment maps each required domain to the documents that address it, identifies gaps, and documents the remediation plan. This assessment is documented in AISDP Module 4. Where the knowledge base covers regulated domains such as legal, clinical, or financial guidance, completeness gaps carry heightened risk because users may rely on the system's outputs for consequential decisions.
Currency monitoring identifies documents whose content may have been superseded, preventing the retrieval of outdated information.
Currency monitoring identifies documents whose content may have been superseded, preventing the retrieval of outdated information. For knowledge bases drawn from regulatory, legal, or clinical sources, currency monitoring is particularly important: a RAG system that retrieves outdated regulatory guidance may produce advice that is no longer correct.
Currency monitoring should track: the age of each document; the frequency of updates in the source domain; external signals indicating a document may be outdated, such as new legislation, published corrections, or superseding publications; and user feedback indicating the system produced an outdated response. The knowledge base is versioned as a data asset. Each version captures the complete set of documents, their metadata, and the embedding vectors generated from them. Version changes are tracked in AISDP Module 10 and evaluated against the substantial modification criteria. Data Quality Monitoring provides complementary guidance on ongoing data quality assurance processes.
A knowledge base assembled from sources that underrepresent certain perspectives, demographics, or viewpoints will produce outputs that reflect those gaps.
A knowledge base assembled from sources that underrepresent certain perspectives, demographics, or viewpoints will produce outputs that reflect those gaps. The fairness assessment for a RAG system must address not only the GPAI model's parametric biases but also the representational biases in the knowledge base itself.
The Technical SME conducts a representativeness analysis assessing whether the document collection adequately covers: all geographic regions relevant to the system's deployment; all demographic groups within the affected population; all perspectives relevant to the domain, including dissenting or minority viewpoints where the system's outputs may influence decisions affecting diverse groups; and all languages in which the system operates. The analysis is documented in AISDP Module 4. Where gaps are identified, the remediation approach is documented with a timeline.
The retrieval pipeline determines which documents the GPAI model sees and is a decision-making component with direct impact on the system's outputs.
The retrieval pipeline determines which documents the GPAI model sees and is a decision-making component with direct impact on the system's outputs. When a user queries a RAG system, the retrieval pipeline selects a subset of documents from the knowledge base based on semantic similarity to the query. The selected documents become the context for the model's response.
If the retrieval pipeline selects the wrong documents, selects incomplete documents, or fails to retrieve relevant documents, the model's output will be correspondingly wrong, incomplete, or missing critical information. The retrieval pipeline's behaviour must be documented in AISDP Module 3 with the same rigour as the model inference layer. The documentation covers: the embedding model used for semantic search, its version, provider, and known limitations; the similarity metric and threshold; the number of documents retrieved (top-k); the re-ranking strategy if any; the chunking strategy for splitting documents into retrievable segments; and the metadata filtering logic if retrieval is constrained by document attributes.
The embedding model that converts queries and documents into vectors is itself an AI component that introduces bias, accuracy, and version control considerations.
The embedding model that converts queries and documents into vectors is itself an AI component that introduces bias, accuracy, and version control considerations. For RAG systems, three additional requirements apply beyond the general embedding model governance obligations.
First, embedding bias evaluation: the Technical SME evaluates whether the embedding model produces systematically different retrieval results for semantically equivalent queries phrased in different ways. A query about "maternity leave entitlements" should retrieve the same documents as a query about "parental leave for mothers." If the embedding model's semantic space encodes demographic biases, the retrieval pipeline may systematically underserve certain query formulations.
Second, cross-lingual retrieval quality: for multilingual RAG systems, the Technical SME evaluates whether retrieval quality is consistent across languages. A system that retrieves comprehensive, relevant documents for English queries but sparse, tangential documents for Polish or Romanian queries produces systematically different output quality across language communities.
Third, embedding version alignment: the document embeddings in the vector store must be generated by the same embedding model version as the query embeddings at inference time. A mismatch between the embedding model used to index the knowledge base and the model used to embed incoming queries can silently degrade retrieval quality. The Technical SME implements version alignment checks in the retrieval pipeline and documents the alignment mechanism in AISDP Module 10.
The catalogue should record the original author or publisher, acquisition date, licence type (proprietary, Creative Commons, public domain, contractual licence), usage restrictions, licence expiry date if time-limited, and last review date.
Document embeddings in the vector store must be generated by the same embedding model version as query embeddings at inference time. A mismatch can silently degrade retrieval quality because the vector spaces may no longer align correctly.
Knowledge base bias stems from underrepresentation of perspectives, demographics, or viewpoints in the document collection, whereas model bias is encoded in the GPAI model's parameters. RAG fairness assessment must address both sources of bias independently.
The Technical SME maps required domains to available documents, identifies gaps, and documents remediation plans in AISDP Module 4.
Currency monitoring tracks document age, source update frequency, external supersession signals, and user feedback indicating outdated responses.
A representativeness analysis examines whether the document collection covers all relevant geographic regions, demographic groups, perspectives, and languages.
RAG embedding models require bias evaluation, cross-lingual retrieval quality assessment, and version alignment between document indexing and query-time embeddings.