Why do multi-model AI systems need dedicated governance?

Each model introduces its own biases, failure modes, and documentation obligations. Compliance properties emerge from model interactions and cannot be reduced to individual evaluations, so Annex IV requires documentation of all system elements.

What is a system component registry for AI compliance?

A version-controlled inventory documenting every AI model component including identifier, type, provider, pipeline position, input/output specs, downstream consumers, version, training data, and fairness status. Referenced in AISDP Module 2.

How should composite AI systems be evaluated for compliance?

At four tiers: per-component in isolation, per-component in integration, aggregate end-to-end, and aggregate disaggregated by protected characteristics. The aggregate evaluation is authoritative; per-component results are diagnostic only.

What is cascade testing for multi-model systems?

When a component changes, the governance pipeline re-evaluates every downstream component and the aggregate system using a directed-graph cascade map. The pipeline fails if any metric drops below documented thresholds.

How is fairness evaluated across multiple AI models?

Stage-by-stage fairness analysis measures disparate impact at each pipeline stage, then aggregate disaggregated evaluation checks whether individually fair components produce an unfair system through interaction effects.

How does conformity assessment work for composite AI systems?

Three phases: component verification against individual requirements, integration verification of data flows and cascade testing, and aggregate verification of end-to-end performance and disaggregated fairness.

What tools support automated cascade testing?

The cascade map is maintained as a machine-readable YAML or JSON file. DVC (Data Version Control) tracks data and model lineage, MLflow handles experiment tracking, and the CI/CD pipeline implements cascade-aware testing that automatically identifies downstream components when a model changes.

Can small multi-model systems use manual governance?

Systems with two to three components can manage composite governance manually using a spreadsheet registry, documented cascade diagram, and manual downstream evaluation. Systems with five or more components require automated cascade testing and contribution monitoring.

How does bias propagate through a multi-model pipeline?

Bias introduced at an early stage propagates through every subsequent stage. A competency extractor less accurate for non-standard English introduces Stage 1 bias that a downstream candidate scorer may amplify by weighting the biased competencies. Only aggregate disaggregated evaluation reveals the compounded effect.

What tools support automated cascade testing?

The cascade map is maintained as a machine-readable YAML or JSON file. DVC (Data Version Control) tracks data and model lineage, MLflow handles experiment tracking, and the CI/CD pipeline implements cascade-aware testing that automatically identifies downstream components when a model changes.

Can small multi-model systems use manual governance?

Systems with two to three components can manage composite governance manually using a spreadsheet registry, documented cascade diagram, and manual downstream evaluation. Systems with five or more components require automated cascade testing and contribution monitoring.

How does bias propagate through a multi-model pipeline?

Bias introduced at an early stage propagates through every subsequent stage. A competency extractor less accurate for non-standard English introduces Stage 1 bias that a downstream candidate scorer may amplify by weighting the biased competencies. Only aggregate disaggregated evaluation reveals the compounded effect.

Multi-Model and Ensemble Governance for Composite AI Systems

Q: Can a composite system claim compliance based on per-component evaluation alone?

No. The aggregate evaluation is the authoritative measure. Individual models may each perform within specification while the combined system produces unacceptable outputs due to correlated errors, bias amplification, or interaction effects.

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Annex IV of the EU AI Act requires documentation of 'the elements of the AI system and of the process for its development.' For composite systems using multiple models, compliance properties emerge from component interactions and cannot be reduced to individual model evaluations.

Abstract

Read abstract

Most production AI systems combine multiple models, each introducing its own biases, failure modes, and documentation obligations. The EU AI Act requires the AI System Documentation Package to account for every model, their interactions, and emergent behaviour. This guide covers the system component registry for inventorying AI models, the four-tier evaluation architecture spanning per-component isolation through aggregate disaggregated analysis, cascading change management with directed-graph cascade maps, composite version manifests for reproducibility under Article 18's ten-year retention obligation, stage-by-stage fairness analysis for detecting bias amplification across pipeline stages, and the three-phase conformity assessment structure for composite systems. Compensating controls include automated cascade-aware CI/CD testing and ensemble-level contribution monitoring for early warning of interaction effects.

Why do composite AI systems need dedicated governance?

Regulatory Requirement

Many production AI systems rely on multiple models working together, not a single model in isolation.

Many production AI systems rely on multiple models working together, not a single model in isolation. A recruitment screening system may use an NLP model to extract competencies, a classification model to score candidates, and a ranking model to order results. A clinical decision support system may combine medical image analysis, natural language processing for clinical notes, and risk stratification. A RAG system with reranking uses at least three models: the embedding model, the reranker, and the generator.

Each additional model introduces its own biases, failure modes, version control requirements, and documentation obligations. The aisdp must account for every model in the system, their interactions, and the emergent behaviour that arises from their combination. Annex IV requires documentation of the system's "elements of the AI system and of the process for its development." For a single-model system this is straightforward: one model, one training process, one set of evaluation results.

For a composite system, documentation must cover each model component and the system as a whole. The system's compliance properties emerge from the interaction of components and cannot be reduced to the sum of individual model evaluations. The outputs of multi-model governance feed into AISDP Modules 2, 3, 4, 5, 6, 9, 10, and 12.

What is the system component registry?

Engineering Approach

The Technical SME maintains a system component registry that documents every AI model component in the system.

The Technical SME maintains a system component registry that documents every AI model component in the system. This registry is a governed artefact, version-controlled and referenced in AISDP Module 2. It provides both the Technical SME and the conformity assessment Coordinator with a complete inventory of the system's AI components.

Each entry in the registry records the following fields:

Field	Example (Multi-Model Recruitment System)
Component identifier	`competency-extractor-v2.1`
Model type	Transformer-based NER model (fine-tuned BERT)
Provider	Internal (trained on proprietary data)
Position in pipeline	Stage 1: processes raw CV text, outputs structured competency entities
Input specification	Raw text (UTF-8, max 50,000 characters)
Output specification	JSON array of competency objects with name, confidence, and source span
Downstream consumers	`candidate-scorer-v3.0` (Stage 2)
Version	2.1.0 (model registry: `s3://models/comp-ext/2.1.0/`)
Training data version	`ds://training/competencies/v4.2`
Last evaluation date	2026-01-15
Fairness evaluation status	Passed (SRR at or above 0.85 across all subgroups)

The registry serves as the single source of truth for what is deployed. When any component changes, the registry is updated before the change reaches production. Without this inventory, organisations cannot demonstrate to market surveillance authorities which models constitute their system or how those models interact.

How does evaluation work at per-component and aggregate levels?

Engineering Approach

Evaluation of composite systems must operate at two distinct levels: per-component evaluation and aggregate evaluation.

Evaluation of composite systems must operate at two distinct levels: per-component evaluation and aggregate evaluation. Individual models may each perform within specification while the combined system produces unacceptable outputs. A competency extractor with 95 per cent precision and a candidate scorer with 90 per cent accuracy do not produce a system with combined accuracy of 85 per cent; the errors may correlate, amplify, or cancel in ways that depend on the specific data and interaction pattern.

The aggregate evaluation is the authoritative measure. Per-component evaluation is diagnostic.

The Technical SME designs the evaluation architecture to test both levels across four tiers:

Evaluation Level	What It Tests	Metric Examples	Failure Threshold
Per-component (isolation)	Each model component in isolation against its own test set	Component-specific accuracy, precision, recall, latency	Component-specific thresholds documented in the component registry
Per-component (integration)	Each model component within the full pipeline, using realistic upstream inputs

How does cascading change management work?

Engineering Approach

A change to any component in a multi-model system can affect the outputs of every downstream component and the system's aggregate behaviour.

A change to any component in a multi-model system can affect the outputs of every downstream component and the system's aggregate behaviour. The Technical SME maintains a cascade map: a directed graph showing which components consume the outputs of which other components. When a component changes, the cascade map identifies every downstream component that may be affected.

Change impact assessment. When a component is updated, retrained, reconfigured, or replaced, the Technical SME assesses the impact on all downstream components using the cascade map. The assessment determines three things: whether downstream components need re-evaluation, whether the aggregate evaluation thresholds still hold, and whether the change triggers the substantial modification criteria under Article 3.

Cascade testing. The governance pipeline implements cascade testing as a pipeline stage. When a component changes, the pipeline re-evaluates not only the changed component but every downstream component and the aggregate system. The pipeline fails if any component or the aggregate falls below its documented threshold. This is not optional good practice; without cascade testing, a component update that degrades downstream performance may go undetected until a market surveillance authority or incident report reveals it.

For example, in a RAG system, updating the embedding model changes the vectors stored and retrieved, which changes what the reranker receives, which changes what the generator produces. Testing only the embedding model in isolation would miss the downstream impact entirely. See for the full pipeline architecture.

What version control approach suits composite systems?

Engineering Approach

The composite version identifier gains critical importance in multi-model systems.

The composite version identifier gains critical importance in multi-model systems. The system's behaviour at any point in time is determined by the specific combination of component versions deployed. The composite version must capture every component version, the configuration version, and the pipeline version.

The composite version manifest. Each deployment produces a manifest listing every component, its version, its model registry reference, and the configuration applied. The manifest is stored in the governance artefact registry and referenced in AISDP Module 10.

Reproducibility requirement. The Technical SME must be able to reproduce the system's behaviour at any historical point from the composite version manifest. This requires that every component version, every dataset version, and every configuration version referenced in the manifest is retrievable from the organisation's artefact stores.

The ten-year retention obligation under Article 18 applies to the complete set of artefacts referenced by the manifest, not merely the manifest itself. An organisation that retains the manifest but allows the underlying model weights, training data, or configuration files to expire cannot reproduce the system's historical behaviour and fails the retention requirement.

How is fairness evaluated in composite systems?

Regulatory Requirement

Fairness evaluation for composite systems is more complex than for single-model systems because bias can be introduced, amplified, or masked at each stage of the pipeline.

Fairness evaluation for composite systems is more complex than for single-model systems because bias can be introduced, amplified, or masked at each stage of the pipeline. The Technical SME must conduct fairness analysis at each stage, measuring whether the stage's outputs show disparate impact across protected characteristic subgroups.

Stage-by-stage fairness analysis. A competency extractor that is less accurate for CVs written in non-standard English introduces a bias at Stage 1 that propagates through every subsequent stage. A candidate scorer that is fair on clean competency data may amplify the extractor's bias by weighting the biased competencies. The bias compounds rather than cancels.

Interaction effects. The Technical SME evaluates whether the combination of individually fair components produces an unfair system. This requires the aggregate disaggregated evaluation described in the evaluation architecture above. The interaction effects are documented in AISDP Module 6 as a distinct risk category. Organisations cannot rely on per-component fairness reports; only the aggregate disaggregated evaluation reveals whether the composite system meets fairness thresholds. See Bias and Fairness for the full fairness evaluation framework.

How does conformity assessment address composite systems?

Regulatory Requirement

The CONFORMITY ASSESSMENT must evaluate the composite system, not merely its components.

The conformity assessment must evaluate the composite system, not merely its components. A system that demonstrates per-component compliance but cannot demonstrate aggregate compliance fails the assessment. The Conformity Assessment Coordinator structures the assessment in three phases.

Phase 1: component verification. Each component is verified against its individual requirements: training data documentation, performance metrics, and fairness evaluation. The component registry provides the inventory.

Phase 2: integration verification. Data flows between components, version alignment, and cascade testing results are verified. This phase checks that components work together as documented and that the cascade map accurately reflects the system's architecture.

Phase 3: aggregate verification. End-to-end performance, disaggregated fairness, risk management, and human oversight effectiveness are verified at the system level. The aggregate evaluation results from AISDP Module 5 are the primary evidence.

Non-conformities identified at any phase are logged in the non-conformity register with the specific component or interaction responsible. Remediation may require changes to a single component, to the interaction between components, or to the system's aggregate architecture. See Conformity Assessment for the full assessment methodology.

What compensating controls support multi-model governance?

Compensating Controls

Automated cascade testing is essential for multi-model systems at scale.

Automated cascade testing is essential for multi-model systems at scale. The CI/CD pipeline should implement cascade-aware testing: when a pull request modifies a model component, the pipeline automatically identifies downstream components via the cascade map and runs integration and aggregate tests. The pipeline fails if any downstream metric degrades beyond the declared tolerance.

The cascade map is maintained as a machine-readable configuration file (YAML or JSON), version-controlled alongside the pipeline definition. Tools such as DVC (Data Version Control) for data and model lineage, and MLflow for experiment tracking, provide the infrastructure for tracking component relationships.

Ensemble-level monitoring in the post-market monitoring programme should track not only aggregate metrics but per-component contribution metrics. These measure how much each component contributes to the aggregate output and whether that contribution is shifting over time. A component whose contribution decreases may be masking a degradation. A component whose contribution increases disproportionately may be introducing bias. Monitoring contribution metrics provides early warning of interaction effects before they manifest in aggregate performance. See Post-Market Monitoring for the broader monitoring framework.

Procedural alternative for smaller systems. For systems with two to three model components, composite governance can be managed manually. The component registry is a spreadsheet. The cascade map is a documented diagram. Change impact assessment is conducted by the Technical SME through manual review. Cascade testing is executed manually by running downstream evaluation suites after each component change. For systems with five or more components, manual governance becomes unsustainable and automated cascade testing with component contribution monitoring is necessary.

Multi-Model and Ensemble Governance for Composite AI Systems

Written by

Why do composite AI systems need dedicated governance?

What is the system component registry?

How does evaluation work at per-component and aggregate levels?

How does cascading change management work?

What version control approach suits composite systems?

How is fairness evaluated in composite systems?

How does conformity assessment address composite systems?

What compensating controls support multi-model governance?

Frequently Asked Questions

Related Pages

Start your compliance journey