We use cookies to improve your experience and analyse site traffic.
Annex IV of the EU AI Act requires documentation of 'the elements of the AI system and of the process for its development.' For composite systems using multiple models, compliance properties emerge from component interactions and cannot be reduced to individual model evaluations.
Many production AI systems rely on multiple models working together, not a single model in isolation.
Many production AI systems rely on multiple models working together, not a single model in isolation. A recruitment screening system may use an NLP model to extract competencies, a classification model to score candidates, and a ranking model to order results. A clinical decision support system may combine medical image analysis, natural language processing for clinical notes, and risk stratification. A RAG system with reranking uses at least three models: the embedding model, the reranker, and the generator.
Each additional model introduces its own biases, failure modes, version control requirements, and documentation obligations. The aisdp must account for every model in the system, their interactions, and the emergent behaviour that arises from their combination. Annex IV requires documentation of the system's "elements of the AI system and of the process for its development." For a single-model system this is straightforward: one model, one training process, one set of evaluation results.
For a composite system, documentation must cover each model component and the system as a whole. The system's compliance properties emerge from the interaction of components and cannot be reduced to the sum of individual model evaluations. The outputs of multi-model governance feed into AISDP Modules 2, 3, 4, 5, 6, 9, 10, and 12.
The Technical SME maintains a system component registry that documents every AI model component in the system.
The Technical SME maintains a system component registry that documents every AI model component in the system. This registry is a governed artefact, version-controlled and referenced in AISDP Module 2. It provides both the Technical SME and the conformity assessment Coordinator with a complete inventory of the system's AI components.
Each entry in the registry records the following fields:
| Field | Example (Multi-Model Recruitment System) |
|---|---|
| Component identifier | `competency-extractor-v2.1` |
| Model type | Transformer-based NER model (fine-tuned BERT) |
| Provider | Internal (trained on proprietary data) |
| Position in pipeline | Stage 1: processes raw CV text, outputs structured competency entities |
| Input specification | Raw text (UTF-8, max 50,000 characters) |
| Output specification | JSON array of competency objects with name, confidence, and source span |
| Downstream consumers | `candidate-scorer-v3.0` (Stage 2) |
| Version | 2.1.0 (model registry: `s3://models/comp-ext/2.1.0/`) |
| Training data version | `ds://training/competencies/v4.2` |
| Last evaluation date | 2026-01-15 |
| Fairness evaluation status | Passed (SRR at or above 0.85 across all subgroups) |
The registry serves as the single source of truth for what is deployed. When any component changes, the registry is updated before the change reaches production. Without this inventory, organisations cannot demonstrate to market surveillance authorities which models constitute their system or how those models interact.
Evaluation of composite systems must operate at two distinct levels: per-component evaluation and aggregate evaluation.
Evaluation of composite systems must operate at two distinct levels: per-component evaluation and aggregate evaluation. Individual models may each perform within specification while the combined system produces unacceptable outputs. A competency extractor with 95 per cent precision and a candidate scorer with 90 per cent accuracy do not produce a system with combined accuracy of 85 per cent; the errors may correlate, amplify, or cancel in ways that depend on the specific data and interaction pattern.
The aggregate evaluation is the authoritative measure. Per-component evaluation is diagnostic.
The Technical SME designs the evaluation architecture to test both levels across four tiers:
| Evaluation Level | What It Tests | Metric Examples | Failure Threshold |
|---|---|---|---|
| Per-component (isolation) | Each model component in isolation against its own test set | Component-specific accuracy, precision, recall, latency | Component-specific thresholds documented in the component registry |
| Per-component (integration) | Each model component within the full pipeline, using realistic upstream inputs |
A change to any component in a multi-model system can affect the outputs of every downstream component and the system's aggregate behaviour.
A change to any component in a multi-model system can affect the outputs of every downstream component and the system's aggregate behaviour. The Technical SME maintains a cascade map: a directed graph showing which components consume the outputs of which other components. When a component changes, the cascade map identifies every downstream component that may be affected.
Change impact assessment. When a component is updated, retrained, reconfigured, or replaced, the Technical SME assesses the impact on all downstream components using the cascade map. The assessment determines three things: whether downstream components need re-evaluation, whether the aggregate evaluation thresholds still hold, and whether the change triggers the substantial modification criteria under Article 3.
Cascade testing. The governance pipeline implements cascade testing as a pipeline stage. When a component changes, the pipeline re-evaluates not only the changed component but every downstream component and the aggregate system. The pipeline fails if any component or the aggregate falls below its documented threshold. This is not optional good practice; without cascade testing, a component update that degrades downstream performance may go undetected until a market surveillance authority or incident report reveals it.
For example, in a RAG system, updating the embedding model changes the vectors stored and retrieved, which changes what the reranker receives, which changes what the generator produces. Testing only the embedding model in isolation would miss the downstream impact entirely. See for the full pipeline architecture.
The composite version identifier gains critical importance in multi-model systems.
The composite version identifier gains critical importance in multi-model systems. The system's behaviour at any point in time is determined by the specific combination of component versions deployed. The composite version must capture every component version, the configuration version, and the pipeline version.
The composite version manifest. Each deployment produces a manifest listing every component, its version, its model registry reference, and the configuration applied. The manifest is stored in the governance artefact registry and referenced in AISDP Module 10.
Reproducibility requirement. The Technical SME must be able to reproduce the system's behaviour at any historical point from the composite version manifest. This requires that every component version, every dataset version, and every configuration version referenced in the manifest is retrievable from the organisation's artefact stores.
The ten-year retention obligation under Article 18 applies to the complete set of artefacts referenced by the manifest, not merely the manifest itself. An organisation that retains the manifest but allows the underlying model weights, training data, or configuration files to expire cannot reproduce the system's historical behaviour and fails the retention requirement.
Fairness evaluation for composite systems is more complex than for single-model systems because bias can be introduced, amplified, or masked at each stage of the pipeline.
Fairness evaluation for composite systems is more complex than for single-model systems because bias can be introduced, amplified, or masked at each stage of the pipeline. The Technical SME must conduct fairness analysis at each stage, measuring whether the stage's outputs show disparate impact across protected characteristic subgroups.
Stage-by-stage fairness analysis. A competency extractor that is less accurate for CVs written in non-standard English introduces a bias at Stage 1 that propagates through every subsequent stage. A candidate scorer that is fair on clean competency data may amplify the extractor's bias by weighting the biased competencies. The bias compounds rather than cancels.
Interaction effects. The Technical SME evaluates whether the combination of individually fair components produces an unfair system. This requires the aggregate disaggregated evaluation described in the evaluation architecture above. The interaction effects are documented in AISDP Module 6 as a distinct risk category. Organisations cannot rely on per-component fairness reports; only the aggregate disaggregated evaluation reveals whether the composite system meets fairness thresholds. See Bias and Fairness for the full fairness evaluation framework.
The CONFORMITY ASSESSMENT must evaluate the composite system, not merely its components.
The conformity assessment must evaluate the composite system, not merely its components. A system that demonstrates per-component compliance but cannot demonstrate aggregate compliance fails the assessment. The Conformity Assessment Coordinator structures the assessment in three phases.
Phase 1: component verification. Each component is verified against its individual requirements: training data documentation, performance metrics, and fairness evaluation. The component registry provides the inventory.
Phase 2: integration verification. Data flows between components, version alignment, and cascade testing results are verified. This phase checks that components work together as documented and that the cascade map accurately reflects the system's architecture.
Phase 3: aggregate verification. End-to-end performance, disaggregated fairness, risk management, and human oversight effectiveness are verified at the system level. The aggregate evaluation results from AISDP Module 5 are the primary evidence.
Non-conformities identified at any phase are logged in the non-conformity register with the specific component or interaction responsible. Remediation may require changes to a single component, to the interaction between components, or to the system's aggregate architecture. See Conformity Assessment for the full assessment methodology.
Automated cascade testing is essential for multi-model systems at scale.
Automated cascade testing is essential for multi-model systems at scale. The CI/CD pipeline should implement cascade-aware testing: when a pull request modifies a model component, the pipeline automatically identifies downstream components via the cascade map and runs integration and aggregate tests. The pipeline fails if any downstream metric degrades beyond the declared tolerance.
The cascade map is maintained as a machine-readable configuration file (YAML or JSON), version-controlled alongside the pipeline definition. Tools such as DVC (Data Version Control) for data and model lineage, and MLflow for experiment tracking, provide the infrastructure for tracking component relationships.
Ensemble-level monitoring in the post-market monitoring programme should track not only aggregate metrics but per-component contribution metrics. These measure how much each component contributes to the aggregate output and whether that contribution is shifting over time. A component whose contribution decreases may be masking a degradation. A component whose contribution increases disproportionately may be introducing bias. Monitoring contribution metrics provides early warning of interaction effects before they manifest in aggregate performance. See Post-Market Monitoring for the broader monitoring framework.
Procedural alternative for smaller systems. For systems with two to three model components, composite governance can be managed manually. The component registry is a spreadsheet. The cascade map is a documented diagram. Change impact assessment is conducted by the Technical SME through manual review. Cascade testing is executed manually by running downstream evaluation suites after each component change. For systems with five or more components, manual governance becomes unsustainable and automated cascade testing with component contribution monitoring is necessary.
No. The aggregate evaluation is the authoritative measure. Individual models may each perform within specification while the combined system produces unacceptable outputs due to correlated errors, bias amplification, or interaction effects.
The cascade map is maintained as a machine-readable YAML or JSON file. DVC (Data Version Control) tracks data and model lineage, MLflow handles experiment tracking, and the CI/CD pipeline implements cascade-aware testing that automatically identifies downstream components when a model changes.
Systems with two to three components can manage composite governance manually using a spreadsheet registry, documented cascade diagram, and manual downstream evaluation. Systems with five or more components require automated cascade testing and contribution monitoring.
Bias introduced at an early stage propagates through every subsequent stage. A competency extractor less accurate for non-standard English introduces Stage 1 bias that a downstream candidate scorer may amplify by weighting the biased competencies. Only aggregate disaggregated evaluation reveals the compounded effect.
At four tiers: per-component in isolation, per-component in integration, aggregate end-to-end, and aggregate disaggregated by protected characteristics. The aggregate evaluation is authoritative; per-component results are diagnostic only.
When a component changes, the governance pipeline re-evaluates every downstream component and the aggregate system using a directed-graph cascade map. The pipeline fails if any metric drops below documented thresholds.
Stage-by-stage fairness analysis measures disparate impact at each pipeline stage, then aggregate disaggregated evaluation checks whether individually fair components produce an unfair system through interaction effects.
Three phases: component verification against individual requirements, integration verification of data flows and cascade testing, and aggregate verification of end-to-end performance and disaggregated fairness.
| Same metrics, but on pipeline-processed inputs rather than clean test data |
| Same thresholds; failures here indicate integration issues |
| Aggregate (end-to-end) | The complete system's outputs against the ground truth | System-level accuracy, fairness metrics (SRR, equalised odds), calibration | System-level thresholds documented in AISDP Module 5 |
| Aggregate (disaggregated) | The complete system's outputs disaggregated by protected characteristic subgroups | Per-subgroup accuracy, per-subgroup SRR, intersectional analysis | Subgroup-specific thresholds documented in AISDP Module 6 |
The evaluation is documented in AISDP Module 5. Per-component results provide diagnostic information, while aggregate results determine the system's compliance status. The distinction matters: a system cannot claim compliance on the basis of per-component results alone when the aggregate output fails to meet thresholds.