How should multi-model AI system evaluation be structured?

At four levels: per-component isolation, per-component integration, aggregate end-to-end, and aggregate disaggregated by protected characteristics. Aggregate results are authoritative.

How does cascading change management work in multi-model systems?

A cascade map identifies all downstream components affected by any change. The governance pipeline re-evaluates every downstream component and the aggregate system when any component changes.

How does fairness evaluation differ for composite AI systems?

Bias can be introduced, amplified, or masked at each pipeline stage. Stage-by-stage analysis and aggregate disaggregated evaluation are needed to detect interaction effects.

Can per-component compliance substitute for aggregate system compliance?

No. A system that demonstrates per-component compliance but cannot demonstrate aggregate compliance fails the conformity assessment. The aggregate results are authoritative.

What does the composite version manifest contain?

Every component version, its model registry reference, the configuration applied, and the pipeline version. Each deployment produces a manifest stored in the governance artefact registry.

How does bias cascade through a multi-model pipeline?

A component that introduces bias at an early stage propagates that bias through every subsequent stage. A downstream component that is fair on clean data may amplify the upstream bias when processing biased inputs.

Can per-component compliance substitute for aggregate system compliance?

No. A system that demonstrates per-component compliance but cannot demonstrate aggregate compliance fails the conformity assessment. The aggregate results are authoritative.

What does the composite version manifest contain?

Every component version, its model registry reference, the configuration applied, and the pipeline version. Each deployment produces a manifest stored in the governance artefact registry.

How does bias cascade through a multi-model pipeline?

A component that introduces bias at an early stage propagates that bias through every subsequent stage. A downstream component that is fair on clean data may amplify the upstream bias when processing biased inputs.

Composite AI systems require documentation of each model component and the system as a whole because compliance properties emerge from component interaction. Evaluation must operate at per-component and aggregate levels. Cascading change management, composite version control, and stage-by-stage fairness analysis address the unique challenges of multi-model architectures.

Abstract

Read abstract

Multi-model AI systems present distinct compliance challenges under the EU AI Act because compliance properties emerge from component interaction and cannot be reduced to individual model evaluations. Annex IV documentation must cover each component through a system component registry tracking model type, provider, pipeline position, input and output specifications, versions, training data references, and fairness evaluation status. Evaluation operates at four levels: per-component isolation against individual test sets, per-component integration within the full pipeline using realistic upstream inputs, aggregate end-to-end testing against ground truth, and aggregate disaggregated testing broken down by protected characteristic subgroups. The aggregation problem means individual components performing within specification may produce unacceptable combined outputs because errors may correlate, amplify, or cancel unpredictably. Cascading change management uses a directed dependency graph showing which components consume other components' outputs, enabling identification of all downstream components affected by any change. The governance pipeline implements cascade testing that re-evaluates the complete chain. The composite version manifest captures every component version, model registry reference, and configuration for ten-year reproducibility under Article 18. Fairness evaluation requires stage-by-stage analysis because bias can be introduced, amplified, or masked at each pipeline stage.

Why do composite AI systems need dedicated documentation?

Regulatory Requirement

Annex IV requires documentation of the system's elements and the process for its development.

Annex IV requires documentation of the system's elements and the process for its development. For a single-model system, this is straightforward: one model, one training process, one set of evaluation results. For a composite system combining multiple models or chaining several models together, the documentation must cover each model component and the system as a whole, because the system's compliance properties emerge from the interaction of components and cannot be reduced to the sum of individual model evaluations.

A recruitment screening system may use an NLP model to extract competencies from CVs, a classification model to score candidates against role profiles, and a ranking model to order the results. A clinical decision support system may combine medical image analysis, a natural language model for clinical notes, and a risk stratification model. A RAG system with reranking uses at least three models: the embedding model, the reranker, and the generator. Each additional model introduces its own biases, failure modes, version control requirements, and documentation obligations.

What should the system component registry contain?

Engineering Approach

The Technical SME maintains a system component registry that documents every AI model component in the system.

The Technical SME maintains a system component registry that documents every AI model component in the system. Each entry records the component identifier, the model type such as transformer-based NER or gradient-boosted classifier, the provider as internal or third-party, the position in the pipeline describing what the component processes and what it outputs, the input and output specifications, the downstream consumers that depend on this component's output, the current version with its model registry reference, the training data version, the last evaluation date, and the fairness evaluation status across all measured subgroups.

The registry is a governed artefact, version-controlled alongside the system code and referenced in aisdp Module 2. It provides the Technical SME and the Conformity Assessment Coordinator with a complete inventory of the system's AI components. Adding a new component to the registry or replacing an existing component triggers the governance pipeline's change classification, as the change may constitute a substantial modification.

How should per-component and aggregate evaluation work?

Engineering Approach

The system's evaluation must operate at two levels: per-component evaluation testing whether each model performs within its documented parameters, and aggregate evaluation testing whether the combined system produces outputs satisfying the compliance thresholds.

The aggregation problem means individual models may each perform within specification while the combined system produces unacceptable outputs. A competency extractor with 95 per cent precision and a candidate scorer with 90 per cent accuracy do not produce a system with combined accuracy of approximately 86 per cent; the errors may correlate, amplify, or cancel in ways that depend on the specific data and interaction patterns.

The evaluation architecture tests four levels. Per-component isolation testing evaluates each model against its own test set using component-specific metrics and thresholds. Per-component integration testing evaluates each model within the full pipeline using realistic upstream inputs, detecting failures that only appear when components interact. Aggregate end-to-end testing evaluates the complete system's outputs against ground truth using system-level accuracy, fairness metrics, and calibration thresholds documented in Module 5. Aggregate disaggregated testing evaluates outputs by protected characteristic subgroups using per-subgroup accuracy and fairness thresholds documented in Module 6. The evaluation results at all four levels are documented in AISDP Module 5, with per-component results providing diagnostic information and aggregate results determining the system's compliance status.

The Composite System Documentation Problem

Written by

Why do composite AI systems need dedicated documentation?

What should the system component registry contain?

How should per-component and aggregate evaluation work?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline