We use cookies to improve your experience and analyse site traffic.
Composite AI systems require documentation of each model component and the system as a whole because compliance properties emerge from component interaction. Evaluation must operate at per-component and aggregate levels. Cascading change management, composite version control, and stage-by-stage fairness analysis address the unique challenges of multi-model architectures.
Annex IV requires documentation of the system's elements and the process for its development.
Annex IV requires documentation of the system's elements and the process for its development. For a single-model system, this is straightforward: one model, one training process, one set of evaluation results. For a composite system combining multiple models or chaining several models together, the documentation must cover each model component and the system as a whole, because the system's compliance properties emerge from the interaction of components and cannot be reduced to the sum of individual model evaluations.
A recruitment screening system may use an NLP model to extract competencies from CVs, a classification model to score candidates against role profiles, and a ranking model to order the results. A clinical decision support system may combine medical image analysis, a natural language model for clinical notes, and a risk stratification model. A RAG system with reranking uses at least three models: the embedding model, the reranker, and the generator. Each additional model introduces its own biases, failure modes, version control requirements, and documentation obligations.
The Technical SME maintains a system component registry that documents every AI model component in the system.
The Technical SME maintains a system component registry that documents every AI model component in the system. Each entry records the component identifier, the model type such as transformer-based NER or gradient-boosted classifier, the provider as internal or third-party, the position in the pipeline describing what the component processes and what it outputs, the input and output specifications, the downstream consumers that depend on this component's output, the current version with its model registry reference, the training data version, the last evaluation date, and the fairness evaluation status across all measured subgroups.
The registry is a governed artefact, version-controlled alongside the system code and referenced in aisdp Module 2. It provides the Technical SME and the Conformity Assessment Coordinator with a complete inventory of the system's AI components. Adding a new component to the registry or replacing an existing component triggers the governance pipeline's change classification, as the change may constitute a substantial modification.
The system's evaluation must operate at two levels: per-component evaluation testing whether each model performs within its documented parameters, and aggregate evaluation testing whether the combined system produces outputs satisfying the compliance thresholds.
The system's evaluation must operate at two levels: per-component evaluation testing whether each model performs within its documented parameters, and aggregate evaluation testing whether the combined system produces outputs satisfying the compliance thresholds. The aggregate evaluation is the authoritative measure; per-component evaluation is diagnostic.
The aggregation problem means individual models may each perform within specification while the combined system produces unacceptable outputs. A competency extractor with 95 per cent precision and a candidate scorer with 90 per cent accuracy do not produce a system with combined accuracy of approximately 86 per cent; the errors may correlate, amplify, or cancel in ways that depend on the specific data and interaction patterns.
The evaluation architecture tests four levels. Per-component isolation testing evaluates each model against its own test set using component-specific metrics and thresholds. Per-component integration testing evaluates each model within the full pipeline using realistic upstream inputs, detecting failures that only appear when components interact. Aggregate end-to-end testing evaluates the complete system's outputs against ground truth using system-level accuracy, fairness metrics, and calibration thresholds documented in Module 5. Aggregate disaggregated testing evaluates outputs by protected characteristic subgroups using per-subgroup accuracy and fairness thresholds documented in Module 6. The evaluation results at all four levels are documented in AISDP Module 5, with per-component results providing diagnostic information and aggregate results determining the system's compliance status.
No. A system that demonstrates per-component compliance but cannot demonstrate aggregate compliance fails the conformity assessment. The aggregate results are authoritative.
Every component version, its model registry reference, the configuration applied, and the pipeline version. Each deployment produces a manifest stored in the governance artefact registry.
A component that introduces bias at an early stage propagates that bias through every subsequent stage. A downstream component that is fair on clean data may amplify the upstream bias when processing biased inputs.
Bias can be introduced, amplified, or masked at each pipeline stage. Stage-by-stage analysis and aggregate disaggregated evaluation are needed to detect interaction effects.