How is model testability assessed for EU AI Act compliance?

Testability assessment determines whether standard evaluation methodologies exist for the candidate architecture and whether they suffice for the system's risk profile, covering accuracy, robustness, and fairness evaluation as required by Article 15.

What auditability requirements apply to AI model selection?

Auditability requires that model outputs can be logged, traced, and attributed per Article 12, with the assessment specifying what must be logged to enable full decision reconstruction.

How should bias detectability be evaluated in model selection?

Bias detectability assessment determines whether the architecture supports feature attribution methods and disaggregated fairness evaluation across protected characteristic subgroups, as required by Article 10.

How does model determinism affect compliance requirements?

Deterministic models simplify compliance through reproducibility, while stochastic models require additional controls such as temperature clamping, seed fixing, and output logging to achieve sufficient reproducibility for conformity assessment.

How do different model architectures compare on documentability?

Logistic regression models have strong documentability because every parameter is a named coefficient. Transformer models with billions of parameters have weaker documentability: the architecture can be described, but learned representations cannot be enumerated. Documentation gaps require compensating measures such as detailed behavioural characterisation.

What makes stochastic models harder to use in compliant systems?

Stochastic models like large language models produce different outputs for the same input across executions. This complicates reproducibility for conformity assessment. Controls such as temperature clamping, seed fixing, and output logging are needed, but these may degrade the model's effectiveness at its intended task.

Why does maintainability matter for ongoing EU AI Act compliance?

Article 15 requires ongoing resilience, meaning models must be retrained or recalibrated in response to post-market monitoring findings. Some architectures like gradient-boosted trees produce stable, predictable changes when retrained, while deep neural networks can exhibit large behavioural shifts that risk triggering a substantial modification assessment.

How do different model architectures compare on documentability?

Logistic regression models have strong documentability because every parameter is a named coefficient. Transformer models with billions of parameters have weaker documentability: the architecture can be described, but learned representations cannot be enumerated. Documentation gaps require compensating measures such as detailed behavioural characterisation.

What makes stochastic models harder to use in compliant systems?

Stochastic models like large language models produce different outputs for the same input across executions. This complicates reproducibility for conformity assessment. Controls such as temperature clamping, seed fixing, and output logging are needed, but these may degrade the model's effectiveness at its intended task.

Why does maintainability matter for ongoing EU AI Act compliance?

Article 15 requires ongoing resilience, meaning models must be retrained or recalibrated in response to post-market monitoring findings. Some architectures like gradient-boosted trees produce stable, predictable changes when retrained, while deep neural networks can exhibit large behavioural shifts that risk triggering a substantial modification assessment.

Model Selection Criteria for Regulatory Compliance

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

The EU AI Act requires organisations deploying high-risk AI systems to demonstrate that model selection decisions account for regulatory obligations. Article 9, Article 10, Article 12, Article 15, and Annex IV each impose requirements that directly affect which model architectures are suitable for compliant deployment.

Abstract

Read abstract

Selecting an AI model for a high-risk system under the EU AI Act requires evaluation against six compliance criteria alongside traditional performance metrics. These criteria are documentability (Annex IV design specification requirements), testability (Article 9 validated risk controls), auditability (Article 12 record-keeping), bias detectability (Article 10 bias assessment), maintainability (Article 15 ongoing resilience), and determinism (reproducibility for conformity assessment). Each criterion is scored on a three-level scale of strong, adequate, or weak by the Technical SME, with scores recorded in the model selection rationale document by the AI System Assessor. Architecture type significantly affects compliance scoring: logistic regression models score strongly on documentability and maintainability, while large language models and deep neural networks require compensating controls for documentation gaps, stochastic output variability, and behavioural sensitivity to retraining. The structured evaluation ensures that compliance considerations are integral to the selection decision rather than retrofitted after integration.

How should organisations evaluate models for regulatory compliance?

Regulatory Requirement

Organisations must assess candidate models against six compliance criteria alongside traditional performance metrics: documentability, testability, auditability, bias detectability, maintainability, and determinism. The Technical SME evaluates each criterion and scores it on a three-level scale of strong, adequate, or weak. The ai system assessor records these scores in the model selection rationale document, creating an auditable record of why a particular model was chosen. This structured evaluation ensures that compliance considerations inform the selection decision from the outset, rather than being retrofitted after a model has already been integrated into the system. Model Selection and Due Diligence covers the broader due diligence framework within which these criteria sit.

Each criterion maps to a specific regulatory obligation under the EU AI Act. Documentability relates to Annex IV requirements for design specifications. Testability supports the validated risk controls required by Article 9. Auditability addresses the record-keeping obligations of Article 12. Bias detectability connects to the bias assessment requirements of Article 10. Maintainability underpins the ongoing resilience mandate of Article 15. Determinism supports reproducibility during conformity assessment.

What does the documentability criterion require?

Regulatory Requirement

Documentability asks whether the model's architecture, hyperparameters, and decision process can be described precisely enough to satisfy Annex IV, Section 2.

Documentability asks whether the model's architecture, hyperparameters, and decision process can be described precisely enough to satisfy Annex IV, Section 2. The assessment determines whether a qualified reviewer could reproduce the training process from the documentation alone. The Technical SME reviews the model architecture and determines whether its structure can be expressed in a technical specification document.

Different architectures score very differently on this criterion. A logistic regression model has strong documentability because every parameter is a named coefficient with a direct interpretation. A transformer with billions of parameters has weaker documentability: the architecture itself can be described, but the learned representations cannot be enumerated at the parameter level. Where documentation gaps exist, the assessment should identify compensating controls that would be needed, such as detailed behavioural characterisation in lieu of parameter-level documentation. The ai system description package must capture whatever level of documentation the model architecture supports.

How is testability assessed?

Regulatory Requirement

Testability determines whether the model architecture supports the testing required by Article 15, covering accuracy, robustness, and fairness evaluation.

Testability determines whether the model architecture supports the testing required by Article 15, covering accuracy, robustness, and fairness evaluation. The assessment establishes whether standard evaluation methodologies exist for the candidate architecture and whether those methodologies are sufficient for the system's risk profile. Risk Assessment Fundamentals provides the framework for determining what level of testing the risk profile demands.

Architecture type has a significant effect on testing complexity. Some architectures, such as decision trees and linear models, produce deterministic outputs that simplify testing because the same input always yields the same output. Others, including large language models and diffusion models, produce stochastic outputs that require statistical testing frameworks to evaluate meaningfully. The assessment should specify the testing methodology that would be needed for the candidate architecture and estimate the testing effort involved, so that the organisation can factor testing costs into the selection decision.

What does auditability require in practice?

Regulatory Requirement

Auditability assesses whether the model produces outputs that can be logged, traced, and attributed in accordance with Article 12's record-keeping requirements.

Auditability assesses whether the model produces outputs that can be logged, traced, and attributed in accordance with Article 12's record-keeping requirements. The key question is whether individual decisions can be reconstructed from the logs after the fact.

Models that require only the input and the model version for output reconstruction are strongly auditable, because a third party can reproduce any given decision with minimal context. Models where the output depends on runtime conditions present greater challenges. When session state, conversation history, or retrieval-augmented generation context influences the output, more sophisticated logging is required. The assessment must specify exactly what the system needs to log to enable full decision reconstruction, and this logging specification feeds directly into the system's technical documentation and monitoring design. Logging and Traceability addresses the detailed implementation of these logging requirements.

How should bias detectability be evaluated?

Engineering Approach

Bias detectability measures whether fairness metrics can be computed at the subgroup level and whether the model can be interrogated for proxy variable effects.

Bias detectability measures whether fairness metrics can be computed at the subgroup level and whether the model can be interrogated for proxy variable effects. The assessment determines whether the candidate architecture supports feature attribution methods, such as SHAP, LIME, or integrated gradients, that can identify when ostensibly neutral features act as proxies for protected characteristics.

The model's output format matters for fairness analysis. Models that produce calibrated probability scores are more amenable to disaggregated fairness evaluation across all protected characteristic subgroups. Models that produce only ranked outputs or categorical labels provide less material for subgroup analysis. The Technical SME must determine whether the architecture supports fairness-aware training or post-hoc calibration as remediation options if bias is detected during evaluation or post-market monitoring.

What does maintainability mean for ongoing compliance?

Engineering Approach

Maintainability evaluates whether the model can be retrained, fine-tuned, or recalibrated in response to post-market monitoring findings without triggering a substantial modification assessment under Article 15. The criterion also examines whether the model's behaviour remains stable across minor updates.

Architecture type strongly influences maintainability scoring. Some architectures, such as gradient-boosted trees and logistic regression, produce stable and predictable changes when retrained on augmented data. The impact of additional training data can be estimated in advance, making incremental maintenance straightforward. Deep neural networks, by contrast, can exhibit large behavioural shifts from small data changes, making incremental maintenance more difficult and increasing the risk that a routine update triggers a substantial modification assessment. The assessment evaluates this sensitivity and documents the expected maintenance approach in the model selection rationale.

How does determinism affect compliance requirements?

Engineering Approach

Determinism assesses whether the model produces the same output consistently for a given input.

Determinism assesses whether the model produces the same output consistently for a given input. Deterministic models simplify compliance because reproducibility supports conformity assessment: any given decision can be verified by re-running the input through the same model version.

Stochastic models require additional controls to achieve sufficient reproducibility for compliance purposes. These controls include temperature clamping to reduce output variability, seed fixing to enable reproducibility during testing, and output logging to maintain traceability when exact reproduction is not possible. The assessment must determine whether the candidate architecture is inherently deterministic or inherently stochastic. For stochastic architectures, the Technical SME specifies the controls needed and assesses the performance cost of imposing those controls, since constraints like temperature clamping may degrade the model's effectiveness at its intended task. Repeated inference testing validates whether the chosen controls achieve acceptable reproducibility levels.

Model Selection Criteria for Regulatory Compliance

Written by

How should organisations evaluate models for regulatory compliance?

What does the documentability criterion require?

How is testability assessed?

What does auditability require in practice?

How should bias detectability be evaluated?

What does maintainability mean for ongoing compliance?

How does determinism affect compliance requirements?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline