What happens when a high-severity test fails but the release is urgent?

The AI Governance Lead may approve an exception with documented justification. The exception details, including approver identity, justification, and expiry conditions, are logged as compliance evidence.

What happens when a high-severity test fails but the release is urgent?

The AI Governance Lead may approve an exception with documented justification. The exception details, including approver identity, justification, and expiry conditions, are logged as compliance evidence.

The EU AI Act requires high-risk AI systems to demonstrate comprehensive testing across every component, not just model accuracy. This guide covers the full testing strategy: unit tests for data pipelines, feature engineering, model inference, post-processing rules, explainability components, and human oversight interfaces. Integration testing validates that pipeline components work together through end-to-end inference path tests, contract tests, and regression suites using golden datasets that include cases from protected characteristic subgroups. Load and chaos testing verify system resilience under stress and simulated failures. Test failures are classified by severity, with critical compliance failures blocking the pipeline unconditionally and exceptions requiring documented approval from the AI Governance Lead. The Technical SME retains all test results as Module 5 compliance evidence. Minimum tooling includes pytest for test execution, Hypothesis for property-based testing of data transformations, and open-source tools such as Locust or k6 for load testing. Manual testing is possible for small systems but cannot provide the repeatable, evidenced results that regulatory compliance demands.

Why does AI system testing go beyond model evaluation?

Regulatory Requirement

The testing strategy for a HIGH RISK AI SYSTEM extends well beyond the model evaluation that most ML teams focus on.

The testing strategy for a high risk ai system extends well beyond the model evaluation that most ML teams focus on. Model validation gates test the model's predictions, but unit, integration, and end-to-end tests cover everything else: data pipelines, feature engineering logic, post-processing rules, logging infrastructure, the human oversight interface, deployment mechanisms, and system behaviour under stress and failure conditions. Both traditional software units and AI-specific components require dedicated test coverage to demonstrate compliance.

What should data pipeline unit tests cover?

Engineering Approach

Each data transformation step should have unit tests that validate the transformation produces expected output for known inputs.

Each data transformation step should have unit tests that validate the transformation produces expected output for known inputs. Tests must confirm that edge cases, including null values, empty strings, extreme values, and malformed records, are handled correctly. The transformation must preserve data types and schemas, and its effect on data distributions should remain within expected bounds. CI/CD Pipelines for High-Risk AI provides the broader pipeline context in which these tests execute.

Property-based testing with Hypothesis is particularly valuable for data pipelines. The developer defines properties that should hold for any valid input, such as "the output of the normalisation step should have mean approximately zero and standard deviation approximately one for any input distribution". Hypothesis then generates hundreds of random inputs to test the property, catching edge cases that hand-written tests miss: empty datasets, single-row datasets, datasets with all null values, and datasets with extreme values.

How should feature engineering be tested?

Engineering Approach

Each feature computation should have unit tests verifying that the feature's output matches the specification in the FEATURE REGISTRY.

Each feature computation should have unit tests verifying that the feature's output matches the specification in the feature registry. Tests must confirm that computation is deterministic, meaning the same input always produces the same output. Feature tests should also verify that missing input values are handled according to the documented imputation strategy and that each feature's output range falls within expected bounds.

What do model inference unit tests validate?

Engineering Approach

Model inference unit tests confirm that the model loads correctly from the model registry and that inference produces outputs in the expected format and range.

Model inference unit tests confirm that the model loads correctly from the model registry and that inference produces outputs in the expected format and range. For deterministic architectures, inference must be deterministic for a given model version and input. Tests should also confirm that the model's latency falls within the documented Service Level Agreement and that error handling produces graceful degradation rather than silent failures.

How are post-processing rules tested?

Engineering Approach

Threshold application, score calibration, business rule application, and output formatting should each have unit tests confirming correctness, edge case handling, and consistency with documented behaviour. If a rule rejects applicants below a threshold, the test should verify the threshold value, the behaviour at exactly the threshold as a boundary case, and the logging of the rejection reason. Where a fairness calibration adjusts thresholds per subgroup, the test should verify that adjusted thresholds produce the expected selection rate ratios on a reference dataset. Bias Detection and Mitigation covers the broader fairness validation framework.

What tests apply to explainability components?

Engineering Approach

The explanation generation component should have tests verifying that explanations are produced for every inference.

The explanation generation component should have tests verifying that explanations are produced for every inference. Feature attributions must sum correctly for additive explanation methods such as SHAP. Explanation fidelity metrics, which measure how well the explanation approximates the model's actual behaviour, must exceed defined thresholds. Tests should also confirm that explanations are formatted correctly for the target audience.

How is the human oversight interface tested?

Engineering Approach

UI tests must confirm that the mandatory review workflow cannot be bypassed and that override functionality works correctly while capturing the required rationale.

UI tests must confirm that the mandatory review workflow cannot be bypassed and that override functionality works correctly while capturing the required rationale. Confidence indicators must be displayed accurately, and automation bias countermeasures such as delayed recommendation display must function as designed. Human Oversight Interface Design details the interface requirements these tests validate.

Automated browser testing with Selenium, Playwright, or Cypress should verify that the interface displays the required information: case data, model recommendation, confidence score, and explanation. The approval and override workflows must function correctly, minimum dwell time enforcement must work, and operator actions must be correctly logged. These tests should run on every interface change and on a weekly schedule to catch regressions.

What does end-to-end inference testing involve?

Engineering Approach

A suite of test cases must exercise the complete inference path from data ingestion through feature engineering, model inference, post-processing, explanation generation, and output delivery.

A suite of test cases must exercise the complete inference path from data ingestion through feature engineering, model inference, post-processing, explanation generation, and output delivery. These tests use curated test datasets with known expected outcomes and validate end-to-end accuracy, latency, and output format. The test dataset should include cases that exercise each branch of the post-processing logic, each explanation type, and each output format.

Contract tests validate that each service's outputs conform to its consumers' expectations and run in the CI pipeline for every service change. A golden dataset of historical inputs with known correct outputs forms the regression test suite. Every candidate release is evaluated against this dataset to detect behavioural regression, and cases should be drawn from each protected characteristic subgroup to ensure regressions do not disproportionately affect vulnerable populations. The golden dataset should be version-controlled and expanded over time as new edge cases emerge through production operation.

How do load and chaos tests verify resilience?

Engineering Approach

Load tests using tools such as Locust or k6 confirm that the system meets its declared latency and throughput thresholds under realistic production load.

Load tests using tools such as Locust or k6 confirm that the system meets its declared latency and throughput thresholds under realistic production load. Chaos tests using tools such as Gremlin, Litmus, or Chaos Monkey inject failures, including pod crashes, network partitions, and dependency outages. These tests verify that the system fails gracefully with no data loss, no silent accuracy degradation, proper error handling, and correct logging of failure events. Both categories should be run before every major release and periodically in production, with chaos testing conducted in a controlled, off-peak window.

Simulated failures at each layer, whether data source unavailability, model serving timeout, or post-processing misconfiguration, must verify that the system degrades gracefully, that failsafe mechanisms activate, and that the human oversight layer receives appropriate alerts.

How are test failures classified and handled?

Regulatory Requirement

The Technical SME classifies test failures by severity.

The Technical SME classifies test failures by severity. Critical failures, meaning any test that exercises a compliance-relevant property such as fairness or human oversight bypass, block the pipeline unconditionally. High-severity failures, including end-to-end accuracy regression and latency threshold breaches, block the pipeline unless the AI Governance Lead approves an exception with documented justification. Medium-severity failures, such as non-critical UI tests and documentation formatting, generate warnings and are tracked in the non conformity register.

The exception approval process, including the approver's identity, justification, and conditions under which the exception expires, is logged by the Technical SME as compliance evidence. The test results from all categories are retained as Module 5 evidence. The test suite itself, including test code, test data, and test configuration, should be version-controlled and referenced in the AISDP's test strategy documentation.

What are the minimum tooling requirements?

Compensating Controls

Automated test execution requires a test framework; there is no manual alternative to automated unit testing.

Automated test execution requires a test framework; there is no manual alternative to automated unit testing. The minimum tooling is pytest, which is free. Great Expectations and Hypothesis are freely available open-source tools for more specialised testing of data quality and property-based testing respectively.

For integration testing, pytest with fixtures provides the minimum viable framework. Locust and k6 are open-source options for load testing. Gremlin offers a free tier for basic chaos testing, and Litmus is fully open-source. Manual integration testing, where test scenarios are executed by hand with results visually checked, is possible for small systems but does not scale and cannot provide the repeatable, evidenced test results that compliance requires.

Unit, Integration, and End-to-End Testing for AI Systems

Written by

Why does AI system testing go beyond model evaluation?

What should data pipeline unit tests cover?

How should feature engineering be tested?

What do model inference unit tests validate?

How are post-processing rules tested?

What tests apply to explainability components?

How is the human oversight interface tested?

What does end-to-end inference testing involve?

How do load and chaos tests verify resilience?

How are test failures classified and handled?

What are the minimum tooling requirements?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline