We use cookies to improve your experience and analyse site traffic.
Testing high-risk AI systems requires coverage far beyond model evaluation. Unit, integration, and end-to-end tests must validate data pipelines, feature engineering, post-processing rules, human oversight interfaces, and system resilience under failure conditions.
The testing strategy for a HIGH RISK AI SYSTEM extends well beyond the model evaluation that most ML teams focus on.
The testing strategy for a high risk ai system extends well beyond the model evaluation that most ML teams focus on. Model validation gates test the model's predictions, but unit, integration, and end-to-end tests cover everything else: data pipelines, feature engineering logic, post-processing rules, logging infrastructure, the human oversight interface, deployment mechanisms, and system behaviour under stress and failure conditions. Both traditional software units and AI-specific components require dedicated test coverage to demonstrate compliance.
Each data transformation step should have unit tests that validate the transformation produces expected output for known inputs.
Each data transformation step should have unit tests that validate the transformation produces expected output for known inputs. Tests must confirm that edge cases, including null values, empty strings, extreme values, and malformed records, are handled correctly. The transformation must preserve data types and schemas, and its effect on data distributions should remain within expected bounds. CI/CD Pipelines for High-Risk AI provides the broader pipeline context in which these tests execute.
Property-based testing with Hypothesis is particularly valuable for data pipelines. The developer defines properties that should hold for any valid input, such as "the output of the normalisation step should have mean approximately zero and standard deviation approximately one for any input distribution". Hypothesis then generates hundreds of random inputs to test the property, catching edge cases that hand-written tests miss: empty datasets, single-row datasets, datasets with all null values, and datasets with extreme values.
Each feature computation should have unit tests verifying that the feature's output matches the specification in the FEATURE REGISTRY.
Each feature computation should have unit tests verifying that the feature's output matches the specification in the feature registry. Tests must confirm that computation is deterministic, meaning the same input always produces the same output. Feature tests should also verify that missing input values are handled according to the documented imputation strategy and that each feature's output range falls within expected bounds.
Model inference unit tests confirm that the model loads correctly from the model registry and that inference produces outputs in the expected format and range.
Model inference unit tests confirm that the model loads correctly from the model registry and that inference produces outputs in the expected format and range. For deterministic architectures, inference must be deterministic for a given model version and input. Tests should also confirm that the model's latency falls within the documented Service Level Agreement and that error handling produces graceful degradation rather than silent failures.
Threshold application, score calibration, business rule application, and output formatting should each have unit tests confirming correctness, edge case handling, and consistency with documented behaviour.
Threshold application, score calibration, business rule application, and output formatting should each have unit tests confirming correctness, edge case handling, and consistency with documented behaviour. If a rule rejects applicants below a threshold, the test should verify the threshold value, the behaviour at exactly the threshold as a boundary case, and the logging of the rejection reason. Where a fairness calibration adjusts thresholds per subgroup, the test should verify that adjusted thresholds produce the expected selection rate ratios on a reference dataset. Bias Detection and Mitigation covers the broader fairness validation framework.
The explanation generation component should have tests verifying that explanations are produced for every inference.
The explanation generation component should have tests verifying that explanations are produced for every inference. Feature attributions must sum correctly for additive explanation methods such as SHAP. Explanation fidelity metrics, which measure how well the explanation approximates the model's actual behaviour, must exceed defined thresholds. Tests should also confirm that explanations are formatted correctly for the target audience.
UI tests must confirm that the mandatory review workflow cannot be bypassed and that override functionality works correctly while capturing the required rationale.
UI tests must confirm that the mandatory review workflow cannot be bypassed and that override functionality works correctly while capturing the required rationale. Confidence indicators must be displayed accurately, and automation bias countermeasures such as delayed recommendation display must function as designed. Human Oversight Interface Design details the interface requirements these tests validate.
Automated browser testing with Selenium, Playwright, or Cypress should verify that the interface displays the required information: case data, model recommendation, confidence score, and explanation. The approval and override workflows must function correctly, minimum dwell time enforcement must work, and operator actions must be correctly logged. These tests should run on every interface change and on a weekly schedule to catch regressions.
A suite of test cases must exercise the complete inference path from data ingestion through feature engineering, model inference, post-processing, explanation generation, and output delivery.
A suite of test cases must exercise the complete inference path from data ingestion through feature engineering, model inference, post-processing, explanation generation, and output delivery. These tests use curated test datasets with known expected outcomes and validate end-to-end accuracy, latency, and output format. The test dataset should include cases that exercise each branch of the post-processing logic, each explanation type, and each output format.
Contract tests validate that each service's outputs conform to its consumers' expectations and run in the CI pipeline for every service change. A golden dataset of historical inputs with known correct outputs forms the regression test suite. Every candidate release is evaluated against this dataset to detect behavioural regression, and cases should be drawn from each protected characteristic subgroup to ensure regressions do not disproportionately affect vulnerable populations. The golden dataset should be version-controlled and expanded over time as new edge cases emerge through production operation.
Load tests using tools such as Locust or k6 confirm that the system meets its declared latency and throughput thresholds under realistic production load.
Load tests using tools such as Locust or k6 confirm that the system meets its declared latency and throughput thresholds under realistic production load. Chaos tests using tools such as Gremlin, Litmus, or Chaos Monkey inject failures, including pod crashes, network partitions, and dependency outages. These tests verify that the system fails gracefully with no data loss, no silent accuracy degradation, proper error handling, and correct logging of failure events. Both categories should be run before every major release and periodically in production, with chaos testing conducted in a controlled, off-peak window.
Simulated failures at each layer, whether data source unavailability, model serving timeout, or post-processing misconfiguration, must verify that the system degrades gracefully, that failsafe mechanisms activate, and that the human oversight layer receives appropriate alerts.
The Technical SME classifies test failures by severity.
The Technical SME classifies test failures by severity. Critical failures, meaning any test that exercises a compliance-relevant property such as fairness or human oversight bypass, block the pipeline unconditionally. High-severity failures, including end-to-end accuracy regression and latency threshold breaches, block the pipeline unless the AI Governance Lead approves an exception with documented justification. Medium-severity failures, such as non-critical UI tests and documentation formatting, generate warnings and are tracked in the non conformity register.
The exception approval process, including the approver's identity, justification, and conditions under which the exception expires, is logged by the Technical SME as compliance evidence. The test results from all categories are retained as Module 5 evidence. The test suite itself, including test code, test data, and test configuration, should be version-controlled and referenced in the AISDP's test strategy documentation.
Automated test execution requires a test framework; there is no manual alternative to automated unit testing.
Automated test execution requires a test framework; there is no manual alternative to automated unit testing. The minimum tooling is pytest, which is free. Great Expectations and Hypothesis are freely available open-source tools for more specialised testing of data quality and property-based testing respectively.
For integration testing, pytest with fixtures provides the minimum viable framework. Locust and k6 are open-source options for load testing. Gremlin offers a free tier for basic chaos testing, and Litmus is fully open-source. Manual integration testing, where test scenarios are executed by hand with results visually checked, is possible for small systems but does not scale and cannot provide the repeatable, evidenced test results that compliance requires.
Manual integration testing is possible for small systems but does not scale and cannot provide the repeatable, evidenced test results that compliance requires. The minimum tooling is pytest with fixtures.
Both should be run before every major release and periodically in production, with chaos testing conducted in a controlled, off-peak window.
The AI Governance Lead may approve an exception with documented justification. The exception details, including approver identity, justification, and expiry conditions, are logged as compliance evidence.
Test cases exercise the complete path from data ingestion through inference, post-processing, explanation generation, and output delivery using curated datasets with known outcomes.
Critical compliance failures block the pipeline unconditionally. High-severity failures require documented exception approval. Medium-severity failures generate warnings tracked in the non-conformity register.
Pytest is the minimum. Hypothesis and Great Expectations are available for specialised testing. Locust, k6, and Litmus are open-source options for load and chaos testing.