We use cookies to improve your experience and analyse site traffic.
Four sequential model validation gates block deployment when declared compliance thresholds are breached. Performance, fairness, robustness, and drift gates are layered so that each subsequent gate only evaluates models that passed the previous one. These gates convert AISDP thresholds from aspirational commitments into hard constraints that no model can bypass.
The CI pipeline must include four automated model validation gates that block deployment if defined thresholds are breached.
The CI pipeline must include four automated model validation gates that block deployment if defined thresholds are breached. The gate architecture is layered and sequential. Performance runs first because a model failing basic performance is not worth evaluating for fairness. Fairness runs second because a model passing performance but failing fairness is rejected regardless of robustness. Robustness runs third, testing stability under perturbation. Drift comparison runs last, comparing the candidate against production and baseline models. If any gate fails, execution halts, the failure is logged, and no subsequent gates run.
Each gate produces a structured report stored as a pipeline artefact. The report records the gate name, the metrics computed, the threshold for each metric, the actual value, the pass/fail determination, the evaluation dataset version, and the timestamp. This report is the primary Module 5 evidence for the candidate model. Over time, accumulating gate reports across all pipeline runs provides a longitudinal record of the model's quality trajectory.
The performance gate computes the declared metrics including AUC-ROC, F1 score, precision, recall, Brier score, and calibration error on the holdout test set and compares against the thresholds declared in the AISDP.
The performance gate computes the declared metrics including AUC-ROC, F1 score, precision, recall, Brier score, and calibration error on the holdout test set and compares against the thresholds declared in the aisdp. Any metric falling below its declared threshold fails the gate.
Two subtleties warrant attention. First, the holdout set must be truly held out and must not have been used during training, hyperparameter tuning, or feature selection. If the holdout set has leaked into the training process, the gate is testing on training data and the results are unreliable. Second, the Technical SME computes metrics with confidence intervals using bootstrap or cross-validation, and the threshold comparison uses the lower bound of the confidence interval rather than the point estimate. A model achieving 0.86 AUC-ROC with a 95 per cent confidence interval of 0.82 to 0.90 is compared against the threshold using 0.82, the worst-case plausible performance.
The fairness gate computes the agreed fairness metrics including selection rate ratios, equalised odds, predictive parity, and calibration within groups across all measured protected characteristic subgroups.
The fairness gate computes the agreed fairness metrics including selection rate ratios, equalised odds, predictive parity, and calibration within groups across all measured protected characteristic subgroups. Any subgroup metric breaching its threshold fails the gate. This gate is non-negotiable: it cannot be overridden without AI Governance Lead approval, and that approval is logged as a formal risk acceptance.
The most common failure mode is that the model passes fairness for all subgroups except one small subgroup where the metric is unreliable due to small sample size. Gate design must handle this either by computing small-subgroup metrics with confidence intervals and comparing using the lower bound, or by flagging subgroups below the minimum sample size as insufficient data requiring manual review rather than automatic pass or fail.
The robustness gate tests the model's stability under input perturbation.
The robustness gate tests the model's stability under input perturbation. IBM's Adversarial Robustness Toolbox provides adversarial attack methods and defence evaluations. For neural networks, it supports FGSM, PGD, and DeepFool; for tabular models, it supports feature perturbation. Perturbations are calibrated to realistic input noise levels and verify that accuracy does not degrade beyond a defined tolerance. For tabular models, feature perturbation at realistic noise levels, such as plus or minus 5 per cent on continuous features and random category flips at 1 per cent rate, provides the practical starting point.
The drift gate compares the candidate model's output distribution against the current production model's distribution and the baseline model's distribution. PSI, Jensen-Shannon divergence, and the KS test are the standard metrics. Evidently AI computes these automatically and can be called from the CI pipeline. A significant drift signal means the candidate behaves materially differently from the production or baseline model, warranting investigation before deployment even if absolute performance and fairness metrics are within threshold. All thresholds are defined in a version-controlled configuration file kept in the repository, not hardcoded in pipeline definitions, ensuring threshold changes are tracked and reviewed through the standard change management process.
Only with AI Governance Lead approval, which is logged as a formal risk acceptance. The override must include documented justification and conditions under which the exception expires.
A PSI above 0.25 against the assessed dataset is a presumptive indicator of substantial modification. A PSI threshold of 0.20 is typical for the routine drift gate, with values above triggering investigation before deployment.
Thresholds are declared in the AISDP and maintained in a single version-controlled configuration file. They are set by the AI Governance Lead in consultation with the Technical SME and Legal Advisor, based on the system's risk profile and use case.
Output distribution differences between candidate, production, and baseline models using PSI, Jensen-Shannon divergence, and KS tests. Significant drift warrants investigation even if absolute metrics pass.
Structured JSON recording gate name, metrics, thresholds, actual values, pass/fail determination, dataset version, and timestamp. Stored with ten-year retention as primary compliance evidence.