What are the four model validation gates?

Performance (accuracy metrics), fairness (subgroup equity), robustness (perturbation stability), and drift (output distribution comparison). They run sequentially, halting on the first failure.

Why does the fairness gate require intersectional analysis?

Intersectional failures are invisible in single-axis evaluations. Subgroups below the reliability threshold must be flagged as 'insufficient data' rather than silently omitted.

What does the drift gate measure?

Output distribution differences between candidate, production, and baseline models using PSI, Jensen-Shannon divergence, and KS tests. Significant drift warrants investigation even if absolute metrics pass.

How should gate reports be structured for compliance?

Structured JSON recording gate name, metrics, thresholds, actual values, pass/fail determination, dataset version, and timestamp. Stored with ten-year retention as primary compliance evidence.

Can a fairness gate failure be overridden?

Only with AI Governance Lead approval, which is logged as a formal risk acceptance. The override must include documented justification and conditions under which the exception expires.

What drift threshold indicates a substantial modification?

A PSI above 0.25 against the assessed dataset is a presumptive indicator of substantial modification. A PSI threshold of 0.20 is typical for the routine drift gate, with values above triggering investigation before deployment.

How are thresholds for the validation gates determined?

Thresholds are declared in the AISDP and maintained in a single version-controlled configuration file. They are set by the AI Governance Lead in consultation with the Technical SME and Legal Advisor, based on the system's risk profile and use case.

Can a fairness gate failure be overridden?

Only with AI Governance Lead approval, which is logged as a formal risk acceptance. The override must include documented justification and conditions under which the exception expires.

What drift threshold indicates a substantial modification?

A PSI above 0.25 against the assessed dataset is a presumptive indicator of substantial modification. A PSI threshold of 0.20 is typical for the routine drift gate, with values above triggering investigation before deployment.

How are thresholds for the validation gates determined?

Thresholds are declared in the AISDP and maintained in a single version-controlled configuration file. They are set by the AI Governance Lead in consultation with the Technical SME and Legal Advisor, based on the system's risk profile and use case.

Four sequential model validation gates block deployment when declared compliance thresholds are breached. Performance, fairness, robustness, and drift gates are layered so that each subsequent gate only evaluates models that passed the previous one. These gates convert AISDP thresholds from aspirational commitments into hard constraints that no model can bypass.

Abstract

Read abstract

Model validation gates are the automated enforcement of an organisation's declared compliance thresholds. Four sequential gates evaluate every candidate model before deployment. The performance gate computes accuracy metrics with confidence intervals, comparing the lower bound against thresholds. The fairness gate evaluates selection rate ratios, equalised odds, and calibration across all protected subgroups, including intersectional analysis where subgroups are crossed. The robustness gate tests stability under input perturbation using tools such as IBM's Adversarial Robustness Toolbox. The drift gate compares output distributions against production and baseline models using Population Stability Index and Jensen-Shannon divergence. Gates are layered and sequential: performance runs first, then fairness, then robustness, then drift. If any gate fails, execution halts. Each gate produces a structured report recording metrics, thresholds, actual values, and pass/fail determinations. Reports are stored with ten-year retention as primary compliance evidence. All threshold values are maintained in a single version-controlled configuration file referenced by both the pipeline and the AISDP.

What model validation gates must the CI pipeline enforce?

Engineering Approach

The CI pipeline must include four automated model validation gates that block deployment if defined thresholds are breached.

The CI pipeline must include four automated model validation gates that block deployment if defined thresholds are breached. The gate architecture is layered and sequential. Performance runs first because a model failing basic performance is not worth evaluating for fairness. Fairness runs second because a model passing performance but failing fairness is rejected regardless of robustness. Robustness runs third, testing stability under perturbation. Drift comparison runs last, comparing the candidate against production and baseline models. If any gate fails, execution halts, the failure is logged, and no subsequent gates run.

Each gate produces a structured report stored as a pipeline artefact. The report records the gate name, the metrics computed, the threshold for each metric, the actual value, the pass/fail determination, the evaluation dataset version, and the timestamp. This report is the primary Module 5 evidence for the candidate model. Over time, accumulating gate reports across all pipeline runs provides a longitudinal record of the model's quality trajectory.

How does the performance gate work?

Engineering Approach

The performance gate computes the declared metrics including AUC-ROC, F1 score, precision, recall, Brier score, and calibration error on the holdout test set and compares against the thresholds declared in the AISDP.

Two subtleties warrant attention. First, the holdout set must be truly held out and must not have been used during training, hyperparameter tuning, or feature selection. If the holdout set has leaked into the training process, the gate is testing on training data and the results are unreliable. Second, the Technical SME computes metrics with confidence intervals using bootstrap or cross-validation, and the threshold comparison uses the lower bound of the confidence interval rather than the point estimate. A model achieving 0.86 AUC-ROC with a 95 per cent confidence interval of 0.82 to 0.90 is compared against the threshold using 0.82, the worst-case plausible performance.

How does the fairness gate handle multiple subgroups?

Engineering Approach

The fairness gate computes the agreed fairness metrics including selection rate ratios, equalised odds, predictive parity, and calibration within groups across all measured protected characteristic subgroups. Any subgroup metric breaching its threshold fails the gate. This gate is non-negotiable: it cannot be overridden without AI Governance Lead approval, and that approval is logged as a formal risk acceptance.

The most common failure mode is that the model passes fairness for all subgroups except one small subgroup where the metric is unreliable due to small sample size. Gate design must handle this either by computing small-subgroup metrics with confidence intervals and comparing using the lower bound, or by flagging subgroups below the minimum sample size as insufficient data requiring manual review rather than automatic pass or fail.

How do the robustness and drift gates work?

Engineering Approach

The robustness gate tests the model's stability under input perturbation.

The robustness gate tests the model's stability under input perturbation. IBM's Adversarial Robustness Toolbox provides adversarial attack methods and defence evaluations. For neural networks, it supports FGSM, PGD, and DeepFool; for tabular models, it supports feature perturbation. Perturbations are calibrated to realistic input noise levels and verify that accuracy does not degrade beyond a defined tolerance. For tabular models, feature perturbation at realistic noise levels, such as plus or minus 5 per cent on continuous features and random category flips at 1 per cent rate, provides the practical starting point.

The drift gate compares the candidate model's output distribution against the current production model's distribution and the baseline model's distribution. PSI, Jensen-Shannon divergence, and the KS test are the standard metrics. Evidently AI computes these automatically and can be called from the CI pipeline. A significant drift signal means the candidate behaves materially differently from the production or baseline model, warranting investigation before deployment even if absolute performance and fairness metrics are within threshold. All thresholds are defined in a version-controlled configuration file kept in the repository, not hardcoded in pipeline definitions, ensuring threshold changes are tracked and reviewed through the standard change management process.

Model Validation Gates: Thresholds, Fairness, and Drift

Written by

What model validation gates must the CI pipeline enforce?

How does the performance gate work?

How does the fairness gate handle multiple subgroups?

How do the robustness and drift gates work?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline