We use cookies to improve your experience and analyse site traffic.
The CI/CD pipeline produces compliance evidence as a byproduct of development. This section covers pipeline orchestration, model validation gates, experiment tracking, automated documentation generation, compliance-gated deployment, and pipeline monitoring for high-risk AI systems.
The CI/CD pipeline is the mechanism through which compliance evidence is produced as a byproduct of development rather than left as a retrospective exercise.
The CI/CD pipeline is the mechanism through which compliance evidence is produced as a byproduct of development rather than left as a retrospective exercise. CI/CD for AI systems extends well beyond traditional software pipelines because it must encompass code quality gates, model validation gates, fairness and bias testing, compliance documentation generation, and deployment controls that enforce human oversight.
The structural distinction is that traditional software CI/CD operates on a single artefact type (code) with a single build process. AI system CI/CD operates on multiple artefact types (code, data, models, configurations) with multiple interconnected build processes: data preparation, feature engineering, model training, model evaluation, model registration, integration testing, and deployment. Each process has its own inputs, outputs, quality gates, and failure modes. Pipeline orchestration must handle these processes, enforce dependencies between them, and produce a coherent evidence trail demonstrating compliance at every stage.
The pipeline catches problems before they reach production rather than detecting them after the fact. A pipeline stage that passes is evidence that a specific compliance check was satisfied for a specific version at a specific time by a specific process. The outputs feed into AISDP Modules 2 (Development Process), 5 (Testing and Validation), 9 (Robustness and Cybersecurity), and 10 (Record-Keeping).
Pipeline orchestration tools such as Apache Airflow, Kubeflow Pipelines, Prefect, Dagster, and cloud-native offerings manage the dependencies, sequencing, and parallelism between stages.
Pipeline orchestration tools such as Apache Airflow, Kubeflow Pipelines, Prefect, Dagster, and cloud-native offerings manage the dependencies, sequencing, and parallelism between stages. A compliance-grade ML pipeline defines stages as discrete, auditable units.
Data ingestion and validation receives raw data, applies schema checks and quality expectations, and quarantines records that fail. Feature engineering computes features from validated data using the versioned transformation logic from the feature store. Model training executes the training job with version-pinned code, data, and configuration. Model evaluation runs the complete evaluation suite covering performance, fairness, robustness, and calibration against defined thresholds. Model registration promotes successful models to the registry with full metadata. Integration testing validates the model within the serving infrastructure against end-to-end scenarios. Deployment executes the controlled rollout through staging, canary, and production stages.
Each stage records its inputs, outputs, execution time, pass/fail status, and any artefacts produced. Stage outputs are immutable once recorded. Failed stages block downstream stages. Pipeline definitions are version-controlled alongside the code and referenced in the aisdp as QMS artefacts. Pipeline versioning links each execution to the specific pipeline definition that produced it, enabling reconstruction of the exact process that produced any given model version.
Static analysis for AI systems extends beyond standard code linting.
Static analysis for AI systems extends beyond standard code linting. Type checking and linting apply to all application code. Data contract validation checks schema conformance and statistical expectations at every data interface. Configuration validation ensures decision thresholds, feature flags, and model parameters are within documented ranges. Security scanning detects secrets in code, vulnerable dependencies, and insecure configurations.
Unit testing for AI components covers feature transformation functions, post-processing logic, threshold application, explanation generation, and data validation. Each function is tested in isolation with deterministic inputs and expected outputs. Integration testing validates the complete inference pathway from data ingestion through operator interface, catching failures at component boundaries. End-to-end testing exercises complete user journeys including the human oversight workflow.
Model validation gates are the compliance-critical pipeline stage. Gate 1 (performance) evaluates accuracy, precision, recall, and calibration against declared thresholds. Gate 2 (fairness) computes selection rate ratios, equalised odds, and other fairness metrics across all measured protected characteristic subgroups, comparing against declared thresholds. Gate 3 (robustness) tests the model's resilience to adversarial perturbations, out-of-distribution inputs, and edge cases. Gate 4 (drift) compares the new model's behaviour against the production baseline to quantify the change.
Any gate failure blocks the pipeline. Gate failures are logged with the specific metric that failed, the expected threshold, and the actual value. The gate results are retained as Module 5 evidence. The adds compliance-specific gates on top of these engineering gates.
Experiment tracking records every training run with its hyperparameters, data version, code version, evaluation metrics, and artefacts.
Experiment tracking records every training run with its hyperparameters, data version, code version, evaluation metrics, and artefacts. MLflow, Weights and Biases, and Neptune provide this capability. For compliance, experiment tracking demonstrates that the selected model was chosen through a systematic evaluation process rather than ad hoc experimentation.
Reproducibility requires that any historical training run can be re-executed to produce equivalent results. This demands version-pinned dependencies, deterministic random seeds where applicable, recorded hardware specifications, and archived training data versions. The experiment record is Module 2 evidence demonstrating the development methodology.
Automated documentation generation produces AISDP content as a pipeline byproduct. Each pipeline execution generates structured data that feeds into AISDP modules: model cards from the evaluation stage, data quality reports from the validation stage, security scan results from the security stage, and deployment records from the deployment stage. The generated documentation is version-linked to the pipeline execution that produced it.
Pipeline security addresses the risk that the pipeline itself becomes a vector for introducing non-compliant changes. Pipeline definitions are protected by the same branch protection and review requirements as application code. Secrets used by the pipeline are sourced from a secrets manager, not embedded in pipeline definitions. Pipeline execution logs are immutable and retained as audit evidence. The pipeline's own access permissions follow the principle of least privilege.
Continuous deployment for high-risk AI systems requires compliance controls that go beyond automated testing.
Continuous deployment for high-risk AI systems requires compliance controls that go beyond automated testing. The deployment process must enforce human approval at defined stages, produce deployment evidence, and support rollback without compliance gaps.
Deployment to staging is automated upon pipeline success. Deployment from staging to production requires explicit approval from the Technical Owner and, for changes classified as substantial modifications, from the AI Governance Lead. The approval is recorded with the approver's identity, timestamp, and the specific composite version being approved.
Progressive rollout using canary or shadow deployment strategies reduces the risk of widespread harm from a faulty deployment. Canary deployment routes a small percentage of production traffic to the new version while monitoring performance and fairness metrics. Shadow deployment runs the new version in parallel without affecting production decisions, comparing outputs against the current version. Both strategies produce evidence that the new version's behaviour in production matches its behaviour during evaluation.
Rollback procedures must be tested and documented. A rollback reverts to the previous composite version, including model, configuration, and any dependent infrastructure changes. The rollback itself is a deployment event recorded in the pipeline log.
Infrastructure-as-code ensures that the deployment environment is version-controlled alongside the system. Terraform, Pulumi, or CloudFormation definitions capture the complete infrastructure specification. Changes to infrastructure follow the same review and approval process as code changes. The infrastructure definition is referenced in AISDP Module 3 as part of the architectural documentation.
Pipeline monitoring tracks the health and compliance of the pipeline itself, distinct from the AI system's operational monitoring.
Pipeline monitoring tracks the health and compliance of the pipeline itself, distinct from the AI system's operational monitoring. Execution duration trends identify performance degradation. Failure rates by stage identify recurring quality issues. Test coverage metrics ensure that new code and model changes are adequately tested. Evidence generation completeness verifies that every required compliance artefact was produced.
Pipeline audit logs record every execution with its trigger, stages executed, gate results, approvals, and artefacts produced. Logs are immutable and retained for the ten-year period. Regulatory log export capability ensures that a competent authority can receive a structured account of the pipeline's execution history for any specified period.
For organisations at earlier maturity levels, the pipeline can be assembled from open-source components at minimal cost. Apache Airflow or Dagster for orchestration, pytest for testing, MLflow for experiment tracking and model registry, Great Expectations for data validation, and GitHub Actions or GitLab CI for execution are all available at zero licence cost. The manual approach requires documented checklists for each pipeline stage, with results recorded by the person executing the step. The loss is automated enforcement: a manual checklist depends on the operator's discipline, and a missed step creates a compliance gap with no automatic detection.
Pipeline documentation for the AISDP includes the pipeline architecture showing stages, dependencies, and gate definitions. It includes gate threshold definitions with rationale for each threshold value. Approval workflow documentation shows who approves at each stage. Evidence retention policies show what is stored, where, and for how long. Pipeline execution history is retained as Module 10 evidence.
Yes. Apache Airflow or Dagster for orchestration, pytest for testing, MLflow for tracking and registry, Great Expectations for data validation, and GitHub Actions or GitLab CI for execution are all available at zero licence cost.
The pipeline is blocked. The failure is logged with the specific metric, expected threshold, and actual value. The model cannot proceed to deployment until the failure is resolved. Gate results are retained as AISDP Module 5 evidence.
Each pipeline execution generates structured data feeding AISDP modules: model cards from evaluation, data quality reports from validation, security scan results, and deployment records. Documentation is version-linked to the pipeline execution that produced it.
Every training run is recorded with hyperparameters, data version, code version, and metrics. Reproducibility requires version-pinned dependencies and deterministic execution. The experiment record is AISDP Module 2 evidence.