How is compliance evidence generated and stored?

Every stage deposits JSON/XML/SARIF reports in a versioned S3 bucket keyed by pipeline run ID. Stage 12 generates a manifest cataloguing every artefact for end-to-end traceability.

Can this pipeline be adapted to non-Python technology stacks?

Yes. The specific tools and commands should be adapted, but the stage structure, gate sequence, and evidence generation pattern apply to any technology stack.

What happens if a model evaluation gate fails?

The pipeline halts with a structured failure report identifying the specific metric, expected threshold, and actual value. All downstream stages are blocked until the failure is resolved.

How does the pipeline prevent documentation drift?

Stage 7 automatically generates AISDP module updates from engineering artefacts and a documentation currency gate blocks deployment if any affected module has not been updated.

Can this pipeline be adapted to non-Python technology stacks?

Yes. The specific tools and commands should be adapted, but the stage structure, gate sequence, and evidence generation pattern apply to any technology stack.

What happens if a model evaluation gate fails?

The pipeline halts with a structured failure report identifying the specific metric, expected threshold, and actual value. All downstream stages are blocked until the failure is resolved.

How does the pipeline prevent documentation drift?

Stage 7 automatically generates AISDP module updates from engineering artefacts and a documentation currency gate blocks deployment if any affected module has not been updated.

End-to-End Reference Pipeline: EU AI Act Compliance-Grade ML Pipeline

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

The reference pipeline implements twelve stages from static analysis through post-deployment verification, producing compliance evidence as a byproduct of every stage. Designed for GitHub Actions with Python and Kubernetes, the stage structure and evidence pattern apply to any technology stack.

Abstract

Read abstract

A compliance-grade ML pipeline implements twelve stages that produce compliance evidence as a byproduct of development. Stages 1 through 3 cover static analysis, testing, and data validation on every commit. Stage 4 handles model training on explicit trigger. Stage 5 implements four model evaluation gates covering performance, fairness, robustness, and drift. Stage 6 scans container images, infrastructure-as-code, and policy compliance. Stage 7 generates AISDP module updates and checks documentation currency. Stage 8 classifies the change as routine, significant, or substantial, determining the approval authority. Stage 9 validates in staging. Stage 10 enforces role-based human approval through GitHub Environments. Stage 11 deploys via progressive canary rollout. Stage 12 verifies production health and generates a complete evidence manifest. Every stage deposits artefacts in a versioned evidence bucket, enabling end-to-end traceability.

What does a compliance-grade ML pipeline look like end-to-end?

Engineering Approach

A governance-grade ML pipeline implements every stage from data validation through model evaluation through compliance-gated deployment, connecting the guide's prose guidance to an executable implementation. The reference pipeline uses GitHub Actions for a Python-based ML system deployed on Kubernetes, though organisations using different technology stacks should adapt the specific tools and commands while preserving the stage structure, the gate sequence, and the evidence generation pattern.

The pipeline is designed around a core principle: every stage deposits its evidence artefacts in a versioned S3 bucket keyed by the pipeline run identifier. The final manifest catalogues every artefact produced, enabling the Conformity Assessment Coordinator to trace any compliance claim to its source evidence through a single identifier. The pipeline definition itself is version-controlled in the repository, with changes following the same review and approval process as code changes. The pipeline's execution history constitutes part of the aisdp Module 10 evidence.

How are the twelve pipeline stages structured?

Engineering Approach

The reference pipeline implements twelve stages in a defined sequence with explicit dependencies between them.

The reference pipeline implements twelve stages in a defined sequence with explicit dependencies between them. Stages 1 through 3, covering static analysis, testing, and data validation, run on every commit to the main branch when source code, configuration, data, prompts, models, or pipeline definitions change. Stage 4 (model training) runs only when explicitly triggered or when the commit message contains a retrain flag, preventing unnecessary retraining on code-only changes. Stages 5 through 7, covering model evaluation, security scanning, and documentation generation, run on every commit to ensure compliance evidence remains current. Stage 8 (change classification) determines the approval authority required for the specific change. Stage 9 (staging validation) mirrors production. Stage 10 (human approval) enforces role-based authorisation. Stage 11 executes the production deployment. Stage 12 verifies post-deployment health.

The stage dependencies enforce a strict sequence: static analysis must pass before tests run, tests must pass before data validation, data validation must pass before training can proceed, and all evaluation gates must pass before security scanning begins. A failure at any stage blocks all downstream stages, ensuring that compliance evidence is complete and consistent for any deployment that reaches production.

What does Stage 1 (Static Analysis) check?

Engineering Approach

Stage 1 implements the static analysis requirements from the guide's code quality section.

Stage 1 implements the static analysis requirements from the guide's code quality section. Code quality analysis uses Ruff for linting and mypy for type checking, producing JSON reports as evidence artefacts. AI-specific static analysis uses Semgrep with custom compliance rules that detect patterns specific to AI system risks, generating SARIF output for integration with code review tools.

Dependency scanning uses pip-audit in strict mode to detect known vulnerabilities in Python dependencies, producing a JSON vulnerability report. Licence compliance scanning uses pip-licenses to verify that no dependency carries a licence incompatible with the system's distribution model, failing the stage if AGPL or GPL-3.0 licences are detected. Secret detection uses TruffleHog to scan for accidentally committed credentials, API keys, and other sensitive material, checking only verified findings to reduce false positives.

Every report produced by Stage 1 is archived to the evidence bucket under a stage-specific prefix, creating a traceable record of the code quality posture at the point of each pipeline execution.

What do Stages 2 and 3 (Testing and Data Validation) verify?

Engineering Approach

Stage 2 runs unit tests, integration tests, and contract tests.

Stage 2 runs unit tests, integration tests, and contract tests. Unit tests execute with coverage measurement, producing JUnit XML test reports and JSON coverage reports. Integration tests validate component interactions across service boundaries. Contract tests verify API schema conformance between services, catching silent breaking changes that integration tests may miss. All test reports are archived as Module 5 evidence.

Stage 3 validates the training data against defined quality expectations. A Great Expectations checkpoint runs the training data quality suite, with the pipeline halting if any expectation fails. A distribution stability check compares the current training data against the baseline distribution using Population Stability Index, with a configurable threshold that flags significant distributional shifts before they can affect model behaviour.

Data versioning is recorded at this stage: DVC status and dataset hashes are captured in a provenance record that links the specific data version to the pipeline execution. This provenance chain enables the organisation to trace any deployed model back to the exact training data that produced it, satisfying the Article 10 data governance requirements and the Article 12 traceability mandate.

How do the four model evaluation gates work in practice?

Engineering Approach

Stage 5 implements the four model validation gates that determine whether a candidate model meets the declared compliance thresholds.

Stage 5 implements the four model validation gates that determine whether a candidate model meets the declared compliance thresholds. Gate 1 (performance) evaluates accuracy, precision, recall, and calibration against the thresholds defined in the AISDP configuration. Gate 2 (fairness) computes selection rate ratios, equalised odds, and other fairness metrics across all measured protected characteristic subgroups, comparing against declared fairness thresholds. Gate 3 (robustness) tests the model's resilience against an adversarial test suite covering perturbations, out-of-distribution inputs, and edge cases. Gate 4 (drift) compares the candidate model's behaviour against both the current production model and the baseline model from the last conformity assessment, detecting version-to-version changes and cumulative drift.

A verification step confirms that all four gates passed before the pipeline can proceed. If any gate fails, the pipeline halts with a structured failure report identifying the specific metric, the expected threshold, and the actual value. The gate results are the primary Module 5 evidence artefacts, and the gate summary provides the quantitative basis for the change classification in Stage 8.

What happens during security scanning and documentation generation?

Engineering Approach

Stage 6 implements security scanning across three dimensions.

Stage 6 implements security scanning across three dimensions. Container image scanning uses Trivy to detect HIGH and CRITICAL vulnerabilities in the built container image. Infrastructure-as-code scanning uses Checkov to validate Terraform, Kubernetes manifests, and cloud configurations against security benchmarks. OPA policy compliance uses Conftest to verify infrastructure definitions against the organisation's compliance policies, catching misconfigurations that could affect the system's security posture.

Stage 7 generates AISDP module updates as a pipeline byproduct. A documentation generation script processes the evidence artefacts from all preceding stages, producing structured updates for the affected AISDP modules. A documentation currency gate then verifies that all modules affected by the change have been updated to reflect the new state, flagging any module that should have been updated but was not. This gate prevents the documentation drift where the AISDP describes a historical version rather than the version being deployed.

The documentation generation approach ensures that compliance documentation is produced from engineering artefacts rather than written retrospectively. Model cards are generated from evaluation results. Data quality reports are generated from validation outputs. Security posture is documented from scan results. Each generated document is version-linked to the pipeline execution that produced it.

How does change classification determine the approval pathway?

Engineering Approach

Stage 8 classifies the change by comparing the candidate version's gate results against the AISDP thresholds and the baseline from the last conformity assessment.

Stage 8 classifies the change by comparing the candidate version's gate results against the AISDP thresholds and the baseline from the last conformity assessment. The classification determines which approval authority is required for deployment, mapping directly to the three-tier change classification framework.

Routine changes fall within all quantitative thresholds and trigger no qualitative flags. These require approval from the Technical SME only, enforced through a GitHub Environment with a single required reviewer. Significant changes approach quantitative thresholds or trigger qualitative flags such as a model architecture change. These require approval from the AI Governance Lead, enforced through a different GitHub Environment. Substantial modifications cross quantitative thresholds or change the intended purpose. These require dual approval from both the AI Governance Lead and the Legal and Regulatory Advisor, enforced through a third GitHub Environment with two required reviewers.

The classification output is recorded as an evidence artefact and passed to the deployment approval stage. The human approver reviews the evidence summary including all gate results, the change classification rationale, and the affected AISDP modules before authorising deployment.

How does production deployment and verification work?

Engineering Approach

Stage 9 deploys the candidate to a staging environment that mirrors production, using Helm to install the system with the specific model version and AISDP version.

Stage 9 deploys the candidate to a staging environment that mirrors production, using Helm to install the system with the specific model version and AISDP version. A staging test suite validates end-to-end behaviour in the production-representative environment, catching issues that unit and integration tests cannot detect.

Stage 10 enforces human approval before production deployment. The approval step assembles an evidence summary for the approver, pulling gate results from the evidence bucket and formatting them for review. The approval decision, including the approver's identity, timestamp, classification, and pipeline run identifier, is recorded as an immutable evidence artefact.

Stage 11 executes the production deployment using Argo Rollouts for progressive canary delivery. Traffic is initially routed to the new version at a low percentage, with automatic promotion to full traffic after analysis confirms stable behaviour. If the canary analysis fails, the rollout is automatically aborted. The deployment event is recorded in a structured JSON document capturing the composite version, all component versions, the approver, the classification, and the pipeline run. This event is appended to an immutable deployment ledger that provides the authoritative history of every production deployment.

Stage 12 verifies production health after deployment, waiting for canary stabilisation before running post-deployment checks against the production baseline. The pipeline concludes by generating a complete evidence manifest cataloguing every artefact produced across all twelve stages, stored alongside the stage-specific evidence in the versioned bucket.

End-to-End Reference Pipeline: EU AI Act Compliance-Grade ML Pipeline

Written by

What does a compliance-grade ML pipeline look like end-to-end?

How are the twelve pipeline stages structured?

What does Stage 1 (Static Analysis) check?

What do Stages 2 and 3 (Testing and Data Validation) verify?

How do the four model evaluation gates work in practice?

What happens during security scanning and documentation generation?

How does change classification determine the approval pathway?

How does production deployment and verification work?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline