Can we use a manual runbook instead of orchestration tooling?

Yes, but only for systems retrained quarterly or less. The runbook must list each stage, the command to execute, expected output, and verification checks. A structured execution log must be maintained as Module 5 evidence.

Does changing the pipeline definition require governance review?

Yes. Pipeline definition changes such as adding quality gates, modifying thresholds, or altering stage sequences constitute changes to the development process documented in the AISDP and must follow the same review workflow as code changes.

Which orchestration tool is best for compliance evidence?

Dagster's asset-aware model is particularly strong for compliance because it automatically tracks lineage between pipeline assets. Kubeflow Pipelines provides automatic metadata logging via ML Metadata. Airflow's metadata database makes pipeline history queryable.

What governance metrics should a pipeline dashboard show?

The dashboard should aggregate validation gate success rates, fairness gate failure frequency, mean time from training to deployment, and the number of executions blocked by governance approval. These metrics reveal systemic compliance patterns.

Can we use a manual runbook instead of orchestration tooling?

Yes, but only for systems retrained quarterly or less. The runbook must list each stage, the command to execute, expected output, and verification checks. A structured execution log must be maintained as Module 5 evidence.

Does changing the pipeline definition require governance review?

Yes. Pipeline definition changes such as adding quality gates, modifying thresholds, or altering stage sequences constitute changes to the development process documented in the AISDP and must follow the same review workflow as code changes.

Which orchestration tool is best for compliance evidence?

Dagster's asset-aware model is particularly strong for compliance because it automatically tracks lineage between pipeline assets. Kubeflow Pipelines provides automatic metadata logging via ML Metadata. Airflow's metadata database makes pipeline history queryable.

What governance metrics should a pipeline dashboard show?

The dashboard should aggregate validation gate success rates, fairness gate failure frequency, mean time from training to deployment, and the number of executions blocked by governance approval. These metrics reveal systemic compliance patterns.

Pipeline Orchestration for ML Workflows

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Article 9 of the EU AI Act requires documented, auditable development processes for high-risk AI systems. Pipeline orchestration enforces the correct sequencing of compliance-relevant activities, from data validation through model deployment, producing structured evidence at every stage.

Abstract

Read abstract

Pipeline orchestration provides the structural backbone for EU AI Act compliance in ML workflows. A compliance-grade pipeline defines eight discrete stages: data preparation, feature engineering, model training, evaluation, registration, integration testing, pre-deployment approval, and deployment. Each stage must satisfy three properties: idempotency, observability, and recoverability. The pipeline definition itself is a compliance artefact that must be version-controlled and follow the same review workflow as code changes. Orchestration tools such as Apache Airflow, Kubeflow Pipelines, Dagster, and Prefect enforce stage dependencies and capture structured metadata at each step. This metadata feeds provenance queries that auditors use to trace model history. Pipeline observability requires both an engineering view of individual run status and a governance dashboard aggregating compliance metrics such as validation gate pass rates and fairness gate failure frequency. For organisations without orchestration tooling, a documented runbook with structured execution logging provides a procedural alternative, though this approach is only feasible for systems retrained quarterly or less frequently.

Why does ML pipeline orchestration matter for EU AI Act compliance?

Regulatory Requirement

Pipeline orchestration enforces the correct sequencing of every compliance-relevant activity in an ML workflow, from data validation through to production deployment.

Pipeline orchestration enforces the correct sequencing of every compliance-relevant activity in an ML workflow, from data validation through to production deployment. The EU AI Act requires that high-risk AI systems follow documented development processes with auditable evidence at each stage. A well-designed orchestration layer makes it structurally impossible to train a model on unvalidated data, register a model that has not passed fairness gates, or deploy a model without governance approval. The pipeline definition itself functions as an executable compliance specification, encoding the sequencing rules that the aisdp documents in prose.

Without orchestration, organisations rely on manual discipline to ensure stages execute in the correct order. This approach is error-prone because steps can be skipped, run out of sequence, or executed without proper verification of predecessor outputs. The resulting evidence trail is fragmented and difficult to audit.

Pipeline orchestration tools such as Apache Airflow, Kubeflow Pipelines, Prefect, Dagster, and cloud-native offerings like AWS Step Functions, Azure ML Pipelines, and Vertex AI Pipelines provide the structural backbone that transforms compliance requirements into repeatable, verifiable workflows. These tools manage the dependencies, sequencing, and parallelism between stages, ensuring that each stage's prerequisites are satisfied before execution begins.

What are the core stages of a compliance-grade ML pipeline?

Engineering Approach

A compliance-grade ML pipeline defines eight discrete, auditable stages that must execute in a strict dependency order.

A compliance-grade ML pipeline defines eight discrete, auditable stages that must execute in a strict dependency order. Each stage produces a defined output, and the pipeline halts if any stage fails its quality or compliance checks. CI/CD Pipelines for AI Systems covers how these stages integrate into the broader continuous integration strategy.

Data Preparation ingests raw data from documented sources, applies data quality checks, executes preprocessing transformations, and produces a versioned dataset. The output is a dataset version registered in the data versioning system. The stage fails if any data quality check breaches its defined threshold.

Feature Engineering transforms the versioned dataset into the feature representation used for training, applying pre-step and post-step capture methodology to record lineage. The output is a versioned feature set. The stage fails if any feature falls outside its documented value range or if the feature set schema does not match the expected specification.

Model Training trains the model using the versioned feature set and documented hyperparameters. All training metadata is recorded, including duration, resource consumption, convergence metrics, and random seed. The output is a candidate model artefact registered in the model registry with "experimental" status.

Model Evaluation assesses the candidate model against a holdout test set, computing performance, fairness, robustness, and calibration metrics as documented in the AISDP. The output is a structured evaluation report. The stage fails if any metric breaches its declared threshold.

Model Registration promotes the model to "staging" status in the model registry when evaluation passes. The evaluation report, dataset version, training code commit, and pipeline execution identifier are all attached as metadata. This rich metadata attachment ensures that any model in the registry can be traced back to the exact data, code, and evaluation that produced it.

Integration Testing deploys the staged model to the integration environment and runs the end-to-end test suite. The stage fails if any integration, regression, or contract test fails. Integration Testing Strategies provides detailed guidance on structuring these test suites for compliance purposes.

Pre-Deployment Approval introduces a human approval gate where the designated approver reviews the evaluation report and integration test results before authorising promotion to production. This gate ensures that no model reaches production without explicit human oversight, as required by Article 14 for high-risk systems.

What properties should each pipeline stage satisfy?

Engineering Approach

Every pipeline stage must satisfy three essential properties to support both operational reliability and regulatory compliance.

Every pipeline stage must satisfy three essential properties to support both operational reliability and regulatory compliance. These properties ensure that pipeline executions are deterministic, transparent, and resilient to failure.

First, each stage must be idempotent: re-running the stage with the same inputs produces the same outputs. Idempotency allows teams to retry failed stages without risk of corrupting downstream artefacts or producing inconsistent evidence records. This property is particularly important when pipelines are triggered automatically by code commits or data updates.

Second, each stage must be observable, emitting structured logs and metrics that the pipeline monitoring system can consume. Observability enables both engineering teams and governance reviewers to understand what happened at each stage, how long it took, and whether any anomalies occurred. Structured metadata from each step feeds into the provenance query tool.

Third, each stage must be recoverable: if a stage fails, the pipeline can resume from that stage without re-executing completed stages. Recovery avoids wasted computation and preserves the evidence trail from stages that completed successfully. Orchestration tools handle recovery natively through checkpointing and stage-level retry mechanisms. Without recoverability, a failure late in the pipeline would require re-running the entire workflow, wasting resources and potentially producing inconsistent evidence if earlier stages behave differently on re-execution.

How should pipeline definitions be version-controlled?

Regulatory Requirement

The pipeline definition itself is a compliance artefact that must be version-controlled alongside the code and configuration it orchestrates.

The pipeline definition itself is a compliance artefact that must be version-controlled alongside the code and configuration it orchestrates. Changes to the pipeline definition, such as adding a new quality gate, modifying a threshold, or altering the stage sequence, constitute changes to the development process documented in AISDP Module 2.

Pipeline definition changes should follow the same review and approval workflow as code changes. A modification to the pipeline that removes a fairness gate or reorders evaluation before training has compliance implications that must be reviewed by the governance function. Version-controlling the pipeline definition ensures that every execution can be traced back to a specific, approved version of the workflow.

The pipeline specification file, whether a DAG definition or pipeline configuration, should be stored in the code repository and explicitly referenced in the AISDP. This creates a direct link between the documented development process and its executable implementation. When an auditor examines the AISDP, they can follow the reference to the exact pipeline version that was in effect at any point in the system's history.

Which orchestration tools are suitable for compliance workflows?

Engineering Approach

Orchestration tool selection depends on the organisation's infrastructure maturity, cloud strategy, and the frequency of model retraining cycles.

Orchestration tool selection depends on the organisation's infrastructure maturity, cloud strategy, and the frequency of model retraining cycles. Each tool brings distinct strengths that map to different compliance needs.

Apache Airflow is the most widely adopted open-source option, offering a large ecosystem of operators with pre-built integrations for cloud services, databases, and ML platforms. Its DAG-based model is intuitive for sequential workflows, and Airflow stores task execution metadata in its metadata database, making pipeline history queryable. Metadata and Experiment Tracking covers how orchestration metadata integrates with broader experiment tracking.

Kubeflow Pipelines is the natural choice for organisations already running ML workloads on Kubernetes. It provides container-native pipeline steps with automatic metadata logging through Kubeflow ML Metadata, which captures artefact lineage automatically. This metadata is queryable and serves as the foundation for provenance tracking.

Dagster's asset-aware model is particularly useful for compliance purposes. Each pipeline step produces a defined asset, such as a dataset, a feature set, or a model, and Dagster tracks the lineage relationships between assets automatically. This built-in lineage tracking reduces the instrumentation burden on engineering teams.

Prefect and ZenML offer Pythonic APIs that feel natural to data scientists, reducing the gap between notebook experimentation and production pipeline code. This lower barrier to adoption can accelerate the transition from ad-hoc scripts to governed pipelines, which is particularly valuable for teams early in their compliance journey.

How should a compliance-relevant pipeline be structured?

Engineering Approach

A compliance-relevant pipeline follows a specific structural pattern that enforces validation gates at each transition between stages.

A compliance-relevant pipeline follows a specific structural pattern that enforces validation gates at each transition between stages. The pipeline begins with data validation, confirming that the input data meets documented quality standards. If validation fails, the pipeline halts and no downstream processing occurs on invalid data.

Data transformation and feature engineering follow, with lineage captured at each step. The training step produces a model artefact that is logged to the experiment tracker with all hyperparameters, random seeds, and environment specifications. Complete recording at this stage ensures that training runs are fully reproducible.

The evaluation step runs the full validation gate suite: performance metrics, fairness metrics, robustness tests, and drift comparison against both the production model and the baseline model. If any gate fails, the pipeline halts. If all gates pass, the model is registered in the model registry at "Staging" status. The deployment step then promotes the model to production, subject to governance approval.

Each pipeline step should emit structured metadata that is captured by the orchestration tool's metadata store. Airflow stores this in its metadata database, while Kubeflow ML Metadata captures artefact lineage automatically. This metadata is queryable and serves as the foundation for provenance queries that auditors and governance teams use to trace the full history of any model in production. The pipeline definition file itself should be version-controlled in the code repository and referenced in the AISDP, closing the loop between documented process and executable workflow.

What does pipeline observability require?

Engineering Approach

Pipeline observability means that the current state and history of pipeline executions are visible to both engineering and governance teams.

Pipeline observability means that the current state and history of pipeline executions are visible to both engineering and governance teams. Without observability, compliance evidence exists in scattered logs and manual records rather than in a coherent, queryable system.

The orchestration tool's user interface provides the engineering view of pipeline health. Airflow's web UI, Dagster's Dagit, and Prefect's dashboard each show which runs succeeded, which failed, where failures occurred, and what the execution times were. This operational view is essential for diagnosing pipeline failures and maintaining system reliability. Engineers use these dashboards to identify bottlenecks, monitor resource consumption, and trace the cause of stage failures back to specific data inputs or configuration changes.

A governance-oriented dashboard should aggregate pipeline health metrics that reveal compliance patterns at the organisational level. These metrics include the success rate of validation gates over time, the frequency of fairness gate failures, the mean time from model training to production deployment, and the number of pipeline executions blocked by governance approval. These aggregate metrics reveal systemic patterns that individual pipeline run records do not surface, such as a gradual decline in fairness gate pass rates that might indicate data distribution drift. Surfacing these trends early allows the governance function to intervene before compliance failures occur in production.

What is the procedural alternative when orchestration tooling is unavailable?

Compensating Controls

Organisations without orchestration tooling can execute pipeline steps manually in sequence, following a documented runbook.

Organisations without orchestration tooling can execute pipeline steps manually in sequence, following a documented runbook. The runbook lists each pipeline stage, including data validation, feature engineering, training, evaluation, and registration. For each stage, it specifies the command or script to execute, the expected output, and the verification check to perform before proceeding to the next step.

The person executing the pipeline logs each step's start time, end time, outcome, and any anomalies in a structured execution log. This log is retained as Module 5 evidence and serves as the audit trail for that pipeline execution.

Manual execution carries significant risks: steps can be skipped or run out of order, and the process depends entirely on the operator's discipline and attention. There is no automated enforcement of stage dependencies, so a model could theoretically be deployed without having passed the evaluation gate. This procedural alternative is feasible for systems retrained infrequently, typically quarterly or less.

For systems retrained weekly or more often, orchestration tooling becomes essential to maintain both reliability and compliance. Airflow, Prefect, and Dagster all offer open-source editions that provide a low-cost entry point to automated orchestration. The transition from manual runbooks to automated pipelines should be treated as a compliance improvement, with the new pipeline definition reviewed and approved through the standard change management process.

Pipeline Orchestration for ML Workflows

Written by

Why does ML pipeline orchestration matter for EU AI Act compliance?

What are the core stages of a compliance-grade ML pipeline?

What properties should each pipeline stage satisfy?

How should pipeline definitions be version-controlled?

Which orchestration tools are suitable for compliance workflows?

How should a compliance-relevant pipeline be structured?

What does pipeline observability require?

What is the procedural alternative when orchestration tooling is unavailable?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline