Which experiment tracking tools are most commonly used for EU AI Act compliance?

MLflow Tracking is the most widely adopted open-source option. Weights & Biases, Neptune, and Comet are also used. All record parameters, metrics, and artefacts per run. MLflow, ClearML, and Neptune offer free tiers suitable for initial compliance work.

Is full bitwise reproducibility required under the EU AI Act?

No. The regulation requires documented reproducibility, not necessarily bitwise exactness. Where GPU non-determinism prevents identical results, statistical reproducibility with confidence intervals across multiple runs is an accepted approach, provided the AISDP documents the level achieved and the limitations.

Which experiment tracking tools are most commonly used for EU AI Act compliance?

MLflow Tracking is the most widely adopted open-source option. Weights & Biases, Neptune, and Comet are also used. All record parameters, metrics, and artefacts per run. MLflow, ClearML, and Neptune offer free tiers suitable for initial compliance work.

Is full bitwise reproducibility required under the EU AI Act?

No. The regulation requires documented reproducibility, not necessarily bitwise exactness. Where GPU non-determinism prevents identical results, statistical reproducibility with confidence intervals across multiple runs is an accepted approach, provided the AISDP documents the level achieved and the limitations.

The EU AI Act's conformity assessment process requires organisations to explain how a deployed model was selected from among candidate models and to reproduce specific training runs on request. Experiment tracking captures the full specification of each run, including code version, data version, hyperparameters, random seed, compute environment, training metrics, and final evaluation results. Tools such as MLflow Tracking and Weights & Biases record these artefacts automatically and provide comparison and visualisation capabilities. Reproducibility requires that all inputs to a training run are retrievable and that the compute environment can be recreated. For CPU-based training, dependency pinning, seed recording, data versioning, and containerised environments enable identical re-execution. GPU training introduces non-deterministic operations that prevent bitwise reproducibility; statistical reproducibility across multiple runs with confidence intervals provides a defensible alternative. The AISDP reproducibility specification must document the level achieved, mechanisms used, and known limitations. Organisations without dedicated tooling can use structured experiment log spreadsheets, though this approach is adequate only for systems with infrequent training runs.

Why does the EU AI Act require experiment tracking?

Regulatory Requirement

Model development involves experimentation across different architectures, hyperparameters, feature sets, and training strategies.

Model development involves experimentation across different architectures, hyperparameters, feature sets, and training strategies. Experiment tracking captures this exploration systematically and connects it to the formal version control system. When a conformity assessment requires the organisation to explain why a particular model architecture and hyperparameter configuration was chosen, experiment tracking records provide the evidence base. They demonstrate that alternatives were evaluated, that the chosen configuration was selected on merit, and that the selection criteria aligned with the compliance requirements documented in the aisdp.

Without experiment tracking, the Technical SME must reconstruct this narrative from scattered logs and team memory. That approach is fragile and error-prone, making it inadequate for regulatory scrutiny.

What should an experiment tracking system record?

Engineering Approach

Experiment tracking tools such as MLflow Tracking, Weights & Biases, Neptune, and Comet record the parameters, metrics, and artefacts for each training run.

Experiment tracking tools such as MLflow Tracking, Weights & Biases, Neptune, and Comet record the parameters, metrics, and artefacts for each training run. For compliance purposes, the tracking system must capture the full specification of each run: code version, data version, hyperparameters, random seed, and compute environment. It must also record the training metrics at each epoch or iteration, the final evaluation metrics on the holdout test set, and the resulting model artefact with its content hash.

MLflow Tracking is the most widely adopted open-source option. It logs parameters, metrics, and artefacts per run and provides a UI for run comparison and visualisation. CI/CD Pipelines for High-Risk AI covers the broader pipeline context in which experiment tracking operates. Weights & Biases adds collaboration features such as shared dashboards and run notes, along with richer visualisation including learning curves and parameter importance plots.

The compliance value becomes apparent during conformity assessment. An assessor reviewing the AISDP's Module 5 needs to understand how the deployed model was selected from among the candidate models that were trained. The experiment tracking system provides this narrative: the set of experiments that were run, the metrics comparison that led to the selection, and the specific run that produced the deployed model.

What level of reproducibility does the regulation expect?

Regulatory Requirement

A competent authority or notified body may require the organisation to reproduce a specific training run to verify the claims in the AISDP.

A competent authority or notified body may require the organisation to reproduce a specific training run to verify the claims in the AISDP. Reproducibility requires that the exact code version, data version, and configuration are retrievable, that the compute environment can be recreated, that the random seed is recorded and can be set to produce the same initialisation, and that the training framework's version is pinned, since framework updates can change numerical behaviour even with the same seed.

Full bitwise reproducibility is not always achievable, particularly for GPU-accelerated training where non-deterministic operations such as atomic floating-point additions are common. The AISDP must document the level of reproducibility the system achieves, the factors that may cause variation between runs, and the tolerance bounds within which reproduced results are considered consistent. Where bitwise reproducibility is not achievable, statistical reproducibility (results within defined confidence intervals across multiple runs) is demonstrated and documented by the Technical SME.

How is reproducibility achieved for CPU and GPU training?

Engineering Approach

For deterministic algorithms on CPU, reproducibility is achievable through dependency pinning, random seed recording, and data versioning.

For deterministic algorithms on CPU, reproducibility is achievable through dependency pinning, random seed recording, and data versioning. Poetry or Conda lock files pin every dependency, including transitive dependencies, to exact versions. The random seed is logged as a hyperparameter, data is versioned via DVC, and the training environment is captured as a Docker image. Given these four elements, the training run can be re-executed identically.

For GPU training, bitwise reproducibility is often unachievable because GPU operations use non-deterministic algorithms for performance. Parallel floating-point reductions process elements in different orders across executions. NVIDIA CUDA provides a deterministic mode (torch.use_deterministic_algorithms(True) in PyTorch) that forces deterministic operations, but at a significant performance penalty and with some operations unsupported.

The practical alternative is statistical reproducibility: run the training specification multiple times (three to five runs is standard), compute the confidence interval of the evaluation metrics across runs, and declare the model's performance as the mean plus or minus the interval. This approach acknowledges the inherent stochasticity while providing a defensible performance range. Validation and Testing Frameworks covers how confidence intervals feed into validation gates. The gate should compare the lower bound of the interval against the threshold, ensuring that the model meets the declared standard even under stochastic variation.

What should the AISDP reproducibility specification contain?

Engineering Approach

For AISDP purposes, the reproducibility specification should document which level of reproducibility is achieved (bitwise or statistical), the mechanisms used to achieve it, and any known limitations. For GPU-trained models, the Technical SME retains multiple-run results and confidence intervals as Module 5 evidence. Technical Documentation and Evidence details how this evidence integrates into the broader documentation structure.

The specification bridges experiment tracking and reproducibility, which are two sides of the same compliance requirement: the ability to reconstruct how a model was produced. Experiment tracking records what happened during each training run; reproducibility ensures that the same run specification would produce the same or statistically equivalent result if re-executed.

How can organisations track experiments without dedicated tooling?

Compensating Controls

Experiment tracking can be done manually through a structured experiment log spreadsheet.

Experiment tracking can be done manually through a structured experiment log spreadsheet. Columns should include experiment ID, date, hypothesis, all hyperparameters, data version, code commit, random seed, training duration, all declared evaluation metrics, notes, and outcome determination (selected for further development, rejected, or baseline). Model artefacts are stored with the experiment ID as the directory name, and the log is reviewed during model selection to document why the deployed model was chosen.

This manual approach loses visualisation features such as learning curves and parameter importance plots, easy run comparison, and automatic metric logging. It is adequate for systems with infrequent training runs (fewer than ten per quarter). For active experimentation with dozens or hundreds of runs, tools such as MLflow, ClearML, or Neptune are needed; all have free tiers.

Experiment Tracking and Reproducibility

Written by

Why does the EU AI Act require experiment tracking?

What should an experiment tracking system record?

What level of reproducibility does the regulation expect?

How is reproducibility achieved for CPU and GPU training?

What should the AISDP reproducibility specification contain?

How can organisations track experiments without dedicated tooling?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline