We use cookies to improve your experience and analyse site traffic.
Article 11 and Annex IV require providers of high-risk AI systems to document how models were developed and selected. Experiment tracking and reproducibility together provide the evidence that a competent authority or notified body needs to verify the claims in the AISDP.
Model development involves experimentation across different architectures, hyperparameters, feature sets, and training strategies.
Model development involves experimentation across different architectures, hyperparameters, feature sets, and training strategies. Experiment tracking captures this exploration systematically and connects it to the formal version control system. When a conformity assessment requires the organisation to explain why a particular model architecture and hyperparameter configuration was chosen, experiment tracking records provide the evidence base. They demonstrate that alternatives were evaluated, that the chosen configuration was selected on merit, and that the selection criteria aligned with the compliance requirements documented in the aisdp.
Without experiment tracking, the Technical SME must reconstruct this narrative from scattered logs and team memory. That approach is fragile and error-prone, making it inadequate for regulatory scrutiny.
Experiment tracking tools such as MLflow Tracking, Weights & Biases, Neptune, and Comet record the parameters, metrics, and artefacts for each training run.
Experiment tracking tools such as MLflow Tracking, Weights & Biases, Neptune, and Comet record the parameters, metrics, and artefacts for each training run. For compliance purposes, the tracking system must capture the full specification of each run: code version, data version, hyperparameters, random seed, and compute environment. It must also record the training metrics at each epoch or iteration, the final evaluation metrics on the holdout test set, and the resulting model artefact with its content hash.
MLflow Tracking is the most widely adopted open-source option. It logs parameters, metrics, and artefacts per run and provides a UI for run comparison and visualisation. CI/CD Pipelines for High-Risk AI covers the broader pipeline context in which experiment tracking operates. Weights & Biases adds collaboration features such as shared dashboards and run notes, along with richer visualisation including learning curves and parameter importance plots.
The compliance value becomes apparent during conformity assessment. An assessor reviewing the AISDP's Module 5 needs to understand how the deployed model was selected from among the candidate models that were trained. The experiment tracking system provides this narrative: the set of experiments that were run, the metrics comparison that led to the selection, and the specific run that produced the deployed model.
A competent authority or notified body may require the organisation to reproduce a specific training run to verify the claims in the AISDP.
A competent authority or notified body may require the organisation to reproduce a specific training run to verify the claims in the AISDP. Reproducibility requires that the exact code version, data version, and configuration are retrievable, that the compute environment can be recreated, that the random seed is recorded and can be set to produce the same initialisation, and that the training framework's version is pinned, since framework updates can change numerical behaviour even with the same seed.
Full bitwise reproducibility is not always achievable, particularly for GPU-accelerated training where non-deterministic operations such as atomic floating-point additions are common. The AISDP must document the level of reproducibility the system achieves, the factors that may cause variation between runs, and the tolerance bounds within which reproduced results are considered consistent. Where bitwise reproducibility is not achievable, statistical reproducibility (results within defined confidence intervals across multiple runs) is demonstrated and documented by the Technical SME.
For deterministic algorithms on CPU, reproducibility is achievable through dependency pinning, random seed recording, and data versioning.
For deterministic algorithms on CPU, reproducibility is achievable through dependency pinning, random seed recording, and data versioning. Poetry or Conda lock files pin every dependency, including transitive dependencies, to exact versions. The random seed is logged as a hyperparameter, data is versioned via DVC, and the training environment is captured as a Docker image. Given these four elements, the training run can be re-executed identically.
For GPU training, bitwise reproducibility is often unachievable because GPU operations use non-deterministic algorithms for performance. Parallel floating-point reductions process elements in different orders across executions. NVIDIA CUDA provides a deterministic mode (torch.use_deterministic_algorithms(True) in PyTorch) that forces deterministic operations, but at a significant performance penalty and with some operations unsupported.
The practical alternative is statistical reproducibility: run the training specification multiple times (three to five runs is standard), compute the confidence interval of the evaluation metrics across runs, and declare the model's performance as the mean plus or minus the interval. This approach acknowledges the inherent stochasticity while providing a defensible performance range. Validation and Testing Frameworks covers how confidence intervals feed into validation gates. The gate should compare the lower bound of the interval against the threshold, ensuring that the model meets the declared standard even under stochastic variation.
For AISDP purposes, the reproducibility specification should document which level of reproducibility is achieved (bitwise or statistical), the mechanisms used to achieve it, and any known limitations.
For AISDP purposes, the reproducibility specification should document which level of reproducibility is achieved (bitwise or statistical), the mechanisms used to achieve it, and any known limitations. For GPU-trained models, the Technical SME retains multiple-run results and confidence intervals as Module 5 evidence. Technical Documentation and Evidence details how this evidence integrates into the broader documentation structure.
The specification bridges experiment tracking and reproducibility, which are two sides of the same compliance requirement: the ability to reconstruct how a model was produced. Experiment tracking records what happened during each training run; reproducibility ensures that the same run specification would produce the same or statistically equivalent result if re-executed.
Experiment tracking can be done manually through a structured experiment log spreadsheet.
Experiment tracking can be done manually through a structured experiment log spreadsheet. Columns should include experiment ID, date, hypothesis, all hyperparameters, data version, code commit, random seed, training duration, all declared evaluation metrics, notes, and outcome determination (selected for further development, rejected, or baseline). Model artefacts are stored with the experiment ID as the directory name, and the log is reviewed during model selection to document why the deployed model was chosen.
This manual approach loses visualisation features such as learning curves and parameter importance plots, easy run comparison, and automatic metric logging. It is adequate for systems with infrequent training runs (fewer than ten per quarter). For active experimentation with dozens or hundreds of runs, tools such as MLflow, ClearML, or Neptune are needed; all have free tiers.
MLflow Tracking is the most widely adopted open-source option. Weights & Biases, Neptune, and Comet are also used. All record parameters, metrics, and artefacts per run. MLflow, ClearML, and Neptune offer free tiers suitable for initial compliance work.
No. The regulation requires documented reproducibility, not necessarily bitwise exactness. Where GPU non-determinism prevents identical results, statistical reproducibility with confidence intervals across multiple runs is an accepted approach, provided the AISDP documents the level achieved and the limitations.
Yes, for systems with infrequent training runs (fewer than ten per quarter). A structured spreadsheet with experiment ID, hyperparameters, data version, code commit, random seed, metrics, and outcome determination is adequate. Active experimentation with many runs requires dedicated tooling.
The AISDP must document the level of reproducibility achieved, factors causing variation, and tolerance bounds. Statistical reproducibility is acceptable where bitwise reproducibility is not achievable.
CPU reproducibility uses dependency pinning, seed recording, data versioning, and containerised environments. GPU training requires statistical reproducibility through multiple runs with confidence intervals.
The specification documents which level of reproducibility is achieved, the mechanisms used, any known limitations, and for GPU models, multiple-run results with confidence intervals as Module 5 evidence.
A structured experiment log spreadsheet with columns for experiment ID, hyperparameters, data version, code commit, random seed, metrics, and outcome determination serves as a manual alternative.