Why do third-party models create a version control gap?

The organisation does not control versioning, update cadence, or change notification for third-party components, yet compliance obligations remain unchanged regardless of who controls the model.

How should API-accessed models be version-controlled?

Pin to specific model versions where the API supports it, document the pinned version in the AISDP, enforce the pin in deployment configuration, and route any version change through the standard evaluation process.

Why must prompts be governed as compliance artefacts?

Prompt changes can alter system outputs as materially as model weight changes. They must be version-controlled, reviewed, tested against validation gates, and included in the composite version identifier.

How should embedding models be tested for silent changes?

Sentinel queries with known-good retrieval results, comparing top-k document identifiers and rank ordering at regular intervals. Rank stability is measured with Kendall's tau or nDCG for models with numerical variations.

What happens when an API provider deprecates a model version?

The organisation must run the full validation gate suite against the replacement version before the deprecated version is withdrawn. The AISDP documents the migration trigger, evaluation required, and completion timeline.

How often should sentinel tests run?

Daily or weekly, depending on the model's change risk. Tests cover critical decision paths: inputs near the decision boundary, inputs from underrepresented subgroups, and inputs exercising known model limitations.

Why do embedding model changes affect the entire system?

A change in embedding behaviour alters retrieval results for every query without changing the primary model, knowledge base, or code. This makes embedding models a hidden dependency that requires independent sentinel monitoring.

How should prompt versions appear in the compliance record?

The composite version identifier includes the prompt version alongside model and configuration versions. Module 3 records the current prompt content and change history. Module 10 records the governance process. The change log must be reviewable by competent authorities.

What happens when an API provider deprecates a model version?

The organisation must run the full validation gate suite against the replacement version before the deprecated version is withdrawn. The AISDP documents the migration trigger, evaluation required, and completion timeline.

How often should sentinel tests run?

Daily or weekly, depending on the model's change risk. Tests cover critical decision paths: inputs near the decision boundary, inputs from underrepresented subgroups, and inputs exercising known model limitations.

Why do embedding model changes affect the entire system?

A change in embedding behaviour alters retrieval results for every query without changing the primary model, knowledge base, or code. This makes embedding models a hidden dependency that requires independent sentinel monitoring.

How should prompt versions appear in the compliance record?

The composite version identifier includes the prompt version alongside model and configuration versions. Module 3 records the current prompt content and change history. Module 10 records the governance process. The change log must be reviewable by competent authorities.

Third-party models create a version control gap because the organisation does not control versioning, updates, or change notifications. This page covers version pinning for API-accessed models, internal storage of downloaded models, sentinel dataset testing for detecting silent changes, embedding model monitoring, and prompt governance as a compliance requirement.

Abstract

Read abstract

Third-party and cloud-hosted models introduce version control challenges because the organisation does not control versioning, update cadence, or change notification. The compliance obligations remain unchanged regardless of who controls the model. API-accessed models should be pinned to specific versions where the provider supports it, with the pinned version documented in the AISDP and enforced in deployment configuration. Deprecation migration processes must be planned, requiring the full validation gate suite against replacement versions before deprecated versions are withdrawn. Downloaded models from Hugging Face or similar hubs are stored internally with cryptographic hashes and accessed only from the internal copy, preventing silent updates. Hugging Face revision parameters pin downloads to specific Git commit SHAs for exact reproducibility. Sentinel dataset testing detects behavioural changes that version pinning may not catch, submitting curated inputs at regular intervals and comparing outputs against baselines. For deterministic models any output change signals a model change; for stochastic models distributional shifts are detected via Kolmogorov-Smirnov or similar statistical tests. Embedding models require their own sentinel regime because changes alter retrieval results for every query without touching the primary model or knowledge base. System prompts must be governed as compliance artefacts with version control, approval workflows, and testing equivalent to model changes.

What version control challenges do third-party models create?

Engineering Approach

Many high-risk AI systems incorporate models provided by third parties: foundation models accessed via API, pre-trained models downloaded from model hubs, or cloud-hosted inference services where the provider controls the model lifecycle. These models present a version control challenge because the organisation does not control the versioning, update cadence, or change notification for the third-party component.

For API-accessed models from providers such as OpenAI, Anthropic, Cohere, or Google, the primary control is version pinning. Most providers offer versioned model endpoints. The system configuration should reference the specific version string, and any version change should flow through the standard deployment change process with full re-evaluation against declared thresholds. A key risk is provider-initiated deprecation: the provider announces a version will be retired and the organisation must migrate. The aisdp must document the migration process including the trigger, the evaluation required, and the timeline.

Where the API does not support version pinning and the provider serves the latest model version, the organisation faces an ongoing compliance risk: the system's behaviour may change without the organisation's knowledge or consent. Compensating controls include regular behavioural testing against a sentinel dataset, output distribution monitoring detecting response pattern shifts, and contractual provisions requiring advance notification of model changes.

For downloaded pre-trained models from Hugging Face, TensorFlow Hub, or PyTorch Hub, the model is captured in the model registry at download with a cryptographic hash verifying integrity. The engineering team stores the artefact internally and subsequent loads reference the internal copy, never the external hub, preventing silent changes if the hub updates the model under the same identifier.

For cloud inference services such as SageMaker, Azure ML, or Vertex AI endpoints, the organisation documents which model version is deployed to which endpoint, how configuration affects inference behaviour, and how updates are controlled and logged. Endpoint configuration is managed as infrastructure-as-code subject to the same governance as other infrastructure.

How does sentinel testing detect third-party model changes?

Engineering Approach

Sentinel testing is the compensating control that detects behavioural changes in third-party models that the organisation cannot directly monitor.

Sentinel testing is the compensating control that detects behavioural changes in third-party models that the organisation cannot directly monitor. A fixed set of inputs with known expected outputs is submitted to the third-party model on a scheduled basis. Changes in the outputs indicate that the model's behaviour has shifted, even if the provider has not issued a change notification.

For embedding models in RAG systems, sentinel tests compare the embeddings produced for fixed documents against a stored baseline. Shifts in the embedding space indicate a model update that may affect retrieval quality and fairness. The Technical SME defines the sentinel dataset to cover the system's key use cases and edge cases, and monitors the similarity between current and baseline outputs using cosine distance or equivalent measures.

Sentinel testing cannot detect all types of change: subtle shifts in probability distributions or changes in edge-case behaviour may fall below the sentinel dataset's coverage. The organisation should treat sentinel testing as a detection mechanism, not a guarantee, and supplement it with output distribution monitoring and contractual notification requirements.

How should prompt governance and version control work?

Engineering Approach

Prompt governance applies version control discipline to LLM prompts.

Prompt governance applies version control discipline to LLM prompts. System prompts, few-shot examples, prompt chains, and tool definitions are stored as version-controlled text files in the code repository. Each prompt version is linked to the evaluation results that validated it. Changes to prompts follow the same review and approval process as code changes, with pull requests requiring designated reviewer approval.

Prompt changes can alter system behaviour as profoundly as model retraining, yet they are often treated as informal configuration. The version control system must treat them with the same rigour as model artefacts. Prompt registries using emerging tools from LangSmith, Humanloop, or PromptLayer provide additional metadata tracking including the prompt's performance metrics, A/B test results, and deployment history.

For knowledge bases in RAG systems, document additions, updates, and removals must be tracked. DVC or LakeFS provide document snapshots. Vector database versioning tracks embedding changes. A knowledge base change that alters the information available to the model alters outputs and may trigger the substantial modification assessment.

Versioning Third-Party and Cloud-Hosted Models

Written by

What version control challenges do third-party models create?

How does sentinel testing detect third-party model changes?

How should prompt governance and version control work?

Frequently Asked Questions

Related Pages

In This Section

Build compliance into your pipeline