We use cookies to improve your experience and analyse site traffic.
Third-party models create a version control gap because the organisation does not control versioning, updates, or change notifications. This page covers version pinning for API-accessed models, internal storage of downloaded models, sentinel dataset testing for detecting silent changes, embedding model monitoring, and prompt governance as a compliance requirement.
Many high-risk AI systems incorporate models provided by third parties: foundation models accessed via API, pre-trained models downloaded from model hubs, or cloud-hosted inference services where the provider controls the model lifecycle.
Many high-risk AI systems incorporate models provided by third parties: foundation models accessed via API, pre-trained models downloaded from model hubs, or cloud-hosted inference services where the provider controls the model lifecycle. These models present a version control challenge because the organisation does not control the versioning, update cadence, or change notification for the third-party component.
For API-accessed models from providers such as OpenAI, Anthropic, Cohere, or Google, the primary control is version pinning. Most providers offer versioned model endpoints. The system configuration should reference the specific version string, and any version change should flow through the standard deployment change process with full re-evaluation against declared thresholds. A key risk is provider-initiated deprecation: the provider announces a version will be retired and the organisation must migrate. The aisdp must document the migration process including the trigger, the evaluation required, and the timeline.
Where the API does not support version pinning and the provider serves the latest model version, the organisation faces an ongoing compliance risk: the system's behaviour may change without the organisation's knowledge or consent. Compensating controls include regular behavioural testing against a sentinel dataset, output distribution monitoring detecting response pattern shifts, and contractual provisions requiring advance notification of model changes.
For downloaded pre-trained models from Hugging Face, TensorFlow Hub, or PyTorch Hub, the model is captured in the model registry at download with a cryptographic hash verifying integrity. The engineering team stores the artefact internally and subsequent loads reference the internal copy, never the external hub, preventing silent changes if the hub updates the model under the same identifier.
For cloud inference services such as SageMaker, Azure ML, or Vertex AI endpoints, the organisation documents which model version is deployed to which endpoint, how configuration affects inference behaviour, and how updates are controlled and logged. Endpoint configuration is managed as infrastructure-as-code subject to the same governance as other infrastructure.
Sentinel testing is the compensating control that detects behavioural changes in third-party models that the organisation cannot directly monitor.
Sentinel testing is the compensating control that detects behavioural changes in third-party models that the organisation cannot directly monitor. A fixed set of inputs with known expected outputs is submitted to the third-party model on a scheduled basis. Changes in the outputs indicate that the model's behaviour has shifted, even if the provider has not issued a change notification.
For embedding models in RAG systems, sentinel tests compare the embeddings produced for fixed documents against a stored baseline. Shifts in the embedding space indicate a model update that may affect retrieval quality and fairness. The Technical SME defines the sentinel dataset to cover the system's key use cases and edge cases, and monitors the similarity between current and baseline outputs using cosine distance or equivalent measures.
Sentinel testing cannot detect all types of change: subtle shifts in probability distributions or changes in edge-case behaviour may fall below the sentinel dataset's coverage. The organisation should treat sentinel testing as a detection mechanism, not a guarantee, and supplement it with output distribution monitoring and contractual notification requirements.
Prompt governance applies version control discipline to LLM prompts.
Prompt governance applies version control discipline to LLM prompts. System prompts, few-shot examples, prompt chains, and tool definitions are stored as version-controlled text files in the code repository. Each prompt version is linked to the evaluation results that validated it. Changes to prompts follow the same review and approval process as code changes, with pull requests requiring designated reviewer approval.
Prompt changes can alter system behaviour as profoundly as model retraining, yet they are often treated as informal configuration. The version control system must treat them with the same rigour as model artefacts. Prompt registries using emerging tools from LangSmith, Humanloop, or PromptLayer provide additional metadata tracking including the prompt's performance metrics, A/B test results, and deployment history.
For knowledge bases in RAG systems, document additions, updates, and removals must be tracked. DVC or LakeFS provide document snapshots. Vector database versioning tracks embedding changes. A knowledge base change that alters the information available to the model alters outputs and may trigger the substantial modification assessment.
The organisation must run the full validation gate suite against the replacement version before the deprecated version is withdrawn. The AISDP documents the migration trigger, evaluation required, and completion timeline.
Daily or weekly, depending on the model's change risk. Tests cover critical decision paths: inputs near the decision boundary, inputs from underrepresented subgroups, and inputs exercising known model limitations.
A change in embedding behaviour alters retrieval results for every query without changing the primary model, knowledge base, or code. This makes embedding models a hidden dependency that requires independent sentinel monitoring.
The composite version identifier includes the prompt version alongside model and configuration versions. Module 3 records the current prompt content and change history. Module 10 records the governance process. The change log must be reviewable by competent authorities.
Prompt changes can alter system outputs as materially as model weight changes. They must be version-controlled, reviewed, tested against validation gates, and included in the composite version identifier.
Sentinel queries with known-good retrieval results, comparing top-k document identifiers and rank ordering at regular intervals. Rank stability is measured with Kendall's tau or nDCG for models with numerical variations.