How does AISDP preparation differ for an LLM-based system versus classical ML?

LLM-based systems add GPAI provider information asymmetry, knowledge base governance, grounding verification, prompt governance, multi-model cascade management, and for clinical systems, the amplification of every compliance gap into a patient safety concern.

What happens when a GPAI provider cannot fully disclose training data and model information?

Compensating controls become the primary compliance evidence: behavioural proxy testing with sentinel datasets, structured behavioural characterisation, red-teaming programmes, and daily sentinel monitoring to detect model behaviour changes.

How does grounding verification work in a RAG-based clinical AI system?

Two stages: citation extraction requires the LLM to cite guideline references, then a DeBERTa-v3 NLI model verifies entailment with a 0.85 threshold. Ungrounded recommendations are suppressed and the GP is directed to consult guidelines directly.

What does cascade-aware governance mean for multi-model AI systems?

A change to any upstream component triggers re-evaluation of all downstream components. MediAssist AI's six-component cascade map prevented a deployment when an embedding model update degraded paediatric query retrieval quality.

How is clinical fairness evaluated in an LLM-based medical AI system?

Evaluation covers 14 demographic subgroups and 23 intersectional subgroups, measuring diagnostic accuracy, recommendation completeness, and grounding score. The 5-percentage-point gap for elderly South Asian patients was mitigated through targeted knowledge base augmentation.

Does fine-tuning a foundation model for clinical use make the fine-tuning organisation the AI system provider?

Yes. Article 25(1)(b) is engaged when fine-tuning changes the model's intended purpose (from general-purpose to clinical advice), alters the risk profile, and affects compliance with GPAI provider obligations. The fine-tuning organisation bears full provider obligations under Article 16, while the GPAI model provider retains obligations under Articles 51 through 56.

Why is grounding verification a compliance control rather than just a product feature?

For classical ML systems, outputs are deterministic scores that can be directly evaluated for accuracy. For LLM-based systems, outputs are natural language that may or may not be grounded in evidence. Without grounding verification, the system cannot demonstrate Article 15 accuracy compliance because there is no mechanism to confirm that recommendations are supported by the evidence base.

How should knowledge base governance compare to training data governance?

Knowledge base governance should be equally rigorous. The knowledge base is a compliance-critical data asset whose provenance, completeness, currency, and representativeness directly affect system outputs. MediAssist AI maintains three distinct governance regimes for clinical guidelines (weekly updates), peer-reviewed abstracts (monthly refresh), and drug interaction records (monthly BNF alignment).

What clinical safety monitoring thresholds should an LLM-based medical AI implement?

MediAssist AI implements real-time grounding suppression below 0.75, daily diagnostic accuracy checks (below 90% aggregate or 85% per subgroup triggers escalation), daily GPAI sentinel evaluation, weekly GP override rate monitoring for automation bias, and immediate break-glass escalation for any missed drug interaction.

Does fine-tuning a foundation model for clinical use make the fine-tuning organisation the AI system provider?

Yes. Article 25(1)(b) is engaged when fine-tuning changes the model's intended purpose (from general-purpose to clinical advice), alters the risk profile, and affects compliance with GPAI provider obligations. The fine-tuning organisation bears full provider obligations under Article 16, while the GPAI model provider retains obligations under Articles 51 through 56.

Why is grounding verification a compliance control rather than just a product feature?

For classical ML systems, outputs are deterministic scores that can be directly evaluated for accuracy. For LLM-based systems, outputs are natural language that may or may not be grounded in evidence. Without grounding verification, the system cannot demonstrate Article 15 accuracy compliance because there is no mechanism to confirm that recommendations are supported by the evidence base.

How should knowledge base governance compare to training data governance?

Knowledge base governance should be equally rigorous. The knowledge base is a compliance-critical data asset whose provenance, completeness, currency, and representativeness directly affect system outputs. MediAssist AI maintains three distinct governance regimes for clinical guidelines (weekly updates), peer-reviewed abstracts (monthly refresh), and drug interaction records (monthly BNF alignment).

What clinical safety monitoring thresholds should an LLM-based medical AI implement?

MediAssist AI implements real-time grounding suppression below 0.75, daily diagnostic accuracy checks (below 90% aggregate or 85% per subgroup triggers escalation), daily GPAI sentinel evaluation, weekly GP override rate monitoring for automation bias, and immediate break-glass escalation for any missed drug interaction.

Worked Example: MediAssist AI (LLM-Based RAG Clinical Decision)

Q: How do you manage prompt changes in a clinical LLM system?

Treat the system prompt as a version-controlled first-class artefact. MediAssist AI maintains five separately versioned prompt components with designated approvers. Every prompt change triggers the governance pipeline and is assessed against substantial modification criteria, because prompt changes can alter clinical behaviour as profoundly as model retraining.

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

MediAssist AI is a high-risk clinical decision support system built on retrieval-augmented generation with a fine-tuned foundation model, introducing compliance challenges that go far beyond classical ML systems. This worked example demonstrates AISDP preparation for an LLM-based system where GPAI provider information gaps, knowledge base governance, grounding verification, and the clinical safety dimension transform every compliance requirement into a patient safety control.

Abstract

Read abstract

This worked example demonstrates AISDP preparation for MediAssist AI v1.0, a retrieval-augmented generation system deployed as clinical decision support for general practitioners. Built on a fine-tuned Claude 3.5 Sonnet model with a knowledge base of 12,000 clinical guidelines, 45,000 peer-reviewed abstracts, and 3,200 drug interaction records, the system is classified as high-risk under Annex III, Area 1(b) and as Software as a Medical Device under Regulation (EU) 2017/745. The example contrasts with the classical ML TalentLens Pro case by highlighting six compliance challenges unique to LLM-based systems: GPAI provider information asymmetry requiring compensating controls, knowledge base governance as demanding as training data governance, two-stage grounding verification as a compliance control rather than a product feature, prompt governance as version control for LLM behaviour, cascade-aware governance across six model components, and clinical safety amplifying every compliance dimension into a patient safety concern. The preparation process covers GPAI due diligence under Article 25, multi-model component registry management, fairness evaluation across 14 demographic subgroups with intersectional analysis, and post-market monitoring with daily sentinel evaluation and real-time grounding suppression.

What is MediAssist AI and how does it differ from classical ML compliance?

Engineering Approach

MediAssist AI is a high-risk clinical decision support system that uses retrieval-augmented generation built on a fine-tuned foundation model, introducing compliance challenges that go far beyond those of classical ML systems. The system assists general practitioners by providing evidence-based diagnostic suggestions and treatment recommendations, grounded in clinical guidelines and peer-reviewed literature, serving approximately 2 million consultations per year.

Provided by Thornfield Health Technologies Ltd (UK), the system is classified under annex iii Area 1(b) as a safety component in healthcare. It uses a fine-tuned version of Claude 3.5 Sonnet as the generation model, with a knowledge base of approximately 12,000 clinical guidelines, 45,000 peer-reviewed abstracts, and 3,200 drug interaction records. Because Thornfield fine-tuned a gpai model for a high-risk use case, Article 25(1)(b) is engaged and Thornfield bears full provider obligations under Article 16.

The contrast with the TalentLens Pro example is instructive. Where TalentLens Pro's challenges centre on data governance, fairness, and human oversight, MediAssist AI adds GPAI provider information asymmetry, knowledge base governance, grounding verification, prompt governance, fine-tuning provider boundaries, multi-model version control, and the clinical safety dimension that makes every compliance failure a potential patient safety event.

How was the system classified and what GPAI obligations apply?

Regulatory Requirement

Classification confirmed MediAssist AI as high-risk under both the AI Act and the Medical Devices Regulation, creating parallel conformity assessment obligations.

Classification confirmed MediAssist AI as high-risk under both the AI Act and the Medical Devices Regulation, creating parallel conformity assessment obligations. The system meets the AI system definition under Article 3(1): it is machine-based, operates with autonomy (generates suggestions without per-query pre-approval), and produces outputs influencing clinical decisions affecting patients.

All eight Article 5 prohibitions were cleared. Pathway A is engaged because the system is a safety component of a healthcare service, classified as Software as a Medical Device (SaMD) Class IIa under Regulation (EU) 2017/745. This triggers notified body involvement for the MDR conformity assessment, though the AI Act assessment follows the Annex VI internal procedure. Pathway B confirms Annex III Area 1(b).

The fine-tuning of Claude 3.5 Sonnet changed the model's intended purpose from general-purpose to clinical advice, engaging Article 25(1)(b). Anthropic remains the GPAI model provider under Articles 51 through 56. The structured Article 25(3) information request to Anthropic covered six categories: training data governance, model architecture, safety evaluation, versioning, data handling, and systemic risk. Every category produced partial disclosures, confirming that the GPAI information gap is real and consequential for downstream providers.

How does the system compensate for GPAI provider information gaps?

Compensating Controls

Compensating controls replace the compliance evidence that Anthropic cannot fully provide, forming the primary evidence for GPAI-dependent AISDP modules rather than optional supplementary measures.

Compensating controls replace the compliance evidence that Anthropic cannot fully provide, forming the primary evidence for GPAI-dependent AISDP modules rather than optional supplementary measures. Without these controls, Modules 3, 4, and 6 would have material gaps.

Behavioural proxy testing uses a sentinel dataset of 2,400 clinical vignettes across 14 demographic subgroups and 6 clinical specialties. This compensates for the absence of training data representativeness information from Anthropic. Structured behavioural characterisation systematically evaluates hallucination rate, refusal patterns, and clinical accuracy across specialties, compensating for the limited failure mode analysis available from the GPAI provider.

A red-teaming programme tested four attack categories. Prompt injection via clinical notes was detected by the input sanitisation layer in 89% of cases, with the 11% failure rate logged as risk R-007. Knowledge base poisoning was rejected by the provenance verification step. Hallucination provocation using out-of-coverage scenarios found the system hallucinated in 23% of cases, with the grounding verification layer catching 78% before delivery to the GP.

Daily sentinel monitoring evaluates against a 500-vignette dataset, detecting GPAI model behaviour changes with a Population Stability Index threshold of 0.10. Version pinning and contractual notification (90 days for deprecation) provide additional protection. The six-category GPAI disclosure register (AISDP-MED-001-GPAI-REG-v1.0) records all received information and identified gaps alongside the compensating controls that address each gap.

How does six-component architecture governance work?

Engineering Approach

Six model components form a cascade pipeline where a change to any upstream component triggers re-evaluation of all downstream components, making cascade-aware governance essential.

Six model components form a cascade pipeline where a change to any upstream component triggers re-evaluation of all downstream components, making cascade-aware governance essential. The multi-model component registry tracks each component's type, provider, version, and AISDP module allocation.

Component	Type	Provider	Version
Clinical NER	Fine-tuned SciBERT	Thornfield (internal)	1.2.0
Embedding model	Voyage AI (voyage-large-2)	Voyage AI (third-party)	2.0.3
Reranker	Cohere Rerank v3	Cohere (third-party)	3.0.1
Generator (LLM)	Fine-tuned Claude 3.5 Sonnet	Anthropic / Thornfield	ft-med-1.0.0

How does grounding verification convert LLM output into compliance-grade recommendations?

Compensating Controls

Two-stage grounding verification transforms unreliable LLM output into evidence-backed clinical recommendations, serving as a compliance control rather than a product feature.

Two-stage grounding verification transforms unreliable LLM output into evidence-backed clinical recommendations, serving as a compliance control rather than a product feature. Without it, the system cannot demonstrate Article 15 accuracy compliance because natural language outputs may or may not be grounded in evidence.

Stage 1 extracts citations. The system prompt requires the LLM to cite specific guideline references for every recommendation. The citation extractor parses the output and identifies cited documents. Stage 2 verifies entailment. A DeBERTa-v3 NLI model evaluates whether each cited document entails the recommendation it supports. If any recommendation lacks a citation, or if the highest entailment score falls below 0.85, the recommendation is flagged as ungrounded.

Ungrounded recommendations are never delivered to the GP. Instead, the system displays a message directing the GP to consult the relevant NICE guideline directly. The ungrounded query is logged for review by the clinical governance team, who assess whether the knowledge base has a coverage gap requiring remediation.

Grounding metrics are reported as post-market monitoring measures: mean grounding score, proportion of ungrounded queries, and distribution by clinical specialty. The conformity assessment identified a non-conformity (NC-003) where the GP interface did not display the grounding score, limiting the GP's ability to assess evidence quality under Article 14(4)(a). This was remediated by adding a traffic-light grounding indicator (green at 0.90 or above, amber between 0.85 and 0.90, red below 0.85) alongside each recommendation.

What does fairness evaluation look like for a clinical AI system?

Engineering Approach

Fairness evaluation covered 14 demographic subgroups and 23 intersectional subgroups, measuring diagnostic accuracy, recommendation completeness, and grounding score, with the clinical safety dimension amplifying every fairness gap into a potential patient harm event.

Aggregate diagnostic accuracy reached 94.2%. The lowest subgroup accuracy was 89.1% for elderly patients (75+) from South Asian backgrounds, a gap of five percentage points. This was logged as risk R-003, with mitigation through targeted knowledge base augmentation (additional guidelines addressing conditions with higher prevalence in South Asian populations) and quarterly monitoring.

The FRIA identified intersectional elevated risk for elderly patients from ethnic minority backgrounds, whose clinical presentations may differ from training data patterns and whose conditions may be underrepresented in the knowledge base. Seven Charter rights were assessed: human dignity (Article 1), right to life (Article 2), right to integrity (Article 3), private life (Article 7), non-discrimination (Article 21), healthcare access (Article 35), and effective remedy (Article 47).

The prompt governance regime version-controls the system prompt as a first-class artefact. Five prompt components (system prompt, few-shot examples, output format, safety guardrails, drug interaction template) are separately versioned with designated approvers. Every prompt change triggers the governance pipeline and is assessed against criteria, since prompt changes can alter system behaviour as profoundly as model retraining.

What monitoring thresholds maintain clinical safety after deployment?

Engineering Approach

Eight monitoring metrics with quantified thresholds and tiered escalation maintain continuous compliance, with the drug interaction checker and grounding score operating as patient safety controls with immediate escalation paths. The post-market monitoring configuration reflects the clinical safety dimension that distinguishes this system from non-clinical AI.

Component	Type	Provider	Version
Clinical NER	Fine-tuned SciBERT	Thornfield (internal)	1.2.0
Embedding model	Voyage AI (voyage-large-2)	Voyage AI (third-party)	2.0.3
Reranker	Cohere Rerank v3	Cohere (third-party)	3.0.1
Generator (LLM)	Fine-tuned Claude 3.5 Sonnet

What lessons emerge for LLM-based systems that classical ML systems would not encounter?

Engineering Approach

Six lessons from MediAssist AI are specific to LLM-based architectures and would not arise in a classical ML system such as TalentLens Pro, reflecting the fundamentally different compliance challenges that retrieval-augmented generation introduces.

The GPAI information gap is real and consequential. Anthropic provided useful but incomplete information across all six request categories. The compensating controls, particularly sentinel monitoring and behavioural characterisation, are the primary compliance evidence for the GPAI model layer, not optional supplements.

Grounding verification is a compliance control, not a feature. For classical ML, outputs are deterministic scores and accuracy evaluation directly measures compliance-relevant behaviour. For LLM-based systems, outputs are natural language that may or may not be grounded in evidence. The grounding verification layer converts unreliable output into compliance-grade recommendations.

Knowledge base governance is as demanding as training data governance. Provenance, completeness, currency, and representativeness all affect outputs. The governance regime is a prerequisite for AISDP Module 4 data governance claims, not an enhancement.

Prompt governance is version control for LLM behaviour. Changes can alter behaviour as profoundly as retraining. Multi-model systems require cascade-aware governance because a change to any of six components can ripple through the pipeline. Clinical safety amplifies every compliance dimension, making every compliance failure a potential patient safety event. An organisation treating compliance as documentation rather than safety engineering will endanger patients and fail conformity assessment.

Worked Example: MediAssist AI (LLM-Based RAG Clinical Decision)

Written by

What is MediAssist AI and how does it differ from classical ML compliance?

How was the system classified and what GPAI obligations apply?

How does the system compensate for GPAI provider information gaps?

How does six-component architecture governance work?

How does grounding verification convert LLM output into compliance-grade recommendations?

What does fairness evaluation look like for a clinical AI system?

What monitoring thresholds maintain clinical safety after deployment?

What lessons emerge for LLM-based systems that classical ML systems would not encounter?

Frequently Asked Questions

Related Pages

Build compliance into your pipeline