We use cookies to improve your experience and analyse site traffic.
MediAssist AI is a high-risk clinical decision support system built on retrieval-augmented generation with a fine-tuned foundation model, introducing compliance challenges that go far beyond classical ML systems. This worked example demonstrates AISDP preparation for an LLM-based system where GPAI provider information gaps, knowledge base governance, grounding verification, and the clinical safety dimension transform every compliance requirement into a patient safety control.
MediAssist AI is a high-risk clinical decision support system that uses retrieval-augmented generation built on a fine-tuned foundation model, introducing compliance challenges that go far beyond those of classical ML systems.
MediAssist AI is a high-risk clinical decision support system that uses retrieval-augmented generation built on a fine-tuned foundation model, introducing compliance challenges that go far beyond those of classical ML systems. The system assists general practitioners by providing evidence-based diagnostic suggestions and treatment recommendations, grounded in clinical guidelines and peer-reviewed literature, serving approximately 2 million consultations per year.
Provided by Thornfield Health Technologies Ltd (UK), the system is classified under annex iii Area 1(b) as a safety component in healthcare. It uses a fine-tuned version of Claude 3.5 Sonnet as the generation model, with a knowledge base of approximately 12,000 clinical guidelines, 45,000 peer-reviewed abstracts, and 3,200 drug interaction records. Because Thornfield fine-tuned a gpai model for a high-risk use case, Article 25(1)(b) is engaged and Thornfield bears full provider obligations under Article 16.
The contrast with the TalentLens Pro example is instructive. Where TalentLens Pro's challenges centre on data governance, fairness, and human oversight, MediAssist AI adds GPAI provider information asymmetry, knowledge base governance, grounding verification, prompt governance, fine-tuning provider boundaries, multi-model version control, and the clinical safety dimension that makes every compliance failure a potential patient safety event.
Classification confirmed MediAssist AI as high-risk under both the AI Act and the Medical Devices Regulation, creating parallel conformity assessment obligations.
Classification confirmed MediAssist AI as high-risk under both the AI Act and the Medical Devices Regulation, creating parallel conformity assessment obligations. The system meets the AI system definition under Article 3(1): it is machine-based, operates with autonomy (generates suggestions without per-query pre-approval), and produces outputs influencing clinical decisions affecting patients.
All eight Article 5 prohibitions were cleared. Pathway A is engaged because the system is a safety component of a healthcare service, classified as Software as a Medical Device (SaMD) Class IIa under Regulation (EU) 2017/745. This triggers notified body involvement for the MDR conformity assessment, though the AI Act assessment follows the Annex VI internal procedure. Pathway B confirms Annex III Area 1(b).
The fine-tuning of Claude 3.5 Sonnet changed the model's intended purpose from general-purpose to clinical advice, engaging Article 25(1)(b). Anthropic remains the GPAI model provider under Articles 51 through 56. The structured Article 25(3) information request to Anthropic covered six categories: training data governance, model architecture, safety evaluation, versioning, data handling, and systemic risk. Every category produced partial disclosures, confirming that the GPAI information gap is real and consequential for downstream providers.
Compensating controls replace the compliance evidence that Anthropic cannot fully provide, forming the primary evidence for GPAI-dependent AISDP modules rather than optional supplementary measures.
Compensating controls replace the compliance evidence that Anthropic cannot fully provide, forming the primary evidence for GPAI-dependent AISDP modules rather than optional supplementary measures. Without these controls, Modules 3, 4, and 6 would have material gaps.
Behavioural proxy testing uses a sentinel dataset of 2,400 clinical vignettes across 14 demographic subgroups and 6 clinical specialties. This compensates for the absence of training data representativeness information from Anthropic. Structured behavioural characterisation systematically evaluates hallucination rate, refusal patterns, and clinical accuracy across specialties, compensating for the limited failure mode analysis available from the GPAI provider.
A red-teaming programme tested four attack categories. Prompt injection via clinical notes was detected by the input sanitisation layer in 89% of cases, with the 11% failure rate logged as risk R-007. Knowledge base poisoning was rejected by the provenance verification step. Hallucination provocation using out-of-coverage scenarios found the system hallucinated in 23% of cases, with the grounding verification layer catching 78% before delivery to the GP.
Daily sentinel monitoring evaluates against a 500-vignette dataset, detecting GPAI model behaviour changes with a Population Stability Index threshold of 0.10. Version pinning and contractual notification (90 days for deprecation) provide additional protection. The six-category GPAI disclosure register (AISDP-MED-001-GPAI-REG-v1.0) records all received information and identified gaps alongside the compensating controls that address each gap.
Six model components form a cascade pipeline where a change to any upstream component triggers re-evaluation of all downstream components, making cascade-aware governance essential.
Six model components form a cascade pipeline where a change to any upstream component triggers re-evaluation of all downstream components, making cascade-aware governance essential. The multi-model component registry tracks each component's type, provider, version, and AISDP module allocation.
| Component | Type | Provider | Version |
|---|---|---|---|
| Clinical NER | Fine-tuned SciBERT | Thornfield (internal) | 1.2.0 |
| Embedding model | Voyage AI (voyage-large-2) | Voyage AI (third-party) | 2.0.3 |
| Reranker | Cohere Rerank v3 | Cohere (third-party) | 3.0.1 |
| Generator (LLM) | Fine-tuned Claude 3.5 Sonnet | Anthropic / Thornfield | ft-med-1.0.0 |
Two-stage grounding verification transforms unreliable LLM output into evidence-backed clinical recommendations, serving as a compliance control rather than a product feature.
Two-stage grounding verification transforms unreliable LLM output into evidence-backed clinical recommendations, serving as a compliance control rather than a product feature. Without it, the system cannot demonstrate Article 15 accuracy compliance because natural language outputs may or may not be grounded in evidence.
Stage 1 extracts citations. The system prompt requires the LLM to cite specific guideline references for every recommendation. The citation extractor parses the output and identifies cited documents. Stage 2 verifies entailment. A DeBERTa-v3 NLI model evaluates whether each cited document entails the recommendation it supports. If any recommendation lacks a citation, or if the highest entailment score falls below 0.85, the recommendation is flagged as ungrounded.
Ungrounded recommendations are never delivered to the GP. Instead, the system displays a message directing the GP to consult the relevant NICE guideline directly. The ungrounded query is logged for review by the clinical governance team, who assess whether the knowledge base has a coverage gap requiring remediation.
Grounding metrics are reported as post-market monitoring measures: mean grounding score, proportion of ungrounded queries, and distribution by clinical specialty. The conformity assessment identified a non-conformity (NC-003) where the GP interface did not display the grounding score, limiting the GP's ability to assess evidence quality under Article 14(4)(a). This was remediated by adding a traffic-light grounding indicator (green at 0.90 or above, amber between 0.85 and 0.90, red below 0.85) alongside each recommendation.
Fairness evaluation covered 14 demographic subgroups and 23 intersectional subgroups, measuring diagnostic accuracy, recommendation completeness, and grounding score, with the clinical safety dimension amplifying every fairness gap into a potential patient harm event.
Fairness evaluation covered 14 demographic subgroups and 23 intersectional subgroups, measuring diagnostic accuracy, recommendation completeness, and grounding score, with the clinical safety dimension amplifying every fairness gap into a potential patient harm event. The Fundamental Rights Impact Assessment identified seven engaged Charter rights.
Aggregate diagnostic accuracy reached 94.2%. The lowest subgroup accuracy was 89.1% for elderly patients (75+) from South Asian backgrounds, a gap of five percentage points. This was logged as risk R-003, with mitigation through targeted knowledge base augmentation (additional guidelines addressing conditions with higher prevalence in South Asian populations) and quarterly monitoring.
The FRIA identified intersectional elevated risk for elderly patients from ethnic minority backgrounds, whose clinical presentations may differ from training data patterns and whose conditions may be underrepresented in the knowledge base. Seven Charter rights were assessed: human dignity (Article 1), right to life (Article 2), right to integrity (Article 3), private life (Article 7), non-discrimination (Article 21), healthcare access (Article 35), and effective remedy (Article 47).
The prompt governance regime version-controls the system prompt as a first-class artefact. Five prompt components (system prompt, few-shot examples, output format, safety guardrails, drug interaction template) are separately versioned with designated approvers. Every prompt change triggers the governance pipeline and is assessed against criteria, since prompt changes can alter system behaviour as profoundly as model retraining.
Eight monitoring metrics with quantified thresholds and tiered escalation maintain continuous compliance, with the drug interaction checker and grounding score operating as patient safety controls with immediate escalation paths.
Eight monitoring metrics with quantified thresholds and tiered escalation maintain continuous compliance, with the drug interaction checker and grounding score operating as patient safety controls with immediate escalation paths. The post-market monitoring configuration reflects the clinical safety dimension that distinguishes this system from non-clinical AI.
| Component | Type | Provider | Version |
|---|---|---|---|
| Clinical NER | Fine-tuned SciBERT | Thornfield (internal) | 1.2.0 |
| Embedding model | Voyage AI (voyage-large-2) | Voyage AI (third-party) | 2.0.3 |
| Reranker | Cohere Rerank v3 | Cohere (third-party) | 3.0.1 |
| Generator (LLM) | Fine-tuned Claude 3.5 Sonnet |
Six lessons from MediAssist AI are specific to LLM-based architectures and would not arise in a classical ML system such as TalentLens Pro, reflecting the fundamentally different compliance challenges that retrieval-augmented generation introduces.
Six lessons from MediAssist AI are specific to LLM-based architectures and would not arise in a classical ML system such as TalentLens Pro, reflecting the fundamentally different compliance challenges that retrieval-augmented generation introduces. Each lesson maps to a concrete compliance control implemented during the preparation process.
The GPAI information gap is real and consequential. Anthropic provided useful but incomplete information across all six request categories. The compensating controls, particularly sentinel monitoring and behavioural characterisation, are the primary compliance evidence for the GPAI model layer, not optional supplements.
Grounding verification is a compliance control, not a feature. For classical ML, outputs are deterministic scores and accuracy evaluation directly measures compliance-relevant behaviour. For LLM-based systems, outputs are natural language that may or may not be grounded in evidence. The grounding verification layer converts unreliable output into compliance-grade recommendations.
Knowledge base governance is as demanding as training data governance. Provenance, completeness, currency, and representativeness all affect outputs. The governance regime is a prerequisite for AISDP Module 4 data governance claims, not an enhancement.
Prompt governance is version control for LLM behaviour. Changes can alter behaviour as profoundly as retraining. Multi-model systems require cascade-aware governance because a change to any of six components can ripple through the pipeline. Clinical safety amplifies every compliance dimension, making every compliance failure a potential patient safety event. An organisation treating compliance as documentation rather than safety engineering will endanger patients and fail conformity assessment.
Yes. Article 25(1)(b) is engaged when fine-tuning changes the model's intended purpose (from general-purpose to clinical advice), alters the risk profile, and affects compliance with GPAI provider obligations. The fine-tuning organisation bears full provider obligations under Article 16, while the GPAI model provider retains obligations under Articles 51 through 56.
For classical ML systems, outputs are deterministic scores that can be directly evaluated for accuracy. For LLM-based systems, outputs are natural language that may or may not be grounded in evidence. Without grounding verification, the system cannot demonstrate Article 15 accuracy compliance because there is no mechanism to confirm that recommendations are supported by the evidence base.
Knowledge base governance should be equally rigorous. The knowledge base is a compliance-critical data asset whose provenance, completeness, currency, and representativeness directly affect system outputs. MediAssist AI maintains three distinct governance regimes for clinical guidelines (weekly updates), peer-reviewed abstracts (monthly refresh), and drug interaction records (monthly BNF alignment).
MediAssist AI implements real-time grounding suppression below 0.75, daily diagnostic accuracy checks (below 90% aggregate or 85% per subgroup triggers escalation), daily GPAI sentinel evaluation, weekly GP override rate monitoring for automation bias, and immediate break-glass escalation for any missed drug interaction.
Treat the system prompt as a version-controlled first-class artefact. MediAssist AI maintains five separately versioned prompt components with designated approvers. Every prompt change triggers the governance pipeline and is assessed against substantial modification criteria, because prompt changes can alter clinical behaviour as profoundly as model retraining.
Two stages: citation extraction requires the LLM to cite guideline references, then a DeBERTa-v3 NLI model verifies entailment with a 0.85 threshold. Ungrounded recommendations are suppressed and the GP is directed to consult guidelines directly.
A change to any upstream component triggers re-evaluation of all downstream components. MediAssist AI's six-component cascade map prevented a deployment when an embedding model update degraded paediatric query retrieval quality.
Evaluation covers 14 demographic subgroups and 23 intersectional subgroups, measuring diagnostic accuracy, recommendation completeness, and grounding score. The 5-percentage-point gap for elderly South Asian patients was mitigated through targeted knowledge base augmentation.
| Drug interaction checker | Rule-based (BNF sourced) | Thornfield (internal) | 4.1.0 |
| Grounding verifier | DeBERTa-v3 NLI model | Microsoft (open-source) | 1.0.0 |
The cascade map runs: Clinical NER, Embedding model, Reranker, Generator, Drug interaction checker, Grounding verifier. This cascade governance prevented a deployment in Week 16 when an embedding model update degraded retrieval quality for paediatric queries, which would have reduced diagnostic accuracy for child patients below the declared threshold.
Knowledge base governance applies three distinct regimes. Clinical guidelines (12,000 documents from NICE, SIGN, and BNF) are checked weekly for updates, with 94% coverage against the NICE clinical pathway taxonomy at launch. Peer-reviewed abstracts (45,000 from PubMed) are refreshed monthly, with abstracts older than five years flagged for currency review. Drug interaction records (3,200 from BNF) are updated monthly within 30 days of publication. The knowledge base is versioned as a composite artefact, with changes triggering cascade evaluation.
| Anthropic / Thornfield |
| ft-med-1.0.0 |
| Drug interaction checker | Rule-based (BNF sourced) | Thornfield (internal) | 4.1.0 |
| Grounding verifier | DeBERTa-v3 NLI model | Microsoft (open-source) | 1.0.0 |
The conformity assessment identified three non-conformities. NC-001: the Instructions for Use inadequately described limitations for rare conditions (Article 13(3)(b)(ii)). NC-002: the post-market monitoring plan lacked a quantified grounding score threshold (Article 72). NC-003: the GP interface did not display the grounding score (Article 14(4)(a)). All three were remediated within four weeks. The Declaration of Conformity was signed on 15 March 2026 after re-assessment confirmed conformity. EU database and MDR registration were completed in parallel.