We use cookies to improve your experience and analyse site traffic.
TalentLens Pro is a high-risk AI recruitment screening system that ranks job candidates using an XGBoost ensemble model across 18 EU Member States. This worked example walks through the complete seven-phase AISDP preparation process, from classification and risk assessment through architecture, testing, conformity assessment, deployment, and post-market monitoring. Every governance gate, technical decision, and evidence artefact described in the Practitioners Implementation Guide is illustrated through a single end-to-end case study.
TalentLens Pro is a high-risk AI recruitment screening system that ranks job candidates using an XGBoost ensemble model, triggering full EU AI Act compliance obligations under Annex III, Area 4(a).
TalentLens Pro is a high-risk AI recruitment screening system that ranks job candidates using an XGBoost ensemble model, triggering full EU AI Act compliance obligations under Annex III, Area 4(a). The system is provided by Meridian AI Solutions GmbH, a Berlin-based company, and processes approximately 2.3 million screenings per year across 18 EU Member States. It produces suitability scores from 0 to 100 with SHAP-based explanatory profiles for each candidate.
The system is classified as high-risk because it falls within annex iii Area 4(a): AI systems intended for recruitment or selection of natural persons. The article 6 3 exception does not apply because TalentLens Pro does not perform a narrow procedural task. It evaluates and ranks candidates using 47 features, producing a substantive suitability assessment that structures the recruiter's review. The aisdp reference is AISDP-2025-0042-v3.1, registered in the EU database as EU-AI-HR-2025-00784.
This worked example follows the complete seven-phase AISDP preparation process from discovery through post-market operations. It demonstrates how the engineering practices described throughout the Practitioners Implementation Guide translate into concrete artefacts, governance gates, and compliance evidence for a real-world system. The domain-to-module mapping table below shows how each guide section was applied.
| Domain Section | Application to TalentLens Pro | AISDP Module(s) |
|---|---|---|
| Risk Assessment | Five-method risk identification; 10-risk register; four-dimension scoring | 6, 11 |
| Model Selection | XGBoost selection over deep learning and LLM alternatives; SHAP rationale | 2, 3 |
| Data Governance | 326k-record dataset; three-source acquisition; 14-subgroup bias assessment | 4 |
| Development Architectures | Three-microservice architecture; eight-layer reference; C4 diagrams | 3, 7 |
| Version Control | Git monorepo with DVC; MLflow registry; composite versioning scheme | 10, 12 |
| CI/CD Pipelines | 11-stage pipeline with fairness, robustness, and modification gates | 2, 5, 9, 10 |
| Cybersecurity | ISO 27001; 13 AI threat categories; adversarial testing; CRA scope | 9 |
| Conformity Assessment | Annex VI internal control with voluntary TUV SUD review | Cross-cutting |
| Post-Market Monitoring | Nine monitoring activities with quantified thresholds; feedback loop | 12 |
| Operational Oversight | Six-level oversight pyramid; break-glass; anti-automation-bias features | 7 |
Classification required confirming the system meets the AI system definition under Article 3(1), clearing all Article 5 prohibitions, and positively identifying the applicable Annex III area.
Classification required confirming the system meets the AI system definition under Article 3(1), clearing all Article 5 prohibitions, and positively identifying the applicable Annex III area. The AI System Assessor assembled an evidence pack of 47 documents including the product specification, existing risk assessments, deployer DPIAs, and interview transcripts.
The article 3 1 definition was confirmed: TalentLens Pro is machine-based (XGBoost on AWS SageMaker), operates with autonomy (produces scores without per-prediction human intervention), and generates outputs influencing decisions affecting natural persons. All eight Article 5 prohibitions were cleared. The system does not engage in subliminal manipulation, exploitation of vulnerable groups, social scoring, or biometric identification.
Pathway A (Annex I safety component) did not apply. Pathway B identified annex iii Area 4(a) as the applicable classification: AI systems intended for recruitment or selection of natural persons. The Legal and Regulatory Advisor assessed the Article 6(3) exception and determined it did not apply on three grounds. First, the system does not perform a narrow procedural task. Second, it does not merely improve a previously completed human activity. Third, it does not simply detect decision-making patterns without replacing human assessment.
The Classification Decision Record was reviewed against five dimensions (completeness, accuracy, reasoning, legal correctness, proportionality) and approved without amendment. The seven-member assessment team completed conflict of interest declarations. One assessor's prior consulting relationship with a deployer was managed by restricting access to that deployer's configuration data.
Five complementary methods produced 10 consolidated risks, each scored across four impact dimensions using a five-by-five likelihood-severity matrix.
Five complementary methods produced 10 consolidated risks, each scored across four impact dimensions using a five-by-five likelihood-severity matrix. The methods were FMEA, stakeholder consultation, regulatory gap analysis, adversarial red-teaming, and horizon scanning.
The FMEA identified 34 failure modes across six components. Notable high-RPN entries included: feature engineering computing values outside the training distribution for novel job categories (RPN 36), SHAP generating misleading attributions when feature interactions dominate (RPN 36), and adversarial CV crafting producing inflated scores (RPN 30). All failure modes with RPN above 24 were escalated to the risk register.
Stakeholder consultation engaged nine participants including deployer HR directors, a labour rights advocate, an accessibility specialist, and a former job applicant. The labour rights advocate raised concerns about disadvantaging candidates with non-linear career paths. The accessibility specialist identified risks for candidates using assistive technology. Both concerns entered the risk register with assigned mitigations.
Adversarial red-teaming against the MITRE ATLAS matrix demonstrated an 8-point score inflation through adversarial CV crafting, with a 94% detection rate. Model extraction was impractical given rate limiting, and membership inference was not feasible for the XGBoost architecture. Horizon scanning reviewed OECD, Stanford HAI, and AI Incident Database sources, identifying three relevant developments including a study on gender bias amplification in resume screening.
XGBoost was selected because it achieved the best balance of accuracy, explainability, and compliance suitability among five evaluated architectures.
XGBoost was selected because it achieved the best balance of accuracy, explainability, and compliance suitability among five evaluated architectures. The Model Selection Record documented the full evaluation rationale, including an Article 25 provider status analysis of why an LLM-based approach was rejected.
Five architectures were evaluated. Logistic regression achieved 0.791 AUC-ROC but the accuracy gap would have incorrectly ranked thousands of candidates monthly. Random forest achieved 0.832 but SHAP computation was too expensive for per-prediction explanations at inference time. XGBoost v1.7.6 achieved 0.847 with exact Shapley values via TreeExplainer within the latency budget, plus deterministic inference supporting Article 12 logging and Article 15 accuracy requirements.
A deep neural network achieved 0.851, a marginal improvement of 0.004 that did not justify the explainability degradation for a high-risk system requiring meaningful per-decision explanations under Article 14. A fine-tuned GPT-4 CV analyser was rejected on multiple grounds. The Legal and Regulatory Advisor determined that fine-tuning engaged Article 25(1)(b), making Meridian a provider with full obligations. Stochastic output variation created challenges for accuracy and logging requirements. Copyright and training data provenance risks were assessed as high given ongoing litigation.
The architecture maps to an eight-layer reference model implemented across three microservices. The Data Ingestion Module handles schema validation, prohibited feature blocking, and data minimisation. The Scoring Engine handles model inference, SHAP computation, and confidence thresholding. The Employer Reporting Interface handles human oversight, delayed score reveal, and calibration case injection.
The Employer Reporting Interface implements five anti-automation-bias measures that go well beyond a simple override button, addressing the risk that recruiters accept AI scores without independent clinical judgement.
The Employer Reporting Interface implements five anti-automation-bias measures that go well beyond a simple override button, addressing the risk that recruiters accept AI scores without independent clinical judgement. These controls were redesigned during Phase 3 after the risk assessment for R-002 revealed that a basic score display would be insufficient.
Delayed score reveal withholds the suitability score for 30 seconds. During this period, the recruiter sees the candidate profile and must form an independent impression before the AI assessment appears. This is a technical control enforced at the interface level, not a policy recommendation.
Calibration cases are injected at a rate of 1 in 20: cases where the system's recommendation is known to be incorrect, testing whether the recruiter identifies the error independently. Review time monitoring alerts the deployer administrator when the median review time per recruiter falls below 45 seconds. Override capability allows any score to be overridden with a mandatory free-text justification.
The break-glass design provides two independent halt mechanisms. An in-application stop button propagates through the feature flag system within 200ms. A separate Lambda function in a different AWS account scales the inference endpoint to zero. Annual exercises verify both mechanisms. The first exercise found one deficiency: the deployer notification email lacked an expected restart timeline, which was remediated within 48 hours.
The six-level pyramid ranges from Level 1 (two SREs with emergency rollback authority) through Level 6 (external oversight, including the AESIA sandbox engagement). The aggregate override rate in Q2 2025 was 11.3%, with no deployer falling below the 2% threshold that would trigger an automation bias investigation.
The training dataset was assembled from three sources with full provenance documentation, bias assessment across 14 subgroups, and Article 10(5) safeguards for special category data processing.
The training dataset was assembled from three sources with full provenance documentation, bias assessment across 14 subgroups, and Article 10(5) safeguards for special category data processing. Each source was documented using the Gebru et al. datasheet framework.
Anonymised historical recruitment data from 14 enterprise deployers contributed 248,000 records (January 2019 to June 2024). A synthetic augmentation dataset generated using CTGAN addressed under-representation in specific subgroups, contributing 52,000 records. A validated benchmark dataset from the Technical University of Munich contributed 26,000 records. The data spanned 18 EU Member States and 12 languages.
The bias assessment used Fairlearn's MetricFrame to evaluate 14 subgroups across seven dimensions (gender, age band, ethnicity, disability status, nationality grouping, language, highest qualification origin). Post-mitigation selection rate ratios ranged from 0.89 to 0.96. Intersectional analysis produced 148 cells, with 23 falling below the 50-candidate minimum cell size. The female-over-50-non-EU-qualification intersection at 0.87 was identified for targeted improvement. FairML Consulting GmbH conducted an independent audit confirming the findings.
Article 10(5) special category processing used pseudonymisation at collection, purpose limitation through technical controls (logically separated database with access restricted to three named individuals), and automatic deletion within 72 hours of aggregate metric computation. Data version control used DVC with S3 storage, with every training run's DVC reference recorded in MLflow to establish complete provenance from model version to raw data.
The pipeline enforces compliance as a technical gate, not a documentation exercise, by blocking deployment unless every quality, fairness, robustness, and modification threshold is met.
The pipeline enforces compliance as a technical gate, not a documentation exercise, by blocking deployment unless every quality, fairness, robustness, and modification threshold is met. No model reaches production without passing all 11 stages.
Stages 1 through 4 cover code quality: checkout with SBOM generation via CycloneDX, static analysis (Ruff, Bandit, pip-audit), 847 unit tests at 85% coverage, and integration tests using a 500-record fixture spanning all 14 subgroups. Stage 5 validates training data against 23 Great Expectations rules covering types, ranges, null rates, and distribution bounds.
Stage 6 trains the model. Stage 7 enforces the performance gate: AUC-ROC must reach 0.82, precision 0.75, and recall 0.70. Stage 8 enforces the fairness gate: every subgroup selection rate ratio must reach 0.80 (hard floor), with 0.90 generating a warning. Stage 9 enforces the robustness gate: maximum adversarial score inflation must not exceed 10 points, and out-of-distribution detection must flag 90% of synthetic OOD inputs.
Stage 10 detects substantial modifications through two comparisons. Version-to-version comparison checks against the current production model (thresholds include AUC-ROC shift exceeding 0.03 and any subgroup SRR below 0.80). Version-to-baseline comparison checks against the conformity assessment baseline (thresholds include cumulative AUC-ROC drift exceeding 0.05). Stage 11 generates documentation automatically, updating AISDP Modules 3, 4, 5, and 9 with current values.
Human override requires the AI Governance Lead's written approval with a logged justification. The posture includes ISO 27001 certification, assessment of all 13 AI threat categories, and annual penetration testing by NCC Group. The CRA scope determination treated TalentLens Pro as within scope given evolving Commission interpretation of the SaaS boundary.
Conformity assessment followed the Annex VI internal control procedure, supplemented by a voluntary TUV SUD review, with a five-step review workflow that identified and remediated three non-conformities before the Declaration was signed.
Conformity assessment followed the Annex VI internal control procedure, supplemented by a voluntary TUV SUD review, with a five-step review workflow that identified and remediated three non-conformities before the Declaration was signed. Four harmonised standards were applied: ISO/IEC 42001, 23894, 25012, and 27001.
The five-step review took 12 working days. The Technical Review (4 days) found one minor non-conformity: an architecture diagram showed Redis connected to the wrong microservice. The Legal Review (3 days) found one major non-conformity (candidate notification template missing in 3 of 12 languages) and one minor (a cross-reference error). The Data Protection Review (2 days) found no issues. The Holistic Review (3 days) confirmed all 26 Annex IV completeness items.
The major non-conformity was remediated within 21 days through a certified translation agency with AI domain expertise. The Declaration of Conformity was prepared per Annex V in machine-readable (JSON-LD) and signed PDF formats, translated into all 24 official EU languages, and signed on 1 April 2025. EU database registration was confirmed as EU-AI-HR-2025-00784 before the system was available to new deployers.
The 30-minute inspection readiness drill was conducted with two internal mock inspectors unfamiliar with the AISDP. All four document requests were fulfilled within the benchmark: design specifications in 8 minutes, fairness testing evidence in 4 minutes, the risk register in 2 minutes, and a live decision reconstruction in 18 minutes. CE marking was affixed in three locations: the user interface, API response headers, and the Instructions for Use.
Nine monitoring activities with quantified thresholds and escalation procedures form the backbone of continuous compliance, supported by a six-level oversight pyramid and a feedback loop with tiered decision authority.
Nine monitoring activities with quantified thresholds and escalation procedures form the backbone of continuous compliance, supported by a six-level oversight pyramid and a feedback loop with tiered decision authority. Monitoring is not a reporting exercise but an active compliance control.
| Activity | Metric | Threshold | Frequency |
|---|---|---|---|
| Performance | AUC-ROC on labelled production data | Below 0.80 | Monthly |
| Fairness | Selection rate ratio per subgroup | Below 0.85 (warning); below 0.80 (critical) | Monthly |
| Data drift | Population Stability Index | Above 0.20 | Weekly |
| Human oversight | Override rate per deployer | Below 2% or above 40% |
Total initial preparation cost approximately EUR 325,000 over 22 weeks, with ongoing annual costs of approximately EUR 95,000, representing about 18% of annual development cost.
Total initial preparation cost approximately EUR 325,000 over 22 weeks, with ongoing annual costs of approximately EUR 95,000, representing about 18% of annual development cost. The largest internal effort allocation was the AI System Assessor at 0.6 FTE for the full duration.
Direct costs included internal effort (EUR 210,000), the FairML bias audit (EUR 25,000), the TUV SUD voluntary review (EUR 35,000), NCC Group penetration testing (EUR 18,000), and translation services (EUR 22,000). Shared infrastructure costs across Meridian's three high-risk systems were approximately EUR 40,000 per year. The 18% ongoing cost ratio falls within the 15% to 25% range typical for high-risk AI compliance programmes.
The decommissioning plan was documented during Phase 3, covering a six-month planned retirement with technical shutdown, data lifecycle closure, and downstream decision monitoring. The timeline runs from T minus 6 months (deployer announcement) through full shutdown at T-0, with credential revocation via HashiCorp Vault's parent lease mechanism and infrastructure teardown via Terraform. Special category data is deleted within 72 hours, while aggregated monitoring data is retained for ten years. A 12-month post-decommission plan tracks aggregate hiring outcomes for the final cohort, disaggregated by subgroup, acknowledging that historical AI scores continue to affect individuals.
Five key lessons emerged. First, retrospective documentation is less credible than contemporaneous evidence. Meridian chose to start compliant documentation from v4.0.0 rather than reconstructing v3.x history, clearly marking the boundary. Second, requires operational design, not a checkbox. The initial score-and-override design was replaced with delayed reveal, calibration injection, and review time monitoring after the R-002 risk assessment.
It is unlikely for substantive screening systems. TalentLens Pro could not claim the exception because it evaluates and ranks candidates using 47 features, producing a substantive suitability assessment. Systems that merely sort applications by date or check for keyword presence may qualify, but any system that generates rankings or scores for human review is performing more than a narrow procedural task.
Meridian chose not to reconstruct retroactive documentation for the pre-Act version. Instead, they treated the first substantial modification (v4.0.0 model retraining) as the starting point for compliant documentation, clearly marking what was reconstructed versus contemporaneous. This transparency was more credible than retroactive documentation claiming to be original.
TalentLens Pro detected a deployer using the system for internal transfers through Level 3 (product management) oversight. The deployer was notified, the misuse was documented, and the Instructions for Use were clarified. Technical monitoring alone cannot detect intent drift; business-level oversight is essential.
Implement both version-to-version comparison (against the current production model) and version-to-baseline comparison (against the conformity assessment baseline). TalentLens Pro's cumulative tracking detected AUC-ROC drift of 0.041 approaching the 0.05 threshold, which would have been invisible with only version-to-version checks.
A phased six-month timeline from deployer announcement to full shutdown, credential revocation and infrastructure teardown, data lifecycle closure with special category deletion within 72 hours, downstream decision monitoring for 12 months tracking how historical outputs continue to affect individuals, and documentation archival with retrieval procedures.
XGBoost provides exact Shapley values via TreeExplainer within latency budgets, deterministic inference for logging and accuracy requirements, and competitive AUC-ROC without the explainability trade-offs of neural networks or the compliance complications of LLM-based approaches.
Delayed score reveal (30 seconds), calibration case injection (1:20 ratio), review time monitoring, mandatory override justification, and aggregate override rate tracking with thresholds for investigation.
Annex VI internal control with a five-step review workflow (technical, legal, data protection, holistic, final approval), non-conformity tracking and remediation, optional voluntary notified body review, and a 30-minute inspection readiness drill.
Approximately EUR 325,000 initial cost over 22 weeks, with EUR 95,000 annual ongoing costs (about 18% of development cost), including internal effort, external audits, voluntary review, penetration testing, and translations.
| ID | Risk | Residual | Key Mitigation |
|---|---|---|---|
| R-001 | Discriminatory scoring against protected subgroups | Medium | 14-subgroup testing; SRR threshold 0.80; external audit |
| R-002 | Automation bias: recruiters accept scores without review | Medium | Delayed score reveal; calibration cases (1:20); review time monitoring |
| R-003 | Model drift degrading accuracy or fairness | Low | PSI monitoring; quarterly revalidation; cumulative baseline tracking |
| R-004 | Adversarial CV crafting inflating scores | Medium | Input validation; 94% detection rate; annual red-teaming |
| R-005 | Adverse impact on non-standard CVs | Medium | Feature audit for proxy variables; synthetic augmentation |
The Fundamental Rights Impact Assessment covered five Charter rights. Non-discrimination (Article 21) was the primary concern, addressed through Fairlearn MetricFrame analysis across 14 subgroups. Freedom to choose an occupation (Article 15) was mitigated by the prohibition on autonomous rejection, technically enforced by withholding scores until the review period completes. The reputational risk assessment rated R-001 (discriminatory scoring) as the highest exposure, with customer and market dimensions both scoring five out of five.
| Monthly |
| Adversarial detection | Detection rate on test cases | Below 90% | Quarterly |
| Serious incidents | Any Article 73 event | Any occurrence | Continuous |
One complete feedback loop cycle occurred in Q2 2025. Weekly PSI monitoring detected a 0.18 reading on "years of experience" (approaching the 0.20 threshold). Investigation identified the cause: a technology sector deployer had begun processing graduate applications. The Technical SME authorised a threshold adjustment for that deployer's partition, documented in the PMM review minutes. This did not trigger a substantial modification assessment because the model itself was unchanged.
Level 3 oversight detected an intent drift signal: a deployer using TalentLens Pro for internal transfers, outside the documented intended purpose. The deployer was notified, the misuse documented, and the Instructions for Use clarified. A tabletop exercise in September 2025 rehearsed the serious incident reporting procedure, simulating a scenario where the system systematically underscored candidates with disabilities. The exercise verified the dual-reporting decision tree, evidence preservation procedures, and notification chain.
Third, cumulative change tracking prevents undetected drift. Version-to-baseline comparison detected that cumulative AUC-ROC drift had reached 0.041, approaching the 0.05 threshold, which would have been invisible without cumulative tracking. Fourth, scope creep requires business-level oversight. Technical monitoring alone cannot detect intent drift such as deployers using the system for internal transfers. Fifth, non-retaliation culture enables early detection. A recruiter's concern about university-based scoring led to a supplementary analysis confirming no adverse impact, reinforcing the value of escalation.
Meridian's compliance maturity was assessed at Level 4 (Operational) across all domains except end-of-life (Level 3, planned but not yet executed), progressing toward Level 5 (Embedded) through deeper integration of compliance evidence generation into the engineering workflow.