We use cookies to improve your experience and analyse site traffic.
Risk scoring translates qualitative judgements into comparable, auditable ratings that drive treatment priority. Under the EU AI Act, risks are assessed against four impact dimensions: health and safety, fundamental rights, operational integrity, and reputational exposure. This page covers the scoring rubrics, composite calculation, calibration methodology, and the evidence standards that make risk scores defensible during conformity assessment.
Risks are scored using a likelihood-impact matrix that translates qualitative judgements into comparable, auditable ratings.
Risks are scored using a likelihood-impact matrix that translates qualitative judgements into comparable, auditable ratings. For AI systems under the EU AI Act, the AI System Assessor assesses impact against four dimensions: health and safety, fundamental rights, operational integrity, and reputational exposure. A five-by-five matrix is standard, with likelihood rated from rare (one) to almost certain (five) and impact from negligible (one) to catastrophic (five).
The composite risk score is the product of the likelihood rating and the highest impact rating across the four dimensions. This worst-case dimension approach ensures that a risk with low health impact but catastrophic fundamental rights impact is not diluted by averaging across dimensions. The AI System Assessor records all four impact ratings; the composite score drives the treatment priority, but the individual dimension scores inform the type of mitigation required.
Risks scoring above the organisation's defined threshold (typically twelve or above on a twenty-five point scale) require specific, documented mitigation measures. Those scoring below the threshold are accepted, with the acceptance recorded and signed by the AI Governance Lead. For the broader risk management framework, see AI Risk Assessment and Management.
Each dimension requires a calibrated rubric to ensure consistency across assessors and across systems in the organisation's portfolio.
Each dimension requires a calibrated rubric to ensure consistency across assessors and across systems in the organisation's portfolio. The likelihood scale runs from rare (failure mode not observed in comparable systems, requiring a highly improbable confluence of conditions) through unlikely (documented in comparable systems under materially different conditions), possible (observed under broadly similar conditions or in testing at low frequency), and likely (observed in testing or early production, or a known characteristic of the model architecture) to almost certain (observed repeatedly in operation, expected to recur without additional mitigation).
The scoring must be accompanied by explicit rationale. "Medium likelihood" alone is insufficient; the assessor must explain why medium rather than high, citing evidence such as the frequency of a failure mode observed during testing, the exposure of the affected population, or comparable incidents in similar systems.
Each impact dimension has its own five-level rubric that anchors abstract labels to concrete consequences, preventing "significant" from meaning different things in different assessments.
Each impact dimension has its own five-level rubric that anchors abstract labels to concrete consequences, preventing "significant" from meaning different things in different assessments.
Health and safety ranges from negligible (no measurable health or safety consequence) through minor (temporary inconvenience or minor distress to a small number of affected persons), moderate (material adverse effect on wellbeing, reversible with intervention), and significant (serious harm to health, safety, or livelihoods, or moderate harm affecting a large population) to catastrophic (irreversible harm to life, health, or safety, or serious harm affecting a large and vulnerable population).
Fundamental rights ranges from negligible (no discernible effect on any EU Charter right) through minor (marginal effect, correctable through the system's appeal or redress mechanism), moderate (measurable infringement affecting identifiable individuals, requiring active remediation), and significant (systematic infringement affecting a class of persons, particularly those in vulnerable situations) to catastrophic (large-scale or irreversible infringement, or infringement affecting a right of particular sensitivity such as human dignity, non-discrimination, or liberty).
Operational integrity ranges from negligible (no effect on system availability or accuracy) through minor (brief service degradation recoverable without manual intervention), moderate (extended degradation requiring engineering intervention, or accuracy degradation affecting a measurable proportion of decisions), and significant (system outage affecting deployers and affected persons, or accuracy degradation severe enough to undermine the intended purpose) to catastrophic (total system failure, data loss, or compromise of the system's integrity such that outputs cannot be trusted).
The composite score uses the worst-case dimension approach: likelihood multiplied by the highest of the four impact ratings.
The composite score uses the worst-case dimension approach: likelihood multiplied by the highest of the four impact ratings. This design decision reflects the EU AI Act's emphasis on protecting the most affected dimension rather than averaging across all dimensions.
A risk rated likely (four) for likelihood with health and safety impact of minor (two), fundamental rights impact of significant (four), operational impact of negligible (one), and reputational impact of moderate (three) produces a composite score of sixteen (four multiplied by four). The composite score drives the treatment priority, but the AI System Assessor retains all four dimension scores because they inform which type of mitigation is needed: a rights-focused risk needs rights-focused controls, not operational improvements.
The threshold for mandatory mitigation must be documented in the organisation's risk management policy and applied consistently across all systems in the portfolio. For how residual risk is managed after mitigations are applied, see Residual Risk and Acceptability.
Scoring is inherently subjective, and calibration exercises should be run at least annually, presenting assessors with a set of standardised risk scenarios and comparing their scores.
Scoring is inherently subjective, and calibration exercises should be run at least annually, presenting assessors with a set of standardised risk scenarios and comparing their scores. Systematic divergences (one assessor consistently scoring likelihood higher than another) are identified by the AI Governance Lead and addressed through shared reference cases and discussion. Where the organisation has multiple high-risk systems, cross-system calibration ensures that a "significant" rating carries the same meaning across the portfolio, enabling meaningful portfolio-level risk reporting to executive leadership.
Calibration workshops use five to ten reference scenarios drawn from published enforcement actions, the AI Incident Database, or internal near-miss events. Assessors score the scenarios independently, then compare results. Divergences are discussed and the group agrees on reference scores for each scenario. These become calibration anchors: when scoring a new risk, assessors compare it to the anchored scenarios, which grounds scoring in concrete reference points.
For high-uncertainty risks, where the team genuinely cannot determine whether a failure mode is likely or unlikely, semi-quantitative Bayesian scoring offers a more defensible approach. Each assessor provides a probability distribution across the five likelihood levels (for example, ten per cent rare, thirty per cent unlikely, forty per cent possible, fifteen per cent likely, five per cent almost certain). The distributions are aggregated, and the resulting expected value and confidence interval are reported alongside the risk. This makes uncertainty visible instead of concealing it behind a point estimate.
Every score must be accompanied by a written rationale citing specific evidence.
Every score must be accompanied by a written rationale citing specific evidence. The rationale is a Module 6 evidence artefact that will be examined during conformity assessment. Assessors who record only the score without the supporting reasoning create an audit vulnerability.
Acceptable evidence includes the frequency of similar failures observed during testing, the exposure of the affected population, comparable incidents documented in the AI Incident Database, the maturity of existing mitigations, and the system's operational history. The rationale must address why the chosen level is correct rather than the adjacent levels, demonstrating that the assessor considered the alternatives.
Scoring patterns across the register should be reviewed by the AI Governance Lead to identify systematic inconsistencies before the assessment is finalised. The scoring matrix either strengthens or undermines the risk assessment's credibility. A matrix that produces consistent, defensible scores across assessors is a compliance asset. Inconsistent scoring suggests the organisation does not understand its own risk profile, and will be treated as such during conformity assessment.
Most GRC platforms (OneTrust, ServiceNow, Archer) provide structured risk scoring with configurable matrices, automated composite calculation, and audit trail retention.
Most GRC platforms (OneTrust, ServiceNow, Archer) provide structured risk scoring with configurable matrices, automated composite calculation, and audit trail retention. These platforms also support calibration workshop management by recording each assessor's independent scores alongside the agreed reference scores.
For organisations not using a GRC platform, a spreadsheet-based approach works if it captures all required fields: risk identifier, likelihood score with rationale, all four impact dimension scores with rationale, composite score, threshold comparison, and treatment decision. The calibration anchors from workshop exercises should be documented in a standing reference document accessible to all assessors.
Semi-quantitative Bayesian scoring, where assessors provide probability distributions rather than point estimates, is not natively supported by most GRC platforms. Implementation typically requires a custom tool such as a Python script or a simple web form that collects distributions and computes aggregates. The additional effort is justified for high-stakes risks where the cost of a wrong point estimate is severe. For the risk identification methods that feed into this scoring process, see Risk Identification Methods.
Calibration workshops should precede each assessment cycle, not follow it.
Calibration workshops should precede each assessment cycle, not follow it. New assessors must complete a calibration exercise before conducting their first live assessment. The calibration results, including each assessor's scores, the agreed reference scores, and any adjustments to scoring guidance, are retained as compliance evidence.
All assessors who will score risks for any AI system in the portfolio should participate in annual calibration workshops. Workshops are convened by the AI Governance Lead to ensure cross-system participation. The calibration records become part of the aisdp evidence pack, demonstrating to a notified body or competent authority that the organisation's scoring is systematic and defensible rather than ad hoc. A consistent, calibrated scoring approach is one of the clearest signals of a mature risk management system.
Averaging can dilute a catastrophic fundamental rights impact with low operational impact, producing a misleadingly moderate composite score. The worst-case approach reflects the EU AI Act's emphasis on protecting the most affected dimension.
Twelve or above on a twenty-five point scale is typical, but the threshold must be documented in the organisation's risk management policy and applied consistently across all systems. The AI Governance Lead approves the threshold.
Yes. OneTrust, ServiceNow, and Archer provide configurable matrices, automated composite calculation, and audit trails. Spreadsheet-based approaches also work if they capture all required fields including per-dimension rationale.
Instead of a single point estimate, each assessor provides a probability distribution across the five likelihood levels. The distributions are aggregated to produce an expected value and confidence interval, making disagreement and uncertainty visible.
Annual calibration workshops using standardised reference scenarios from enforcement actions and the AI Incident Database, with Bayesian distributional scoring for high-uncertainty risks.
Written rationale citing testing frequency, population exposure, comparable incidents, mitigation maturity, and operational history, explaining why the chosen level is correct rather than adjacent levels.
Before each assessment cycle, not after. New assessors must complete calibration before their first live assessment. Cross-system calibration annually for organisations with multiple high-risk systems.
Reputational exposure ranges from negligible (internal awareness only) through minor (deployer awareness, no external visibility), moderate (limited external visibility in trade press and specialist publications), and significant (mainstream media coverage, regulatory attention, potential for affected person litigation) to catastrophic (sustained public attention, political scrutiny, regulatory enforcement proceedings, and material effect on commercial relationships).