Why use worst-case dimension rather than averaging impact scores?

Averaging can dilute a catastrophic fundamental rights impact with low operational impact, producing a misleadingly moderate composite score. The worst-case approach reflects the EU AI Act's emphasis on protecting the most affected dimension.

Can we use a GRC platform for risk scoring?

Yes. OneTrust, ServiceNow, and Archer provide configurable matrices, automated composite calculation, and audit trails. Spreadsheet-based approaches also work if they capture all required fields including per-dimension rationale.

How does Bayesian scoring differ from standard scoring?

Instead of a single point estimate, each assessor provides a probability distribution across the five likelihood levels. The distributions are aggregated to produce an expected value and confidence interval, making disagreement and uncertainty visible.

Why use worst-case dimension rather than averaging impact scores?

Averaging can dilute a catastrophic fundamental rights impact with low operational impact, producing a misleadingly moderate composite score. The worst-case approach reflects the EU AI Act's emphasis on protecting the most affected dimension.

Can we use a GRC platform for risk scoring?

Yes. OneTrust, ServiceNow, and Archer provide configurable matrices, automated composite calculation, and audit trails. Spreadsheet-based approaches also work if they capture all required fields including per-dimension rationale.

How does Bayesian scoring differ from standard scoring?

Instead of a single point estimate, each assessor provides a probability distribution across the five likelihood levels. The distributions are aggregated to produce an expected value and confidence interval, making disagreement and uncertainty visible.

AI Risk Evaluation: Scoring Frameworks and Calibration

Q: What threshold should trigger mandatory mitigation?

Twelve or above on a twenty-five point scale is typical, but the threshold must be documented in the organisation's risk management policy and applied consistently across all systems. The AI Governance Lead approves the threshold.

Written by

Michael Clark

Chief Executive Officer, Standard Intelligence

Founder and CEO of Standard Intelligence. Author of the Practitioners Implementation Guide series for EU AI Act compliance.

Martin Dean

Chief Technology Officer, Standard Intelligence

CTO of Standard Intelligence. Leads platform engineering and contributes to the PIG series technical content.

Risk scoring translates qualitative judgements into comparable, auditable ratings that drive treatment priority. Under the EU AI Act, risks are assessed against four impact dimensions: health and safety, fundamental rights, operational integrity, and reputational exposure. This page covers the scoring rubrics, composite calculation, calibration methodology, and the evidence standards that make risk scores defensible during conformity assessment.

Abstract

Read abstract

Risk evaluation under the EU AI Act requires a structured scoring framework that produces consistent, defensible ratings across assessors and systems. The recommended approach uses a five-by-five likelihood-impact matrix assessing impact against four dimensions: health and safety, fundamental rights, operational integrity, and reputational exposure. The composite risk score is the product of likelihood and the highest impact rating across all four dimensions, ensuring worst-case impact is not diluted by averaging. Each score must be accompanied by written rationale citing specific evidence such as testing frequency, population exposure, or comparable incidents. Calibration workshops using standardised reference scenarios ensure consistency across assessors, with semi-quantitative Bayesian scoring available for high-uncertainty risks where point estimates conceal important uncertainty. Risks scoring above the organisation's defined threshold require documented mitigation measures, while those below are formally accepted with AI Governance Lead sign-off. Cross-system calibration enables meaningful portfolio-level risk reporting. All scoring records, calibration results, and evidence rationales form Module 6 evidence artefacts for conformity assessment.

How should AI risks be scored under the EU AI Act?

Regulatory Requirement

Risks are scored using a likelihood-impact matrix that translates qualitative judgements into comparable, auditable ratings.

Risks are scored using a likelihood-impact matrix that translates qualitative judgements into comparable, auditable ratings. For AI systems under the EU AI Act, the AI System Assessor assesses impact against four dimensions: health and safety, fundamental rights, operational integrity, and reputational exposure. A five-by-five matrix is standard, with likelihood rated from rare (one) to almost certain (five) and impact from negligible (one) to catastrophic (five).

The composite risk score is the product of the likelihood rating and the highest impact rating across the four dimensions. This worst-case dimension approach ensures that a risk with low health impact but catastrophic fundamental rights impact is not diluted by averaging across dimensions. The AI System Assessor records all four impact ratings; the composite score drives the treatment priority, but the individual dimension scores inform the type of mitigation required.

Risks scoring above the organisation's defined threshold (typically twelve or above on a twenty-five point scale) require specific, documented mitigation measures. Those scoring below the threshold are accepted, with the acceptance recorded and signed by the AI Governance Lead. For the broader risk management framework, see AI Risk Assessment and Management.

What does each likelihood level mean?

Engineering Approach

Each dimension requires a calibrated rubric to ensure consistency across assessors and across systems in the organisation's portfolio.

Each dimension requires a calibrated rubric to ensure consistency across assessors and across systems in the organisation's portfolio. The likelihood scale runs from rare (failure mode not observed in comparable systems, requiring a highly improbable confluence of conditions) through unlikely (documented in comparable systems under materially different conditions), possible (observed under broadly similar conditions or in testing at low frequency), and likely (observed in testing or early production, or a known characteristic of the model architecture) to almost certain (observed repeatedly in operation, expected to recur without additional mitigation).

The scoring must be accompanied by explicit rationale. "Medium likelihood" alone is insufficient; the assessor must explain why medium rather than high, citing evidence such as the frequency of a failure mode observed during testing, the exposure of the affected population, or comparable incidents in similar systems.

How are the four impact dimensions calibrated?

Engineering Approach

Each impact dimension has its own five-level rubric that anchors abstract labels to concrete consequences, preventing "significant" from meaning different things in different assessments.

Health and safety ranges from negligible (no measurable health or safety consequence) through minor (temporary inconvenience or minor distress to a small number of affected persons), moderate (material adverse effect on wellbeing, reversible with intervention), and significant (serious harm to health, safety, or livelihoods, or moderate harm affecting a large population) to catastrophic (irreversible harm to life, health, or safety, or serious harm affecting a large and vulnerable population).

Fundamental rights ranges from negligible (no discernible effect on any EU Charter right) through minor (marginal effect, correctable through the system's appeal or redress mechanism), moderate (measurable infringement affecting identifiable individuals, requiring active remediation), and significant (systematic infringement affecting a class of persons, particularly those in vulnerable situations) to catastrophic (large-scale or irreversible infringement, or infringement affecting a right of particular sensitivity such as human dignity, non-discrimination, or liberty).

Operational integrity ranges from negligible (no effect on system availability or accuracy) through minor (brief service degradation recoverable without manual intervention), moderate (extended degradation requiring engineering intervention, or accuracy degradation affecting a measurable proportion of decisions), and significant (system outage affecting deployers and affected persons, or accuracy degradation severe enough to undermine the intended purpose) to catastrophic (total system failure, data loss, or compromise of the system's integrity such that outputs cannot be trusted).

How does composite scoring work in practice?

Engineering Approach

The composite score uses the worst-case dimension approach: likelihood multiplied by the highest of the four impact ratings.

The composite score uses the worst-case dimension approach: likelihood multiplied by the highest of the four impact ratings. This design decision reflects the EU AI Act's emphasis on protecting the most affected dimension rather than averaging across all dimensions.

A risk rated likely (four) for likelihood with health and safety impact of minor (two), fundamental rights impact of significant (four), operational impact of negligible (one), and reputational impact of moderate (three) produces a composite score of sixteen (four multiplied by four). The composite score drives the treatment priority, but the AI System Assessor retains all four dimension scores because they inform which type of mitigation is needed: a rights-focused risk needs rights-focused controls, not operational improvements.

The threshold for mandatory mitigation must be documented in the organisation's risk management policy and applied consistently across all systems in the portfolio. For how residual risk is managed after mitigations are applied, see Residual Risk and Acceptability.

How should scoring be calibrated across assessors?

Compensating Controls

Scoring is inherently subjective, and calibration exercises should be run at least annually, presenting assessors with a set of standardised risk scenarios and comparing their scores.

Scoring is inherently subjective, and calibration exercises should be run at least annually, presenting assessors with a set of standardised risk scenarios and comparing their scores. Systematic divergences (one assessor consistently scoring likelihood higher than another) are identified by the AI Governance Lead and addressed through shared reference cases and discussion. Where the organisation has multiple high-risk systems, cross-system calibration ensures that a "significant" rating carries the same meaning across the portfolio, enabling meaningful portfolio-level risk reporting to executive leadership.

Calibration workshops use five to ten reference scenarios drawn from published enforcement actions, the AI Incident Database, or internal near-miss events. Assessors score the scenarios independently, then compare results. Divergences are discussed and the group agrees on reference scores for each scenario. These become calibration anchors: when scoring a new risk, assessors compare it to the anchored scenarios, which grounds scoring in concrete reference points.

For high-uncertainty risks, where the team genuinely cannot determine whether a failure mode is likely or unlikely, semi-quantitative Bayesian scoring offers a more defensible approach. Each assessor provides a probability distribution across the five likelihood levels (for example, ten per cent rare, thirty per cent unlikely, forty per cent possible, fifteen per cent likely, five per cent almost certain). The distributions are aggregated, and the resulting expected value and confidence interval are reported alongside the risk. This makes uncertainty visible instead of concealing it behind a point estimate.

What evidence must support each risk score?

Compensating Controls

Every score must be accompanied by a written rationale citing specific evidence.

Every score must be accompanied by a written rationale citing specific evidence. The rationale is a Module 6 evidence artefact that will be examined during conformity assessment. Assessors who record only the score without the supporting reasoning create an audit vulnerability.

Acceptable evidence includes the frequency of similar failures observed during testing, the exposure of the affected population, comparable incidents documented in the AI Incident Database, the maturity of existing mitigations, and the system's operational history. The rationale must address why the chosen level is correct rather than the adjacent levels, demonstrating that the assessor considered the alternatives.

Scoring patterns across the register should be reviewed by the AI Governance Lead to identify systematic inconsistencies before the assessment is finalised. The scoring matrix either strengthens or undermines the risk assessment's credibility. A matrix that produces consistent, defensible scores across assessors is a compliance asset. Inconsistent scoring suggests the organisation does not understand its own risk profile, and will be treated as such during conformity assessment.

What tools support risk scoring and calibration?

Compensating Controls

Most GRC platforms (OneTrust, ServiceNow, Archer) provide structured risk scoring with configurable matrices, automated composite calculation, and audit trail retention.

Most GRC platforms (OneTrust, ServiceNow, Archer) provide structured risk scoring with configurable matrices, automated composite calculation, and audit trail retention. These platforms also support calibration workshop management by recording each assessor's independent scores alongside the agreed reference scores.

For organisations not using a GRC platform, a spreadsheet-based approach works if it captures all required fields: risk identifier, likelihood score with rationale, all four impact dimension scores with rationale, composite score, threshold comparison, and treatment decision. The calibration anchors from workshop exercises should be documented in a standing reference document accessible to all assessors.

Semi-quantitative Bayesian scoring, where assessors provide probability distributions rather than point estimates, is not natively supported by most GRC platforms. Implementation typically requires a custom tool such as a Python script or a simple web form that collects distributions and computes aggregates. The additional effort is justified for high-stakes risks where the cost of a wrong point estimate is severe. For the risk identification methods that feed into this scoring process, see Risk Identification Methods.

When should calibration exercises be run?

Engineering Approach

Calibration workshops should precede each assessment cycle, not follow it.

Calibration workshops should precede each assessment cycle, not follow it. New assessors must complete a calibration exercise before conducting their first live assessment. The calibration results, including each assessor's scores, the agreed reference scores, and any adjustments to scoring guidance, are retained as compliance evidence.

All assessors who will score risks for any AI system in the portfolio should participate in annual calibration workshops. Workshops are convened by the AI Governance Lead to ensure cross-system participation. The calibration records become part of the aisdp evidence pack, demonstrating to a notified body or competent authority that the organisation's scoring is systematic and defensible rather than ad hoc. A consistent, calibrated scoring approach is one of the clearest signals of a mature risk management system.

AI Risk Evaluation: Scoring Frameworks and Calibration

Written by

How should AI risks be scored under the EU AI Act?

What does each likelihood level mean?

How are the four impact dimensions calibrated?

How does composite scoring work in practice?

How should scoring be calibrated across assessors?

What evidence must support each risk score?

What tools support risk scoring and calibration?

When should calibration exercises be run?

Frequently Asked Questions

Related Pages

In This Section

Navigate the regulatory landscape