• Pascal's Chatbot Q&As
  • Posts
  • AI as a Justice-System Risk Multiplier: (1) implementation mistakes, (2) questionable “fit” to the real-world population, and (3) legal/ethical fragility around data and proxies.

AI as a Justice-System Risk Multiplier: (1) implementation mistakes, (2) questionable “fit” to the real-world population, and (3) legal/ethical fragility around data and proxies.

The model becomes a quiet policy lever, nudging outcomes across thousands of cases while leaving only a faint trace of how much it influenced each decision.

When a “Small” Formula Error Becomes a Justice-System Risk Multiplier

by ChatGPT-5.2

In February 2026, Dutch probation services (“de reclassering”) were confronted with a blunt conclusion from the Justice and Security Inspectorate: their use of key recidivism-risk algorithms was not responsible, and the defects could have negative consequences for society, suspects, and convicted people.

At the center of the story is OXREC, an algorithmic risk tool used systematically in probation advice to prosecutors and judges—and reportedly applied around 44,000 times per year.

The Inspectorate found software implementation errors dating back to OXREC’s 2018 introduction—specifically, formulas for detainees and non-detainees were swapped, and incorrect numbers were used, producing systematically distorted risk estimates (often too low), especially affecting people with drug use issues and severe mental health conditions such as psychosis.

A separate report and media coverage highlighted that this mis-estimation had been present for years and tied it to a basic “formula mistake” that should have been caught by routine testing and governance.

The Inspectorate also raised a second, deeper problem: even if the math were correct, the tool’s design and deployment context still mattered. OXREC relied on outdated data, was developed for a different target population than the one it was being used on, and did not comply with privacy law requirements (as summarized by the Inspectorate).

On top of that, it used variables like “neighborhood score” and “income level,” which can function as proxies that risk discrimination; the Inspectorate noted prior guidance that such variables are generally prohibited unless necessity is justified and extra safeguards are put in place.

That combination—(1) implementation mistakes, (2) questionable “fit” to the real-world population, and (3) legal/ethical fragility around data and proxies—creates a justice-system failure mode that is easy to underestimate: it doesn’t require bad intent, sophisticated adversaries, or exotic AI. It only requires scale. Once a tool is embedded in routine workflows, small technical errors become systemic distortions. The model becomes a quiet policy lever, nudging outcomes across thousands of cases while leaving only a faint trace of how much it influenced each decision.

Why this matters more than “one broken tool”

In criminal justice, risk scoring isn’t a harmless admin convenience. It is upstream of liberty and downstream of social bias. Even when framed as “advisory,” a score can influence how seriously a case is treated, what conditions are recommended, how much surveillance is deemed “reasonable,” and whether rehabilitation support is offered or withheld. And because probation advice sits at a junction between people and institutions—courts, prosecutors, prisons, treatment providers—errors can propagate laterally: a flawed score may indirectly shape multiple decisions by multiple actors, each assuming the other has verified the underlying logic.

This is also why the Inspectorate’s critique should be read as a governance indictment, not just a software bug report. A swapped formula is not simply an engineering mishap; it is evidence that the system lacked adequate controls for validation, monitoring, change management, and accountability. In safety-critical domains, you don’t “discover” a multi-year formula inversion by accident—you prevent it, detect it quickly, or you have an institutional blind spot.

All possible negative consequences

Below is a full-spectrum list of plausible harms that can flow from this kind of situation—whether the error pushes estimates too low, too high, or inconsistently across subgroups.

1) Public safety and victim-impact risks

  • Under-supervision of genuinely higher-risk individuals, increasing the chance of reoffending and victimization.

  • Mis-targeted interventions: resources and attention go to the wrong people, while those who need structured support most may not receive it.

  • False reassurance: institutions and communities believe risk is “managed” when the core measurement is defective.

2) Liberty, proportionality, and fairness harms to individuals

  • Disproportionate restrictions: if some people are wrongly classified upward, they can face harsher conditions (reporting, electronic monitoring, exclusion zones, treatment mandates) than necessary.

  • Deprivation of support: if risk is wrongly classified downward, individuals may lose access to interventions (addiction support, mental-health services, structured programs) that could reduce relapse and reoffending—so “too lenient” can still be harmful.

  • Unequal impact on vulnerable groups: the reported skew affecting people with drug use issues and serious mental health conditions can amplify existing marginalization and deepen the cycle of instability.

  • Proxy discrimination: neighborhood and income variables can encode structural inequality; without strict necessity tests and safeguards, the tool can effectively “launder” socioeconomic disadvantage into risk scoring.

3) Due process and contestability harms

  • Opaque decision support: defendants and counsel may struggle to understand or challenge how an assessment was produced, especially when the system is treated as “scientific.”

  • Procedural unfairness: if a tool influences advice and outcomes, but the affected person cannot meaningfully contest it, the process can become substantively biased even if formally legal.

  • Retrospective uncertainty: when errors persist for years, it becomes difficult to reconstruct what happened case-by-case, creating a “backlog of doubt.”

4) Systemic integrity and professional practice harms

  • Automation bias: staff may overweight the tool, especially under time pressure, leading to degraded judgment quality.

  • Deskilling: over time, professionals rely on the score rather than cultivating the ability to reason through risk factors independently.

  • Accountability diffusion: responsibility spreads thinly across tool builders, implementers, and users (“we only advise,” “the model did it,” “the judge decides”), weakening incentives to fix root causes.

  • Institutional mistrust: internal confidence collapses, creating either overcorrection (rejecting all analytics) or cynicism (“nothing works; just follow the form”).

  • Privacy and data-protection exposure: if the processing lacks a robust legal basis, transparency, minimization, DPIAs, and governance, enforcement risk increases.

  • Administrative-law vulnerability: flawed tools can create grounds for complaint, review, or litigation—especially if their influence is material and insufficiently disclosed or controlled.

  • Procurement and vendor lock-in harms: weak contracts and limited audit rights can make it hard to prove what went wrong, when it went wrong, and who is responsible.

  • Reputational damage: public perception of “algorithmic justice” becomes synonymous with incompetence or unfairness, even when professionals act in good faith.

6) Political and societal harms

  • Legitimacy erosion: if people believe risk tools systematically fail or discriminate, confidence in rule-of-law institutions declines.

  • Polarization and backlash: the debate can swing between “ban all algorithms” and “trust the science,” leaving little room for practical governance reforms.

  • Chilling effect on innovation: responsible experimentation becomes harder because failures poison the well for future, better-governed approaches.

7) Operational disruption and hidden costs

  • Tool suspension shock: pausing OXREC forces abrupt workflow changes, retraining, and alternative instruments that may not be comparable.

  • Audit and remediation burden: institutions must investigate historical outputs, communicate with stakeholders, and implement new controls—often while continuing daily operations.

  • Data and documentation debt: gaps in logging, documentation, and traceability become painfully visible, and expensive to fix.

What regulators should do next

If regulators treat this as a one-off scandal, they’ll miss the real lesson: high-impact algorithms require safety-style regulation and continuous oversight. Below are concrete regulatory measures that match the failure modes revealed here.

1) Regulate justice risk tools as safety-critical systems

  • Require formal software assurance: version control, reproducible builds, automated testing, regression tests, and independent verification/validation.

  • Mandate incident response protocols comparable to those in aviation, healthcare, or critical infrastructure—because the stakes (liberty + public safety) justify it.

2) Move from “approval” to “lifecycle supervision”

  • Impose continuous monitoring for model drift, calibration, and subgroup error.

  • Require periodic re-certification and public reporting of key performance and fairness metrics.

3) Enforce strict “fit-for-purpose” and population alignment

  • Prohibit using a tool outside the population it was developed for unless revalidated for the new population.

  • Treat outdated datasets as a trigger for mandatory review and potential suspension.

4) Hardline rules for proxy variables and discrimination risk

  • Require a necessity-and-proportionality justification for variables like neighborhood and income, plus:

    • documented rationale,

    • subgroup impact analysis,

    • mitigations (removal, transformation, fairness constraints),

    • and safeguards for vulnerable populations.

  • If the justification cannot be made, the variables should be banned in practice, not merely discouraged.

5) Make contestability enforceable

  • Ensure affected people can access meaningful explanations: what factors mattered, the model version used, known limitations, and how to challenge or contextualize outcomes.

  • Require that algorithmic influence be traceable: what score was generated, who saw it, where it was recorded, and how it flowed into advice.

6) Force auditability and accountability through procurement

  • Mandate standard contract clauses: audit rights, defect notification duties, documentation access, logging requirements, liability allocation, and exit/transition provisions.

  • Require that public institutions remain able to reconstruct the logic and version history years later.

7) Treat “human in the loop” as a measurable control, not a slogan

  • Require evidence of effective human oversight: training, structured decision templates that prompt critical evaluation, override documentation, and periodic audits of whether staff are over-relying on scores.

8) Create a national register of high-impact justice algorithms

  • A public register (with appropriate security/privacy limits) listing purpose, legal basis, data sources, key variables, validation results, governance owner, and audit status—so oversight does not depend on leaks, journalism, or “inspection after damage.”

9) Establish mandatory algorithmic incident reporting

  • Define “material algorithmic incident” (e.g., implementation error, drift, fairness breach, legal noncompliance).

  • Require rapid notification to oversight bodies and standardized public transparency reporting.

10) Resource regulators to do real technical oversight

  • Oversight cannot be performative. Regulators need technical capacity—data science, software assurance, privacy engineering, and domain expertise—or they will be outpaced by the operational reality of algorithm deployment.

The core lesson

This episode is not primarily about whether algorithms should exist in criminal justice. It’s about whether we are willing to govern them at the level their impact demands. When a tool is used at scale in decisions touching liberty and safety, it is no longer “decision support.” It is part of the justice system’s operating machinery. And machinery needs inspection-grade controls—before the damage compounds.

·

26 MAY 2023

Question 1 of 7 for Bing Chat: Please read the paper "ChatGPT and Generative AI Systems as Corporate Ethics Advisors" by Rupert Macey-Dare. What is the paper about and can you summarise it?