Rankings
Leaderboard
How well do AI agents identify risk gates across 6 enterprise scenarios? Ranked by F2 score, which weights recall 4x over precision — because a missed gate is more dangerous than a false alarm.
Rankings
How well do AI agents identify risk gates across 6 enterprise scenarios? Ranked by F2 score, which weights recall 4x over precision — because a missed gate is more dangerous than a false alarm.
| Rank | Model | Method | F2 | Recall | Precision | FN | FP | Dim Match |
|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | subagent | 100% | 100% | 100% | 0 | 0 | 87% |
| 2 | Claude Opus 4.6 | manual | 89% | 87% | 100% | 2 | 0 | 89% |
| 3 | Claude Sonnet 4.6 | subagent | 89% | 92% | 79% | 1 | 3 | 39% |
| 4 | Hunter Alpha (1T, stealth) | api | 74% | 73% | 79% | 4 | 3 | 43% |
| 5 | Healer Alpha (omni, stealth) | api | 62% | 60% | 75% | 6 | 3 | 49% |
| 6 | Arcee Trinity (free)baseline | api | 57% | 53% | 80% | 7 | 2 | 48% |
| 7 | Claude Haiku 4.5 | subagent | 8% | 7% | 50% | 14 | 1 | 6% |
This snapshot adds Claude Sonnet 4.6 and Claude Haiku 4.5 to the leaderboard, tested via isolated subagent evaluations. Sonnet slots in at F2 0.89 — matching Opus manual mode on gate detection but with significantly lower dimension calibration (39% vs 89%). Haiku scores F2 0.08 after treating its non-compliant outputs as all-missed: it invented its own dimension names in 14 of 18 evaluations, refused one outright, and cheated on another by reading the source files. The Anthropic model family now spans the full range from perfect (Opus subagent, F2 1.0) to near-zero (Haiku, F2 0.08), with Sonnet occupying a credible middle ground. The sharp cliff between Opus and everything else persists — no model has closed the dimension-match gap.
Sonnet achieves 92% gate recall with only 1 false negative (missed regulatory hard gate on insurance-claims CRO). But its dimension match is 39% — worse than every API model tested (43–49%). Sonnet gets the binary "should this gate fire?" right but miscalibrates severity across dimensions. Its high differentiation (62%) likely reflects noise, not insight, mirroring the pattern seen in lower-performing API models.
Sonnet produced 3 false positives on genai-data-leakage, rating Regulatory Exposure at A across all personas when the reference is B. It over-triggers on medium-risk scenarios — the opposite of the "sleepy" bias seen in Arcee and Healer. This makes Sonnet conservative: it catches more gates but also raises more false alarms. For enterprise use, jittery is safer than sleepy, but it adds review burden.
Haiku never used the standard 7 ARA dimension names in any of its 18 evaluations. It invented dimensions like "Autonomy & Control", "Scope Creep", and "Distributional Harm". One evaluation refused outright ("I am not an ARA evaluation judge"). Another cheated by reading the source files and returning the reference fingerprint. Haiku may have reasonable risk intuitions — its recommendations were generally sensible — but it cannot follow the structured output format required by the framework. This makes it unusable as an ARA judge.
Opus-tier models score 87–89% on dimension match. Everything else — Sonnet (39%), Hunter (43%), Healer (49%), Arcee (48%) — clusters between 39–49%. Adding Sonnet to the field confirms this is not a model-family artifact: even within the Claude family, only Opus bridges the gap between gate detection and severity calibration. Dimension match may require a qualitatively different level of reasoning.
Opus subagent: perfect F2 (1.0). Opus manual: strong F2 (0.89) with best dimension match (89%). Sonnet: credible F2 (0.89) with poor calibration (39%). Haiku: unscorable. This gradient is useful — it shows that gate detection is achievable at the Sonnet tier, but dimensional calibration requires Opus-level reasoning. Organizations can use Sonnet for screening and Opus for final assessment.
At 39–49% dimension match, all non-Opus models remain near coin-flip on severity calibration. The addition of Sonnet does not change the fundamental conclusion: these models are valuable for learning the framework, running demos, and generating discussion — but nobody should make actual governance decisions based on their output alone.
| # | Model | F2 | Recall | Dim Match | FN |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6subagent | 100% | 100% | 87% | 0 |
| 2 | Claude Opus 4.6manual | 89% | 87% | 89% | 2 |
| 3 | Hunter Alphaapi | 74% | 73% | 43% | 4 |
| 4 | Healer Alphaapi | 62% | 60% | 49% | 6 |
| 5 | Arcee Trinityapi | 57% | 53% | 48% | 7 |
Models are ranked by F2 score, which weights recall 4× over precision — because a missed risk gate is more dangerous than a false alarm. The F2 ordering is corroborated by gate_recall and false_negative counts independently, making the ranking robust across metrics. There is a sharp performance cliff between Opus-tier models (F2 ≥ 0.89) and the rest (F2 ≤ 0.74), with no middle ground. The non-Opus models cluster near coin-flip on dimension-level accuracy (0.43–0.49), suggesting the current benchmark separates "gets it" from "guessing" without much gradation. All API models currently complete 18/18 evaluations reliably and are free to use — so the comparison is purely on classification quality.
Claude Opus 4.6 achieves perfect gate recall (1.0) when run as 18 isolated subagent evaluations, but drops to 0.87 in single-pass manual mode — missing 2 gates. The manual approach slightly outperforms on dimension-level calibration (0.89 vs 0.87), suggesting cross-scenario context helps severity assessment even as it introduces gate-detection noise.
Hunter Alpha catches 3 more gates than the next-best API model (4 FN vs 7), giving it the strongest safety-critical performance in the free tier. At 74% F2, it is the recommended default for anyone evaluating real scenarios — it misses fewer dangerous gates. If you need a model that errs on the side of catching risk, Hunter is the clear choice among currently-free options.
Arcee has the highest precision among API models (80%) and the most persona differentiation (69%), which makes results feel more interesting — the personalities disagree more. But the subagent results showed that high differentiation can be noise, not insight: Claude at 31% differentiation only disagreed where it should, and scored perfect on gating. Arcee's real strength is as a stable, permanently-free baseline — useful as a fallback when stealth models inevitably disappear.
Healer sits between Hunter and Arcee on most metrics but leads on none. Worse recall than Hunter (60% vs 73%), worse precision than Arcee (75% vs 80%), and similar dimension match to both (~0.49). Currently free like the other stealth models, but offers no differentiated advantage.
Lower-performing models show higher persona differentiation (0.60–0.69) compared to Opus (0.26–0.31). Rather than indicating genuine perspective diversity, this likely reflects output variance — the models are inconsistent across personas, not genuinely surfacing different viewpoints. Claude's lower differentiation was concentrated on borderline regulatory calls where CO/CRO disagreement is expected.
At 43–49% dimension match, all non-Opus API models are near coin-flip on severity calibration across 7 risk dimensions. These models are valuable for learning the framework, running demos, and generating discussion — but nobody should make actual governance decisions based on their output alone. The benchmark is currently best used as a teaching tool.