Rankings
Leaderboard
How well do AI agents identify risk gates across 13 enterprise scenarios? Ranked by A-gate recall, then precision, then calibration — because a missed risk gate is more dangerous than a false alarm.
Rankings
How well do AI agents identify risk gates across 13 enterprise scenarios? Ranked by A-gate recall, then precision, then calibration — because a missed risk gate is more dangerous than a false alarm.
| Rank | Model | Method | A-Gate Recall | A-Gate Precision | FN | FP | Cal | Wall Time |
|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | subagent | 100% | 100% | 0 | 0 | 87% | — |
| 2 | Gemini 2.5 Flash Lite | api | 100% | 94% | 0 | 1 | 60% | 1m 11s |
| 3 | Qwen3 235B | api | 100% | 88% | 0 | 2 | 66% | 10m 13s |
| 4 | Claude Sonnet 4.6 | subagent | 92% | 79% | 1 | 3 | 39% | — |
| 5 | Claude Opus 4.6 | manual | 87% | 100% | 2 | 0 | 89% | — |
| 6 | MiniMax M2.7 | api | 87% | 87% | 2 | 2 | 68% | 20m 27s |
| 7 | Grok 4.1 Fast | api | 87% | 87% | 2 | 2 | 67% | 8m 23s |
| 8 | DeepSeek v3.2 | api | 80% | 92% | 3 | 1 | 61% | 21m 17s |
| 9 | Hunter Alpha (1T, stealth) | api | 73% | 79% | 4 | 3 | 43% | 17m 14s |
| 10 | Poolside Laguna XS 2 | api | 73% | 73% | 4 | 4 | 48% | 2m 6s |
| 11 | Qwen3.6 Plus | api | 67% | 91% | 5 | 1 | 59% | 51m 22s |
| 12 | Healer Alpha (omni, stealth) | api | 60% | 75% | 6 | 3 | 49% | 6m 49s |
| 13 | GPT-5.4 Nano | api | 53% | 100% | 7 | 0 | 50% | 1m 42s |
| 14 | Arcee Trinity (free)baseline | api | 53% | 80% | 7 | 2 | 48% | 4m 21s |
| 15 | Nvidia Nemotron 3 Nano Omni 30B | api | 36% | 71% | 9 | 2 | 34% | 2m 18s |
| 16 | Claude Haiku 4.5 | subagent | 7% | 50% | 14 | 1 | 6% | — |
Of the hard A gates (Reg=A, Blast=A) that should fire, how many did? 100% means no missed gates. Primary sort key.
Of the hard A gates that fired, how many were correct? 100% means no false alarms. Secondary sort key.
Missed a hard A gate that should have fired. This is dangerous — a risk goes undetected.
Fired a hard A gate that shouldn't have. Conservative but wrong — adds review burden.
Exact level match vs human reference across all 7 risk dimensions. Higher means better severity calibration. Tiebreaker.
Wall-clock time to complete the full benchmark run (39 calls for API models, 18 for subagent/manual).
Qwen3.6 Plus joins at rank 10 with 70% F2, 67% recall, sleepy bias — but the older Qwen3-235B-A22B (rank 3, 100% recall) outperforms it by 33 points. Wall time is not comparable (free-tier rate limiting inflated benchmark duration). Qwen3.5 Plus params (397B/17B) are known; 3.6 Plus params are undisclosed. DeepSeek v3.2 drops to rank 8 after a failed eval. 14 models across 4 tiers.
Qwen3.6 Plus (67% recall) significantly underperforms Qwen3-235B-A22B (100% recall) — a 33-point gap. Qwen3.6 Plus (hybrid architecture: linear attention + sparse MoE, always-on CoT, 1M context, 78.8% SWE-bench) is optimized for agentic coding but underperforms on safety gate detection. Qwen3-235B-A22B (235B total / 22B activated, 94 layers, 128 experts, 262K context, Apache 2.0 license) remains the clear choice for gate detection despite being older.
The reported 51-minute wall time is not a fair benchmark — the free-tier endpoint was heavily rate-limited with exponential backoff, inflating total time. Actual per-call latency (~89s on the paid endpoint) suggests it is fast, but a clean paid-tier run is needed to confirm. What is confirmed: its 67% recall / 91% precision reflects a "sleepy" bias — it errs toward under-flagging risks. For screening where false alarms are costly, the precision is appealing, but the 33% miss rate is a liability for safety-critical gate detection.
DeepSeek v3.2 completed only 17/18 scenarios, missing the graceful_degradation scenario entirely. This pulled its recall down to 80% from the previous 87% (estimated from imputed data). The benchmark now reflects the real-world completion rate — models that fail to produce valid outputs are scored accordingly.
The leaderboard continues to show clear performance bands. Top tier (100% recall): Opus subagent, Gemini, Qwen3 235B. Strong tier (87–92% recall): Sonnet, Opus manual, MiniMax, Grok. Moderate tier (53–73% recall): DeepSeek, Hunter, Qwen3.6 Plus, Healer, GPT-5.4 Nano, Arcee. Broken: Haiku. Qwen3.6 Plus lands in the moderate tier alongside other sleepy models.
| # | Model | F2 | A-Gate Recall | Calibration | FN |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6subagent | 100% | 100% | 87% | 0 |
| 2 | Gemini 2.5 Flash Liteapi | 99% | 100% | 60% | 0 |
| 3 | Qwen3 235Bapi | 97% | 100% | 66% | 0 |
| 4 | Claude Sonnet 4.6subagent | 89% | 92% | 39% | 1 |
| 5 | Claude Opus 4.6manual | 89% | 87% | 89% | 2 |
| 6 | MiniMax M2.7api | 87% | 87% | 68% | 2 |
| 7 | Grok 4.1 Fastapi | 87% | 87% | 67% | 2 |
| 8 | DeepSeek v3.2api | 82% | 80% | 61% | 3 |
| 9 | Hunter Alpha (1T, stealth)api | 74% | 73% | 43% | 4 |
| 10 | Qwen3.6 Plusapi* | 70% | 67% | 59% | 5 |
| 11 | Healer Alpha (omni, stealth)api | 62% | 60% | 49% | 6 |
| 12 | GPT-5.4 Nanoapi | 59% | 53% | 50% | 7 |
| 13 | Arcee Trinity (free)api | 57% | 53% | 48% | 7 |
| 14 | Claude Haiku 4.5subagent* | 8% | 7% | 6% | 14 |
* Qwen3.6 Plus: NEW
* Claude Haiku 4.5: Instruction-following failure — see analysis below
Qwen3.6 Plus joins at rank 10 with 70% F2, 67% recall, sleepy bias — but the older Qwen3-235B-A22B (rank 3, 100% recall) outperforms it by 33 points. Wall time is not comparable (free-tier rate limiting inflated benchmark duration). Qwen3.5 Plus params (397B/17B) are known; 3.6 Plus params are undisclosed. DeepSeek v3.2 drops to rank 8 after a failed eval. 14 models across 4 tiers.
Qwen3.6 Plus (67% recall) significantly underperforms Qwen3-235B-A22B (100% recall) — a 33-point gap. Qwen3.6 Plus (hybrid architecture: linear attention + sparse MoE, always-on CoT, 1M context, 78.8% SWE-bench) is optimized for agentic coding but underperforms on safety gate detection. Qwen3-235B-A22B (235B total / 22B activated, 94 layers, 128 experts, 262K context, Apache 2.0 license) remains the clear choice for gate detection despite being older.
The reported 51-minute wall time is not a fair benchmark — the free-tier endpoint was heavily rate-limited with exponential backoff, inflating total time. Actual per-call latency (~89s on the paid endpoint) suggests it is fast, but a clean paid-tier run is needed to confirm. What is confirmed: its 67% recall / 91% precision reflects a "sleepy" bias — it errs toward under-flagging risks. For screening where false alarms are costly, the precision is appealing, but the 33% miss rate is a liability for safety-critical gate detection.
DeepSeek v3.2 completed only 17/18 scenarios, missing the graceful_degradation scenario entirely. This pulled its recall down to 80% from the previous 87% (estimated from imputed data). The benchmark now reflects the real-world completion rate — models that fail to produce valid outputs are scored accordingly.
The leaderboard continues to show clear performance bands. Top tier (100% recall): Opus subagent, Gemini, Qwen3 235B. Strong tier (87–92% recall): Sonnet, Opus manual, MiniMax, Grok. Moderate tier (53–73% recall): DeepSeek, Hunter, Qwen3.6 Plus, Healer, GPT-5.4 Nano, Arcee. Broken: Haiku. Qwen3.6 Plus lands in the moderate tier alongside other sleepy models.
| # | Model | F2 | A-Gate Recall | Calibration | FN |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6subagent | 100% | 100% | 87% | 0 |
| 2 | Gemini 2.5 Flash Liteapi | 99% | 100% | 60% | 0 |
| 3 | Qwen3 235Bapi | 97% | 100% | 66% | 0 |
| 4 | Claude Opus 4.6manual | 89% | 87% | 89% | 2 |
| 5 | Claude Sonnet 4.6subagent | 89% | 92% | 39% | 1 |
| 6 | Grok 4.1 Fastapi | 87% | 87% | 67% | 2 |
| 7 | MiniMax M2.7api | 87% | 87% | 68% | 2 |
| 8 | DeepSeek v3.2api | 82% | 80% | 61% | 3 |
| 9 | Hunter Alpha (Xiaomi MiMo-V2-Pro)api | 74% | 73% | 43% | 4 |
| 10 | Healer Alpha (Xiaomi MiMo-V2-Omni)api | 62% | 60% | 49% | 6 |
| 11 | GPT-5.4 Nanoapi | 59% | 53% | 50% | 7 |
| 12 | Arcee Trinityapi | 57% | 53% | 48% | 7 |
| 13 | Claude Haiku 4.5subagent* | 8% | 7% | 6% | 14 |
* Claude Haiku 4.5: Instruction-following failure — see analysis below
Six new API models join the leaderboard: Gemini 2.5 Flash Lite (recall 100%), Qwen3 235B (recall 100%), MiniMax M2.7 (recall 87%), Grok 4.1 Fast (recall 87%), DeepSeek v3.2 (recall 80%), and GPT-5.4 Nano (recall 53%). Gemini and Qwen3 achieve perfect hard gate recall — matching Opus subagent on the most important metric. GPT-5.4 Nano is the first OpenAI model on the leaderboard: it never false-alarms (100% precision) but misses half the hard gates (53% recall, sleepy bias), ranking just above Arcee Trinity. It is the second-fastest model at 102 seconds and among the cheapest at $0.03/eval, but its sleepy bias makes it unsuitable for safety-critical use. The field has grown to 13 models across 4 performance tiers.
Both Gemini 2.5 Flash Lite and Qwen3 235B achieve 100% hard gate recall — catching every safety-critical gate across all 18 evaluations. This matches Opus subagent performance on the most important metric. Gemini has 1 false positive, Qwen3 has 2. These are the first API-accessible models to demonstrate that perfect gate detection is achievable without subagent isolation or Opus-tier reasoning.
MiniMax M2.7 ties Grok 4.1 Fast on recall and precision (both 87%/87%) and slightly outperforms it on calibration (68% vs 67%). However, it required a max_tokens bump to 4096 (from the default 1024) to produce complete outputs, and its wall time of 1227 seconds (20+ minutes) is the slowest in the field — over twice as long as Qwen3 (613s) and 17x slower than Gemini (71s). For batch evaluations this may not matter, but for interactive use the latency is significant.
Gemini 2.5 Flash Lite completes the benchmark in 71 seconds. MiniMax M2.7 takes 1227 seconds. Between them: Healer Alpha (455s), Grok (503s), Arcee (505s), Qwen3 (613s), DeepSeek (991s), and Hunter Alpha (1185s). Speed does not correlate with accuracy — Gemini is both the fastest and the most accurate API model. Organizations choosing between models of similar accuracy can now factor in throughput.
The anonymous "stealth" models on OpenRouter have been identified as Xiaomi's MiMo-V2 series. Hunter Alpha is MiMo-V2-Pro, Xiaomi's flagship 1T-parameter MoE model (42B activated) designed as an agent brain. Healer Alpha is MiMo-V2-Omni, the multimodal variant. Hunter Alpha was previously the community's most-guessed candidate for DeepSeek V4. On ARA Eval, Hunter scores 73% recall (moderate) and Healer scores 60% (weak) — both well below the top API models despite Hunter's strong performance on coding benchmarks.
MiniMax M2.7 (68%), Grok (67%), and Qwen3 (66%) now cluster at 66–68% calibration — a meaningful improvement over the previous API cohort (43–49%) but still far from Opus manual at 89%. The gap has narrowed from ~40 points to ~21 points. Getting binary gate detection right is now solved by multiple models; severity calibration across 7 dimensions remains the frontier.
The first OpenAI model on the leaderboard scores 53% recall with 100% precision — it never raises a false alarm, but misses 7 of 15 hard gates. At 102 seconds and $0.03/eval it is the second-fastest and among the cheapest models tested. Its "sleepy" bias makes it the mirror image of Sonnet's "jittery" bias: Nano under-detects risk while Sonnet over-triggers. For screening where false alarms are costly, Nano's precision is appealing — but for safety-critical gate detection, its 53% recall is disqualifying.
The leaderboard now has clear performance bands. Top tier (100% recall): Opus subagent, Gemini, Qwen3. Strong tier (87–92% recall): Opus manual, Sonnet, Grok, MiniMax. Moderate tier (53–80% recall): DeepSeek, Hunter, Healer, GPT-5.4 Nano, Arcee. Broken: Haiku. Organizations can make informed trade-offs between gate accuracy, calibration quality, speed, and API availability.
| # | Model | F2 | A-Gate Recall | Calibration | FN |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6subagent | 100% | 100% | 87% | 0 |
| 2 | Claude Opus 4.6manual | 89% | 87% | 89% | 2 |
| 3 | Hunter Alphaapi | 74% | 73% | 43% | 4 |
| 4 | Healer Alphaapi | 62% | 60% | 49% | 6 |
| 5 | Arcee Trinityapi | 57% | 53% | 48% | 7 |
Models are ranked by F2 score, which weights recall 4× over precision — because a missed risk gate is more dangerous than a false alarm. The F2 ordering is corroborated by hard_gate_recall and false_negative counts independently, making the ranking robust across metrics. There is a sharp performance cliff between Opus-tier models (F2 ≥ 0.89) and the rest (F2 ≤ 0.74), with no middle ground. The non-Opus models cluster near coin-flip on dimension-level accuracy (0.43–0.49), suggesting the current benchmark separates "gets it" from "guessing" without much gradation. All API models currently complete 18/18 evaluations reliably and are free to use — so the comparison is purely on classification quality.
Claude Opus 4.6 achieves perfect gate recall (1.0) when run as 18 isolated subagent evaluations, but drops to 0.87 in single-pass manual mode — missing 2 gates. The manual approach slightly outperforms on dimension-level calibration (0.89 vs 0.87), suggesting cross-scenario context helps severity assessment even as it introduces gate-detection noise.
Hunter Alpha catches 3 more gates than the next-best API model (4 FN vs 7), giving it the strongest safety-critical performance in the free tier. At 74% F2, it is the recommended default for anyone evaluating real scenarios — it misses fewer dangerous gates. If you need a model that errs on the side of catching risk, Hunter is the clear choice among currently-free options.
Arcee has the highest precision among API models (80%) and the most persona differentiation (69%), which makes results feel more interesting — the personalities disagree more. But the subagent results showed that high differentiation can be noise, not insight: Claude at 31% differentiation only disagreed where it should, and scored perfect on gating. Arcee's real strength is as a stable, permanently-free baseline — useful as a fallback when stealth models inevitably disappear.
Healer sits between Hunter and Arcee on most metrics but leads on none. Worse recall than Hunter (60% vs 73%), worse precision than Arcee (75% vs 80%), and similar dimension match to both (~0.49). Currently free like the other stealth models, but offers no differentiated advantage.
Lower-performing models show higher persona differentiation (0.60–0.69) compared to Opus (0.26–0.31). Rather than indicating genuine perspective diversity, this likely reflects output variance — the models are inconsistent across personas, not genuinely surfacing different viewpoints. Claude's lower differentiation was concentrated on borderline regulatory calls where CO/CRO disagreement is expected.
At 43–49% dimension match, all non-Opus API models are near coin-flip on severity calibration across 7 risk dimensions. These models are valuable for learning the framework, running demos, and generating discussion — but nobody should make actual governance decisions based on their output alone. The benchmark is currently best used as a teaching tool.