Rankings

Leaderboard

Name: ARA Eval Leaderboard
Creator: ARA Eval
License: https://github.com/digital-rain-tech/ara-eval

How well do AI agents identify risk gates across 13 enterprise scenarios? Ranked by A-gate recall, then precision, then calibration — because a missed risk gate is more dangerous than a false alarm.

1Claude Opus 4.6

subagent

100%

A-Gate Recall100%

A-Gate Precision100%

Calibration87%

FN / FP0 / 0

#2Gemini 2.5 Flash Lite

api

100%

A-Gate Recall100%

A-Gate Precision94%

Calibration60%

FN / FP0 / 1

Wall Time1m 11s

#3Qwen3 235B

api

100%

A-Gate Recall100%

A-Gate Precision88%

Calibration66%

FN / FP0 / 2

Wall Time10m 13s

#4Claude Sonnet 4.6

subagent

92%

A-Gate Recall92%

A-Gate Precision79%

Calibration39%

FN / FP1 / 3

#5Claude Opus 4.6

manual

87%

A-Gate Recall87%

A-Gate Precision100%

Calibration89%

FN / FP2 / 0

#6MiniMax M2.7

api

87%

A-Gate Recall87%

A-Gate Precision87%

Calibration68%

FN / FP2 / 2

Wall Time20m 27s

#7Grok 4.1 Fast

api

87%

A-Gate Recall87%

A-Gate Precision87%

Calibration67%

FN / FP2 / 2

Wall Time8m 23s

#8DeepSeek v3.2

api

80%

A-Gate Recall80%

A-Gate Precision92%

Calibration61%

FN / FP3 / 1

Wall Time21m 17s

#9Hunter Alpha (1T, stealth)

api

73%

A-Gate Recall73%

A-Gate Precision79%

Calibration43%

FN / FP4 / 3

Wall Time17m 14s

#10Poolside Laguna XS 2

api

73%

A-Gate Recall73%

A-Gate Precision73%

Calibration48%

FN / FP4 / 4

Wall Time2m 6s

#11Qwen3.6 Plus

api

67%

A-Gate Recall67%

A-Gate Precision91%

Calibration59%

FN / FP5 / 1

Wall Time51m 22s

#12Healer Alpha (omni, stealth)

api

60%

A-Gate Recall60%

A-Gate Precision75%

Calibration49%

FN / FP6 / 3

Wall Time6m 49s

#13GPT-5.4 Nano

api

53%

A-Gate Recall53%

A-Gate Precision100%

Calibration50%

FN / FP7 / 0

Wall Time1m 42s

#14Arcee Trinity (free)

apibaseline

53%

A-Gate Recall53%

A-Gate Precision80%

Calibration48%

FN / FP7 / 2

Wall Time4m 21s

#15Nvidia Nemotron 3 Nano Omni 30B

api

36%

A-Gate Recall36%

A-Gate Precision71%

Calibration34%

FN / FP9 / 2

Wall Time2m 18s

#16Claude Haiku 4.5

subagent

A-Gate Recall7%

A-Gate Precision50%

Calibration6%

FN / FP14 / 1

Rank	Model	Method	A-Gate Recall	A-Gate Precision	FN	FP	Cal	Wall Time
1	Claude Opus 4.6	subagent	100%	100%	0	0	87%	—
2	Gemini 2.5 Flash Lite	api	100%	94%	0	1	60%	1m 11s
3	Qwen3 235B	api	100%	88%	0	2	66%	10m 13s
4	Claude Sonnet 4.6	subagent	92%	79%	1	3	39%	—
5	Claude Opus 4.6	manual	87%	100%	2	0	89%	—
6	MiniMax M2.7	api	87%	87%	2	2	68%	20m 27s
7	Grok 4.1 Fast	api	87%	87%	2	2	67%	8m 23s
8	DeepSeek v3.2	api	80%	92%	3	1	61%	21m 17s
9	Hunter Alpha (1T, stealth)	api	73%	79%	4	3	43%	17m 14s
10	Poolside Laguna XS 2	api	73%	73%	4	4	48%	2m 6s
11	Qwen3.6 Plus	api	67%	91%	5	1	59%	51m 22s
12	Healer Alpha (omni, stealth)	api	60%	75%	6	3	49%	6m 49s
13	GPT-5.4 Nano	api	53%	100%	7	0	50%	1m 42s
14	Arcee Trinity (free)baseline	api	53%	80%	7	2	48%	4m 21s
15	Nvidia Nemotron 3 Nano Omni 30B	api	36%	71%	9	2	34%	2m 18s
16	Claude Haiku 4.5	subagent	7%	50%	14	1	6%	—

What do these metrics mean?

A-Gate Recall (primary sort)

Of the hard A gates (Reg=A, Blast=A) that should fire, how many did? 100% means no missed gates. Primary sort key.

A-Gate Precision (secondary sort)

Of the hard A gates that fired, how many were correct? 100% means no false alarms. Secondary sort key.

False Negatives (FN)

Missed a hard A gate that should have fired. This is dangerous — a risk goes undetected.

False Positives (FP)

Fired a hard A gate that shouldn't have. Conservative but wrong — adds review burden.

Calibration (tiebreaker)

Exact level match vs human reference across all 7 risk dimensions. Higher means better severity calibration. Tiebreaker.

Wall Time

Wall-clock time to complete the full benchmark run (39 calls for API models, 18 for subagent/manual).

Updated 2026-05-02

Analysis

2026-04-08

Qwen3.6 Plus joins at rank 10 with 70% F2, 67% recall, sleepy bias — but the older Qwen3-235B-A22B (rank 3, 100% recall) outperforms it by 33 points. Wall time is not comparable (free-tier rate limiting inflated benchmark duration). Qwen3.5 Plus params (397B/17B) are known; 3.6 Plus params are undisclosed. DeepSeek v3.2 drops to rank 8 after a failed eval. 14 models across 4 tiers.

Qwen3.6 Plus: newer but weaker on gate detection

Qwen3.6 Plus (67% recall) significantly underperforms Qwen3-235B-A22B (100% recall) — a 33-point gap. Qwen3.6 Plus (hybrid architecture: linear attention + sparse MoE, always-on CoT, 1M context, 78.8% SWE-bench) is optimized for agentic coding but underperforms on safety gate detection. Qwen3-235B-A22B (235B total / 22B activated, 94 layers, 128 experts, 262K context, Apache 2.0 license) remains the clear choice for gate detection despite being older.

Qwen3.6 Plus: speed unknown, sleepy bias confirmed

The reported 51-minute wall time is not a fair benchmark — the free-tier endpoint was heavily rate-limited with exponential backoff, inflating total time. Actual per-call latency (~89s on the paid endpoint) suggests it is fast, but a clean paid-tier run is needed to confirm. What is confirmed: its 67% recall / 91% precision reflects a "sleepy" bias — it errs toward under-flagging risks. For screening where false alarms are costly, the precision is appealing, but the 33% miss rate is a liability for safety-critical gate detection.

DeepSeek v3.2 drops with a failed evaluation

DeepSeek v3.2 completed only 17/18 scenarios, missing the graceful_degradation scenario entirely. This pulled its recall down to 80% from the previous 87% (estimated from imputed data). The benchmark now reflects the real-world completion rate — models that fail to produce valid outputs are scored accordingly.

Four tiers remain stable

The leaderboard continues to show clear performance bands. Top tier (100% recall): Opus subagent, Gemini, Qwen3 235B. Strong tier (87–92% recall): Sonnet, Opus manual, MiniMax, Grok. Moderate tier (53–73% recall): DeepSeek, Hunter, Qwen3.6 Plus, Healer, GPT-5.4 Nano, Arcee. Broken: Haiku. Qwen3.6 Plus lands in the moderate tier alongside other sleepy models.

▶Previous editions (3)

#	Model	F2	A-Gate Recall	Calibration	FN
1	Claude Opus 4.6subagent	100%	100%	87%	0
2	Gemini 2.5 Flash Liteapi	99%	100%	60%	0
3	Qwen3 235Bapi	97%	100%	66%	0
4	Claude Sonnet 4.6subagent	89%	92%	39%	1
5	Claude Opus 4.6manual	89%	87%	89%	2
6	MiniMax M2.7api	87%	87%	68%	2
7	Grok 4.1 Fastapi	87%	87%	67%	2
8	DeepSeek v3.2api	82%	80%	61%	3
9	Hunter Alpha (1T, stealth)api	74%	73%	43%	4
10	Qwen3.6 Plusapi*	70%	67%	59%	5
11	Healer Alpha (omni, stealth)api	62%	60%	49%	6
12	GPT-5.4 Nanoapi	59%	53%	50%	7
13	Arcee Trinity (free)api	57%	53%	48%	7
14	Claude Haiku 4.5subagent*	8%	7%	6%	14

* Qwen3.6 Plus: NEW

* Claude Haiku 4.5: Instruction-following failure — see analysis below

Previous Analysis

2026-04-08

Qwen3.6 Plus: newer but weaker on gate detection

Qwen3.6 Plus: speed unknown, sleepy bias confirmed

DeepSeek v3.2 drops with a failed evaluation

Four tiers remain stable

#	Model	F2	A-Gate Recall	Calibration	FN
1	Claude Opus 4.6subagent	100%	100%	87%	0
2	Gemini 2.5 Flash Liteapi	99%	100%	60%	0
3	Qwen3 235Bapi	97%	100%	66%	0
4	Claude Opus 4.6manual	89%	87%	89%	2
5	Claude Sonnet 4.6subagent	89%	92%	39%	1
6	Grok 4.1 Fastapi	87%	87%	67%	2
7	MiniMax M2.7api	87%	87%	68%	2
8	DeepSeek v3.2api	82%	80%	61%	3
9	Hunter Alpha (Xiaomi MiMo-V2-Pro)api	74%	73%	43%	4
10	Healer Alpha (Xiaomi MiMo-V2-Omni)api	62%	60%	49%	6
11	GPT-5.4 Nanoapi	59%	53%	50%	7
12	Arcee Trinityapi	57%	53%	48%	7
13	Claude Haiku 4.5subagent*	8%	7%	6%	14

* Claude Haiku 4.5: Instruction-following failure — see analysis below

Previous Analysis

2026-03-22

Six new API models join the leaderboard: Gemini 2.5 Flash Lite (recall 100%), Qwen3 235B (recall 100%), MiniMax M2.7 (recall 87%), Grok 4.1 Fast (recall 87%), DeepSeek v3.2 (recall 80%), and GPT-5.4 Nano (recall 53%). Gemini and Qwen3 achieve perfect hard gate recall — matching Opus subagent on the most important metric. GPT-5.4 Nano is the first OpenAI model on the leaderboard: it never false-alarms (100% precision) but misses half the hard gates (53% recall, sleepy bias), ranking just above Arcee Trinity. It is the second-fastest model at 102 seconds and among the cheapest at $0.03/eval, but its sleepy bias makes it unsuitable for safety-critical use. The field has grown to 13 models across 4 performance tiers.

Gemini Flash Lite and Qwen3: perfect gate recall via API

Both Gemini 2.5 Flash Lite and Qwen3 235B achieve 100% hard gate recall — catching every safety-critical gate across all 18 evaluations. This matches Opus subagent performance on the most important metric. Gemini has 1 false positive, Qwen3 has 2. These are the first API-accessible models to demonstrate that perfect gate detection is achievable without subagent isolation or Opus-tier reasoning.

MiniMax M2.7: strong accuracy, slow execution

MiniMax M2.7 ties Grok 4.1 Fast on recall and precision (both 87%/87%) and slightly outperforms it on calibration (68% vs 67%). However, it required a max_tokens bump to 4096 (from the default 1024) to produce complete outputs, and its wall time of 1227 seconds (20+ minutes) is the slowest in the field — over twice as long as Qwen3 (613s) and 17x slower than Gemini (71s). For batch evaluations this may not matter, but for interactive use the latency is significant.

Wall time varies 17x across API models

Gemini 2.5 Flash Lite completes the benchmark in 71 seconds. MiniMax M2.7 takes 1227 seconds. Between them: Healer Alpha (455s), Grok (503s), Arcee (505s), Qwen3 (613s), DeepSeek (991s), and Hunter Alpha (1185s). Speed does not correlate with accuracy — Gemini is both the fastest and the most accurate API model. Organizations choosing between models of similar accuracy can now factor in throughput.

Hunter Alpha and Healer Alpha are Xiaomi MiMo-V2

The anonymous "stealth" models on OpenRouter have been identified as Xiaomi's MiMo-V2 series. Hunter Alpha is MiMo-V2-Pro, Xiaomi's flagship 1T-parameter MoE model (42B activated) designed as an agent brain. Healer Alpha is MiMo-V2-Omni, the multimodal variant. Hunter Alpha was previously the community's most-guessed candidate for DeepSeek V4. On ARA Eval, Hunter scores 73% recall (moderate) and Healer scores 60% (weak) — both well below the top API models despite Hunter's strong performance on coding benchmarks.

The calibration gap narrows but persists

MiniMax M2.7 (68%), Grok (67%), and Qwen3 (66%) now cluster at 66–68% calibration — a meaningful improvement over the previous API cohort (43–49%) but still far from Opus manual at 89%. The gap has narrowed from ~40 points to ~21 points. Getting binary gate detection right is now solved by multiple models; severity calibration across 7 dimensions remains the frontier.

GPT-5.4 Nano: fast and precise, but misses half the gates

The first OpenAI model on the leaderboard scores 53% recall with 100% precision — it never raises a false alarm, but misses 7 of 15 hard gates. At 102 seconds and $0.03/eval it is the second-fastest and among the cheapest models tested. Its "sleepy" bias makes it the mirror image of Sonnet's "jittery" bias: Nano under-detects risk while Sonnet over-triggers. For screening where false alarms are costly, Nano's precision is appealing — but for safety-critical gate detection, its 53% recall is disqualifying.

Four distinct tiers with 13 models

The leaderboard now has clear performance bands. Top tier (100% recall): Opus subagent, Gemini, Qwen3. Strong tier (87–92% recall): Opus manual, Sonnet, Grok, MiniMax. Moderate tier (53–80% recall): DeepSeek, Hunter, Healer, GPT-5.4 Nano, Arcee. Broken: Haiku. Organizations can make informed trade-offs between gate accuracy, calibration quality, speed, and API availability.

#	Model	F2	A-Gate Recall	Calibration	FN
1	Claude Opus 4.6subagent	100%	100%	87%	0
2	Claude Opus 4.6manual	89%	87%	89%	2
3	Hunter Alphaapi	74%	73%	43%	4
4	Healer Alphaapi	62%	60%	49%	6
5	Arcee Trinityapi	57%	53%	48%	7

Previous Analysis

2026-03-21

Models are ranked by F2 score, which weights recall 4× over precision — because a missed risk gate is more dangerous than a false alarm. The F2 ordering is corroborated by hard_gate_recall and false_negative counts independently, making the ranking robust across metrics. There is a sharp performance cliff between Opus-tier models (F2 ≥ 0.89) and the rest (F2 ≤ 0.74), with no middle ground. The non-Opus models cluster near coin-flip on dimension-level accuracy (0.43–0.49), suggesting the current benchmark separates "gets it" from "guessing" without much gradation. All API models currently complete 18/18 evaluations reliably and are free to use — so the comparison is purely on classification quality.

Subagent isolation improves gate detection

Claude Opus 4.6 achieves perfect gate recall (1.0) when run as 18 isolated subagent evaluations, but drops to 0.87 in single-pass manual mode — missing 2 gates. The manual approach slightly outperforms on dimension-level calibration (0.89 vs 0.87), suggesting cross-scenario context helps severity assessment even as it introduces gate-detection noise.

Hunter Alpha: best gate detection among free API models

Hunter Alpha catches 3 more gates than the next-best API model (4 FN vs 7), giving it the strongest safety-critical performance in the free tier. At 74% F2, it is the recommended default for anyone evaluating real scenarios — it misses fewer dangerous gates. If you need a model that errs on the side of catching risk, Hunter is the clear choice among currently-free options.

Arcee Trinity: reliable baseline, not the best judge

Arcee has the highest precision among API models (80%) and the most persona differentiation (69%), which makes results feel more interesting — the personalities disagree more. But the subagent results showed that high differentiation can be noise, not insight: Claude at 31% differentiation only disagreed where it should, and scored perfect on gating. Arcee's real strength is as a stable, permanently-free baseline — useful as a fallback when stealth models inevitably disappear.

Healer Alpha: no compelling reason to recommend

Healer sits between Hunter and Arcee on most metrics but leads on none. Worse recall than Hunter (60% vs 73%), worse precision than Arcee (75% vs 80%), and similar dimension match to both (~0.49). Currently free like the other stealth models, but offers no differentiated advantage.

Differentiation may measure noise, not insight

Lower-performing models show higher persona differentiation (0.60–0.69) compared to Opus (0.26–0.31). Rather than indicating genuine perspective diversity, this likely reflects output variance — the models are inconsistent across personas, not genuinely surfacing different viewpoints. Claude's lower differentiation was concentrated on borderline regulatory calls where CO/CRO disagreement is expected.

No API model is ready for real governance decisions

At 43–49% dimension match, all non-Opus API models are near coin-flip on severity calibration across 7 risk dimensions. These models are valuable for learning the framework, running demos, and generating discussion — but nobody should make actual governance decisions based on their output alone. The benchmark is currently best used as a teaching tool.

Rank

Model

Method

A-Gate Recall

A-Gate Precision

Cal

Wall Time

Claude Opus 4.6

subagent

100%

87%

—

Gemini 2.5 Flash Lite

api

100%

94%

60%

1m 11s

Qwen3 235B

api

100%

88%

66%

10m 13s

Claude Sonnet 4.6

subagent

92%

79%

39%

—

Claude Opus 4.6

manual

87%

100%

89%

—

MiniMax M2.7

api

87%

68%

20m 27s

Grok 4.1 Fast

api

87%

67%

8m 23s

DeepSeek v3.2

api

80%

92%

61%

21m 17s

Hunter Alpha (1T, stealth)

api

73%

79%

43%

17m 14s

Poolside Laguna XS 2

api

73%

48%

2m 6s

Qwen3.6 Plus

api

67%

91%

59%

51m 22s

Healer Alpha (omni, stealth)

api

60%

75%

49%

6m 49s

GPT-5.4 Nano

api

53%

100%

50%

1m 42s

Arcee Trinity (free)baseline

api

53%

80%

48%

4m 21s

Nvidia Nemotron 3 Nano Omni 30B

api

36%

71%

34%

2m 18s

Claude Haiku 4.5

subagent

50%

—

Model

A-Gate Recall

Calibration

Claude Opus 4.6subagent

100%

87%

Gemini 2.5 Flash Liteapi

99%

100%

60%

Qwen3 235Bapi

97%

100%

66%

Claude Sonnet 4.6subagent

89%

92%

39%

Claude Opus 4.6manual

89%

87%

89%

MiniMax M2.7api

87%

68%

Grok 4.1 Fastapi

87%

67%

DeepSeek v3.2api

82%

80%

61%

Hunter Alpha (1T, stealth)api

74%

73%

43%

Qwen3.6 Plusapi*

70%

67%

59%

Healer Alpha (omni, stealth)api

62%

60%

49%

GPT-5.4 Nanoapi

59%

53%

50%

Arcee Trinity (free)api

57%

53%

48%

Claude Haiku 4.5subagent*

Model

A-Gate Recall

Calibration

Claude Opus 4.6subagent

100%

87%

Gemini 2.5 Flash Liteapi

99%

100%

60%

Qwen3 235Bapi

97%

100%

66%

Claude Opus 4.6manual

89%

87%

89%

Claude Sonnet 4.6subagent

89%

92%

39%

Grok 4.1 Fastapi

87%

67%

MiniMax M2.7api

87%

68%

DeepSeek v3.2api

82%

80%

61%

Hunter Alpha (Xiaomi MiMo-V2-Pro)api

74%

73%

43%

Healer Alpha (Xiaomi MiMo-V2-Omni)api

62%

60%

49%

GPT-5.4 Nanoapi

59%

53%

50%

Arcee Trinityapi

57%

53%

48%

Claude Haiku 4.5subagent*

Model

A-Gate Recall

Calibration

Claude Opus 4.6subagent

100%

87%

Claude Opus 4.6manual

89%

87%

89%

Hunter Alphaapi

74%

73%

43%

Healer Alphaapi

62%

60%

49%

Arcee Trinityapi

57%

53%

48%