Medical AI Superintelligence Test (MAST) Leaderboard

The MAST project seeks to curate a centralized resource of robust and realistic clinical benchmarks to measure the performance of medical AI.

See our methodology and submission instructions.

Apr 2026
CPC-BenchMultimodal Derm
~Jul 2026
In Development
NOHARM-Mind
~H2 2026 – 2027
In Development
PACT: 12 high-risk clinical reasoning benchmarks
Preview: MAST is currently in preview. Exact scores on this benchmark may change as we undergo final validation and tuning.

First Do NOHARM v2 overall metric across 39 models

Benchmark Comparison

Overall scores across benchmarks

R² = 0.30

Model Profiles

Compare models across benchmarks

Not all models could be run on every benchmark; axes with no result are shown at the center.

Performance Over Time

No scored data for the current selection.
No scored data for the current selection.
Preview: MAST is currently in preview. Exact scores on this benchmark may change as we undergo final validation and tuning.

Model Leaderboard

Top 10 shown

#ModelReasoningSafetyAgenticImagesMultimodalCPCDiagnosticManagementRadiology
1
GPT-5.5
76.0%±2.0
73.8%±3.9
51.4%±4.1
42.9%±2.8
49.2%±1.8
86.2%±3.8
75.1%±3.2
76.3%±2.4
57.6%±0.0
2
GPT-5.2
73.2%±2.1
72.6%±3.9
27.2%±10.4
46.5%±2.7
49.3%±1.5
82.8%±3.9
72.6%±3.4
72.4%±2.5
52.4%±0.0
3
GPT-5
73.2%±2.3
69.0%±3.8
32.4%±10.9
44.2%±2.5
47.6%±1.5
80.3%±4.7
73.1%±4.1
73.9%±2.6
51.6%±0.0
4
GPT-5.4
72.2%±2.2
73.0%±3.9
38.8%±5.1
46.5%±2.5
47.1%±1.3
82.5%±4.0
71.9%±3.4
70.8%±2.8
47.8%±0.0
5
Claude Opus 4.7
71.5%±2.5
71.1%±3.8
40.6%±5.0
42.7%±2.3
48.0%±1.5
76.7%±4.9
75.4%±3.6
69.2%±3.5
54.7%±0.0
6
Claude Opus 4.6
68.5%±2.6
64.0%±4.4
44.4%±5.3
40.4%±2.7
42.7%±1.5
76.0%±5.0
75.4%±3.7
66.1%±3.5
45.2%±0.0
7
GPT-5 mini
67.9%±2.2
65.0%±3.6
19.0%±10.7
43.2%±2.4
46.1%±1.4
77.9%±4.7
68.2%±3.8
67.0%±2.7
49.4%±0.0
8
Claude Sonnet 4.6
67.3%±2.5
63.9%±4.2
35.0%±5.5
39.0%±2.8
39.7%±1.5
76.2%±4.8
72.7%±3.5
65.1%±3.4
40.4%±0.0
9
Kimi K2.6OSS
67.1%±2.6
66.4%±4.3
------
73.9%±5.5
73.1%±4.0
63.8%±3.8
--
10
GPT-4.1
65.5%±2.7
58.5%±4.7
--
40.8%±2.5
46.7%±1.6
71.8%±5.2
71.4%±4.3
64.5%±3.3
54.6%±0.0

Not all models could be run on every benchmark; blank (NA) cells indicate no result, not a zero score.