| Benchmark | Gemini 2.5 Pro | OpenAI o3 | OpenAI o4-mini | ... |
|---|---|---|---|---|
| Humanity's Last Exam | 21.6% | 20.3% | 14.3% | ... |
| GPQA (single) | 86.4% | 83.3% | 81.4% | ... |
| GPQA (multiple) | — | — | — | ... |
| AIME (single) | 88.0% | 88.9% | 92.7% | ... |
| AIME (multiple) | — | — | — | ... |
| LiveCodeBench | 69.0% | 72.0% | 75.8% | ... |
| Aider Polyglot | 82.2% | 79.6% | 72.0% | ... |
| SWE-bench (single) | 59.6% | 69.1% | 68.1% | ... |
| SWE-bench (multiple) | 67.2% | — | — | ... |
| SimpleQA | 54.0% | 48.6% | 19.3% | ... |
| ... | ... | ... | ... | ... |
Model
->
Service
Tokens
Tasks
per
Second
| Models | Intelligence | Time | Intelligence Goodput |
|---|---|---|---|
| Grok 4 Fast | 60 | 2.7d | 254.88 |
| GPT-5 Medium | 66 | 3.8d | 202.85 |
| Gemini 2.5 Flash | 54 | 3.1d | 199.27 |
| GPT-5 High | 68 | 7.9d | 100.00 |
| Gemini 2.5 Pro | 60 | 7.5d | 92.67 |
| Claude 4.5 Sonnet | 63 | 7.9d | 91.75 |
| Grok 4 | 65 | 40.2d | 18.71 |
MS - DOS Version 6.22
(C) Copyright Microsoft Corp 1981 - 1990.