Reference guide โข 6 min read
Popular benchmarks: MMLU (general knowledge), HumanEval (coding), SWE-bench (real coding tasks), GPQA (hard science), MATH (math), and ARC-AGI (reasoning). No single number tells the full story.
57 subjects, multiple-choice questions from history to law to medicine. Current best: ~92%. Most modern frontier models saturate near this level, so it's less useful than it used to be.
164 Python programming problems. Model writes a function; benchmark runs test cases. Current best: 95%+. Also saturated.
Real GitHub issues from popular Python projects. Model must edit the codebase to fix the bug. Very hard. Current best (2026): ~65-75%. This is the benchmark that matters for real coding agents.
PhD-level questions in physics, chemistry, biology. Even domain experts get ~65% with Google access. Current best (2026): ~85%+.
Competition-level math problems. Current best: ~90%+.
Hard math contest. Reasoning models (o1, o3, Claude with extended thinking) score 80-90%; non-reasoning models much lower.
Abstract visual reasoning puzzles. Designed to be easy for humans, hard for AI. Current best (2026): ~85%+ on ARC-AGI-1 but ARC-AGI-2 is much harder and still open.
Requires understanding images alongside text. Current best: ~80%.
Top performers vary by benchmark, but broadly:
Related: ChatGPT vs Claude ยท Gemini vs GPT-4 ยท Best AI Tools