AI Model Benchmarks

AI Providers

50+

Models Tracked

Benchmark Categories

2026

Data Current As Of

Understanding AI Benchmarks

Standardized tests that measure how well AI models perform across different cognitive domains. These benchmarks help compare models objectively.

Knowledge

MMLU

Massive Multitask Language Understanding. Tests broad academic knowledge across 57 subjects including STEM, humanities, and social sciences. Variants include MMMLU (multilingual) and MMLU-Pro (harder).

Reasoning

GPQA Diamond

Graduate-Level Google-Proof Q&A. Expert-level science questions designed to be unsearchable, testing genuine reasoning ability.

Coding

HumanEval

OpenAI’s code generation benchmark. Tests the ability to write correct Python functions from docstrings and function signatures.

Math

AIME 2024

American Invitational Mathematics Examination. Competition-level math problems requiring multi-step numerical reasoning.

Multimodal

MMMU

Massive Multi-discipline Multimodal Understanding. Tests vision-language reasoning across college-level subjects with images.

Instruction

IFEval

Instruction Following Evaluation. Measures how precisely models follow specific formatting, length, and structural constraints.

Overall Leaderboard

Flagship models from each provider, ranked by average score across all available benchmark categories.

Flagship Model Comparison

Visual comparison of the top flagship models across all six benchmark dimensions.

Overall Average Score

Average performance across all six benchmark dimensions for each provider’s flagship model.

Radar Comparison: Top 4 Models

Knowledge Scores (MMLU)

Reasoning Scores (GPQA)

Coding Scores (HumanEval)

Explore by Provider

Dive deep into each AI company’s model history, evolution, and benchmark performance over time.

Anthropic

Flagship: Claude Opus 4.6

Flagship: Gemini 2.5 Pro

Flagship: Llama 4 Maverick

Flagship: DeepSeek-R1

Flagship: Mistral Large 3

Flagship: Qwen 2.5 72B

About This Data

Benchmark scores are compiled from official provider reports, technical papers, and independent evaluations. Scores may vary between evaluation runs. Some models have not been tested on all benchmarks (shown as N/A). This page is updated regularly as new models and benchmark results become available.

Benchmark scores measure specific capabilities, not overall model quality for your use case. The best model depends on your specific needs.

Learn How to Prompt These Models

Knowing benchmarks is just the start. Master the art of communicating with AI through 177+ proven prompting techniques.

Explore Techniques Browse Glossary