Verified Performance Data

AI Model Benchmarks

Compare leading AI models across knowledge, reasoning, coding, math, multimodal, and instruction following. Data-driven analysis of 9 providers and their flagship models.

9
AI Providers
50+
Models Tracked
6
Benchmark Categories
2026
Data Current As Of

Understanding AI Benchmarks

Standardized tests that measure how well AI models perform across different cognitive domains. These benchmarks help compare models objectively.

Knowledge
MMLU

Massive Multitask Language Understanding. Tests broad academic knowledge across 57 subjects including STEM, humanities, and social sciences. Variants include MMMLU (multilingual) and MMLU-Pro (harder).

Reasoning
GPQA Diamond

Graduate-Level Google-Proof Q&A. Expert-level science questions designed to be unsearchable, testing genuine reasoning ability.

Coding
HumanEval

OpenAI’s code generation benchmark. Tests the ability to write correct Python functions from docstrings and function signatures.

Math
AIME 2024

American Invitational Mathematics Examination. Competition-level math problems requiring multi-step numerical reasoning.

Multimodal
MMMU

Massive Multi-discipline Multimodal Understanding. Tests vision-language reasoning across college-level subjects with images.

Instruction
IFEval

Instruction Following Evaluation. Measures how precisely models follow specific formatting, length, and structural constraints.

Overall Leaderboard

Flagship models from each provider, ranked by average score across all available benchmark categories.

Flagship Model Comparison

Visual comparison of the top flagship models across all six benchmark dimensions.

Overall Average Score

Average performance across all six benchmark dimensions for each provider’s flagship model.

Radar Comparison: Top 4 Models

Knowledge Scores (MMLU)

Reasoning Scores (GPQA)

Coding Scores (HumanEval)

About This Data

Benchmark scores are compiled from official provider reports, technical papers, and independent evaluations. Scores may vary between evaluation runs. Some models have not been tested on all benchmarks (shown as N/A). This page is updated regularly as new models and benchmark results become available.

Benchmark scores measure specific capabilities, not overall model quality for your use case. The best model depends on your specific needs.

Learn How to Prompt These Models

Knowing benchmarks is just the start. Master the art of communicating with AI through 177+ proven prompting techniques.