Anthropic: Claude Model Benchmarks

Models Tracked

2021

Founded

91.3

Best Reasoning

91.1

Best Knowledge

About Anthropic

Anthropic is an AI safety company founded in 2021 by Dario Amodei, Daniela Amodei, and former members of OpenAI. The company builds the Claude family of AI assistants, with a mission centered on AI safety research and developing reliable, interpretable AI systems. Anthropic pioneered Constitutional AI (CAI), a technique where AI systems are guided by a set of principles rather than purely human feedback.

The Claude model family has evolved rapidly from Claude 3 Opus (2024) through Claude Opus 4.6 (2026), consistently ranking among the top models in reasoning, knowledge, and balanced multi-domain performance. Anthropic is known for its cautious, safety-first approach to capability advancement.

Claude Model Timeline

The evolution of Anthropic’s Claude family from 2024 to 2026. All scores from official Anthropic announcements.

February 2026

Claude Opus 4.6

The latest and most capable Claude model. Claude Opus 4.6 achieves 91.3% on GPQA Diamond and 91.1% on MMMLU, making it one of the highest-scoring models in both reasoning and knowledge. It also achieves 80.8% on SWE-bench Verified for real-world software engineering tasks. Source: Anthropic announcement.

MMMLU 91.1 GPQA 91.3 SWE-bench 80.8

January 2026

Claude Opus 4.5

A major step forward in reasoning capability. Claude Opus 4.5 achieves 91.3% on GPQA Diamond—matching Opus 4.6—and 80.9% on SWE-bench Verified, the highest coding score in the Claude family at time of release.

GPQA 91.3 SWE-bench 80.9

October 2025

Claude Sonnet 4.5

The balanced workhorse of the 4.5 generation. Claude Sonnet 4.5 delivers strong performance at faster response times and lower cost than Opus. It achieves 89.1% on MMLU and 83.4% on GPQA Diamond, with 77.2% on SWE-bench Verified for real-world coding tasks.

MMLU 89.1 GPQA 83.4 SWE-bench 77.2

August 2025

Claude Opus 4.1

An incremental Opus update building on the Claude 4 architecture with improved reliability and enhanced agentic capabilities.

June 2025

Claude Opus 4

A major generational leap that introduced advanced agentic capabilities—the ability to use tools, browse documents, and execute multi-step workflows autonomously. Achieves 87.4% on MMMLU, 76.9% on GPQA Diamond (with extended thinking), and 72.5% on SWE-bench Verified. AIME 2024 score: 33.9%. Source: Anthropic announcement.

MMMLU 87.4 GPQA 76.9 SWE-bench 72.5 AIME 33.9 MMMU 73.7

May 2025

Claude Sonnet 4

The first model in the Claude 4 generation. Claude Sonnet 4 delivered substantial improvements in reasoning and agentic tasks while maintaining the fast response times Sonnet users expected. Achieves 85.4% on MMMLU, 72.3% on GPQA Diamond (with extended thinking), and 72.7% on SWE-bench Verified. Source: Anthropic announcement.

MMMLU 85.4 GPQA 72.3 SWE-bench 72.7 AIME 33.1 MMMU 72.6

February 2025

Claude Sonnet 3.7

A significant update to the 3.5 architecture. Claude Sonnet 3.7 introduced extended thinking capabilities to the Sonnet tier for the first time, allowing the model to reason through complex problems before responding.

October 2024

Claude 3.5 Haiku

The speed-optimized member of the 3.5 family. Claude 3.5 Haiku was designed for high-throughput, low-latency applications where cost efficiency matters most—chatbots, classification tasks, content moderation, and real-time data extraction. Source: Anthropic model card (PDF).

June 2024

Claude 3.5 Sonnet

A breakout hit that reshaped industry expectations. Claude 3.5 Sonnet demonstrated that a mid-tier model could match or exceed competitors’ flagship offerings. At 90.4% on MMLU and 92.0% on HumanEval, it outperformed GPT-4 Turbo on several key metrics while running faster and costing less. Source: Anthropic model card (PDF).

MMLU 90.4 GPQA 59.4 HumanEval 92.0 AIME 16.0

March 2024

Claude 3 Opus

Anthropic’s first true frontier model and the release that established the Claude family as a serious competitor to GPT-4. Claude 3 Opus launched with a 200K context window—the largest in the industry at the time. It scored 88.2% on MMLU (5-shot CoT), 50.4% on GPQA Diamond, and 84.9% on HumanEval. Source: Anthropic model card (PDF).

MMLU 88.2 GPQA 50.4 HumanEval 84.9

Benchmark Performance

Claude Opus 4.6 scores across verified benchmark categories.

Key Strengths

Reasoning Leadership

Claude Opus 4.6 achieves 91.3% on GPQA Diamond, one of the highest reasoning scores among all AI models. Extended thinking mode enables deep, multi-step scientific reasoning.

Real-World Coding

With 80.8% on SWE-bench Verified, Claude Opus 4.6 demonstrates strong ability to solve real-world software engineering problems from GitHub repositories.

Safety-First Design

Built with Constitutional AI principles, Claude models are designed to be helpful, harmless, and honest. Anthropic prioritizes responsible development alongside capability advances.

About This Data

All benchmark scores are sourced from Anthropic’s official announcements and model cards. MMMLU (Multilingual MMLU) is used for Claude 4+ models where standard MMLU is not separately reported. GPQA Diamond scores for Claude 4+ include extended thinking. Scores represent performance at time of release.

Explore More Providers

Compare Anthropic’s Claude models against other frontier AI systems.

Back to Leaderboard Next: OpenAI