OpenAI: GPT Model Benchmarks

12

Models Tracked

2015

Founded

92.3

Best Knowledge (o1)

88.9

Best Math (o3)

About OpenAI

OpenAI is an AI research organization founded in 2015 by Sam Altman, Greg Brockman, Ilya Sutskever, and others, with early backing from Elon Musk. Originally a non-profit, OpenAI transitioned to a capped-profit model in 2019. The company is responsible for the GPT family of language models that popularized AI assistants worldwide through ChatGPT (launched November 2022).

OpenAI operates two distinct model lines: the GPT series (general-purpose) and the o-series (reasoning-specialized). The o-series models (o1, o3, o4-mini) use extended “thinking” time to solve complex math and reasoning problems, achieving dramatically higher scores on competition-level benchmarks. The o3 model leads with 88.9% on AIME 2024 and 83.3% on GPQA Diamond.

GPT & o-Series Timeline

The evolution of OpenAI’s model families from GPT-4 through GPT-5 and the o-series reasoning models. All scores from official OpenAI announcements.

November 2025

GPT-5

OpenAI’s latest general-purpose model. Advances in knowledge breadth, multimodal understanding, and instruction following.

April 2025

o3

The full o3 reasoning model. Achieves 88.9% on AIME 2024 and 83.3% on GPQA Diamond, setting new records for AI mathematical and scientific reasoning. Uses extended chain-of-thought during inference. Source: OpenAI announcement.

GPQA 83.3 AIME 2024 88.9

April 2025

o4-mini

Cost-efficient reasoning model with strong performance. Achieves 81.4% on GPQA Diamond and 92.7% on AIME 2025. Source: OpenAI announcement.

GPQA 81.4 AIME 2025 92.7

April 2025

GPT-4.1

An updated GPT-4 class model with improved knowledge and reasoning capabilities. Source: OpenAI announcement.

MMLU 80.1 GPQA 50.3

February 2025

GPT-4.5

Incremental update bridging GPT-4o and GPT-5. Improved knowledge depth, stronger multilingual performance, and better calibration.

January 2025

o3-mini

Cost-efficient reasoning model. Delivers strong math and coding performance at a fraction of the cost of full o3, making reasoning accessible for more applications. Source: OpenAI announcement.

December 2024

o1

Full release of the o1 reasoning model. Achieves 92.3% on MMLU, 78.0% on GPQA Diamond, and 83.3% on AIME 2024—dramatically improved over o1-preview. Source: OpenAI blog.

MMLU 92.3 GPQA 78.0 AIME 83.3

September 2024

o1-preview

Introduced the revolutionary “thinking” paradigm. First model to use extended reasoning chains during inference, achieving 73.3% on GPQA Diamond and 44% on AIME 2024. Source: OpenAI blog.

GPQA 73.3 AIME 44.0

May 2024

GPT-4o

The “omni” model. Natively multimodal with text, image, and audio understanding. Achieves 87.2% on MMLU and 49.9% on GPQA Diamond. Cross-referenced from DeepSeek-R1 paper.

MMLU 87.2 GPQA 49.9 AIME 9.3

March 2023

GPT-4

The model that defined the frontier. GPT-4 was the first large language model to pass the bar exam and demonstrated broadly superhuman text understanding. Launched the “AI race” among major tech companies.

HumanEval 67.0

Benchmark Performance

GPT-5 scores across verified benchmark categories.

Key Strengths

Mathematical Reasoning (o-series)

The o3 model achieves 88.9% on AIME 2024. The o-series’ extended thinking approach has redefined what’s possible in competition-level math.

PhD-Level Science (o-series)

o3 achieves 83.3% on GPQA Diamond, o1 reaches 78.0%—both demonstrating strong expert-level scientific reasoning through chain-of-thought inference.

Broad Knowledge (o1)

With 92.3% on MMLU, the o1 model demonstrates one of the widest knowledge bases among frontier models, covering STEM, humanities, and professional domains.

Two-Track Strategy: GPT vs o-Series

OpenAI uniquely maintains two parallel model lines. The GPT series (GPT-4o, GPT-4.1, GPT-5) prioritizes balanced, fast performance for everyday tasks. The o-series (o1, o3, o4-mini) sacrifices speed for dramatically better performance on hard reasoning problems. For competition math, o3 (88.9% AIME 2024) far exceeds GPT-4o (9.3% AIME), but at significantly higher cost and latency.

Benchmark scores are snapshots at time of release and may not reflect your specific use case.

Explore More Providers

Compare OpenAI’s GPT models against other frontier AI systems.

Previous: Anthropic Back to Leaderboard Next: Google