OpenAI

o4-mini

Multimodal
Zero-eval
#2AIME 2024
#2MathVista
#2BrowseComp
+2 more

by OpenAI

About

o4-mini is a multimodal language model developed by OpenAI. It achieves strong performance with an average score of 66.5% across 14 benchmarks. It excels particularly in AIME 2024 (93.4%), AIME 2025 (92.7%), MathVista (84.3%). As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents OpenAI's latest advancement in AI technology.

Timeline
AnnouncedApr 16, 2025
ReleasedApr 16, 2025
Knowledge CutoffMay 31, 2024
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

14 benchmarks
Average Score
66.5%
Best Score
93.4%
High Performers (80%+)
5

Top Categories

math
84.3%
vision
81.6%
general
64.4%
agents
60.5%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

AIME 2024

Rank #2 of 41
#1Grok-3 Mini
95.8%
#2o4-mini
93.4%
#3Grok-3
93.3%
#4Gemini 2.5 Pro
92.0%
#5o3
91.6%

AIME 2025

Rank #4 of 36
#1Grok-3
93.3%
#2GPT-5
94.6%
#3Grok-4 Heavy
100.0%
#4o4-mini
92.7%
#5Grok-4
91.7%
#6GPT-5 mini
91.1%
#7Grok-3 Mini
90.8%

MathVista

Rank #2 of 35
#1o3
86.8%
#2o4-mini
84.3%
#3Kimi-k1.5
74.9%
#4Llama 4 Maverick
73.7%
#5GPT-4.1 mini
73.1%

MMMU

Rank #4 of 52
#1Gemini 2.5 Pro Preview 06-05
82.0%
#2o3
82.9%
#3GPT-5
84.2%
#4o4-mini
81.6%
#5Gemini 2.5 Flash
79.7%
#6Gemini 2.5 Pro
79.6%
#7Grok-3
78.0%

GPQA

Rank #12 of 115
#9GPT-5 mini
82.3%
#10Gemini 2.5 Flash
82.8%
#11Gemini 2.5 Pro
83.0%
#12o4-mini
81.4%
#13DeepSeek-R1-0528
81.0%
#14Claude Opus 4
79.6%
#15o1-pro
79.0%
All Benchmark Results for o4-mini
Complete list of benchmark scores with detailed information
AIME 2024
AIME 2024 benchmark
general
text
0.93
93.4%
Self-reported
AIME 2025
AIME 2025 benchmark
general
text
0.93
92.7%
Self-reported
MathVista
MathVista benchmark
math
text
0.84
84.3%
Self-reported
MMMU
MMMU benchmark
vision
multimodal
0.82
81.6%
Self-reported
GPQA
GPQA benchmark
general
text
0.81
81.4%
Self-reported
CharXiv-R
CharXiv-R benchmark
general
text
0.72
72.0%
Self-reported
TAU-bench Retail
TAU-bench Retail benchmark
agents
text
0.72
71.8%
Self-reported
Aider-Polyglot
Aider-Polyglot benchmark
general
text
0.69
68.9%
Self-reported
SWE-Bench Verified
SWE-Bench Verified benchmark
general
text
0.68
68.1%
Self-reported
Aider-Polyglot Edit
Aider-Polyglot Edit benchmark
general
text
0.58
58.2%
Self-reported
Showing 1 to 10 of 14 benchmarks