
o4-mini
Multimodal
Zero-eval
#2AIME 2024
#2MathVista
#2BrowseComp
+2 more
by OpenAI
About
o4-mini is a multimodal language model developed by OpenAI. It achieves strong performance with an average score of 66.5% across 14 benchmarks. It excels particularly in AIME 2024 (93.4%), AIME 2025 (92.7%), MathVista (84.3%). As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents OpenAI's latest advancement in AI technology.
Timeline
AnnouncedApr 16, 2025
ReleasedApr 16, 2025
Knowledge CutoffMay 31, 2024
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
14 benchmarks
Average Score
66.5%
Best Score
93.4%
High Performers (80%+)
5Top Categories
math
84.3%
vision
81.6%
general
64.4%
agents
60.5%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
AIME 2024
Rank #2 of 41
#1Grok-3 Mini
95.8%
#2o4-mini
93.4%
#3Grok-3
93.3%
#4Gemini 2.5 Pro
92.0%
#5o3
91.6%
AIME 2025
Rank #4 of 36
#1Grok-3
93.3%
#2GPT-5
94.6%
#3Grok-4 Heavy
100.0%
#4o4-mini
92.7%
#5Grok-4
91.7%
#6GPT-5 mini
91.1%
#7Grok-3 Mini
90.8%
MathVista
Rank #2 of 35
#1o3
86.8%
#2o4-mini
84.3%
#3Kimi-k1.5
74.9%
#4Llama 4 Maverick
73.7%
#5GPT-4.1 mini
73.1%
MMMU
Rank #4 of 52
#1Gemini 2.5 Pro Preview 06-05
82.0%
#2o3
82.9%
#3GPT-5
84.2%
#4o4-mini
81.6%
#5Gemini 2.5 Flash
79.7%
#6Gemini 2.5 Pro
79.6%
#7Grok-3
78.0%
GPQA
Rank #12 of 115
#9GPT-5 mini
82.3%
#10Gemini 2.5 Flash
82.8%
#11Gemini 2.5 Pro
83.0%
#12o4-mini
81.4%
#13DeepSeek-R1-0528
81.0%
#14Claude Opus 4
79.6%
#15o1-pro
79.0%
All Benchmark Results for o4-mini
Complete list of benchmark scores with detailed information
AIME 2024 AIME 2024 benchmark | general | text | 0.93 | 93.4% | Self-reported |
AIME 2025 AIME 2025 benchmark | general | text | 0.93 | 92.7% | Self-reported |
MathVista MathVista benchmark | math | text | 0.84 | 84.3% | Self-reported |
MMMU MMMU benchmark | vision | multimodal | 0.82 | 81.6% | Self-reported |
GPQA GPQA benchmark | general | text | 0.81 | 81.4% | Self-reported |
CharXiv-R CharXiv-R benchmark | general | text | 0.72 | 72.0% | Self-reported |
TAU-bench Retail TAU-bench Retail benchmark | agents | text | 0.72 | 71.8% | Self-reported |
Aider-Polyglot Aider-Polyglot benchmark | general | text | 0.69 | 68.9% | Self-reported |
SWE-Bench Verified SWE-Bench Verified benchmark | general | text | 0.68 | 68.1% | Self-reported |
Aider-Polyglot Edit Aider-Polyglot Edit benchmark | general | text | 0.58 | 58.2% | Self-reported |
Showing 1 to 10 of 14 benchmarks