
o3-mini
Zero-eval
#1MATH
#1IFEval
#1LiveBench
+9 more
by OpenAI
About
o3-mini is a language model developed by OpenAI. The model shows competitive results across 26 benchmarks. It excels particularly in COLLIE (98.7%), MATH (97.9%), IFEval (93.9%). It supports a 300K token context window for handling large documents. The model is available through 2 API providers. Released in 2025, it represents OpenAI's latest advancement in AI technology.
Pricing Range
Input (per 1M)$1.10 -$1.10
Output (per 1M)$4.40 -$4.40
Providers2
Timeline
AnnouncedJan 30, 2025
ReleasedJan 30, 2025
Knowledge CutoffSep 30, 2023
Specifications
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
26 benchmarks
Average Score
56.9%
Best Score
98.7%
High Performers (80%+)
8Performance Metrics
Max Context Window
300.0KAvg Throughput
115.0 tok/sAvg Latency
5msTop Categories
code
93.9%
roleplay
84.6%
math
66.4%
general
55.2%
agents
45.0%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
COLLIE
Rank #2 of 7
#1GPT-5
99.0%
#2o3-mini
98.7%
#3GPT-4.5
72.3%
#4GPT-4.1
65.8%
#5GPT-4o
61.0%
MATH
Rank #1 of 63
#1o3-mini
97.9%
#2o1
96.4%
#3Gemini 2.0 Flash
89.7%
#4Gemma 3 27B
89.0%
IFEval
Rank #1 of 37
#1o3-mini
93.9%
#2Claude 3.7 Sonnet
93.2%
#3Nova Pro
92.1%
#4Llama 3.3 70B Instruct
92.1%
MGSM
Rank #2 of 31
#1Llama 4 Maverick
92.3%
#2o3-mini
92.0%
#3Claude 3.5 Sonnet
91.6%
#4Claude 3.5 Sonnet
91.6%
#5Llama 3.3 70B Instruct
91.1%
AIME 2024
Rank #8 of 41
#5Gemini 2.5 Flash
88.0%
#6DeepSeek-R1-0528
91.4%
#7o3
91.6%
#8o3-mini
87.3%
#9DeepSeek R1 Distill Llama 70B
86.7%
#10DeepSeek R1 Zero
86.7%
#11o1-pro
86.0%
All Benchmark Results for o3-mini
Complete list of benchmark scores with detailed information
COLLIE COLLIE benchmark | general | text | 0.99 | 98.7% | Self-reported |
MATH MATH benchmark | math | text | 0.98 | 97.9% | Self-reported |
IFEval IFEval benchmark | code | text | 0.94 | 93.9% | Self-reported |
MGSM MGSM benchmark | math | text | 0.92 | 92.0% | Self-reported |
AIME 2024 AIME 2024 benchmark | general | text | 0.87 | 87.3% | Self-reported |
MMLU MMLU benchmark | general | text | 0.87 | 86.9% | Self-reported |
LiveBench LiveBench benchmark | roleplay | text | 0.85 | 84.6% | Self-reported |
Multilingual MMLU Multilingual MMLU benchmark | general | text | 0.81 | 80.7% | Self-reported |
Multi-IF Multi-IF benchmark | general | text | 0.80 | 79.5% | Self-reported |
GPQA GPQA benchmark | general | text | 0.77 | 77.2% | Self-reported |
Showing 1 to 10 of 26 benchmarks