
Grok-4 Heavy
Multimodal
Zero-eval
#1AIME 2025
#1HMMT25
#1GPQA
+3 more
by xAI
About
Grok-4 Heavy is a multimodal language model developed by xAI. It achieves strong performance with an average score of 79.5% across 6 benchmarks. It excels particularly in AIME 2025 (100.0%), HMMT25 (96.7%), GPQA (88.4%). The model shows particular specialization in general tasks with an average performance of 79.5%. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents xAI's latest advancement in AI technology.
Timeline
AnnouncedJul 9, 2025
ReleasedJul 9, 2025
Knowledge CutoffDec 31, 2024
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
6 benchmarks
Average Score
79.5%
Best Score
100.0%
High Performers (80%+)
3Top Categories
general
79.5%
code
79.4%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
AIME 2025
Rank #1 of 36
#1Grok-4 Heavy
100.0%
#2GPT-5
94.6%
#3Grok-3
93.3%
#4o4-mini
92.7%
HMMT25
Rank #1 of 3
#1Grok-4 Heavy
96.7%
#2Grok-4
90.0%
#3Qwen3-235B-A22B-Instruct-2507
55.4%
GPQA
Rank #1 of 115
#1Grok-4 Heavy
88.4%
#2Grok-4
87.5%
#3Gemini 2.5 Pro Preview 06-05
86.4%
#4GPT-5
85.7%
LiveCodeBench
Rank #2 of 44
#1Grok-3 Mini
80.4%
#2Grok-4 Heavy
79.4%
#3Grok-3
79.4%
#4Grok-4
79.0%
#5DeepSeek-R1-0528
73.3%
USAMO25
Rank #1 of 2
#1Grok-4 Heavy
61.9%
#2Grok-4
37.5%
All Benchmark Results for Grok-4 Heavy
Complete list of benchmark scores with detailed information
AIME 2025 AIME 2025 benchmark | general | text | 1.00 | 100.0% | Self-reported |
HMMT25 HMMT25 benchmark | general | text | 0.97 | 96.7% | Self-reported |
GPQA GPQA benchmark | general | text | 0.88 | 88.4% | Self-reported |
LiveCodeBench LiveCodeBench benchmark | code | text | 0.79 | 79.4% | Self-reported |
USAMO25 USAMO25 benchmark | general | text | 0.62 | 61.9% | Self-reported |
Humanity's Last Exam Humanity's Last Exam benchmark | general | text | 0.51 | 50.7% | Self-reported |
Resources