xAI

Grok-4 Heavy

Multimodal
Zero-eval
#1AIME 2025
#1HMMT25
#1GPQA
+3 more

by xAI

About

Grok-4 Heavy is a multimodal language model developed by xAI. It achieves strong performance with an average score of 79.5% across 6 benchmarks. It excels particularly in AIME 2025 (100.0%), HMMT25 (96.7%), GPQA (88.4%). The model shows particular specialization in general tasks with an average performance of 79.5%. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents xAI's latest advancement in AI technology.

Timeline
AnnouncedJul 9, 2025
ReleasedJul 9, 2025
Knowledge CutoffDec 31, 2024
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

6 benchmarks
Average Score
79.5%
Best Score
100.0%
High Performers (80%+)
3

Top Categories

general
79.5%
code
79.4%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

AIME 2025

Rank #1 of 36
#1Grok-4 Heavy
100.0%
#2GPT-5
94.6%
#3Grok-3
93.3%
#4o4-mini
92.7%

HMMT25

Rank #1 of 3
#1Grok-4 Heavy
96.7%
#2Grok-4
90.0%
#3Qwen3-235B-A22B-Instruct-2507
55.4%

GPQA

Rank #1 of 115
#1Grok-4 Heavy
88.4%
#2Grok-4
87.5%
#3Gemini 2.5 Pro Preview 06-05
86.4%
#4GPT-5
85.7%

LiveCodeBench

Rank #2 of 44
#1Grok-3 Mini
80.4%
#2Grok-4 Heavy
79.4%
#3Grok-3
79.4%
#4Grok-4
79.0%
#5DeepSeek-R1-0528
73.3%

USAMO25

Rank #1 of 2
#1Grok-4 Heavy
61.9%
#2Grok-4
37.5%
All Benchmark Results for Grok-4 Heavy
Complete list of benchmark scores with detailed information
AIME 2025
AIME 2025 benchmark
general
text
1.00
100.0%
Self-reported
HMMT25
HMMT25 benchmark
general
text
0.97
96.7%
Self-reported
GPQA
GPQA benchmark
general
text
0.88
88.4%
Self-reported
LiveCodeBench
LiveCodeBench benchmark
code
text
0.79
79.4%
Self-reported
USAMO25
USAMO25 benchmark
general
text
0.62
61.9%
Self-reported
Humanity's Last Exam
Humanity's Last Exam benchmark
general
text
0.51
50.7%
Self-reported
Resources