Grok-4 Heavy

Multimodal

Zero-eval

#1AIME 2025

#1HMMT25

#1GPQA

+3 more

by xAI

About

Grok-4 Heavy is a multimodal language model developed by xAI. It achieves strong performance with an average score of 79.5% across 6 benchmarks. It excels particularly in AIME 2025 (100.0%), HMMT25 (96.7%), GPQA (88.4%). The model shows particular specialization in general tasks with an average performance of 79.5%. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents xAI's latest advancement in AI technology.

Timeline

AnnouncedJul 9, 2025

ReleasedJul 9, 2025

Knowledge CutoffDec 31, 2024

Specifications

Capabilities

Multimodal

License & Family

License

Proprietary

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

6 benchmarks

Average Score

79.5%

Best Score

100.0%

High Performers (80%+)

Top Categories

general

79.5%

code

79.4%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

AIME 2025

Rank #1 of 36

#1Grok-4 Heavy

100.0%

#2GPT-5

94.6%

#3Grok-3

93.3%

#4o4-mini

92.7%

HMMT25

Rank #1 of 3

#1Grok-4 Heavy

96.7%

#2Grok-4

90.0%

#3Qwen3-235B-A22B-Instruct-2507

55.4%

GPQA

Rank #1 of 115

#1Grok-4 Heavy

88.4%

#2Grok-4

87.5%

#3Gemini 2.5 Pro Preview 06-05

86.4%

#4GPT-5

85.7%

LiveCodeBench

Rank #2 of 44

#1Grok-3 Mini

80.4%

#2Grok-4 Heavy

79.4%

#3Grok-3

79.4%

#4Grok-4

79.0%

#5DeepSeek-R1-0528

73.3%

USAMO25

Rank #1 of 2

#1Grok-4 Heavy

61.9%

#2Grok-4

37.5%

All Benchmark Results for Grok-4 Heavy

Complete list of benchmark scores with detailed information


AIME 2025 AIME 2025 benchmark	general	text	1.00	100.0%	Self-reported
HMMT25 HMMT25 benchmark	general	text	0.97	96.7%	Self-reported
GPQA GPQA benchmark	general	text	0.88	88.4%	Self-reported
LiveCodeBench LiveCodeBench benchmark	code	text	0.79	79.4%	Self-reported
USAMO25 USAMO25 benchmark	general	text	0.62	61.9%	Self-reported
Humanity's Last Exam Humanity's Last Exam benchmark	general	text	0.51	50.7%	Self-reported

Resources

API Reference