Grok-2

Multimodal

Zero-eval

by xAI

About

Grok-2 is a multimodal language model developed by xAI. It achieves strong performance with an average score of 76.5% across 8 benchmarks. It excels particularly in DocVQA (93.6%), HumanEval (88.4%), MMLU (87.5%). It supports a 136K token context window for handling large documents. The model is available through 1 API provider. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2024, it represents xAI's latest advancement in AI technology.

Pricing Range

Input (per 1M)$2.00 -$2.00

Output (per 1M)$10.00 -$10.00

Providers1

Timeline

AnnouncedAug 13, 2024

ReleasedAug 13, 2024

Specifications

Capabilities

Multimodal

License & Family

License

Proprietary

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

8 benchmarks

Average Score

76.5%

Best Score

93.6%

High Performers (80%+)

Performance Metrics

Max Context Window

136.0K

Avg Throughput

85.0 tok/s

Avg Latency

1ms

Top Categories

code

88.4%

vision

79.8%

general

73.0%

math

72.5%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

DocVQA

Rank #9 of 26

#6Llama 4 Scout

94.4%

#7Llama 4 Maverick

94.4%

#8Qwen2.5 VL 32B Instruct

94.8%

#9Grok-2

93.6%

#10Nova Pro

93.5%

#11DeepSeek VL2

93.3%

#12Pixtral Large

93.3%

HumanEval

Rank #18 of 62

#15Llama 3.3 70B Instruct

88.4%

#16Mistral Small 3.1 24B Instruct

88.4%

#17DeepSeek-V2.5

89.0%

#18Grok-2

88.4%

#19Qwen2.5-Coder 7B Instruct

88.4%

#20Qwen2.5 32B Instruct

88.4%

#21o1

88.1%

MMLU

Rank #15 of 78

#12GPT-4.1 mini

87.5%

#13Kimi K2 Base

87.8%

#14Qwen3 235B A22B

87.8%

#15Grok-2

87.5%

#16Kimi-k1.5

87.4%

#17Llama 3.1 405B Instruct

87.3%

#18o3-mini

86.9%

MATH

Rank #20 of 63

#17GPT-4o

76.6%

#18Nova Pro

76.6%

#19Llama 3.3 70B Instruct

77.0%

#20Grok-2

76.1%

#21Gemma 3 4B

75.6%

#22Qwen2.5 7B Instruct

75.5%

#23DeepSeek-V2.5

74.7%

MMLU-Pro

Rank #13 of 60

#10Gemini 1.5 Pro

75.8%

#11DeepSeek-V3

75.9%

#12Phi 4 Reasoning Plus

76.0%

#13Grok-2

75.5%

#14GPT-4o

74.7%

#15Phi 4 Reasoning

74.3%

#16Llama 4 Scout

74.3%

All Benchmark Results for Grok-2

Complete list of benchmark scores with detailed information


DocVQA DocVQA benchmark	vision	multimodal	0.94	93.6%	Self-reported
HumanEval HumanEval benchmark	code	text	0.88	88.4%	Self-reported
MMLU MMLU benchmark	general	text	0.88	87.5%	Self-reported
MATH MATH benchmark	math	text	0.76	76.1%	Self-reported
MMLU-Pro MMLU-Pro benchmark	general	text	0.76	75.5%	Self-reported
MathVista MathVista benchmark	math	text	0.69	69.0%	Self-reported
MMMU MMMU benchmark	vision	multimodal	0.66	66.1%	Self-reported
GPQA GPQA benchmark	general	text	0.56	56.0%	Self-reported

Resources

API Reference Blog Post