xAI

Grok-1.5V

Multimodal
Zero-eval
#3RealWorldQA

by xAI

About

Grok-1.5V is a multimodal language model developed by xAI. It achieves strong performance with an average score of 71.9% across 7 benchmarks. It excels particularly in AI2D (88.3%), DocVQA (85.6%), TextVQA (78.1%). The model shows particular specialization in general tasks with an average performance of 77.7%. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2024, it represents xAI's latest advancement in AI technology.

Timeline
AnnouncedApr 12, 2024
ReleasedApr 12, 2024
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

7 benchmarks
Average Score
71.9%
Best Score
88.3%
High Performers (80%+)
2

Top Categories

general
77.7%
vision
72.4%
math
52.8%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

AI2D

Rank #8 of 17
#5Qwen2.5 VL 72B Instruct
88.4%
#6Llama 3.2 11B Instruct
91.1%
#7Llama 3.2 90B Instruct
92.3%
#8Grok-1.5V
88.3%
#9Gemma 3 27B
84.5%
#10Gemma 3 12B
84.2%
#11Qwen2.5-Omni-7B
83.2%

DocVQA

Rank #24 of 26
#21Gemma 3 27B
86.6%
#22Gemma 3 12B
87.1%
#23Llama 3.2 11B Instruct
88.4%
#24Grok-1.5V
85.6%
#25Grok-1.5
85.6%
#26Gemma 3 4B
75.8%

TextVQA

Rank #9 of 15
#6Nova Lite
80.2%
#7DeepSeek VL2 Tiny
80.7%
#8Nova Pro
81.5%
#9Grok-1.5V
78.1%
#10Phi-4-multimodal-instruct
75.6%
#11Llama 3.2 90B Instruct
73.5%
#12Phi-3.5-vision-instruct
72.0%

ChartQA

Rank #22 of 24
#19Gemma 3 27B
78.0%
#20DeepSeek VL2 Tiny
81.0%
#21Phi-4-multimodal-instruct
81.4%
#22Grok-1.5V
76.1%
#23Gemma 3 12B
75.7%
#24Gemma 3 4B
68.8%

RealWorldQA

Rank #3 of 6
#1Qwen2.5-Omni-7B
70.3%
#2Qwen2-VL-72B-Instruct
77.8%
#3Grok-1.5V
68.7%
#4DeepSeek VL2
68.4%
#5DeepSeek VL2 Small
65.4%
#6DeepSeek VL2 Tiny
64.2%
All Benchmark Results for Grok-1.5V
Complete list of benchmark scores with detailed information
AI2D
AI2D benchmark
general
text
0.88
88.3%
Self-reported
DocVQA
DocVQA benchmark
vision
multimodal
0.86
85.6%
Self-reported
TextVQA
TextVQA benchmark
vision
multimodal
0.78
78.1%
Self-reported
ChartQA
ChartQA benchmark
general
multimodal
0.76
76.1%
Self-reported
RealWorldQA
RealWorldQA benchmark
general
text
0.69
68.7%
Self-reported
MMMU
MMMU benchmark
vision
multimodal
0.54
53.6%
Self-reported
MathVista
MathVista benchmark
math
text
0.53
52.8%
Self-reported