Alibaba

Qwen2 72B Instruct

Zero-eval
#1CMMLU
#1TheoremQA
#2EvalPlus
+1 more

by Alibaba

About

Qwen2 72B Instruct is a language model developed by Alibaba. It achieves strong performance with an average score of 73.6% across 17 benchmarks. It excels particularly in GSM8k (91.1%), CMMLU (90.1%), HellaSwag (87.6%). The model shows particular specialization in code tasks with an average performance of 82.3%. Released in 2024, it represents Alibaba's latest advancement in AI technology.

Timeline
AnnouncedJul 23, 2024
ReleasedJul 23, 2024
Specifications
License & Family
License
tongyi-qianwen
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

17 benchmarks
Average Score
73.6%
Best Score
91.1%
High Performers (80%+)
9

Top Categories

code
82.3%
reasoning
80.5%
math
75.4%
general
67.9%
factuality
54.8%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

GSM8k

Rank #23 of 46
#20Llama 3.1 Nemotron 70B Instruct
91.4%
#21Qwen2.5 7B Instruct
91.6%
#22Kimi K2 Base
92.1%
#23Qwen2 72B Instruct
91.1%
#24Qwen2.5-Coder 32B Instruct
91.1%
#25Gemini 1.5 Pro
90.8%
#26Grok-1.5
90.0%

CMMLU

Rank #1 of 1
#1Qwen2 72B Instruct
90.1%

HellaSwag

Rank #6 of 24
#3Command R+
88.6%
#4Claude 3 Sonnet
89.0%
#5Gemini 1.5 Pro
93.3%
#6Qwen2 72B Instruct
87.6%
#7Gemini 1.5 Flash
86.5%
#8Gemma 2 27B
86.4%
#9Claude 3 Haiku
85.9%

HumanEval

Rank #28 of 62
#25Qwen2.5 72B Instruct
86.6%
#26GPT-4 Turbo
87.1%
#27GPT-4o mini
87.2%
#28Qwen2 72B Instruct
86.0%
#29Grok-2 mini
85.7%
#30Nova Lite
85.4%
#31Gemma 3 12B
85.4%

Winogrande

Rank #3 of 19
#1Command R+
85.4%
#2GPT-4
87.5%
#3Qwen2 72B Instruct
85.1%
#4Llama 3.1 Nemotron 70B Instruct
84.5%
#5Gemma 2 27B
83.7%
#6Qwen2.5 32B Instruct
82.0%
All Benchmark Results for Qwen2 72B Instruct
Complete list of benchmark scores with detailed information
GSM8k
GSM8k benchmark
math
text
0.91
91.1%
Self-reported
CMMLU
CMMLU benchmark
general
text
0.90
90.1%
Self-reported
HellaSwag
HellaSwag benchmark
reasoning
text
0.88
87.6%
Self-reported
HumanEval
HumanEval benchmark
code
text
0.86
86.0%
Self-reported
Winogrande
Winogrande benchmark
reasoning
text
0.85
85.1%
Self-reported
C-Eval
C-Eval benchmark
code
text
0.84
83.8%
Self-reported
BBH
BBH benchmark
general
text
0.82
82.4%
Self-reported
MMLU
MMLU benchmark
general
text
0.82
82.3%
Self-reported
MBPP
MBPP benchmark
code
text
80.20
80.2%
Self-reported
EvalPlus
EvalPlus benchmark
code
text
79.00
79.0%
Self-reported
Showing 1 to 10 of 17 benchmarks