
Qwen2 72B Instruct
Zero-eval
#1CMMLU
#1TheoremQA
#2EvalPlus
+1 more
by Alibaba
About
Qwen2 72B Instruct is a language model developed by Alibaba. It achieves strong performance with an average score of 73.6% across 17 benchmarks. It excels particularly in GSM8k (91.1%), CMMLU (90.1%), HellaSwag (87.6%). The model shows particular specialization in code tasks with an average performance of 82.3%. Released in 2024, it represents Alibaba's latest advancement in AI technology.
Timeline
AnnouncedJul 23, 2024
ReleasedJul 23, 2024
Specifications
License & Family
License
tongyi-qianwen
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
17 benchmarks
Average Score
73.6%
Best Score
91.1%
High Performers (80%+)
9Top Categories
code
82.3%
reasoning
80.5%
math
75.4%
general
67.9%
factuality
54.8%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
GSM8k
Rank #23 of 46
#20Llama 3.1 Nemotron 70B Instruct
91.4%
#21Qwen2.5 7B Instruct
91.6%
#22Kimi K2 Base
92.1%
#23Qwen2 72B Instruct
91.1%
#24Qwen2.5-Coder 32B Instruct
91.1%
#25Gemini 1.5 Pro
90.8%
#26Grok-1.5
90.0%
CMMLU
Rank #1 of 1
#1Qwen2 72B Instruct
90.1%
HellaSwag
Rank #6 of 24
#3Command R+
88.6%
#4Claude 3 Sonnet
89.0%
#5Gemini 1.5 Pro
93.3%
#6Qwen2 72B Instruct
87.6%
#7Gemini 1.5 Flash
86.5%
#8Gemma 2 27B
86.4%
#9Claude 3 Haiku
85.9%
HumanEval
Rank #28 of 62
#25Qwen2.5 72B Instruct
86.6%
#26GPT-4 Turbo
87.1%
#27GPT-4o mini
87.2%
#28Qwen2 72B Instruct
86.0%
#29Grok-2 mini
85.7%
#30Nova Lite
85.4%
#31Gemma 3 12B
85.4%
Winogrande
Rank #3 of 19
#1Command R+
85.4%
#2GPT-4
87.5%
#3Qwen2 72B Instruct
85.1%
#4Llama 3.1 Nemotron 70B Instruct
84.5%
#5Gemma 2 27B
83.7%
#6Qwen2.5 32B Instruct
82.0%
All Benchmark Results for Qwen2 72B Instruct
Complete list of benchmark scores with detailed information
GSM8k GSM8k benchmark | math | text | 0.91 | 91.1% | Self-reported |
CMMLU CMMLU benchmark | general | text | 0.90 | 90.1% | Self-reported |
HellaSwag HellaSwag benchmark | reasoning | text | 0.88 | 87.6% | Self-reported |
HumanEval HumanEval benchmark | code | text | 0.86 | 86.0% | Self-reported |
Winogrande Winogrande benchmark | reasoning | text | 0.85 | 85.1% | Self-reported |
C-Eval C-Eval benchmark | code | text | 0.84 | 83.8% | Self-reported |
BBH BBH benchmark | general | text | 0.82 | 82.4% | Self-reported |
MMLU MMLU benchmark | general | text | 0.82 | 82.3% | Self-reported |
MBPP MBPP benchmark | code | text | 80.20 | 80.2% | Self-reported |
EvalPlus EvalPlus benchmark | code | text | 79.00 | 79.0% | Self-reported |
Showing 1 to 10 of 17 benchmarks