Qwen2 72B Instruct

Name: Qwen2 72B Instruct
Rating: 73.6 (17 reviews)
Author: Alibaba

Zero-eval

#1CMMLU

#1TheoremQA

#2EvalPlus

+1 more

by Alibaba

About

Qwen2 72B Instruct is a language model developed by Alibaba. It achieves strong performance with an average score of 73.6% across 17 benchmarks. It excels particularly in GSM8k (91.1%), CMMLU (90.1%), HellaSwag (87.6%). The model shows particular specialization in code tasks with an average performance of 82.3%. Released in 2024, it represents Alibaba's latest advancement in AI technology.

Timeline

AnnouncedJul 23, 2024

ReleasedJul 23, 2024

Specifications

License & Family

License

tongyi-qianwen

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

17 benchmarks

Average Score

73.6%

Best Score

91.1%

High Performers (80%+)

Top Categories

code

82.3%

reasoning

80.5%

math

75.4%

general

67.9%

factuality

54.8%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

GSM8k

Rank #23 of 46

#20Llama 3.1 Nemotron 70B Instruct

91.4%

#21Qwen2.5 7B Instruct

91.6%

#22Kimi K2 Base

92.1%

#23Qwen2 72B Instruct

91.1%

#24Qwen2.5-Coder 32B Instruct

91.1%

#25Gemini 1.5 Pro

90.8%

#26Grok-1.5

90.0%

CMMLU

Rank #1 of 1

#1Qwen2 72B Instruct

90.1%

HellaSwag

Rank #6 of 24

#3Command R+

88.6%

#4Claude 3 Sonnet

89.0%

#5Gemini 1.5 Pro

93.3%

#6Qwen2 72B Instruct

87.6%

#7Gemini 1.5 Flash

86.5%

#8Gemma 2 27B

86.4%

#9Claude 3 Haiku

85.9%

HumanEval

Rank #28 of 62

#25Qwen2.5 72B Instruct

86.6%

#26GPT-4 Turbo

87.1%

#27GPT-4o mini

87.2%

#28Qwen2 72B Instruct

86.0%

#29Grok-2 mini

85.7%

#30Nova Lite

85.4%

#31Gemma 3 12B

85.4%

Winogrande

Rank #3 of 19

#1Command R+

85.4%

#2GPT-4

87.5%

#3Qwen2 72B Instruct

85.1%

#4Llama 3.1 Nemotron 70B Instruct

84.5%

#5Gemma 2 27B

83.7%

#6Qwen2.5 32B Instruct

82.0%

All Benchmark Results for Qwen2 72B Instruct

Complete list of benchmark scores with detailed information


GSM8k GSM8k benchmark	math	text	0.91	91.1%	Self-reported
CMMLU CMMLU benchmark	general	text	0.90	90.1%	Self-reported
HellaSwag HellaSwag benchmark	reasoning	text	0.88	87.6%	Self-reported
HumanEval HumanEval benchmark	code	text	0.86	86.0%	Self-reported
Winogrande Winogrande benchmark	reasoning	text	0.85	85.1%	Self-reported
C-Eval C-Eval benchmark	code	text	0.84	83.8%	Self-reported
BBH BBH benchmark	general	text	0.82	82.4%	Self-reported
MMLU MMLU benchmark	general	text	0.82	82.3%	Self-reported
MBPP MBPP benchmark	code	text	80.20	80.2%	Self-reported
EvalPlus EvalPlus benchmark	code	text	79.00	79.0%	Self-reported

Showing 1 to 10 of 17 benchmarks

Resources

API Reference Playground Research Paper Blog Post Repository Model Weights