Alibaba

Qwen2 7B Instruct

Zero-eval

by Alibaba

About

Qwen2 7B Instruct is a language model developed by Alibaba. The model shows competitive results across 14 benchmarks. It excels particularly in MT-Bench (84.1%), GSM8k (82.3%), HumanEval (79.9%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2024, it represents Alibaba's latest advancement in AI technology.

Timeline
AnnouncedJul 23, 2024
ReleasedJul 23, 2024
Specifications
License & Family
License
Apache 2.0
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

14 benchmarks
Average Score
59.5%
Best Score
84.1%
High Performers (80%+)
2

Top Categories

roleplay
84.1%
math
66.0%
code
64.2%
general
49.4%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

MT-Bench

Rank #6 of 11
#3Mistral Large 2
86.3%
#4Qwen2.5 7B Instruct
87.5%
#5DeepSeek-V2.5
90.2%
#6Qwen2 7B Instruct
84.1%
#7Mistral Small 3 24B Instruct
83.5%
#8Ministral 8B Instruct
83.0%
#9Llama 3.1 Nemotron Nano 8B V1
81.0%

GSM8k

Rank #36 of 46
#33Qwen2.5-Coder 7B Instruct
83.9%
#34Gemini 1.5 Flash
86.2%
#35Phi-3.5-mini-instruct
86.2%
#36Qwen2 7B Instruct
82.3%
#37Granite 3.3 8B Instruct
80.9%
#38Mistral Small 3 24B Base
80.7%
#39Llama 3.2 3B Instruct
77.7%

HumanEval

Rank #42 of 62
#39Llama 3.1 70B Instruct
80.5%
#40Nova Micro
81.1%
#41Codestral-22B
81.1%
#42Qwen2 7B Instruct
79.9%
#43Qwen2.5-Omni-7B
78.7%
#44Claude 3 Haiku
75.9%
#45Gemma 3n E4B Instructed
75.0%

C-Eval

Rank #6 of 6
#3Qwen2 72B Instruct
83.8%
#4DeepSeek-V3
86.5%
#5Kimi-k1.5
88.3%
#6Qwen2 7B Instruct
77.2%

AlignBench

Rank #4 of 4
#1Qwen2.5 7B Instruct
73.3%
#2DeepSeek-V2.5
80.4%
#3Qwen2.5 72B Instruct
81.6%
#4Qwen2 7B Instruct
72.1%
All Benchmark Results for Qwen2 7B Instruct
Complete list of benchmark scores with detailed information
MT-Bench
MT-Bench benchmark
roleplay
text
84.10
84.1%
Self-reported
GSM8k
GSM8k benchmark
math
text
0.82
82.3%
Self-reported
HumanEval
HumanEval benchmark
code
text
0.80
79.9%
Self-reported
C-Eval
C-Eval benchmark
code
text
0.77
77.2%
Self-reported
AlignBench
AlignBench benchmark
general
text
0.72
72.1%
Self-reported
MMLU
MMLU benchmark
general
text
0.70
70.5%
Self-reported
EvalPlus
EvalPlus benchmark
code
text
70.30
70.3%
Self-reported
MBPP
MBPP benchmark
code
text
67.20
67.2%
Self-reported
MultiPL-E
MultiPL-E benchmark
general
text
59.10
59.1%
Self-reported
MATH
MATH benchmark
math
text
0.50
49.6%
Self-reported
Showing 1 to 10 of 14 benchmarks