DeepSeek

DeepSeek-V2.5

Zero-eval
#1DS-FIM-Eval
#1Aider
#1DS-Arena-Code
+4 more

by DeepSeek

About

DeepSeek-V2.5 is a language model developed by DeepSeek. It achieves strong performance with an average score of 71.1% across 15 benchmarks. It excels particularly in GSM8k (95.1%), MT-Bench (90.2%), HumanEval (89.0%). The model is available through 3 API providers. Released in 2024, it represents DeepSeek's latest advancement in AI technology.

Pricing Range
Input (per 1M)$0.14 -$2.00
Output (per 1M)$0.28 -$2.00
Providers3
Timeline
AnnouncedMay 8, 2024
ReleasedMay 8, 2024
Specifications
License & Family
License
deepseek
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

15 benchmarks
Average Score
71.1%
Best Score
95.1%
High Performers (80%+)
6

Performance Metrics

Max Context Window
16.4K
Avg Throughput
87.7 tok/s
Avg Latency
1ms

Top Categories

roleplay
90.2%
math
84.9%
general
68.4%
code
66.1%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

GSM8k

Rank #10 of 46
#7Qwen2.5 72B Instruct
95.8%
#8Qwen2.5 32B Instruct
95.9%
#9Gemma 3 27B
95.9%
#10DeepSeek-V2.5
95.1%
#11Claude 3 Opus
95.0%
#12Nova Pro
94.8%
#13Qwen2.5 14B Instruct
94.8%

MT-Bench

Rank #3 of 11
#1Llama-3.3 Nemotron Super 49B v1
91.7%
#2Qwen2.5 72B Instruct
93.5%
#3DeepSeek-V2.5
90.2%
#4Qwen2.5 7B Instruct
87.5%
#5Mistral Large 2
86.3%
#6Qwen2 7B Instruct
84.1%

HumanEval

Rank #15 of 62
#12Nova Pro
89.0%
#13Llama 3.1 405B Instruct
89.0%
#14Gemini Diffusion
89.6%
#15DeepSeek-V2.5
89.0%
#16Mistral Small 3.1 24B Instruct
88.4%
#17Llama 3.3 70B Instruct
88.4%
#18Grok-2
88.4%

BBH

Rank #4 of 8
#1Qwen2.5 32B Instruct
84.5%
#2Nova Pro
86.9%
#3Qwen3 235B A22B
88.9%
#4DeepSeek-V2.5
84.3%
#5Nova Lite
82.4%
#6Qwen2 72B Instruct
82.4%
#7Nova Micro
79.5%

MMLU

Rank #43 of 78
#40Nova Lite
80.5%
#41Mistral Small 3.2 24B Instruct
80.5%
#42Mistral Small 3.1 24B Instruct
80.6%
#43DeepSeek-V2.5
80.4%
#44Llama 3.1 Nemotron 70B Instruct
80.2%
#45GPT-4.1 nano
80.1%
#46Qwen2.5 14B Instruct
79.7%
All Benchmark Results for DeepSeek-V2.5
Complete list of benchmark scores with detailed information
GSM8k
GSM8k benchmark
math
text
0.95
95.1%
Self-reported
MT-Bench
MT-Bench benchmark
roleplay
text
90.20
90.2%
Self-reported
HumanEval
HumanEval benchmark
code
text
0.89
89.0%
Self-reported
BBH
BBH benchmark
general
text
0.84
84.3%
Self-reported
MMLU
MMLU benchmark
general
text
0.80
80.4%
Self-reported
AlignBench
AlignBench benchmark
general
text
0.80
80.4%
Self-reported
DS-FIM-Eval
DS-FIM-Eval benchmark
code
text
0.78
78.3%
Self-reported
Arena Hard
Arena Hard benchmark
general
text
0.76
76.2%
Self-reported
MATH
MATH benchmark
math
text
0.75
74.7%
Self-reported
HumanEval-Mul
HumanEval-Mul benchmark
code
text
0.74
73.8%
Self-reported
Showing 1 to 10 of 15 benchmarks