🚀 Website under development • Launching soon

DeepSeek

DeepSeek-V3.1

Zero-eval
#1SimpleQA
#1SWE-bench Multilingual
#1BrowseComp-zh
+2 more

by DeepSeek

About

DeepSeek-V3.1 is a language model developed by DeepSeek. The model shows competitive results across 16 benchmarks. It excels particularly in SimpleQA (93.4%), MMLU-Redux (91.8%), MMLU-Pro (83.7%). The model shows particular specialization in factuality tasks with an average performance of 92.6%. It supports a 328K token context window for handling large documents. The model is available through 2 API providers. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents DeepSeek's latest advancement in AI technology.

Pricing Range
Input (per 1M)$0.27 -$0.27
Output (per 1M)$1.00 -$1.00
Providers2
Timeline
AnnouncedJan 10, 2025
ReleasedJan 10, 2025
Specifications
License & Family
License
MIT
Base ModelDeepSeek-V3
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

16 benchmarks
Average Score
58.4%
Best Score
93.4%
High Performers (80%+)
3

Performance Metrics

Max Context Window
327.7K

Top Categories

factuality
92.6%
reasoning
74.9%
general
57.3%
code
56.5%
math
41.6%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

SimpleQA

Rank #1 of 25
#1DeepSeek-V3.1
93.4%
#2DeepSeek-R1-0528
92.3%
#3GPT-4.5
62.5%
#4Qwen3-235B-A22B-Instruct-2507
54.3%

MMLU-Redux

Rank #8 of 18
#5Kimi K2 Instruct
92.7%
#6Kimi K2-Instruct-0905
92.7%
#7Qwen3-Next-80B-A3B-Thinking
92.5%
#8DeepSeek-V3.1
91.8%
#9Qwen3-Next-80B-A3B-Instruct
90.9%
#10DeepSeek-V3
89.1%
#11Qwen3 235B A22B
87.4%

MMLU-Pro

Rank #5 of 68
#2GLM-4.5
84.6%
#3Qwen3-235B-A22B-Thinking-2507
84.4%
#4DeepSeek-R1
84.0%
#5DeepSeek-V3.1
83.7%
#6Qwen3-235B-A22B-Instruct-2507
83.0%
#7Qwen3-Next-80B-A3B-Thinking
82.7%
#8Kimi K2 0905
82.5%

GPQA-Diamond

Rank #2 of 2
#1DeepSeek-R1-0528
81.0%
#2DeepSeek-V3.1
74.9%

CodeForces

Rank #3 of 5
#1GPT OSS 120B
87.4%
#2GPT OSS 20B
83.9%
#3DeepSeek-V3.1
69.7%
#4Qwen3 32B
65.9%
#5DeepSeek-R1-0528
64.3%
All Benchmark Results for DeepSeek-V3.1
Complete list of benchmark scores with detailed information
SimpleQA
SimpleQA is OpenAI's factuality benchmark designed to measure language models' ability to answer short, fact-seeking questions with high correctness and low variance. This comprehensive evaluation tests factual knowledge across diverse topics, challenging even frontier models and providing crucial insights into AI systems' reliability in providing accurate, verifiable information for straightforward factual queries.
factuality
text
0.93
93.4%
Self-reported
MMLU-Redux
MMLU-Redux is a refined version of the Massive Multitask Language Understanding benchmark that addresses issues in the original dataset through improved question curation and evaluation methodology. It aims to provide more accurate and reliable assessment of language models' knowledge and reasoning capabilities across academic domains.
factuality
text
0.92
91.8%
Self-reported
MMLU-Pro
MMLU-Pro is an enhanced version of MMLU featuring more challenging reasoning-focused questions with expanded choice sets from four to ten options. It eliminates trivial questions from the original MMLU and demonstrates greater stability under varying prompts. The benchmark causes a 16-33% accuracy drop compared to standard MMLU, better revealing differences in model capabilities and requiring chain-of-thought reasoning for optimal performance.
general
text
0.84
83.7%
Self-reported
GPQA-Diamond
GPQA-Diamond is the diamond (hardest) subset of the Graduate-Level Google-Proof Q&A benchmark, testing expert-level reasoning in physics, biology, and chemistry
reasoning
text
0.75
74.9%
Self-reported
CodeForces
CodeForces is a competition-level programming benchmark designed to evaluate Large Language Models' reasoning capabilities through challenging algorithmic problems from the CodeForces platform. This benchmark tests advanced problem-solving skills, algorithmic thinking, and code generation abilities using real competitive programming contests. CodeForces provides human-comparable evaluation of AI coding capabilities in complex, contest-quality scenarios requiring sophisticated reasoning and optimization.
code
text
0.70
69.7%
Self-reported
Aider-Polyglot
Aider-Polyglot is a comprehensive AI coding benchmark that evaluates large language models across 225 challenging Exercism programming exercises in C++, Go, Java, JavaScript, Python, and Rust. This multi-language benchmark tests models' ability to solve complex coding problems, edit existing code, and correct mistakes through a two-attempt methodology. It measures code generation accuracy, edit format compliance, and debugging capabilities.
code
text
0.68
68.4%
Self-reported
AIME 2024
The AIME 2024 benchmark evaluates AI models' mathematical reasoning using 15 problems from the 2024 American Invitational Mathematics Examination. This challenging test requires step-by-step problem-solving across algebra, geometry, and number theory, with integer answers from 000-999. Models must demonstrate olympiad-level mathematical capabilities that qualify top high school students for USAMO. The benchmark uses exact match scoring and multiple runs to assess advanced logical reasoning.
general
text
0.66
66.3%
Self-reported
SWE-Bench Verified
SWE-bench-verified is a human-validated subset of the original SWE-bench featuring 500 carefully verified samples for evaluating AI models' software engineering capabilities. This rigorous benchmark tests models' ability to generate patches that resolve real GitHub issues, focusing on bug fixing, code generation, and software development tasks with improved reliability and reduced evaluation noise.
general
text
0.66
66.0%
Self-reported
LiveCodeBench
LiveCodeBench is a holistic and contamination-free code evaluation benchmark that continuously collects new programming problems from competitive coding platforms. This dynamic benchmark tests AI models' programming capabilities through fresh problems that evolve over time, preventing memorization and ensuring genuine coding skills assessment across algorithm implementation, problem-solving, and code generation tasks.
code
text
0.56
56.4%
Self-reported
SWE-bench Multilingual
SWE-bench-multilingual is a software engineering benchmark extending the original SWE-bench to cover multiple programming languages including Java, TypeScript, JavaScript, Go, Rust, C, and C++. This comprehensive evaluation tests AI models' ability to resolve real-world software issues across diverse programming ecosystems, challenging multilingual code understanding and debugging capabilities in authentic development scenarios.
general
text
0.55
54.5%
Self-reported
Showing 1 to 10 of 16 benchmarks