DeepSeek-V3.1

Name: DeepSeek-V3.1
Price: 0.27 USD
Rating: 58.4 (16 reviews)
Author: DeepSeek

Zero-eval

#1SimpleQA

#1SWE-bench Multilingual

#1BrowseComp-zh

+2 more

by DeepSeek

About

DeepSeek-V3.1 is a language model developed by DeepSeek. The model shows competitive results across 16 benchmarks. It excels particularly in SimpleQA (93.4%), MMLU-Redux (91.8%), MMLU-Pro (83.7%). The model shows particular specialization in factuality tasks with an average performance of 92.6%. It supports a 328K token context window for handling large documents. The model is available through 2 API providers. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents DeepSeek's latest advancement in AI technology.

Pricing Range

Input (per 1M)$0.27 -$0.27

Output (per 1M)$1.00 -$1.00

Providers2

Timeline

AnnouncedJan 10, 2025

ReleasedJan 10, 2025

Specifications

License & Family

License

MIT

Base ModelDeepSeek-V3

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

16 benchmarks

Average Score

58.4%

Best Score

93.4%

High Performers (80%+)

Performance Metrics

Max Context Window

327.7K

Top Categories

factuality

92.6%

reasoning

74.9%

general

57.3%

code

56.5%

math

41.6%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

SimpleQA

Rank #1 of 25

#1DeepSeek-V3.1

93.4%

#2DeepSeek-R1-0528

92.3%

#3GPT-4.5

62.5%

#4Qwen3-235B-A22B-Instruct-2507

54.3%

MMLU-Redux

Rank #8 of 18

#5Kimi K2 Instruct

92.7%

#6Kimi K2-Instruct-0905

92.7%

#7Qwen3-Next-80B-A3B-Thinking

92.5%

#8DeepSeek-V3.1

91.8%

#9Qwen3-Next-80B-A3B-Instruct

90.9%

#10DeepSeek-V3

89.1%

#11Qwen3 235B A22B

87.4%

MMLU-Pro

Rank #5 of 68

#2GLM-4.5

84.6%

#3Qwen3-235B-A22B-Thinking-2507

84.4%

#4DeepSeek-R1

84.0%

#5DeepSeek-V3.1

83.7%

#6Qwen3-235B-A22B-Instruct-2507

83.0%

#7Qwen3-Next-80B-A3B-Thinking

82.7%

#8Kimi K2 0905

82.5%

GPQA-Diamond

Rank #2 of 2

#1DeepSeek-R1-0528

81.0%

#2DeepSeek-V3.1

74.9%

CodeForces

Rank #3 of 5

#1GPT OSS 120B

87.4%

#2GPT OSS 20B

83.9%

#3DeepSeek-V3.1

69.7%

#4Qwen3 32B

65.9%

#5DeepSeek-R1-0528

64.3%

All Benchmark Results for DeepSeek-V3.1

Complete list of benchmark scores with detailed information


SimpleQA SimpleQA is OpenAI's factuality benchmark designed to measure language models' ability to answer short, fact-seeking questions with high correctness and low variance. This comprehensive evaluation tests factual knowledge across diverse topics, challenging even frontier models and providing crucial insights into AI systems' reliability in providing accurate, verifiable information for straightforward factual queries.	factuality	text	0.93	93.4%	Self-reported
MMLU-Redux MMLU-Redux is a refined version of the Massive Multitask Language Understanding benchmark that addresses issues in the original dataset through improved question curation and evaluation methodology. It aims to provide more accurate and reliable assessment of language models' knowledge and reasoning capabilities across academic domains.	factuality	text	0.92	91.8%	Self-reported
MMLU-Pro MMLU-Pro is an enhanced version of MMLU featuring more challenging reasoning-focused questions with expanded choice sets from four to ten options. It eliminates trivial questions from the original MMLU and demonstrates greater stability under varying prompts. The benchmark causes a 16-33% accuracy drop compared to standard MMLU, better revealing differences in model capabilities and requiring chain-of-thought reasoning for optimal performance.	general	text	0.84	83.7%	Self-reported
GPQA-Diamond GPQA-Diamond is the diamond (hardest) subset of the Graduate-Level Google-Proof Q&A benchmark, testing expert-level reasoning in physics, biology, and chemistry	reasoning	text	0.75	74.9%	Self-reported
CodeForces CodeForces is a competition-level programming benchmark designed to evaluate Large Language Models' reasoning capabilities through challenging algorithmic problems from the CodeForces platform. This benchmark tests advanced problem-solving skills, algorithmic thinking, and code generation abilities using real competitive programming contests. CodeForces provides human-comparable evaluation of AI coding capabilities in complex, contest-quality scenarios requiring sophisticated reasoning and optimization.	code	text	0.70	69.7%	Self-reported
Aider-Polyglot Aider-Polyglot is a comprehensive AI coding benchmark that evaluates large language models across 225 challenging Exercism programming exercises in C++, Go, Java, JavaScript, Python, and Rust. This multi-language benchmark tests models' ability to solve complex coding problems, edit existing code, and correct mistakes through a two-attempt methodology. It measures code generation accuracy, edit format compliance, and debugging capabilities.	code	text	0.68	68.4%	Self-reported
AIME 2024 The AIME 2024 benchmark evaluates AI models' mathematical reasoning using 15 problems from the 2024 American Invitational Mathematics Examination. This challenging test requires step-by-step problem-solving across algebra, geometry, and number theory, with integer answers from 000-999. Models must demonstrate olympiad-level mathematical capabilities that qualify top high school students for USAMO. The benchmark uses exact match scoring and multiple runs to assess advanced logical reasoning.	general	text	0.66	66.3%	Self-reported
SWE-Bench Verified SWE-bench-verified is a human-validated subset of the original SWE-bench featuring 500 carefully verified samples for evaluating AI models' software engineering capabilities. This rigorous benchmark tests models' ability to generate patches that resolve real GitHub issues, focusing on bug fixing, code generation, and software development tasks with improved reliability and reduced evaluation noise.	general	text	0.66	66.0%	Self-reported
LiveCodeBench LiveCodeBench is a holistic and contamination-free code evaluation benchmark that continuously collects new programming problems from competitive coding platforms. This dynamic benchmark tests AI models' programming capabilities through fresh problems that evolve over time, preventing memorization and ensuring genuine coding skills assessment across algorithm implementation, problem-solving, and code generation tasks.	code	text	0.56	56.4%	Self-reported
SWE-bench Multilingual SWE-bench-multilingual is a software engineering benchmark extending the original SWE-bench to cover multiple programming languages including Java, TypeScript, JavaScript, Go, Rust, C, and C++. This comprehensive evaluation tests AI models' ability to resolve real-world software issues across diverse programming ecosystems, challenging multilingual code understanding and debugging capabilities in authentic development scenarios.	general	text	0.55	54.5%	Self-reported

Showing 1 to 10 of 16 benchmarks

Resources

API Reference Playground Research Paper Blog Post Repository Model Weights