Kimi K2 Base

Name: Kimi K2 Base
Rating: 69.2 (13 reviews)
Author: Moonshot AI

Zero-eval

#1C-Eval

#1MMLU-redux-2.0

#1TriviaQA

+4 more

by Moonshot AI

About

Kimi K2 Base is a language model developed by Moonshot AI. It achieves strong performance with an average score of 69.2% across 13 benchmarks. It excels particularly in C-Eval (92.5%), GSM8k (92.1%), MMLU-redux-2.0 (90.2%). The model shows particular specialization in math tasks with an average performance of 81.2%. Released in 2025, it represents Moonshot AI's latest advancement in AI technology.

Timeline

AnnouncedJan 1, 2025

ReleasedJan 1, 2025

Specifications

Training Tokens15.5T

License & Family

License

Modified MIT License

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

13 benchmarks

Average Score

69.2%

Best Score

92.5%

High Performers (80%+)

Top Categories

math

81.2%

general

67.3%

code

66.4%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

C-Eval

Rank #1 of 6

#1Kimi K2 Base

92.5%

#2DeepSeek-R1

91.8%

#3Kimi-k1.5

88.3%

#4DeepSeek-V3

86.5%

GSM8k

Rank #20 of 46

#17Nova Micro

92.3%

#18Claude 3 Sonnet

92.3%

#19Mistral Large 2

93.0%

#20Kimi K2 Base

92.1%

#21Qwen2.5 7B Instruct

91.6%

#22Llama 3.1 Nemotron 70B Instruct

91.4%

#23Qwen2 72B Instruct

91.1%

MMLU-redux-2.0

Rank #1 of 1

#1Kimi K2 Base

90.2%

MMLU

Rank #13 of 78

#10Qwen3 235B A22B

87.8%

#11DeepSeek-V3

88.5%

#12GPT-4o

88.7%

#13Kimi K2 Base

87.8%

#14GPT-4.1 mini

87.5%

#15Grok-2

87.5%

#16Kimi-k1.5

87.4%

TriviaQA

Rank #1 of 13

#1Kimi K2 Base

85.1%

#2Gemma 2 27B

83.7%

#3Mistral Small 3.1 24B Base

80.5%

#4Mistral Small 3.1 24B Instruct

80.5%

All Benchmark Results for Kimi K2 Base

Complete list of benchmark scores with detailed information


C-Eval C-Eval benchmark	code	text	0.93	92.5%	Self-reported
GSM8k GSM8k benchmark	math	text	0.92	92.1%	Self-reported
MMLU-redux-2.0 MMLU-redux-2.0 benchmark	general	text	0.90	90.2%	Self-reported
MMLU MMLU benchmark	general	text	0.88	87.8%	Self-reported
TriviaQA TriviaQA benchmark	general	text	0.85	85.1%	Self-reported
EvalPlus EvalPlus benchmark	code	text	80.30	80.3%	Self-reported
CSimpleQA CSimpleQA benchmark	general	text	0.78	77.6%	Self-reported
MATH MATH benchmark	math	text	0.70	70.2%	Self-reported
MMLU-Pro MMLU-Pro benchmark	general	text	0.69	69.2%	Self-reported
GPQA GPQA benchmark	general	text	0.48	48.1%	Self-reported

Showing 1 to 10 of 13 benchmarks

Resources

API Reference Blog Post Repository Model Weights