Kimi K2 Instruct
Zero-eval
#1MATH-500
#1GSM8k
#1CBNSL
+23 more
by Moonshot AI
About
Kimi K2 Instruct is a language model developed by Moonshot AI. It achieves strong performance with an average score of 66.7% across 38 benchmarks. It excels particularly in MATH-500 (97.4%), GSM8k (97.3%), CBNSL (95.6%). It supports a 144K token context window for handling large documents. The model is available through 1 API provider. Released in 2025, it represents Moonshot AI's latest advancement in AI technology.
Pricing Range
Input (per 1M)$0.57 -$0.57
Output (per 1M)$2.29 -$2.29
Providers1
Timeline
AnnouncedJan 1, 2025
ReleasedJan 1, 2025
Specifications
Training Tokens15.5T
License & Family
License
Modified MIT License
Base ModelKimi K2 Base
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
38 benchmarks
Average Score
66.7%
Best Score
97.4%
High Performers (80%+)
12Performance Metrics
Max Context Window
144.0KAvg Throughput
45.0 tok/sAvg Latency
1msTop Categories
reasoning
89.0%
math
86.6%
code
79.5%
roleplay
76.4%
general
62.0%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
MATH-500
Rank #1 of 22
#1Kimi K2 Instruct
97.4%
#2DeepSeek-R1
97.3%
#3Llama 3.1 Nemotron Ultra 253B v1
97.0%
#4Llama-3.3 Nemotron Super 49B v1
96.6%
GSM8k
Rank #1 of 46
#1Kimi K2 Instruct
97.3%
#2o1
97.1%
#3GPT-4.5
97.0%
#4Llama 3.1 405B Instruct
96.8%
CBNSL
Rank #1 of 1
#1Kimi K2 Instruct
95.6%
HumanEval
Rank #3 of 62
#1GPT-5
93.4%
#2Claude 3.5 Sonnet
93.7%
#3Kimi K2 Instruct
93.3%
#4Qwen2.5-Coder 32B Instruct
92.7%
#5o1-mini
92.4%
#6Claude 3.5 Sonnet
92.0%
MMLU-Redux
Rank #4 of 13
#1DeepSeek-R1
92.9%
#2Qwen3-235B-A22B-Instruct-2507
93.1%
#3DeepSeek-R1-0528
93.4%
#4Kimi K2 Instruct
92.7%
#5DeepSeek-V3
89.1%
#6Qwen3 235B A22B
87.4%
#7Qwen2.5 72B Instruct
86.8%
All Benchmark Results for Kimi K2 Instruct
Complete list of benchmark scores with detailed information
MATH-500 MATH-500 benchmark | math | text | 0.97 | 97.4% | Self-reported |
GSM8k GSM8k benchmark | math | text | 0.97 | 97.3% | Self-reported |
CBNSL CBNSL benchmark | general | text | 0.96 | 95.6% | Self-reported |
HumanEval HumanEval benchmark | code | text | 0.93 | 93.3% | Self-reported |
MMLU-Redux MMLU-Redux benchmark | general | text | 0.93 | 92.7% | Self-reported |
IFEval IFEval benchmark | code | text | 0.90 | 89.8% | Self-reported |
MMLU MMLU benchmark | general | text | 0.90 | 89.5% | Self-reported |
AutoLogi AutoLogi benchmark | general | text | 0.90 | 89.5% | Self-reported |
ZebraLogic ZebraLogic benchmark | reasoning | text | 0.89 | 89.0% | Self-reported |
MultiPL-E MultiPL-E benchmark | general | text | 85.70 | 85.7% | Self-reported |
Showing 1 to 10 of 38 benchmarks