🚀 Website under development • Launching soon

Top Models By SWE bench Verified

Aug 7, 202574.9%88.0%93.4%--
Aug 5, 202574.5%----
Sep 15, 202574.5%----
May 22, 202572.7%----
May 22, 202572.5%----
Feb 24, 202570.3%----
Apr 16, 202569.1%81.3%---
Apr 16, 202568.1%68.9%---
Jun 5, 202567.2%82.2%-69.0%-
Jan 10, 202566.0%68.4%-56.4%-
Showing 1 to 10 of 160 models
...
Total Models
160

AI models tracked

LLM Leaderboard background
Organizations
17

Companies & labs

Providers
20

API providers

Benchmarks
344

Evaluation metrics

Coding Categories Performance

Model performance across different coding domains and specializations

Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.

Python Coding

Focuses on generating, completing, and debugging Python code.

#1
Kimi K2 0905Moonshot AI
95%
1 benchmarks
#2
Claude 3.5 SonnetAnthropic
94%
1 benchmarks
#3
GPT-5OpenAI
93%
1 benchmarks
#4
Kimi K2 InstructMoonshot AI
93%
1 benchmarks
#5
Phi 4 ReasoningMicrosoft
93%
1 benchmarks
Web Development

Evaluates coding in JavaScript/TypeScript for web frameworks like Next.js.

#1
GPT-5OpenAI
88%
1 benchmarks
#2
Gemini 2.5 Pro Preview 06-05Google
82%
1 benchmarks
#3
o3OpenAI
81%
1 benchmarks
#4
Gemini 2.5 ProGoogle
77%
1 benchmarks
#5
Qwen2.5 32B InstructAlibaba Cloud / Qwen Team
75%
1 benchmarks
Terminal/Command Line Tasks

Tests command-line operations, scripting, and system interactions.

#1
Qwen2.5 VL 7B InstructAlibaba Cloud / Qwen Team
60%
1 benchmarks
#2
Claude Opus 4.1Anthropic
43%
1 benchmarks
#3
Claude Opus 4Anthropic
39%
1 benchmarks
#4
Qwen2.5 VL 32B InstructAlibaba Cloud / Qwen Team
38%
2 benchmarks
#5
Qwen2.5 VL 72B InstructAlibaba Cloud / Qwen Team
38%
2 benchmarks
Agentic Coding

Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.

#1
GPT-5OpenAI
81%
2 benchmarks
#2
Gemini 2.5 Pro Preview 06-05Google
75%
2 benchmarks
#3
o3OpenAI
75%
2 benchmarks
#4
GPT-5 CodexOpenAI
75%
1 benchmarks
#5
Gemini 2.5 ProGoogle
70%
2 benchmarks
Multilingual Coding

Covers code generation across multiple programming languages.

#1
GPT-5OpenAI
88%
1 benchmarks
#2
Gemini 2.5 Pro Preview 06-05Google
82%
1 benchmarks
#3
o3OpenAI
81%
1 benchmarks
#4
Gemini 2.5 ProGoogle
75%
2 benchmarks
#5
Qwen2.5 32B InstructAlibaba Cloud / Qwen Team
75%
1 benchmarks
Coding Contests

Simulates competitive programming problems from platforms like LeetCode or CodeForces.

#1
GPT OSS 120BOpenAI
87%
1 benchmarks
#2
GPT OSS 20BOpenAI
84%
1 benchmarks
#3
Grok-3 MinixAI
80%
1 benchmarks
#4
Grok-4 HeavyxAI
79%
1 benchmarks
#5
Grok-3xAI
79%
1 benchmarks
Repository-Level Coding

Involves understanding and modifying code in full repositories.

#1
Phi-3.5-MoE-instructMicrosoft
85%
1 benchmarks
#2
Phi-3.5-mini-instructMicrosoft
77%
1 benchmarks
#3
GPT-5OpenAI
75%
1 benchmarks
#4
GPT-5 CodexOpenAI
75%
1 benchmarks
#5
Claude Opus 4.1Anthropic
75%
1 benchmarks
Function/Tool Calling

Evaluates API usage, function invocation, and tool integration in code.

#1
Llama 3.1 405B InstructMeta
72%
3 benchmarks
#2
Qwen3 235B A22BAlibaba Cloud / Qwen Team
71%
1 benchmarks
#3
Qwen3 32BAlibaba Cloud / Qwen Team
70%
1 benchmarks
#4
Qwen3 30B A3BAlibaba Cloud / Qwen Team
69%
1 benchmarks
#5
Llama 3.1 70B InstructMeta
68%
3 benchmarks
Math Reasoning for Coding

Tests mathematical problem-solving, which underpins algorithmic thinking in coding.

#1
Kimi K2 InstructMoonshot AI
97%
1 benchmarks
#2
GPT-4.5OpenAI
97%
1 benchmarks
#3
o3-miniOpenAI
95%
2 benchmarks
#4
o1OpenAI
94%
3 benchmarks
#5
Mistral Large 2Mistral AI
93%
1 benchmarks