The AI Forger - Best LLMs for Building & Coding

Top Models By SWE bench Verified


GPT-5	OpenAI	Aug 7, 2025	74.9%	88.0%	93.4%	-	-
Claude Opus 4.1	Anthropic	Aug 5, 2025	74.5%	-	-	-	-
GPT-5 Codex	OpenAI	Sep 15, 2025	74.5%	-	-	-	-
Claude Sonnet 4	Anthropic	May 22, 2025	72.7%	-	-	-	-
Claude Opus 4	Anthropic	May 22, 2025	72.5%	-	-	-	-
Claude 3.7 Sonnet	Anthropic	Feb 24, 2025	70.3%	-	-	-	-
o3	OpenAI	Apr 16, 2025	69.1%	81.3%	-	-	-
o4-mini	OpenAI	Apr 16, 2025	68.1%	68.9%	-	-	-
Gemini 2.5 Pro Preview 06-05	Google	Jun 5, 2025	67.2%	82.2%	-	69.0%	-
DeepSeek-V3.1	DeepSeek	Jan 10, 2025	66.0%	68.4%	-	56.4%	-

Showing 1 to 10 of 160 models

...

Total Models

160

AI models tracked

Organizations

Companies & labs

Providers

API providers

Benchmarks

344

Evaluation metrics

Coding Categories Performance

Model performance across different coding domains and specializations

Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.

Python Coding

Focuses on generating, completing, and debugging Python code.

Kimi K2 0905Moonshot AI

95%

1 benchmarks

Claude 3.5 SonnetAnthropic

94%

1 benchmarks

GPT-5OpenAI

93%

1 benchmarks

Kimi K2 InstructMoonshot AI

93%

1 benchmarks

Phi 4 ReasoningMicrosoft

93%

1 benchmarks

Web Development

Evaluates coding in JavaScript/TypeScript for web frameworks like Next.js.

GPT-5OpenAI

88%

1 benchmarks

Gemini 2.5 Pro Preview 06-05Google

82%

1 benchmarks

o3OpenAI

81%

1 benchmarks

Gemini 2.5 ProGoogle

77%

1 benchmarks

Qwen2.5 32B InstructAlibaba Cloud / Qwen Team

75%

1 benchmarks

Terminal/Command Line Tasks

Tests command-line operations, scripting, and system interactions.

Qwen2.5 VL 7B InstructAlibaba Cloud / Qwen Team

60%

1 benchmarks

Claude Opus 4.1Anthropic

43%

1 benchmarks

Claude Opus 4Anthropic

39%

1 benchmarks

Qwen2.5 VL 32B InstructAlibaba Cloud / Qwen Team

38%

2 benchmarks

Qwen2.5 VL 72B InstructAlibaba Cloud / Qwen Team

38%

2 benchmarks

Agentic Coding

Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.

GPT-5OpenAI

81%

2 benchmarks

Gemini 2.5 Pro Preview 06-05Google

75%

2 benchmarks

o3OpenAI

75%

2 benchmarks

GPT-5 CodexOpenAI

75%

1 benchmarks

Gemini 2.5 ProGoogle

70%

2 benchmarks

Multilingual Coding

Covers code generation across multiple programming languages.

GPT-5OpenAI

88%

1 benchmarks

Gemini 2.5 Pro Preview 06-05Google

82%

1 benchmarks

o3OpenAI

81%

1 benchmarks

Gemini 2.5 ProGoogle

75%

2 benchmarks

Qwen2.5 32B InstructAlibaba Cloud / Qwen Team

75%

1 benchmarks

Coding Contests

Simulates competitive programming problems from platforms like LeetCode or CodeForces.

GPT OSS 120BOpenAI

87%

1 benchmarks

GPT OSS 20BOpenAI

84%

1 benchmarks

Grok-3 MinixAI

80%

1 benchmarks

Grok-4 HeavyxAI

79%

1 benchmarks

Grok-3xAI

79%

1 benchmarks

Repository-Level Coding

Involves understanding and modifying code in full repositories.

Phi-3.5-MoE-instructMicrosoft

85%

1 benchmarks

Phi-3.5-mini-instructMicrosoft

77%

1 benchmarks

GPT-5OpenAI

75%

1 benchmarks

GPT-5 CodexOpenAI

75%

1 benchmarks

Claude Opus 4.1Anthropic

75%

1 benchmarks

Function/Tool Calling

Evaluates API usage, function invocation, and tool integration in code.

Llama 3.1 405B InstructMeta

72%

3 benchmarks

Qwen3 235B A22BAlibaba Cloud / Qwen Team

71%

1 benchmarks

Qwen3 32BAlibaba Cloud / Qwen Team

70%

1 benchmarks

Qwen3 30B A3BAlibaba Cloud / Qwen Team

69%

1 benchmarks

Llama 3.1 70B InstructMeta

68%

3 benchmarks

Math Reasoning for Coding

Tests mathematical problem-solving, which underpins algorithmic thinking in coding.

Kimi K2 InstructMoonshot AI

97%

1 benchmarks

GPT-4.5OpenAI

97%

1 benchmarks

o3-miniOpenAI

95%

2 benchmarks

o1OpenAI

94%

3 benchmarks

Mistral Large 2Mistral AI

93%

1 benchmarks