Top Models By SWE bench Verified
Aug 7, 2025 | 74.9% | 88.0% | 93.4% | - | - | ||
Aug 5, 2025 | 74.5% | - | - | - | - | ||
Sep 15, 2025 | 74.5% | - | - | - | - | ||
May 22, 2025 | 72.7% | - | - | - | - | ||
May 22, 2025 | 72.5% | - | - | - | - | ||
Feb 24, 2025 | 70.3% | - | - | - | - | ||
Apr 16, 2025 | 69.1% | 81.3% | - | - | - | ||
Apr 16, 2025 | 68.1% | 68.9% | - | - | - | ||
Jun 5, 2025 | 67.2% | 82.2% | - | 69.0% | - | ||
Jan 10, 2025 | 66.0% | 68.4% | - | 56.4% | - |
AI models tracked

Companies & labs
API providers
Evaluation metrics
Coding Categories Performance
Model performance across different coding domains and specializations
Note: These rankings reflect performance on available benchmarks for each model. Rankings do not necessarily indicate absolute superiority in a category, as most models have not been evaluated on all benchmarks.
Focuses on generating, completing, and debugging Python code.
Evaluates coding in JavaScript/TypeScript for web frameworks like Next.js.
Tests command-line operations, scripting, and system interactions.
Assesses autonomous agents for code editing, issue resolution, and tool-using workflows.
Covers code generation across multiple programming languages.
Simulates competitive programming problems from platforms like LeetCode or CodeForces.
Involves understanding and modifying code in full repositories.
Evaluates API usage, function invocation, and tool integration in code.
Tests mathematical problem-solving, which underpins algorithmic thinking in coding.