Terminal-bench

general

text

About

Terminal-bench benchmark

Evaluation Stats

Total Models5

Organizations2

Verified Results0

Self-Reported5

Benchmark Details

Max Score1

Language

en

Performance Overview

Score distribution and top performers

Score Distribution

5 models

Top Score

39.2%

Average Score

30.9%

High Performers (80%+)

0

Top Organizations

#1Anthropic

4 models

31.1%

#2Moonshot AI

1 model

30.0%

Leaderboard

Top 5 models ranked by performance

1

39.2%

Raw: 0.392

Self-reported

2

Claude Sonnet 4

35.5%

Raw: 0.355

Self-reported

3

Claude 3.7 Sonnet

35.2%

Raw: 0.352

Self-reported

4

Kimi K2 Instruct

30.0%

Raw: 0.3

Self-reported

5

Claude Opus 4.1

14.7%

Raw: 0.147

Self-reported