HumanEval

code

text

About

HumanEval benchmark

Evaluation Stats

Total Models62

Organizations12

Verified Results0

Self-Reported61

Benchmark Details

Max Score1

Language

en

Performance Overview

Score distribution and top performers

Score Distribution

62 models

Top Score

93.7%

Average Score

80.4%

High Performers (80%+)

41

Top Organizations

#1Moonshot AI

1 model

93.3%

#2DeepSeek

1 model

89.0%

#3IBM

3 models

87.3%

#4Alibaba

10 models

86.1%

#5Amazon

3 models

85.2%

Leaderboard

Top 20 models ranked by performance

1

Claude 3.5 Sonnet

93.7%

Raw: 0.937

Self-reported

2

93.4%

Raw: 0.934

Self-reported

3

Kimi K2 Instruct

93.3%

Raw: 0.933

Self-reported

4

Qwen2.5-Coder 32B Instruct

92.7%

Raw: 0.927

Self-reported

5

92.4%

Raw: 0.924

Self-reported

6

Claude 3.5 Sonnet

92.0%

Raw: 0.92

Self-reported

7

Mistral Large 2

92.0%

Raw: 0.92

Self-reported

8

Qwen2.5 VL 32B Instruct

91.5%

Raw: 0.915

Self-reported

9

90.2%

Raw: 0.902

Self-reported

10

Granite 3.3 8B Instruct

by IBM

89.7%

Raw: 0.8973

Self-reported

11

Granite 3.3 8B Base

by IBM

89.7%

Raw: 0.8973

Self-reported

12

Gemini Diffusion

89.6%

Raw: 0.896

Self-reported

13

Llama 3.1 405B Instruct

89.0%

Raw: 0.89

Self-reported

14

89.0%

Raw: 0.89

Self-reported

15

89.0%

Raw: 0.89

Self-reported

16

Mistral Small 3.1 24B Instruct

88.4%

Raw: 0.8841

Self-reported

17

Llama 3.3 70B Instruct

88.4%

Raw: 0.884

Self-reported

18

by xAI

88.4%

Raw: 0.884

Self-reported

19

Qwen2.5-Coder 7B Instruct

88.4%

Raw: 0.884

Self-reported

20

Qwen2.5 32B Instruct

88.4%

Raw: 0.884

Self-reported

Showing top 20 of 62 models