MATH-500

math

text

About

MATH-500 benchmark

Evaluation Stats

Total Models22

Organizations8

Verified Results0

Self-Reported22

Benchmark Details

Max Score1

Language

en

Performance Overview

Score distribution and top performers

Score Distribution

22 models

Top Score

97.4%

Average Score

91.3%

High Performers (80%+)

20

Top Organizations

#1Moonshot AI

2 models

96.8%

#2NVIDIA

3 models

96.3%

#3Anthropic

1 model

96.2%

#4Microsoft

1 model

94.6%

#5DeepSeek

10 models

92.6%

Leaderboard

Top 20 models ranked by performance

1

Kimi K2 Instruct

97.4%

Raw: 0.974

Self-reported

2

97.3%

Raw: 0.973

Self-reported

3

Llama 3.1 Nemotron Ultra 253B v1

97.0%

Raw: 0.97

Self-reported

4

Llama-3.3 Nemotron Super 49B v1

96.6%

Raw: 0.966

Self-reported

5

Claude 3.7 Sonnet

96.2%

Raw: 0.962

Self-reported

6

96.2%

Raw: 0.962

Self-reported

7

DeepSeek R1 Zero

95.9%

Raw: 0.959

Self-reported

8

Llama 3.1 Nemotron Nano 8B V1

95.4%

Raw: 0.954

Self-reported

9

Phi 4 Mini Reasoning

94.6%

Raw: 0.946

Self-reported

10

DeepSeek R1 Distill Llama 70B

94.5%

Raw: 0.945

Self-reported

11

DeepSeek R1 Distill Qwen 32B

94.3%

Raw: 0.943

Self-reported

12

DeepSeek-V3 0324

94.0%

Raw: 0.94

Self-reported

13

DeepSeek R1 Distill Qwen 14B

93.9%

Raw: 0.939

Self-reported

14

DeepSeek R1 Distill Qwen 7B

92.8%

Raw: 0.928

Self-reported

15

90.6%

Raw: 0.906

Self-reported

16

QwQ-32B-Preview

90.6%

Raw: 0.906

Self-reported

17

90.2%

Raw: 0.902

Self-reported

18

90.0%

Raw: 0.9

Self-reported

19

DeepSeek R1 Distill Llama 8B

89.1%

Raw: 0.891

Self-reported

20

DeepSeek R1 Distill Qwen 1.5B

83.9%

Raw: 0.839

Self-reported

Showing top 20 of 22 models