MT-Bench

roleplay

text

About

MT-Bench benchmark

Evaluation Stats

Total Models11

Organizations4

Verified Results0

Self-Reported11

Benchmark Details

Max Score100

Language

en

Performance Overview

Score distribution and top performers

Score Distribution

11 models

Top Score

93.5%

Average Score

78.8%

High Performers (80%+)

9

Top Organizations

#1DeepSeek

1 model

90.2%

#2Alibaba

3 models

88.4%

#3Mistral AI

4 models

82.4%

#4NVIDIA

3 models

60.6%

Leaderboard

Top 11 models ranked by performance

1

Qwen2.5 72B Instruct

93.5%

Raw: 93.5

Self-reported

2

Llama-3.3 Nemotron Super 49B v1

91.7%

Raw: 91.7

Self-reported

3

90.2%

Raw: 90.2

Self-reported

4

Qwen2.5 7B Instruct

87.5%

Raw: 87.5

Self-reported

5

Mistral Large 2

86.3%

Raw: 86.3

Self-reported

6

Qwen2 7B Instruct

84.1%

Raw: 84.1

Self-reported

7

Mistral Small 3 24B Instruct

83.5%

Raw: 83.5

Self-reported

8

Ministral 8B Instruct

83.0%

Raw: 83

Self-reported

9

Llama 3.1 Nemotron Nano 8B V1

81.0%

Raw: 81

Self-reported

10

76.8%

Raw: 76.8

Self-reported

11

Llama 3.1 Nemotron 70B Instruct

9.0%

Raw: 8.99

Self-reported