All Benchmarks

Explore all 336 benchmarks for evaluating language models across different capabilities and domains

336
Total Benchmarks
0
Verified
10
Categories
1
With Sub-benchmarks
PropertiesLinks
GPQA
GPQA benchmark
general
text
en
115
13
88.4%
54.6%
MMLU
MMLU benchmark
general
text
en
78
15
92.5%
79.1%
MATH
MATH benchmark
math
text
en
63
11
97.9%
66.7%
HumanEval
HumanEval benchmark
code
text
en
62
12
93.7%
80.4%
MMLU-Pro
MMLU-Pro benchmark
general
text
en
60
11
85.0%
63.3%
MMMU
MMMU benchmark
vision
multimodal
en
52
11
84.2%
64.1%
GSM8k
GSM8k benchmark
math
text
en
46
15
97.3%
87.8%
LiveCodeBench
LiveCodeBench benchmark
code
text
en
44
8
80.4%
44.8%
AIME 2024
AIME 2024 benchmark
general
text
en
41
10
95.8%
72.8%
IFEval
IFEval benchmark
code
text
en
37
12
93.9%
83.2%
Showing 1 to 10 of 336 benchmarks
...