All Benchmarks

Explore all 336 benchmarks for evaluating language models across different capabilities and domains

336

Total Benchmarks

Verified

Categories

With Sub-benchmarks

		Properties
GPQA GPQA benchmark	general	text en	115	13	88.4%	54.6%
MMLU MMLU benchmark	general	text en	78	15	92.5%	79.1%
MATH MATH benchmark	math	text en	63	11	97.9%	66.7%
HumanEval HumanEval benchmark	code	text en	62	12	93.7%	80.4%
MMLU-Pro MMLU-Pro benchmark	general	text en	60	11	85.0%	63.3%
MMMU MMMU benchmark	vision	multimodal en	52	11	84.2%	64.1%
GSM8k GSM8k benchmark	math	text en	46	15	97.3%	87.8%
LiveCodeBench LiveCodeBench benchmark	code	text en	44	8	80.4%	44.8%
AIME 2024 AIME 2024 benchmark	general	text en	41	10	95.8%	72.8%
IFEval IFEval benchmark	code	text en	37	12	93.9%	83.2%

Showing 1 to 10 of 336 benchmarks

...