Claude Opus 4.1

Name: Claude Opus 4.1
Rating: 48.5 (8 reviews)
Author: Anthropic

Multimodal

Zero-eval

#1MMMLU

#2MMMU (validation)

by Anthropic

About

Claude Opus 4.1 is a multimodal language model developed by Anthropic. The model shows competitive results across 8 benchmarks. It excels particularly in MMMLU (98.4%), AIME 2025 (80.2%), MMMU (validation) (64.8%). As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents Anthropic's latest advancement in AI technology.

Timeline

AnnouncedAug 5, 2025

ReleasedAug 5, 2025

Specifications

Capabilities

Multimodal

License & Family

License

Proprietary

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

8 benchmarks

Average Score

48.5%

Best Score

98.4%

High Performers (80%+)

Top Categories

vision

64.8%

agents

50.4%

general

44.6%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

MMMLU

Rank #1 of 13

#1Claude Opus 4.1

98.4%

#2Claude Opus 4

88.8%

#3o1

87.7%

#4GPT-4.1

87.3%

AIME 2025

Rank #14 of 36

#11Qwen3 235B A22B

81.5%

#12Gemini 2.5 Pro

83.0%

#13GPT-5 nano

85.2%

#14Claude Opus 4.1

80.2%

#15Phi 4 Reasoning Plus

78.0%

#16Claude Opus 4

75.5%

#17Qwen3 32B

72.9%

MMMU (validation)

Rank #2 of 2

#1Claude Opus 4

76.5%

#2Claude Opus 4.1

64.8%

TAU-bench Retail

Rank #10 of 15

#7DeepSeek-R1-0528

63.9%

#8GPT-4.1

68.0%

#9GPT-4.5

68.4%

#10Claude Opus 4.1

60.4%

#11GPT-4o

60.3%

#12o3-mini

57.6%

#13GPT-4.1 mini

55.8%

TAU-bench Airline

Rank #11 of 15

#8GPT-4o

42.8%

#9Claude 3.5 Sonnet

46.0%

#10o4-mini

49.2%

#11Claude Opus 4.1

40.3%

#12GPT-4.1 mini

36.0%

#13o3-mini

32.4%

#14Claude 3.5 Haiku

22.8%

All Benchmark Results for Claude Opus 4.1

Complete list of benchmark scores with detailed information


MMMLU MMMLU benchmark	general	text	0.98	98.4%	Self-reported
AIME 2025 AIME 2025 benchmark	general	text	0.80	80.2%	Self-reported
MMMU (validation) MMMU (validation) benchmark	vision	multimodal	0.65	64.8%	Self-reported
TAU-bench Retail TAU-bench Retail benchmark	agents	text	0.60	60.4%	Self-reported
TAU-bench Airline TAU-bench Airline benchmark	agents	text	0.40	40.3%	Self-reported
SWE-Bench Verified SWE-Bench Verified benchmark	general	text	0.24	24.3%	Self-reported
Terminal-bench Terminal-bench benchmark	general	text	0.15	14.7%	Self-reported
GPQA GPQA benchmark	general	text	0.05	5.3%	Self-reported

Resources

API Reference Playground Blog Post