Anthropic

Claude Opus 4.1

Multimodal
Zero-eval
#1MMMLU
#2MMMU (validation)

by Anthropic

About

Claude Opus 4.1 is a multimodal language model developed by Anthropic. The model shows competitive results across 8 benchmarks. It excels particularly in MMMLU (98.4%), AIME 2025 (80.2%), MMMU (validation) (64.8%). As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents Anthropic's latest advancement in AI technology.

Timeline
AnnouncedAug 5, 2025
ReleasedAug 5, 2025
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

8 benchmarks
Average Score
48.5%
Best Score
98.4%
High Performers (80%+)
2

Top Categories

vision
64.8%
agents
50.4%
general
44.6%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

MMMLU

Rank #1 of 13
#1Claude Opus 4.1
98.4%
#2Claude Opus 4
88.8%
#3o1
87.7%
#4GPT-4.1
87.3%

AIME 2025

Rank #14 of 36
#11Qwen3 235B A22B
81.5%
#12Gemini 2.5 Pro
83.0%
#13GPT-5 nano
85.2%
#14Claude Opus 4.1
80.2%
#15Phi 4 Reasoning Plus
78.0%
#16Claude Opus 4
75.5%
#17Qwen3 32B
72.9%

MMMU (validation)

Rank #2 of 2
#1Claude Opus 4
76.5%
#2Claude Opus 4.1
64.8%

TAU-bench Retail

Rank #10 of 15
#7DeepSeek-R1-0528
63.9%
#8GPT-4.1
68.0%
#9GPT-4.5
68.4%
#10Claude Opus 4.1
60.4%
#11GPT-4o
60.3%
#12o3-mini
57.6%
#13GPT-4.1 mini
55.8%

TAU-bench Airline

Rank #11 of 15
#8GPT-4o
42.8%
#9Claude 3.5 Sonnet
46.0%
#10o4-mini
49.2%
#11Claude Opus 4.1
40.3%
#12GPT-4.1 mini
36.0%
#13o3-mini
32.4%
#14Claude 3.5 Haiku
22.8%
All Benchmark Results for Claude Opus 4.1
Complete list of benchmark scores with detailed information
MMMLU
MMMLU benchmark
general
text
0.98
98.4%
Self-reported
AIME 2025
AIME 2025 benchmark
general
text
0.80
80.2%
Self-reported
MMMU (validation)
MMMU (validation) benchmark
vision
multimodal
0.65
64.8%
Self-reported
TAU-bench Retail
TAU-bench Retail benchmark
agents
text
0.60
60.4%
Self-reported
TAU-bench Airline
TAU-bench Airline benchmark
agents
text
0.40
40.3%
Self-reported
SWE-Bench Verified
SWE-Bench Verified benchmark
general
text
0.24
24.3%
Self-reported
Terminal-bench
Terminal-bench benchmark
general
text
0.15
14.7%
Self-reported
GPQA
GPQA benchmark
general
text
0.05
5.3%
Self-reported