
Claude Opus 4.1
Multimodal
Zero-eval
#1MMMLU
#2MMMU (validation)
by Anthropic
About
Claude Opus 4.1 is a multimodal language model developed by Anthropic. The model shows competitive results across 8 benchmarks. It excels particularly in MMMLU (98.4%), AIME 2025 (80.2%), MMMU (validation) (64.8%). As a multimodal model, it can process and understand text, images, and other input formats seamlessly. Released in 2025, it represents Anthropic's latest advancement in AI technology.
Timeline
AnnouncedAug 5, 2025
ReleasedAug 5, 2025
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
8 benchmarks
Average Score
48.5%
Best Score
98.4%
High Performers (80%+)
2Top Categories
vision
64.8%
agents
50.4%
general
44.6%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
MMMLU
Rank #1 of 13
#1Claude Opus 4.1
98.4%
#2Claude Opus 4
88.8%
#3o1
87.7%
#4GPT-4.1
87.3%
AIME 2025
Rank #14 of 36
#11Qwen3 235B A22B
81.5%
#12Gemini 2.5 Pro
83.0%
#13GPT-5 nano
85.2%
#14Claude Opus 4.1
80.2%
#15Phi 4 Reasoning Plus
78.0%
#16Claude Opus 4
75.5%
#17Qwen3 32B
72.9%
MMMU (validation)
Rank #2 of 2
#1Claude Opus 4
76.5%
#2Claude Opus 4.1
64.8%
TAU-bench Retail
Rank #10 of 15
#7DeepSeek-R1-0528
63.9%
#8GPT-4.1
68.0%
#9GPT-4.5
68.4%
#10Claude Opus 4.1
60.4%
#11GPT-4o
60.3%
#12o3-mini
57.6%
#13GPT-4.1 mini
55.8%
TAU-bench Airline
Rank #11 of 15
#8GPT-4o
42.8%
#9Claude 3.5 Sonnet
46.0%
#10o4-mini
49.2%
#11Claude Opus 4.1
40.3%
#12GPT-4.1 mini
36.0%
#13o3-mini
32.4%
#14Claude 3.5 Haiku
22.8%
All Benchmark Results for Claude Opus 4.1
Complete list of benchmark scores with detailed information
MMMLU MMMLU benchmark | general | text | 0.98 | 98.4% | Self-reported |
AIME 2025 AIME 2025 benchmark | general | text | 0.80 | 80.2% | Self-reported |
MMMU (validation) MMMU (validation) benchmark | vision | multimodal | 0.65 | 64.8% | Self-reported |
TAU-bench Retail TAU-bench Retail benchmark | agents | text | 0.60 | 60.4% | Self-reported |
TAU-bench Airline TAU-bench Airline benchmark | agents | text | 0.40 | 40.3% | Self-reported |
SWE-Bench Verified SWE-Bench Verified benchmark | general | text | 0.24 | 24.3% | Self-reported |
Terminal-bench Terminal-bench benchmark | general | text | 0.15 | 14.7% | Self-reported |
GPQA GPQA benchmark | general | text | 0.05 | 5.3% | Self-reported |
Resources