TAU-bench Airline
agents
text
About
TAU-bench Airline benchmark
Evaluation Stats
Total Models15
Organizations3
Verified Results0
Self-Reported15
Benchmark Details
Max Score1
Language
en
Performance Overview
Score distribution and top performers
Score Distribution
15 models
Top Score
60.0%
Average Score
44.3%
High Performers (80%+)
0Top Organizations
#1DeepSeek
1 model
53.5%
#2Anthropic
6 models
47.9%
#3OpenAI
8 models
40.5%
Leaderboard
Top 15 models ranked by performance
60.0%
Raw: 0.6
Self-reported
59.6%
Raw: 0.596
Self-reported
58.4%
Raw: 0.584
Self-reported
53.5%
Raw: 0.535
Self-reported
46.0%
Raw: 0.46
Self-reported
40.3%
Raw: 0.403
Self-reported
12
36.0%
Raw: 0.36
Self-reported
22.8%
Raw: 0.228
Self-reported
15
14.0%
Raw: 0.14
Self-reported