TAU-bench Airline

agents
text
About

TAU-bench Airline benchmark

Evaluation Stats
Total Models15
Organizations3
Verified Results0
Self-Reported15
Benchmark Details
Max Score1
Language
en
Performance Overview
Score distribution and top performers

Score Distribution

15 models
Top Score
60.0%
Average Score
44.3%
High Performers (80%+)
0

Top Organizations

#1DeepSeek
1 model
53.5%
#2Anthropic
6 models
47.9%
#3OpenAI
8 models
40.5%
Leaderboard
Top 15 models ranked by performance
60.0%
Raw: 0.6
Self-reported
59.6%
Raw: 0.596
Self-reported
58.4%
Raw: 0.584
Self-reported
53.5%
Raw: 0.535
Self-reported
50.0%
Raw: 0.5
Self-reported
50.0%
Raw: 0.5
Self-reported
49.4%
Raw: 0.494
Self-reported
49.2%
Raw: 0.492
Self-reported
46.0%
Raw: 0.46
Self-reported
42.8%
Raw: 0.428
Self-reported
40.3%
Raw: 0.403
Self-reported
36.0%
Raw: 0.36
Self-reported
32.4%
Raw: 0.324
Self-reported
22.8%
Raw: 0.228
Self-reported
14.0%
Raw: 0.14
Self-reported