OpenAI

GPT-4

Multimodal
Zero-eval
#1AI2 Reasoning Challenge (ARC)
#1Uniform Bar Exam
#1SAT Math
+3 more

by OpenAI

About

GPT-4 is a multimodal language model developed by OpenAI. It achieves strong performance with an average score of 77.7% across 12 benchmarks. It excels particularly in AI2 Reasoning Challenge (ARC) (96.3%), HellaSwag (95.3%), Uniform Bar Exam (90.0%). The model shows particular specialization in reasoning tasks with an average performance of 93.0%. The model is available through 2 API providers. As a multimodal model, it can process and understand text, images, and other input formats seamlessly.

Pricing Range
Input (per 1M)$30.00 -$30.00
Output (per 1M)$60.00 -$60.00
Providers2
Timeline
AnnouncedJun 13, 2023
ReleasedJun 13, 2023
Knowledge CutoffDec 31, 2022
Specifications
Capabilities
Multimodal
License & Family
License
Proprietary
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

12 benchmarks
Average Score
77.7%
Best Score
96.3%
High Performers (80%+)
8

Performance Metrics

Max Context Window
65.5K
Avg Throughput
102.0 tok/s
Avg Latency
0ms

Top Categories

reasoning
93.0%
general
76.2%
math
68.5%
code
67.0%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

AI2 Reasoning Challenge (ARC)

Rank #1 of 1
#1GPT-4
96.3%

HellaSwag

Rank #2 of 24
#1Claude 3 Opus
95.4%
#2GPT-4
95.3%
#3Gemini 1.5 Pro
93.3%
#4Claude 3 Sonnet
89.0%
#5Command R+
88.6%

Uniform Bar Exam

Rank #1 of 1
#1GPT-4
90.0%

SAT Math

Rank #1 of 1
#1GPT-4
89.0%

LSAT

Rank #1 of 1
#1GPT-4
88.0%
All Benchmark Results for GPT-4
Complete list of benchmark scores with detailed information
AI2 Reasoning Challenge (ARC)
AI2 Reasoning Challenge (ARC) benchmark
reasoning
text
0.96
96.3%
Self-reported
HellaSwag
HellaSwag benchmark
reasoning
text
0.95
95.3%
Self-reported
Uniform Bar Exam
Uniform Bar Exam benchmark
general
text
0.90
90.0%
Self-reported
SAT Math
SAT Math benchmark
math
text
0.89
89.0%
Self-reported
LSAT
LSAT benchmark
general
text
0.88
88.0%
Self-reported
Winogrande
Winogrande benchmark
reasoning
text
0.88
87.5%
Self-reported
MMLU
MMLU benchmark
general
text
0.86
86.4%
Self-reported
DROP
DROP benchmark
general
text
0.81
80.9%
Self-reported
MGSM
MGSM benchmark
math
text
0.74
74.5%
Self-reported
HumanEval
HumanEval benchmark
code
text
0.67
67.0%
Self-reported
Showing 1 to 10 of 12 benchmarks