GPT-4

Name: GPT-4
Price: 30 USD
Rating: 77.7 (12 reviews)
Author: OpenAI

Multimodal

Zero-eval

#1AI2 Reasoning Challenge (ARC)

#1Uniform Bar Exam

#1SAT Math

+3 more

by OpenAI

About

GPT-4 is a multimodal language model developed by OpenAI. It achieves strong performance with an average score of 77.7% across 12 benchmarks. It excels particularly in AI2 Reasoning Challenge (ARC) (96.3%), HellaSwag (95.3%), Uniform Bar Exam (90.0%). The model shows particular specialization in reasoning tasks with an average performance of 93.0%. The model is available through 2 API providers. As a multimodal model, it can process and understand text, images, and other input formats seamlessly.

Pricing Range

Input (per 1M)$30.00 -$30.00

Output (per 1M)$60.00 -$60.00

Providers2

Timeline

AnnouncedJun 13, 2023

ReleasedJun 13, 2023

Knowledge CutoffDec 31, 2022

Specifications

Capabilities

Multimodal

License & Family

License

Proprietary

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

12 benchmarks

Average Score

77.7%

Best Score

96.3%

High Performers (80%+)

Performance Metrics

Max Context Window

65.5K

Avg Throughput

102.0 tok/s

Avg Latency

0ms

Top Categories

reasoning

93.0%

general

76.2%

math

68.5%

code

67.0%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

AI2 Reasoning Challenge (ARC)

Rank #1 of 1

#1GPT-4

96.3%

HellaSwag

Rank #2 of 24

#1Claude 3 Opus

95.4%

#2GPT-4

95.3%

#3Gemini 1.5 Pro

93.3%

#4Claude 3 Sonnet

89.0%

#5Command R+

88.6%

Uniform Bar Exam

Rank #1 of 1

#1GPT-4

90.0%

SAT Math

Rank #1 of 1

#1GPT-4

89.0%

LSAT

Rank #1 of 1

#1GPT-4

88.0%

All Benchmark Results for GPT-4

Complete list of benchmark scores with detailed information


AI2 Reasoning Challenge (ARC) AI2 Reasoning Challenge (ARC) benchmark	reasoning	text	0.96	96.3%	Self-reported
HellaSwag HellaSwag benchmark	reasoning	text	0.95	95.3%	Self-reported
Uniform Bar Exam Uniform Bar Exam benchmark	general	text	0.90	90.0%	Self-reported
SAT Math SAT Math benchmark	math	text	0.89	89.0%	Self-reported
LSAT LSAT benchmark	general	text	0.88	88.0%	Self-reported
Winogrande Winogrande benchmark	reasoning	text	0.88	87.5%	Self-reported
MMLU MMLU benchmark	general	text	0.86	86.4%	Self-reported
DROP DROP benchmark	general	text	0.81	80.9%	Self-reported
MGSM MGSM benchmark	math	text	0.74	74.5%	Self-reported
HumanEval HumanEval benchmark	code	text	0.67	67.0%	Self-reported

Showing 1 to 10 of 12 benchmarks

Resources

API Reference Playground Research Paper Blog Post Repository