Phi-4-multimodal-instruct

Name: Phi-4-multimodal-instruct
Price: 0.05 USD
Rating: 72.0 (15 reviews)
Author: Microsoft

Multimodal

Zero-eval

#1ScienceQA Visual

#1BLINK

#1InterGPS

+2 more

by Microsoft

About

Phi-4-multimodal-instruct is a multimodal language model developed by Microsoft. It achieves strong performance with an average score of 72.0% across 15 benchmarks. It excels particularly in ScienceQA Visual (97.5%), DocVQA (93.2%), MMBench (86.7%). The model shows particular specialization in general tasks with an average performance of 75.8%. It supports a 256K token context window for handling large documents. The model is available through 1 API provider. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.

Pricing Range

Input (per 1M)$0.05 -$0.05

Output (per 1M)$0.10 -$0.10

Providers1

Timeline

AnnouncedFeb 1, 2025

ReleasedFeb 1, 2025

Knowledge CutoffJun 1, 2024

Specifications

Training Tokens5.0T

Capabilities

Multimodal

License & Family

License

MIT

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

15 benchmarks

Average Score

72.0%

Best Score

97.5%

High Performers (80%+)

Performance Metrics

Max Context Window

256.0K

Avg Throughput

25.0 tok/s

Avg Latency

1ms

Top Categories

general

75.8%

vision

69.7%

math

62.4%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

ScienceQA Visual

Rank #1 of 1

#1Phi-4-multimodal-instruct

97.5%

DocVQA

Rank #14 of 26

#11Grok-2 mini

93.2%

#12Pixtral Large

93.3%

#13DeepSeek VL2

93.3%

#14Phi-4-multimodal-instruct

93.2%

#15GPT-4o

92.8%

#16Nova Lite

92.4%

#17DeepSeek VL2 Small

92.3%

MMBench

Rank #2 of 7

#1Qwen2.5 VL 72B Instruct

88.0%

#2Phi-4-multimodal-instruct

86.7%

#3Qwen2.5 VL 7B Instruct

84.3%

#4Phi-3.5-vision-instruct

81.9%

#5DeepSeek VL2 Small

80.3%

POPE

Rank #2 of 2

#1Phi-3.5-vision-instruct

86.1%

#2Phi-4-multimodal-instruct

85.6%

OCRBench

Rank #4 of 7

#1Qwen2.5 VL 7B Instruct

86.4%

#2Qwen2-VL-72B-Instruct

87.7%

#3Qwen2.5 VL 72B Instruct

88.5%

#4Phi-4-multimodal-instruct

84.4%

#5DeepSeek VL2 Small

83.4%

#6DeepSeek VL2

81.1%

#7DeepSeek VL2 Tiny

80.9%

All Benchmark Results for Phi-4-multimodal-instruct

Complete list of benchmark scores with detailed information


ScienceQA Visual ScienceQA Visual benchmark	vision	multimodal	0.97	97.5%	Self-reported
DocVQA DocVQA benchmark	vision	multimodal	0.93	93.2%	Self-reported
MMBench MMBench benchmark	general	text	0.87	86.7%	Self-reported
POPE POPE benchmark	general	text	0.86	85.6%	Self-reported
OCRBench OCRBench benchmark	general	text	0.84	84.4%	Self-reported
AI2D AI2D benchmark	general	text	0.82	82.3%	Self-reported
ChartQA ChartQA benchmark	general	multimodal	0.81	81.4%	Self-reported
TextVQA TextVQA benchmark	vision	multimodal	0.76	75.6%	Self-reported
InfoVQA InfoVQA benchmark	vision	multimodal	0.73	72.7%	Self-reported
MathVista MathVista benchmark	math	text	0.62	62.4%	Self-reported

Showing 1 to 10 of 15 benchmarks

Resources

Playground Research Paper Blog Post Model Weights