Microsoft

Phi-3.5-vision-instruct

Multimodal
Zero-eval
#1ScienceQA
#1POPE
#2InterGPS

by Microsoft

About

Phi-3.5-vision-instruct is a multimodal language model developed by Microsoft. It achieves strong performance with an average score of 68.3% across 9 benchmarks. It excels particularly in ScienceQA (91.3%), POPE (86.1%), MMBench (81.9%). The model shows particular specialization in general tasks with an average performance of 75.9%. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2024, it represents Microsoft's latest advancement in AI technology.

Timeline
AnnouncedAug 23, 2024
ReleasedAug 23, 2024
Specifications
Training Tokens500.0B
Capabilities
Multimodal
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

9 benchmarks
Average Score
68.3%
Best Score
91.3%
High Performers (80%+)
4

Top Categories

general
75.9%
vision
57.5%
math
43.9%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

ScienceQA

Rank #1 of 1
#1Phi-3.5-vision-instruct
91.3%

POPE

Rank #1 of 2
#1Phi-3.5-vision-instruct
86.1%
#2Phi-4-multimodal-instruct
85.6%

MMBench

Rank #4 of 7
#1Qwen2.5 VL 7B Instruct
84.3%
#2Phi-4-multimodal-instruct
86.7%
#3Qwen2.5 VL 72B Instruct
88.0%
#4Phi-3.5-vision-instruct
81.9%
#5DeepSeek VL2 Small
80.3%
#6DeepSeek VL2
79.6%
#7DeepSeek VL2 Tiny
69.2%

ChartQA

Rank #18 of 24
#15Pixtral-12B
81.8%
#16Llama 3.2 11B Instruct
83.4%
#17DeepSeek VL2 Small
84.5%
#18Phi-3.5-vision-instruct
81.8%
#19Phi-4-multimodal-instruct
81.4%
#20DeepSeek VL2 Tiny
81.0%
#21Gemma 3 27B
78.0%

AI2D

Rank #15 of 17
#12DeepSeek VL2 Small
80.0%
#13DeepSeek VL2
81.4%
#14Phi-4-multimodal-instruct
82.3%
#15Phi-3.5-vision-instruct
78.1%
#16Gemma 3 4B
74.8%
#17DeepSeek VL2 Tiny
71.6%
All Benchmark Results for Phi-3.5-vision-instruct
Complete list of benchmark scores with detailed information
ScienceQA
ScienceQA benchmark
general
text
0.91
91.3%
Self-reported
POPE
POPE benchmark
general
text
0.86
86.1%
Self-reported
MMBench
MMBench benchmark
general
text
0.82
81.9%
Self-reported
ChartQA
ChartQA benchmark
general
multimodal
0.82
81.8%
Self-reported
AI2D
AI2D benchmark
general
text
0.78
78.1%
Self-reported
TextVQA
TextVQA benchmark
vision
multimodal
0.72
72.0%
Self-reported
MathVista
MathVista benchmark
math
text
0.44
43.9%
Self-reported
MMMU
MMMU benchmark
vision
multimodal
0.43
43.0%
Self-reported
InterGPS
InterGPS benchmark
general
text
0.36
36.3%
Self-reported