Microsoft

Phi-3.5-MoE-instruct

Zero-eval
#1OpenBookQA
#1PIQA
#1RULER
+14 more

by Microsoft

About

Phi-3.5-MoE-instruct is a language model developed by Microsoft. It achieves strong performance with an average score of 65.6% across 31 benchmarks. It excels particularly in ARC-C (91.0%), OpenBookQA (89.6%), GSM8k (88.7%). The model shows particular specialization in reasoning tasks with an average performance of 85.4%. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2024, it represents Microsoft's latest advancement in AI technology.

Timeline
AnnouncedAug 23, 2024
ReleasedAug 23, 2024
Specifications
Training Tokens4.9T
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

31 benchmarks
Average Score
65.6%
Best Score
91.0%
High Performers (80%+)
11

Top Categories

reasoning
85.4%
factuality
77.5%
code
75.8%
math
69.0%
general
60.9%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

ARC-C

Rank #9 of 31
#6Mistral Small 3 24B Base
91.3%
#7Nova Lite
92.4%
#8Jamba 1.5 Large
93.0%
#9Phi-3.5-MoE-instruct
91.0%
#10Nova Micro
90.2%
#11Claude 3 Haiku
89.2%
#12Jamba 1.5 Mini
85.7%

OpenBookQA

Rank #1 of 4
#1Phi-3.5-MoE-instruct
89.6%
#2Phi 4 Mini
79.2%
#3Phi-3.5-mini-instruct
79.2%
#4Mistral NeMo Instruct
60.6%

GSM8k

Rank #30 of 46
#27Qwen2.5-Omni-7B
88.7%
#28Claude 3 Haiku
88.9%
#29Gemma 3 4B
89.2%
#30Phi-3.5-MoE-instruct
88.7%
#31Phi 4 Mini
88.6%
#32Jamba 1.5 Large
87.0%
#33Phi-3.5-mini-instruct
86.2%

PIQA

Rank #1 of 9
#1Phi-3.5-MoE-instruct
88.6%
#2Gemma 2 27B
83.2%
#3Gemma 2 9B
81.7%
#4Phi-3.5-mini-instruct
81.0%

RULER

Rank #1 of 2
#1Phi-3.5-MoE-instruct
87.1%
#2Phi-3.5-mini-instruct
84.1%
All Benchmark Results for Phi-3.5-MoE-instruct
Complete list of benchmark scores with detailed information
ARC-C
ARC-C benchmark
reasoning
text
0.91
91.0%
Self-reported
OpenBookQA
OpenBookQA benchmark
general
text
0.90
89.6%
Self-reported
GSM8k
GSM8k benchmark
math
text
0.89
88.7%
Self-reported
PIQA
PIQA benchmark
general
text
0.89
88.6%
Self-reported
RULER
RULER benchmark
general
text
0.87
87.1%
Self-reported
RepoQA
RepoQA benchmark
general
text
0.85
85.0%
Self-reported
BoolQ
BoolQ benchmark
general
text
0.85
84.6%
Self-reported
HellaSwag
HellaSwag benchmark
reasoning
text
0.84
83.8%
Self-reported
MEGA XStoryCloze
MEGA XStoryCloze benchmark
general
text
0.83
82.8%
Self-reported
Winogrande
Winogrande benchmark
reasoning
text
0.81
81.3%
Self-reported
Showing 1 to 10 of 31 benchmarks