
Phi-3.5-MoE-instruct
Zero-eval
#1OpenBookQA
#1PIQA
#1RULER
+14 more
by Microsoft
About
Phi-3.5-MoE-instruct is a language model developed by Microsoft. It achieves strong performance with an average score of 65.6% across 31 benchmarks. It excels particularly in ARC-C (91.0%), OpenBookQA (89.6%), GSM8k (88.7%). The model shows particular specialization in reasoning tasks with an average performance of 85.4%. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2024, it represents Microsoft's latest advancement in AI technology.
Timeline
AnnouncedAug 23, 2024
ReleasedAug 23, 2024
Specifications
Training Tokens4.9T
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
31 benchmarks
Average Score
65.6%
Best Score
91.0%
High Performers (80%+)
11Top Categories
reasoning
85.4%
factuality
77.5%
code
75.8%
math
69.0%
general
60.9%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
ARC-C
Rank #9 of 31
#6Mistral Small 3 24B Base
91.3%
#7Nova Lite
92.4%
#8Jamba 1.5 Large
93.0%
#9Phi-3.5-MoE-instruct
91.0%
#10Nova Micro
90.2%
#11Claude 3 Haiku
89.2%
#12Jamba 1.5 Mini
85.7%
OpenBookQA
Rank #1 of 4
#1Phi-3.5-MoE-instruct
89.6%
#2Phi 4 Mini
79.2%
#3Phi-3.5-mini-instruct
79.2%
#4Mistral NeMo Instruct
60.6%
GSM8k
Rank #30 of 46
#27Qwen2.5-Omni-7B
88.7%
#28Claude 3 Haiku
88.9%
#29Gemma 3 4B
89.2%
#30Phi-3.5-MoE-instruct
88.7%
#31Phi 4 Mini
88.6%
#32Jamba 1.5 Large
87.0%
#33Phi-3.5-mini-instruct
86.2%
PIQA
Rank #1 of 9
#1Phi-3.5-MoE-instruct
88.6%
#2Gemma 2 27B
83.2%
#3Gemma 2 9B
81.7%
#4Phi-3.5-mini-instruct
81.0%
RULER
Rank #1 of 2
#1Phi-3.5-MoE-instruct
87.1%
#2Phi-3.5-mini-instruct
84.1%
All Benchmark Results for Phi-3.5-MoE-instruct
Complete list of benchmark scores with detailed information
ARC-C ARC-C benchmark | reasoning | text | 0.91 | 91.0% | Self-reported |
OpenBookQA OpenBookQA benchmark | general | text | 0.90 | 89.6% | Self-reported |
GSM8k GSM8k benchmark | math | text | 0.89 | 88.7% | Self-reported |
PIQA PIQA benchmark | general | text | 0.89 | 88.6% | Self-reported |
RULER RULER benchmark | general | text | 0.87 | 87.1% | Self-reported |
RepoQA RepoQA benchmark | general | text | 0.85 | 85.0% | Self-reported |
BoolQ BoolQ benchmark | general | text | 0.85 | 84.6% | Self-reported |
HellaSwag HellaSwag benchmark | reasoning | text | 0.84 | 83.8% | Self-reported |
MEGA XStoryCloze MEGA XStoryCloze benchmark | general | text | 0.83 | 82.8% | Self-reported |
Winogrande Winogrande benchmark | reasoning | text | 0.81 | 81.3% | Self-reported |
Showing 1 to 10 of 31 benchmarks