Microsoft

Phi 4 Mini

Zero-eval
#2OpenBookQA
#2Multilingual MMLU
#3Social IQa
+1 more

by Microsoft

About

Phi 4 Mini is a language model developed by Microsoft. It achieves strong performance with an average score of 65.4% across 17 benchmarks. It excels particularly in GSM8k (88.6%), ARC-C (83.7%), BoolQ (81.2%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.

Timeline
AnnouncedFeb 1, 2025
ReleasedFeb 1, 2025
Knowledge CutoffJun 1, 2024
Specifications
Training Tokens5.0T
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

17 benchmarks
Average Score
65.4%
Best Score
88.6%
High Performers (80%+)
3

Top Categories

reasoning
73.3%
math
72.2%
factuality
66.4%
general
60.8%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

GSM8k

Rank #31 of 46
#28Phi-3.5-MoE-instruct
88.7%
#29Qwen2.5-Omni-7B
88.7%
#30Claude 3 Haiku
88.9%
#31Phi 4 Mini
88.6%
#32Jamba 1.5 Large
87.0%
#33Phi-3.5-mini-instruct
86.2%
#34Gemini 1.5 Flash
86.2%

ARC-C

Rank #14 of 31
#11Phi-3.5-mini-instruct
84.6%
#12Jamba 1.5 Mini
85.7%
#13Claude 3 Haiku
89.2%
#14Phi 4 Mini
83.7%
#15Llama 3.1 8B Instruct
83.4%
#16Llama 3.2 3B Instruct
78.6%
#17Ministral 8B Instruct
71.9%

BoolQ

Rank #6 of 9
#3Gemma 3n E4B Instructed LiteRT Preview
81.6%
#4Gemma 3n E4B
81.6%
#5Gemma 2 9B
84.2%
#6Phi 4 Mini
81.2%
#7Phi-3.5-mini-instruct
78.0%
#8Gemma 3n E2B
76.4%
#9Gemma 3n E2B Instructed LiteRT (Preview)
76.4%

OpenBookQA

Rank #2 of 4
#1Phi-3.5-MoE-instruct
89.6%
#2Phi 4 Mini
79.2%
#3Phi-3.5-mini-instruct
79.2%
#4Mistral NeMo Instruct
60.6%

PIQA

Rank #9 of 9
#6Gemma 3n E2B
78.9%
#7Gemma 3n E2B Instructed LiteRT (Preview)
78.9%
#8Gemma 3n E4B
81.0%
#9Phi 4 Mini
77.6%
All Benchmark Results for Phi 4 Mini
Complete list of benchmark scores with detailed information
GSM8k
GSM8k benchmark
math
text
0.89
88.6%
Self-reported
ARC-C
ARC-C benchmark
reasoning
text
0.84
83.7%
Self-reported
BoolQ
BoolQ benchmark
general
text
0.81
81.2%
Self-reported
OpenBookQA
OpenBookQA benchmark
general
text
0.79
79.2%
Self-reported
PIQA
PIQA benchmark
general
text
0.78
77.6%
Self-reported
Social IQa
Social IQa benchmark
general
text
0.72
72.5%
Self-reported
BIG-Bench Hard
BIG-Bench Hard benchmark
general
text
0.70
70.4%
Self-reported
HellaSwag
HellaSwag benchmark
reasoning
text
0.69
69.1%
Self-reported
MMLU
MMLU benchmark
general
text
0.67
67.3%
Self-reported
Winogrande
Winogrande benchmark
reasoning
text
0.67
67.0%
Self-reported
Showing 1 to 10 of 17 benchmarks