
Phi 4 Mini
Zero-eval
#2OpenBookQA
#2Multilingual MMLU
#3Social IQa
+1 more
by Microsoft
About
Phi 4 Mini is a language model developed by Microsoft. It achieves strong performance with an average score of 65.4% across 17 benchmarks. It excels particularly in GSM8k (88.6%), ARC-C (83.7%), BoolQ (81.2%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.
Timeline
AnnouncedFeb 1, 2025
ReleasedFeb 1, 2025
Knowledge CutoffJun 1, 2024
Specifications
Training Tokens5.0T
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
17 benchmarks
Average Score
65.4%
Best Score
88.6%
High Performers (80%+)
3Top Categories
reasoning
73.3%
math
72.2%
factuality
66.4%
general
60.8%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
GSM8k
Rank #31 of 46
#28Phi-3.5-MoE-instruct
88.7%
#29Qwen2.5-Omni-7B
88.7%
#30Claude 3 Haiku
88.9%
#31Phi 4 Mini
88.6%
#32Jamba 1.5 Large
87.0%
#33Phi-3.5-mini-instruct
86.2%
#34Gemini 1.5 Flash
86.2%
ARC-C
Rank #14 of 31
#11Phi-3.5-mini-instruct
84.6%
#12Jamba 1.5 Mini
85.7%
#13Claude 3 Haiku
89.2%
#14Phi 4 Mini
83.7%
#15Llama 3.1 8B Instruct
83.4%
#16Llama 3.2 3B Instruct
78.6%
#17Ministral 8B Instruct
71.9%
BoolQ
Rank #6 of 9
#3Gemma 3n E4B Instructed LiteRT Preview
81.6%
#4Gemma 3n E4B
81.6%
#5Gemma 2 9B
84.2%
#6Phi 4 Mini
81.2%
#7Phi-3.5-mini-instruct
78.0%
#8Gemma 3n E2B
76.4%
#9Gemma 3n E2B Instructed LiteRT (Preview)
76.4%
OpenBookQA
Rank #2 of 4
#1Phi-3.5-MoE-instruct
89.6%
#2Phi 4 Mini
79.2%
#3Phi-3.5-mini-instruct
79.2%
#4Mistral NeMo Instruct
60.6%
PIQA
Rank #9 of 9
#6Gemma 3n E2B
78.9%
#7Gemma 3n E2B Instructed LiteRT (Preview)
78.9%
#8Gemma 3n E4B
81.0%
#9Phi 4 Mini
77.6%
All Benchmark Results for Phi 4 Mini
Complete list of benchmark scores with detailed information
GSM8k GSM8k benchmark | math | text | 0.89 | 88.6% | Self-reported |
ARC-C ARC-C benchmark | reasoning | text | 0.84 | 83.7% | Self-reported |
BoolQ BoolQ benchmark | general | text | 0.81 | 81.2% | Self-reported |
OpenBookQA OpenBookQA benchmark | general | text | 0.79 | 79.2% | Self-reported |
PIQA PIQA benchmark | general | text | 0.78 | 77.6% | Self-reported |
Social IQa Social IQa benchmark | general | text | 0.72 | 72.5% | Self-reported |
BIG-Bench Hard BIG-Bench Hard benchmark | general | text | 0.70 | 70.4% | Self-reported |
HellaSwag HellaSwag benchmark | reasoning | text | 0.69 | 69.1% | Self-reported |
MMLU MMLU benchmark | general | text | 0.67 | 67.3% | Self-reported |
Winogrande Winogrande benchmark | reasoning | text | 0.67 | 67.0% | Self-reported |
Showing 1 to 10 of 17 benchmarks
Resources