
Phi 4
Zero-eval
#3PhiBench
by Microsoft
About
Phi 4 is a language model developed by Microsoft. It achieves strong performance with an average score of 66.0% across 13 benchmarks. It excels particularly in MMLU (84.8%), HumanEval+ (82.8%), HumanEval (82.6%). The model shows particular specialization in math tasks with an average performance of 80.5%. The model is available through 1 API provider. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2024, it represents Microsoft's latest advancement in AI technology.
Pricing Range
Input (per 1M)$0.07 -$0.07
Output (per 1M)$0.14 -$0.14
Providers1
Timeline
AnnouncedDec 12, 2024
ReleasedDec 12, 2024
Knowledge CutoffJun 1, 2024
Specifications
Training Tokens9.8T
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
13 benchmarks
Average Score
66.0%
Best Score
84.8%
High Performers (80%+)
5Performance Metrics
Max Context Window
32.0KAvg Throughput
33.0 tok/sAvg Latency
0msTop Categories
math
80.5%
code
76.1%
general
60.2%
roleplay
47.6%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
MMLU
Rank #30 of 78
#27o1-mini
85.2%
#28Llama 4 Maverick
85.5%
#29GPT-4o
85.7%
#30Phi 4
84.8%
#31Mistral Large 2
84.0%
#32Llama 3.1 70B Instruct
83.6%
#33Qwen2.5 32B Instruct
83.3%
HumanEval+
Rank #5 of 8
#2Granite 3.3 8B Base
86.1%
#3Granite 3.3 8B Instruct
86.1%
#4Phi 4 Reasoning Plus
92.3%
#5Phi 4
82.8%
#6IBM Granite 4.0 Tiny Preview
78.3%
#7Qwen2.5 32B Instruct
52.4%
#8Qwen2.5 14B Instruct
51.2%
HumanEval
Rank #37 of 62
#34Qwen2.5 14B Instruct
83.5%
#35Gemini 1.5 Pro
84.1%
#36Mistral Small 3 24B Instruct
84.8%
#37Phi 4
82.6%
#38IBM Granite 4.0 Tiny Preview
82.4%
#39Codestral-22B
81.1%
#40Nova Micro
81.1%
MGSM
Rank #19 of 31
#16Gemini 1.5 Flash
82.6%
#17Claude 3 Sonnet
83.5%
#18Qwen3 235B A22B
83.5%
#19Phi 4
80.6%
#20Claude 3 Haiku
75.1%
#21GPT-4
74.5%
#22Llama 3.2 11B Instruct
68.9%
MATH
Rank #13 of 63
#10Qwen2.5 VL 32B Instruct
82.2%
#11Qwen2.5 32B Instruct
83.1%
#12Qwen2.5 72B Instruct
83.1%
#13Phi 4
80.4%
#14Qwen2.5 14B Instruct
80.0%
#15Claude 3.5 Sonnet
78.3%
#16Gemini 1.5 Flash
77.9%
All Benchmark Results for Phi 4
Complete list of benchmark scores with detailed information
MMLU MMLU benchmark | general | text | 0.85 | 84.8% | Self-reported |
HumanEval+ HumanEval+ benchmark | code | text | 0.83 | 82.8% | Self-reported |
HumanEval HumanEval benchmark | code | text | 0.83 | 82.6% | Self-reported |
MGSM MGSM benchmark | math | text | 0.81 | 80.6% | Self-reported |
MATH MATH benchmark | math | text | 0.80 | 80.4% | Self-reported |
DROP DROP benchmark | general | text | 0.76 | 75.5% | Self-reported |
Arena Hard Arena Hard benchmark | general | text | 0.75 | 75.4% | Self-reported |
MMLU-Pro MMLU-Pro benchmark | general | text | 0.70 | 70.4% | Self-reported |
IFEval IFEval benchmark | code | text | 0.63 | 63.0% | Self-reported |
PhiBench PhiBench benchmark | general | text | 0.56 | 56.2% | Self-reported |
Showing 1 to 10 of 13 benchmarks