Microsoft

Phi 4

Zero-eval
#3PhiBench

by Microsoft

About

Phi 4 is a language model developed by Microsoft. It achieves strong performance with an average score of 66.0% across 13 benchmarks. It excels particularly in MMLU (84.8%), HumanEval+ (82.8%), HumanEval (82.6%). The model shows particular specialization in math tasks with an average performance of 80.5%. The model is available through 1 API provider. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2024, it represents Microsoft's latest advancement in AI technology.

Pricing Range
Input (per 1M)$0.07 -$0.07
Output (per 1M)$0.14 -$0.14
Providers1
Timeline
AnnouncedDec 12, 2024
ReleasedDec 12, 2024
Knowledge CutoffJun 1, 2024
Specifications
Training Tokens9.8T
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

13 benchmarks
Average Score
66.0%
Best Score
84.8%
High Performers (80%+)
5

Performance Metrics

Max Context Window
32.0K
Avg Throughput
33.0 tok/s
Avg Latency
0ms

Top Categories

math
80.5%
code
76.1%
general
60.2%
roleplay
47.6%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

MMLU

Rank #30 of 78
#27o1-mini
85.2%
#28Llama 4 Maverick
85.5%
#29GPT-4o
85.7%
#30Phi 4
84.8%
#31Mistral Large 2
84.0%
#32Llama 3.1 70B Instruct
83.6%
#33Qwen2.5 32B Instruct
83.3%

HumanEval+

Rank #5 of 8
#2Granite 3.3 8B Base
86.1%
#3Granite 3.3 8B Instruct
86.1%
#4Phi 4 Reasoning Plus
92.3%
#5Phi 4
82.8%
#6IBM Granite 4.0 Tiny Preview
78.3%
#7Qwen2.5 32B Instruct
52.4%
#8Qwen2.5 14B Instruct
51.2%

HumanEval

Rank #37 of 62
#34Qwen2.5 14B Instruct
83.5%
#35Gemini 1.5 Pro
84.1%
#36Mistral Small 3 24B Instruct
84.8%
#37Phi 4
82.6%
#38IBM Granite 4.0 Tiny Preview
82.4%
#39Codestral-22B
81.1%
#40Nova Micro
81.1%

MGSM

Rank #19 of 31
#16Gemini 1.5 Flash
82.6%
#17Claude 3 Sonnet
83.5%
#18Qwen3 235B A22B
83.5%
#19Phi 4
80.6%
#20Claude 3 Haiku
75.1%
#21GPT-4
74.5%
#22Llama 3.2 11B Instruct
68.9%

MATH

Rank #13 of 63
#10Qwen2.5 VL 32B Instruct
82.2%
#11Qwen2.5 32B Instruct
83.1%
#12Qwen2.5 72B Instruct
83.1%
#13Phi 4
80.4%
#14Qwen2.5 14B Instruct
80.0%
#15Claude 3.5 Sonnet
78.3%
#16Gemini 1.5 Flash
77.9%
All Benchmark Results for Phi 4
Complete list of benchmark scores with detailed information
MMLU
MMLU benchmark
general
text
0.85
84.8%
Self-reported
HumanEval+
HumanEval+ benchmark
code
text
0.83
82.8%
Self-reported
HumanEval
HumanEval benchmark
code
text
0.83
82.6%
Self-reported
MGSM
MGSM benchmark
math
text
0.81
80.6%
Self-reported
MATH
MATH benchmark
math
text
0.80
80.4%
Self-reported
DROP
DROP benchmark
general
text
0.76
75.5%
Self-reported
Arena Hard
Arena Hard benchmark
general
text
0.75
75.4%
Self-reported
MMLU-Pro
MMLU-Pro benchmark
general
text
0.70
70.4%
Self-reported
IFEval
IFEval benchmark
code
text
0.63
63.0%
Self-reported
PhiBench
PhiBench benchmark
general
text
0.56
56.2%
Self-reported
Showing 1 to 10 of 13 benchmarks