Microsoft

Phi 4 Reasoning

Zero-eval
#1HumanEval+
#2FlenQA
#2OmniMath
+1 more

by Microsoft

About

Phi 4 Reasoning is a language model developed by Microsoft. It achieves strong performance with an average score of 75.1% across 11 benchmarks. It excels particularly in FlenQA (97.7%), HumanEval+ (92.9%), IFEval (83.4%). The model shows particular specialization in code tasks with an average performance of 76.7%. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.

Timeline
AnnouncedApr 30, 2025
ReleasedApr 30, 2025
Knowledge CutoffMar 1, 2025
Specifications
Training Tokens16.0B
License & Family
License
MIT
Base ModelPhi 4
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

11 benchmarks
Average Score
75.1%
Best Score
97.7%
High Performers (80%+)
3

Top Categories

code
76.7%
math
76.6%
general
74.3%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

FlenQA

Rank #2 of 2
#1Phi 4 Reasoning Plus
97.9%
#2Phi 4 Reasoning
97.7%

HumanEval+

Rank #1 of 8
#1Phi 4 Reasoning
92.9%
#2Phi 4 Reasoning Plus
92.3%
#3Granite 3.3 8B Instruct
86.1%
#4Granite 3.3 8B Base
86.1%

IFEval

Rank #23 of 37
#20QwQ-32B
83.9%
#21GPT-4.1 mini
84.1%
#22Qwen2.5 72B Instruct
84.1%
#23Phi 4 Reasoning
83.4%
#24DeepSeek-R1
83.3%
#25Mistral Small 3 24B Instruct
82.9%
#26GPT-4o
81.0%

OmniMath

Rank #2 of 2
#1Phi 4 Reasoning Plus
81.9%
#2Phi 4 Reasoning
76.6%

AIME 2024

Rank #26 of 41
#23Kimi-k1.5
77.5%
#24QwQ-32B
79.5%
#25DeepSeek-R1
79.8%
#26Phi 4 Reasoning
75.3%
#27o1
74.3%
#28Magistral Medium
73.6%
#29Gemini 2.0 Flash Thinking
73.3%
All Benchmark Results for Phi 4 Reasoning
Complete list of benchmark scores with detailed information
FlenQA
FlenQA benchmark
general
text
0.98
97.7%
Self-reported
HumanEval+
HumanEval+ benchmark
code
text
0.93
92.9%
Self-reported
IFEval
IFEval benchmark
code
text
0.83
83.4%
Self-reported
OmniMath
OmniMath benchmark
math
text
0.77
76.6%
Self-reported
AIME 2024
AIME 2024 benchmark
general
text
0.75
75.3%
Self-reported
MMLU-Pro
MMLU-Pro benchmark
general
text
0.74
74.3%
Self-reported
Arena Hard
Arena Hard benchmark
general
text
0.73
73.3%
Self-reported
PhiBench
PhiBench benchmark
general
text
0.71
70.6%
Self-reported
GPQA
GPQA benchmark
general
text
0.66
65.8%
Self-reported
AIME 2025
AIME 2025 benchmark
general
text
0.63
62.9%
Self-reported
Showing 1 to 10 of 11 benchmarks