Microsoft

Phi 4 Reasoning Plus

Zero-eval
#1FlenQA
#1OmniMath
#1PhiBench
+1 more

by Microsoft

About

Phi 4 Reasoning Plus is a language model developed by Microsoft. It achieves strong performance with an average score of 78.9% across 11 benchmarks. It excels particularly in FlenQA (97.9%), HumanEval+ (92.3%), IFEval (84.9%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.

Timeline
AnnouncedApr 30, 2025
ReleasedApr 30, 2025
Knowledge CutoffMar 1, 2025
Specifications
Training Tokens16.0B
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

11 benchmarks
Average Score
78.9%
Best Score
97.9%
High Performers (80%+)
5

Top Categories

math
81.9%
general
79.3%
code
76.8%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

FlenQA

Rank #1 of 2
#1Phi 4 Reasoning Plus
97.9%
#2Phi 4 Reasoning
97.7%

HumanEval+

Rank #2 of 8
#1Phi 4 Reasoning
92.9%
#2Phi 4 Reasoning Plus
92.3%
#3Granite 3.3 8B Instruct
86.1%
#4Granite 3.3 8B Base
86.1%
#5Phi 4
82.8%

IFEval

Rank #19 of 37
#16DeepSeek-V3
86.1%
#17Nova Micro
87.2%
#18Kimi-k1.5
87.2%
#19Phi 4 Reasoning Plus
84.9%
#20Qwen2.5 72B Instruct
84.1%
#21GPT-4.1 mini
84.1%
#22QwQ-32B
83.9%

OmniMath

Rank #1 of 2
#1Phi 4 Reasoning Plus
81.9%
#2Phi 4 Reasoning
76.6%

AIME 2024

Rank #16 of 41
#13Qwen3 32B
81.4%
#14DeepSeek R1 Distill Qwen 32B
83.3%
#15DeepSeek R1 Distill Qwen 7B
83.3%
#16Phi 4 Reasoning Plus
81.3%
#17Granite 3.3 8B Instruct
81.2%
#18Granite 3.3 8B Base
81.2%
#19Qwen3 30B A3B
80.4%
All Benchmark Results for Phi 4 Reasoning Plus
Complete list of benchmark scores with detailed information
FlenQA
FlenQA benchmark
general
text
0.98
97.9%
Self-reported
HumanEval+
HumanEval+ benchmark
code
text
0.92
92.3%
Self-reported
IFEval
IFEval benchmark
code
text
0.85
84.9%
Self-reported
OmniMath
OmniMath benchmark
math
text
0.82
81.9%
Self-reported
AIME 2024
AIME 2024 benchmark
general
text
0.81
81.3%
Self-reported
Arena Hard
Arena Hard benchmark
general
text
0.79
79.0%
Self-reported
AIME 2025
AIME 2025 benchmark
general
text
0.78
78.0%
Self-reported
MMLU-Pro
MMLU-Pro benchmark
general
text
0.76
76.0%
Self-reported
PhiBench
PhiBench benchmark
general
text
0.74
74.2%
Self-reported
GPQA
GPQA benchmark
general
text
0.69
68.9%
Self-reported
Showing 1 to 10 of 11 benchmarks