
Phi 4 Reasoning Plus
Zero-eval
#1FlenQA
#1OmniMath
#1PhiBench
+1 more
by Microsoft
About
Phi 4 Reasoning Plus is a language model developed by Microsoft. It achieves strong performance with an average score of 78.9% across 11 benchmarks. It excels particularly in FlenQA (97.9%), HumanEval+ (92.3%), IFEval (84.9%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.
Timeline
AnnouncedApr 30, 2025
ReleasedApr 30, 2025
Knowledge CutoffMar 1, 2025
Specifications
Training Tokens16.0B
License & Family
License
MIT
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
11 benchmarks
Average Score
78.9%
Best Score
97.9%
High Performers (80%+)
5Top Categories
math
81.9%
general
79.3%
code
76.8%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
FlenQA
Rank #1 of 2
#1Phi 4 Reasoning Plus
97.9%
#2Phi 4 Reasoning
97.7%
HumanEval+
Rank #2 of 8
#1Phi 4 Reasoning
92.9%
#2Phi 4 Reasoning Plus
92.3%
#3Granite 3.3 8B Instruct
86.1%
#4Granite 3.3 8B Base
86.1%
#5Phi 4
82.8%
IFEval
Rank #19 of 37
#16DeepSeek-V3
86.1%
#17Nova Micro
87.2%
#18Kimi-k1.5
87.2%
#19Phi 4 Reasoning Plus
84.9%
#20Qwen2.5 72B Instruct
84.1%
#21GPT-4.1 mini
84.1%
#22QwQ-32B
83.9%
OmniMath
Rank #1 of 2
#1Phi 4 Reasoning Plus
81.9%
#2Phi 4 Reasoning
76.6%
AIME 2024
Rank #16 of 41
#13Qwen3 32B
81.4%
#14DeepSeek R1 Distill Qwen 32B
83.3%
#15DeepSeek R1 Distill Qwen 7B
83.3%
#16Phi 4 Reasoning Plus
81.3%
#17Granite 3.3 8B Instruct
81.2%
#18Granite 3.3 8B Base
81.2%
#19Qwen3 30B A3B
80.4%
All Benchmark Results for Phi 4 Reasoning Plus
Complete list of benchmark scores with detailed information
FlenQA FlenQA benchmark | general | text | 0.98 | 97.9% | Self-reported |
HumanEval+ HumanEval+ benchmark | code | text | 0.92 | 92.3% | Self-reported |
IFEval IFEval benchmark | code | text | 0.85 | 84.9% | Self-reported |
OmniMath OmniMath benchmark | math | text | 0.82 | 81.9% | Self-reported |
AIME 2024 AIME 2024 benchmark | general | text | 0.81 | 81.3% | Self-reported |
Arena Hard Arena Hard benchmark | general | text | 0.79 | 79.0% | Self-reported |
AIME 2025 AIME 2025 benchmark | general | text | 0.78 | 78.0% | Self-reported |
MMLU-Pro MMLU-Pro benchmark | general | text | 0.76 | 76.0% | Self-reported |
PhiBench PhiBench benchmark | general | text | 0.74 | 74.2% | Self-reported |
GPQA GPQA benchmark | general | text | 0.69 | 68.9% | Self-reported |
Showing 1 to 10 of 11 benchmarks