
Phi 4 Reasoning
Zero-eval
#1HumanEval+
#2FlenQA
#2OmniMath
+1 more
by Microsoft
About
Phi 4 Reasoning is a language model developed by Microsoft. It achieves strong performance with an average score of 75.1% across 11 benchmarks. It excels particularly in FlenQA (97.7%), HumanEval+ (92.9%), IFEval (83.4%). The model shows particular specialization in code tasks with an average performance of 76.7%. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.
Timeline
AnnouncedApr 30, 2025
ReleasedApr 30, 2025
Knowledge CutoffMar 1, 2025
Specifications
Training Tokens16.0B
License & Family
License
MIT
Base ModelPhi 4
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
11 benchmarks
Average Score
75.1%
Best Score
97.7%
High Performers (80%+)
3Top Categories
code
76.7%
math
76.6%
general
74.3%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
FlenQA
Rank #2 of 2
#1Phi 4 Reasoning Plus
97.9%
#2Phi 4 Reasoning
97.7%
HumanEval+
Rank #1 of 8
#1Phi 4 Reasoning
92.9%
#2Phi 4 Reasoning Plus
92.3%
#3Granite 3.3 8B Instruct
86.1%
#4Granite 3.3 8B Base
86.1%
IFEval
Rank #23 of 37
#20QwQ-32B
83.9%
#21GPT-4.1 mini
84.1%
#22Qwen2.5 72B Instruct
84.1%
#23Phi 4 Reasoning
83.4%
#24DeepSeek-R1
83.3%
#25Mistral Small 3 24B Instruct
82.9%
#26GPT-4o
81.0%
OmniMath
Rank #2 of 2
#1Phi 4 Reasoning Plus
81.9%
#2Phi 4 Reasoning
76.6%
AIME 2024
Rank #26 of 41
#23Kimi-k1.5
77.5%
#24QwQ-32B
79.5%
#25DeepSeek-R1
79.8%
#26Phi 4 Reasoning
75.3%
#27o1
74.3%
#28Magistral Medium
73.6%
#29Gemini 2.0 Flash Thinking
73.3%
All Benchmark Results for Phi 4 Reasoning
Complete list of benchmark scores with detailed information
FlenQA FlenQA benchmark | general | text | 0.98 | 97.7% | Self-reported |
HumanEval+ HumanEval+ benchmark | code | text | 0.93 | 92.9% | Self-reported |
IFEval IFEval benchmark | code | text | 0.83 | 83.4% | Self-reported |
OmniMath OmniMath benchmark | math | text | 0.77 | 76.6% | Self-reported |
AIME 2024 AIME 2024 benchmark | general | text | 0.75 | 75.3% | Self-reported |
MMLU-Pro MMLU-Pro benchmark | general | text | 0.74 | 74.3% | Self-reported |
Arena Hard Arena Hard benchmark | general | text | 0.73 | 73.3% | Self-reported |
PhiBench PhiBench benchmark | general | text | 0.71 | 70.6% | Self-reported |
GPQA GPQA benchmark | general | text | 0.66 | 65.8% | Self-reported |
AIME 2025 AIME 2025 benchmark | general | text | 0.63 | 62.9% | Self-reported |
Showing 1 to 10 of 11 benchmarks