Phi 4 Reasoning

Name: Phi 4 Reasoning
Rating: 75.1 (11 reviews)
Author: Microsoft

Zero-eval

#1HumanEval+

#2FlenQA

#2OmniMath

+1 more

by Microsoft

About

Phi 4 Reasoning is a language model developed by Microsoft. It achieves strong performance with an average score of 75.1% across 11 benchmarks. It excels particularly in FlenQA (97.7%), HumanEval+ (92.9%), IFEval (83.4%). The model shows particular specialization in code tasks with an average performance of 76.7%. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.

Timeline

AnnouncedApr 30, 2025

ReleasedApr 30, 2025

Knowledge CutoffMar 1, 2025

Specifications

Training Tokens16.0B

License & Family

License

MIT

Base ModelPhi 4

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

11 benchmarks

Average Score

75.1%

Best Score

97.7%

High Performers (80%+)

Top Categories

code

76.7%

math

76.6%

general

74.3%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

FlenQA

Rank #2 of 2

#1Phi 4 Reasoning Plus

97.9%

#2Phi 4 Reasoning

97.7%

HumanEval+

Rank #1 of 8

#1Phi 4 Reasoning

92.9%

#2Phi 4 Reasoning Plus

92.3%

#3Granite 3.3 8B Instruct

86.1%

#4Granite 3.3 8B Base

86.1%

IFEval

Rank #23 of 37

#20QwQ-32B

83.9%

#21GPT-4.1 mini

84.1%

#22Qwen2.5 72B Instruct

84.1%

#23Phi 4 Reasoning

83.4%

#24DeepSeek-R1

83.3%

#25Mistral Small 3 24B Instruct

82.9%

#26GPT-4o

81.0%

OmniMath

Rank #2 of 2

#1Phi 4 Reasoning Plus

81.9%

#2Phi 4 Reasoning

76.6%

AIME 2024

Rank #26 of 41

#23Kimi-k1.5

77.5%

#24QwQ-32B

79.5%

#25DeepSeek-R1

79.8%

#26Phi 4 Reasoning

75.3%

#27o1

74.3%

#28Magistral Medium

73.6%

#29Gemini 2.0 Flash Thinking

73.3%

All Benchmark Results for Phi 4 Reasoning

Complete list of benchmark scores with detailed information


FlenQA FlenQA benchmark	general	text	0.98	97.7%	Self-reported
HumanEval+ HumanEval+ benchmark	code	text	0.93	92.9%	Self-reported
IFEval IFEval benchmark	code	text	0.83	83.4%	Self-reported
OmniMath OmniMath benchmark	math	text	0.77	76.6%	Self-reported
AIME 2024 AIME 2024 benchmark	general	text	0.75	75.3%	Self-reported
MMLU-Pro MMLU-Pro benchmark	general	text	0.74	74.3%	Self-reported
Arena Hard Arena Hard benchmark	general	text	0.73	73.3%	Self-reported
PhiBench PhiBench benchmark	general	text	0.71	70.6%	Self-reported
GPQA GPQA benchmark	general	text	0.66	65.8%	Self-reported
AIME 2025 AIME 2025 benchmark	general	text	0.63	62.9%	Self-reported

Showing 1 to 10 of 11 benchmarks

Resources

API Reference Research Paper Blog Post Model Weights