Phi 4 Reasoning Plus

Name: Phi 4 Reasoning Plus
Rating: 78.9 (11 reviews)
Author: Microsoft

Zero-eval

#1FlenQA

#1OmniMath

#1PhiBench

+1 more

by Microsoft

About

Phi 4 Reasoning Plus is a language model developed by Microsoft. It achieves strong performance with an average score of 78.9% across 11 benchmarks. It excels particularly in FlenQA (97.9%), HumanEval+ (92.3%), IFEval (84.9%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Microsoft's latest advancement in AI technology.

Timeline

AnnouncedApr 30, 2025

ReleasedApr 30, 2025

Knowledge CutoffMar 1, 2025

Specifications

Training Tokens16.0B

License & Family

License

MIT

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

11 benchmarks

Average Score

78.9%

Best Score

97.9%

High Performers (80%+)

Top Categories

math

81.9%

general

79.3%

code

76.8%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

FlenQA

Rank #1 of 2

#1Phi 4 Reasoning Plus

97.9%

#2Phi 4 Reasoning

97.7%

HumanEval+

Rank #2 of 8

#1Phi 4 Reasoning

92.9%

#2Phi 4 Reasoning Plus

92.3%

#3Granite 3.3 8B Instruct

86.1%

#4Granite 3.3 8B Base

86.1%

#5Phi 4

82.8%

IFEval

Rank #19 of 37

#16DeepSeek-V3

86.1%

#17Nova Micro

87.2%

#18Kimi-k1.5

87.2%

#19Phi 4 Reasoning Plus

84.9%

#20Qwen2.5 72B Instruct

84.1%

#21GPT-4.1 mini

84.1%

#22QwQ-32B

83.9%

OmniMath

Rank #1 of 2

#1Phi 4 Reasoning Plus

81.9%

#2Phi 4 Reasoning

76.6%

AIME 2024

Rank #16 of 41

#13Qwen3 32B

81.4%

#14DeepSeek R1 Distill Qwen 32B

83.3%

#15DeepSeek R1 Distill Qwen 7B

83.3%

#16Phi 4 Reasoning Plus

81.3%

#17Granite 3.3 8B Instruct

81.2%

#18Granite 3.3 8B Base

81.2%

#19Qwen3 30B A3B

80.4%

All Benchmark Results for Phi 4 Reasoning Plus

Complete list of benchmark scores with detailed information


FlenQA FlenQA benchmark	general	text	0.98	97.9%	Self-reported
HumanEval+ HumanEval+ benchmark	code	text	0.92	92.3%	Self-reported
IFEval IFEval benchmark	code	text	0.85	84.9%	Self-reported
OmniMath OmniMath benchmark	math	text	0.82	81.9%	Self-reported
AIME 2024 AIME 2024 benchmark	general	text	0.81	81.3%	Self-reported
Arena Hard Arena Hard benchmark	general	text	0.79	79.0%	Self-reported
AIME 2025 AIME 2025 benchmark	general	text	0.78	78.0%	Self-reported
MMLU-Pro MMLU-Pro benchmark	general	text	0.76	76.0%	Self-reported
PhiBench PhiBench benchmark	general	text	0.74	74.2%	Self-reported
GPQA GPQA benchmark	general	text	0.69	68.9%	Self-reported

Showing 1 to 10 of 11 benchmarks

Resources

API Reference Research Paper Blog Post Model Weights