Qwen3-235B-A22B-Instruct-2507

Name: Qwen3-235B-A22B-Instruct-2507
Rating: 72.1 (25 reviews)
Author: Alibaba

Zero-eval

#1ZebraLogic

#1MultiPL-E

#1Creative Writing v3

+19 more

by Alibaba

About

Qwen3-235B-A22B-Instruct-2507 is a language model developed by Alibaba. It achieves strong performance with an average score of 72.1% across 25 benchmarks. It excels particularly in ZebraLogic (95.0%), MMLU-Redux (93.1%), IFEval (88.7%). It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents Alibaba's latest advancement in AI technology.

Timeline

AnnouncedJul 22, 2025

ReleasedJul 22, 2025

Specifications

License & Family

License

Apache 2.0

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

25 benchmarks

Average Score

72.1%

Best Score

95.0%

High Performers (80%+)

Top Categories

roleplay

75.4%

general

73.7%

code

70.3%

reasoning

68.4%

math

50.2%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

ZebraLogic

Rank #1 of 2

#1Qwen3-235B-A22B-Instruct-2507

95.0%

#2Kimi K2 Instruct

89.0%

MMLU-Redux

Rank #2 of 13

#1DeepSeek-R1-0528

93.4%

#2Qwen3-235B-A22B-Instruct-2507

93.1%

#3DeepSeek-R1

92.9%

#4Kimi K2 Instruct

92.7%

#5DeepSeek-V3

89.1%

IFEval

Rank #11 of 37

#8Gemma 3 12B

88.9%

#9Llama 3.1 Nemotron Ultra 253B v1

89.5%

#10Nova Lite

89.7%

#11Qwen3-235B-A22B-Instruct-2507

88.7%

#12Llama 3.1 405B Instruct

88.6%

#13GPT-4.5

88.2%

#14Llama 3.1 70B Instruct

87.5%

MultiPL-E

Rank #1 of 10

#1Qwen3-235B-A22B-Instruct-2507

87.9%

#2Kimi K2 Instruct

85.7%

#3Qwen2.5 32B Instruct

75.4%

#4Qwen2.5 72B Instruct

75.1%

Creative Writing v3

Rank #1 of 1

#1Qwen3-235B-A22B-Instruct-2507

87.5%

All Benchmark Results for Qwen3-235B-A22B-Instruct-2507

Complete list of benchmark scores with detailed information


ZebraLogic ZebraLogic benchmark	reasoning	text	0.95	95.0%	Self-reported
MMLU-Redux MMLU-Redux benchmark	general	text	0.93	93.1%	Self-reported
IFEval IFEval benchmark	code	text	0.89	88.7%	Self-reported
MultiPL-E MultiPL-E benchmark	general	text	87.90	87.9%	Self-reported
Creative Writing v3 Creative Writing v3 benchmark	general	text	0.88	87.5%	Self-reported
WritingBench WritingBench benchmark	general	text	0.85	85.2%	Self-reported
CSimpleQA CSimpleQA benchmark	general	text	0.84	84.3%	Self-reported
MMLU-Pro MMLU-Pro benchmark	general	text	0.83	83.0%	Self-reported
Include Include benchmark	general	text	0.80	79.5%	Self-reported
MMLU-ProX MMLU-ProX benchmark	general	text	0.79	79.4%	Self-reported

Showing 1 to 10 of 25 benchmarks

Resources

API Reference Playground Research Paper Blog Post Repository Model Weights