Granite 3.3 8B Instruct

Multimodal

Zero-eval

#1AttaQ

#1PopQA

#2TruthfulQA

+2 more

by IBM

About

Granite 3.3 8B Instruct is a multimodal language model developed by IBM. It achieves strong performance with an average score of 69.8% across 14 benchmarks. It excels particularly in HumanEval (89.7%), AttaQ (88.5%), HumanEval+ (86.1%). The model shows particular specialization in code tasks with an average performance of 78.3%. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents IBM's latest advancement in AI technology.

Timeline

AnnouncedApr 16, 2025

ReleasedApr 16, 2025

Knowledge CutoffApr 1, 2024

Specifications

Capabilities

Multimodal

License & Family

License

Apache 2.0

Benchmark Performance Overview

Performance metrics and category breakdown

Overall Performance

14 benchmarks

Average Score

69.8%

Best Score

89.7%

High Performers (80%+)

Top Categories

code

78.3%

math

75.0%

factuality

66.9%

general

63.9%

Benchmark Performance

Top benchmark scores with normalized values (0-100%)

Ranking Across Benchmarks

Position relative to other models on each benchmark

HumanEval

Rank #10 of 62

#7GPT-4o

90.2%

#8Qwen2.5 VL 32B Instruct

91.5%

#9Mistral Large 2

92.0%

#10Granite 3.3 8B Instruct

89.7%

#11Granite 3.3 8B Base

89.7%

#12Gemini Diffusion

89.6%

#13Llama 3.1 405B Instruct

89.0%

AttaQ

Rank #1 of 3

#1Granite 3.3 8B Instruct

88.5%

#2Granite 3.3 8B Base

88.5%

#3IBM Granite 4.0 Tiny Preview

86.1%

HumanEval+

Rank #3 of 8

#1Phi 4 Reasoning Plus

92.3%

#2Phi 4 Reasoning

92.9%

#3Granite 3.3 8B Instruct

86.1%

#4Granite 3.3 8B Base

86.1%

#5Phi 4

82.8%

#6IBM Granite 4.0 Tiny Preview

78.3%

AIME 2024

Rank #17 of 41

#14Phi 4 Reasoning Plus

81.3%

#15Qwen3 32B

81.4%

#16DeepSeek R1 Distill Qwen 32B

83.3%

#17Granite 3.3 8B Instruct

81.2%

#18Granite 3.3 8B Base

81.2%

#19Qwen3 30B A3B

80.4%

#20DeepSeek R1 Distill Qwen 14B

80.0%

GSM8k

Rank #37 of 46

#34Qwen2 7B Instruct

82.3%

#35Qwen2.5-Coder 7B Instruct

83.9%

#36Gemini 1.5 Flash

86.2%

#37Granite 3.3 8B Instruct

80.9%

#38Mistral Small 3 24B Base

80.7%

#39Llama 3.2 3B Instruct

77.7%

#40Jamba 1.5 Mini

75.8%

All Benchmark Results for Granite 3.3 8B Instruct

Complete list of benchmark scores with detailed information


HumanEval HumanEval benchmark	code	text	0.90	89.7%	Self-reported
AttaQ AttaQ benchmark	general	text	0.89	88.5%	Self-reported
HumanEval+ HumanEval+ benchmark	code	text	0.86	86.1%	Self-reported
AIME 2024 AIME 2024 benchmark	general	text	0.81	81.2%	Self-reported
GSM8k GSM8k benchmark	math	text	0.81	80.9%	Self-reported
IFEval IFEval benchmark	code	text	0.75	74.8%	Self-reported
BIG-Bench Hard BIG-Bench Hard benchmark	general	text	0.69	69.1%	Self-reported
MATH-500 MATH-500 benchmark	math	text	0.69	69.0%	Self-reported
TruthfulQA TruthfulQA benchmark	factuality	text	0.67	66.9%	Self-reported
MMLU MMLU benchmark	general	text	0.66	65.5%	Self-reported

Showing 1 to 10 of 14 benchmarks

Resources

API Reference Playground Blog Post Repository Model Weights