IBM

Granite 3.3 8B Instruct

Multimodal
Zero-eval
#1AttaQ
#1PopQA
#2TruthfulQA
+2 more

by IBM

About

Granite 3.3 8B Instruct is a multimodal language model developed by IBM. It achieves strong performance with an average score of 69.8% across 14 benchmarks. It excels particularly in HumanEval (89.7%), AttaQ (88.5%), HumanEval+ (86.1%). The model shows particular specialization in code tasks with an average performance of 78.3%. As a multimodal model, it can process and understand text, images, and other input formats seamlessly. It's licensed for commercial use, making it suitable for enterprise applications. Released in 2025, it represents IBM's latest advancement in AI technology.

Timeline
AnnouncedApr 16, 2025
ReleasedApr 16, 2025
Knowledge CutoffApr 1, 2024
Specifications
Capabilities
Multimodal
License & Family
License
Apache 2.0
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

14 benchmarks
Average Score
69.8%
Best Score
89.7%
High Performers (80%+)
5

Top Categories

code
78.3%
math
75.0%
factuality
66.9%
general
63.9%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

HumanEval

Rank #10 of 62
#7GPT-4o
90.2%
#8Qwen2.5 VL 32B Instruct
91.5%
#9Mistral Large 2
92.0%
#10Granite 3.3 8B Instruct
89.7%
#11Granite 3.3 8B Base
89.7%
#12Gemini Diffusion
89.6%
#13Llama 3.1 405B Instruct
89.0%

AttaQ

Rank #1 of 3
#1Granite 3.3 8B Instruct
88.5%
#2Granite 3.3 8B Base
88.5%
#3IBM Granite 4.0 Tiny Preview
86.1%

HumanEval+

Rank #3 of 8
#1Phi 4 Reasoning Plus
92.3%
#2Phi 4 Reasoning
92.9%
#3Granite 3.3 8B Instruct
86.1%
#4Granite 3.3 8B Base
86.1%
#5Phi 4
82.8%
#6IBM Granite 4.0 Tiny Preview
78.3%

AIME 2024

Rank #17 of 41
#14Phi 4 Reasoning Plus
81.3%
#15Qwen3 32B
81.4%
#16DeepSeek R1 Distill Qwen 32B
83.3%
#17Granite 3.3 8B Instruct
81.2%
#18Granite 3.3 8B Base
81.2%
#19Qwen3 30B A3B
80.4%
#20DeepSeek R1 Distill Qwen 14B
80.0%

GSM8k

Rank #37 of 46
#34Qwen2 7B Instruct
82.3%
#35Qwen2.5-Coder 7B Instruct
83.9%
#36Gemini 1.5 Flash
86.2%
#37Granite 3.3 8B Instruct
80.9%
#38Mistral Small 3 24B Base
80.7%
#39Llama 3.2 3B Instruct
77.7%
#40Jamba 1.5 Mini
75.8%
All Benchmark Results for Granite 3.3 8B Instruct
Complete list of benchmark scores with detailed information
HumanEval
HumanEval benchmark
code
text
0.90
89.7%
Self-reported
AttaQ
AttaQ benchmark
general
text
0.89
88.5%
Self-reported
HumanEval+
HumanEval+ benchmark
code
text
0.86
86.1%
Self-reported
AIME 2024
AIME 2024 benchmark
general
text
0.81
81.2%
Self-reported
GSM8k
GSM8k benchmark
math
text
0.81
80.9%
Self-reported
IFEval
IFEval benchmark
code
text
0.75
74.8%
Self-reported
BIG-Bench Hard
BIG-Bench Hard benchmark
general
text
0.69
69.1%
Self-reported
MATH-500
MATH-500 benchmark
math
text
0.69
69.0%
Self-reported
TruthfulQA
TruthfulQA benchmark
factuality
text
0.67
66.9%
Self-reported
MMLU
MMLU benchmark
general
text
0.66
65.5%
Self-reported
Showing 1 to 10 of 14 benchmarks