NVIDIA

Llama 3.1 Nemotron 70B Instruct

Zero-eval
#1GSM8K Chat
#1MMLU Chat
#1Instruct HumanEval
+1 more

by NVIDIA

About

Llama 3.1 Nemotron 70B Instruct is a language model developed by NVIDIA. It achieves strong performance with an average score of 67.9% across 11 benchmarks. It excels particularly in GSM8k (91.4%), HellaSwag (85.6%), Winogrande (84.5%). The model shows particular specialization in math tasks with an average performance of 86.7%. Released in 2024, it represents NVIDIA's latest advancement in AI technology.

Timeline
AnnouncedOct 1, 2024
ReleasedOct 1, 2024
Knowledge CutoffDec 1, 2023
Specifications
License & Family
License
Llama 3.1 Community License
Base ModelLlama 3.1 70B Instruct
Benchmark Performance Overview
Performance metrics and category breakdown

Overall Performance

11 benchmarks
Average Score
67.9%
Best Score
91.4%
High Performers (80%+)
6

Top Categories

math
86.7%
reasoning
79.8%
code
73.8%
general
64.1%
factuality
58.6%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark

GSM8k

Rank #22 of 46
#19Qwen2.5 7B Instruct
91.6%
#20Kimi K2 Base
92.1%
#21Nova Micro
92.3%
#22Llama 3.1 Nemotron 70B Instruct
91.4%
#23Qwen2 72B Instruct
91.1%
#24Qwen2.5-Coder 32B Instruct
91.1%
#25Gemini 1.5 Pro
90.8%

HellaSwag

Rank #10 of 24
#7Claude 3 Haiku
85.9%
#8Gemma 2 27B
86.4%
#9Gemini 1.5 Flash
86.5%
#10Llama 3.1 Nemotron 70B Instruct
85.6%
#11Qwen2.5 32B Instruct
85.2%
#12Phi-3.5-MoE-instruct
83.8%
#13Mistral NeMo Instruct
83.5%

Winogrande

Rank #4 of 19
#1Qwen2 72B Instruct
85.1%
#2Command R+
85.4%
#3GPT-4
87.5%
#4Llama 3.1 Nemotron 70B Instruct
84.5%
#5Gemma 2 27B
83.7%
#6Qwen2.5 32B Instruct
82.0%
#7Phi-3.5-MoE-instruct
81.3%

GSM8K Chat

Rank #1 of 1
#1Llama 3.1 Nemotron 70B Instruct
81.9%

MMLU Chat

Rank #1 of 1
#1Llama 3.1 Nemotron 70B Instruct
80.6%
All Benchmark Results for Llama 3.1 Nemotron 70B Instruct
Complete list of benchmark scores with detailed information
GSM8k
GSM8k benchmark
math
text
0.91
91.4%
Self-reported
HellaSwag
HellaSwag benchmark
reasoning
text
0.86
85.6%
Self-reported
Winogrande
Winogrande benchmark
reasoning
text
0.85
84.5%
Self-reported
GSM8K Chat
GSM8K Chat benchmark
math
text
0.82
81.9%
Self-reported
MMLU Chat
MMLU Chat benchmark
general
text
0.81
80.6%
Self-reported
MMLU
MMLU benchmark
general
text
0.80
80.2%
Self-reported
Instruct HumanEval
Instruct HumanEval benchmark
code
text
0.74
73.8%
Self-reported
ARC-C
ARC-C benchmark
reasoning
text
0.69
69.2%
Self-reported
TruthfulQA
TruthfulQA benchmark
factuality
text
0.59
58.6%
Self-reported
XLSum English
XLSum English benchmark
general
text
0.32
31.6%
Self-reported
Showing 1 to 10 of 11 benchmarks