
Llama 3.1 Nemotron 70B Instruct
Zero-eval
#1GSM8K Chat
#1MMLU Chat
#1Instruct HumanEval
+1 more
by NVIDIA
About
Llama 3.1 Nemotron 70B Instruct is a language model developed by NVIDIA. It achieves strong performance with an average score of 67.9% across 11 benchmarks. It excels particularly in GSM8k (91.4%), HellaSwag (85.6%), Winogrande (84.5%). The model shows particular specialization in math tasks with an average performance of 86.7%. Released in 2024, it represents NVIDIA's latest advancement in AI technology.
Timeline
AnnouncedOct 1, 2024
ReleasedOct 1, 2024
Knowledge CutoffDec 1, 2023
Specifications
License & Family
License
Llama 3.1 Community License
Base ModelLlama 3.1 70B Instruct
Benchmark Performance Overview
Performance metrics and category breakdown
Overall Performance
11 benchmarks
Average Score
67.9%
Best Score
91.4%
High Performers (80%+)
6Top Categories
math
86.7%
reasoning
79.8%
code
73.8%
general
64.1%
factuality
58.6%
Benchmark Performance
Top benchmark scores with normalized values (0-100%)
Ranking Across Benchmarks
Position relative to other models on each benchmark
GSM8k
Rank #22 of 46
#19Qwen2.5 7B Instruct
91.6%
#20Kimi K2 Base
92.1%
#21Nova Micro
92.3%
#22Llama 3.1 Nemotron 70B Instruct
91.4%
#23Qwen2 72B Instruct
91.1%
#24Qwen2.5-Coder 32B Instruct
91.1%
#25Gemini 1.5 Pro
90.8%
HellaSwag
Rank #10 of 24
#7Claude 3 Haiku
85.9%
#8Gemma 2 27B
86.4%
#9Gemini 1.5 Flash
86.5%
#10Llama 3.1 Nemotron 70B Instruct
85.6%
#11Qwen2.5 32B Instruct
85.2%
#12Phi-3.5-MoE-instruct
83.8%
#13Mistral NeMo Instruct
83.5%
Winogrande
Rank #4 of 19
#1Qwen2 72B Instruct
85.1%
#2Command R+
85.4%
#3GPT-4
87.5%
#4Llama 3.1 Nemotron 70B Instruct
84.5%
#5Gemma 2 27B
83.7%
#6Qwen2.5 32B Instruct
82.0%
#7Phi-3.5-MoE-instruct
81.3%
GSM8K Chat
Rank #1 of 1
#1Llama 3.1 Nemotron 70B Instruct
81.9%
MMLU Chat
Rank #1 of 1
#1Llama 3.1 Nemotron 70B Instruct
80.6%
All Benchmark Results for Llama 3.1 Nemotron 70B Instruct
Complete list of benchmark scores with detailed information
GSM8k GSM8k benchmark | math | text | 0.91 | 91.4% | Self-reported |
HellaSwag HellaSwag benchmark | reasoning | text | 0.86 | 85.6% | Self-reported |
Winogrande Winogrande benchmark | reasoning | text | 0.85 | 84.5% | Self-reported |
GSM8K Chat GSM8K Chat benchmark | math | text | 0.82 | 81.9% | Self-reported |
MMLU Chat MMLU Chat benchmark | general | text | 0.81 | 80.6% | Self-reported |
MMLU MMLU benchmark | general | text | 0.80 | 80.2% | Self-reported |
Instruct HumanEval Instruct HumanEval benchmark | code | text | 0.74 | 73.8% | Self-reported |
ARC-C ARC-C benchmark | reasoning | text | 0.69 | 69.2% | Self-reported |
TruthfulQA TruthfulQA benchmark | factuality | text | 0.59 | 58.6% | Self-reported |
XLSum English XLSum English benchmark | general | text | 0.32 | 31.6% | Self-reported |
Showing 1 to 10 of 11 benchmarks