vLLM/Recipes
Meta

meta-llama/Llama-3.3-70B-Instruct

Llama 3.3 70B dense model with NVIDIA FP8/FP4 quantized variants for Hopper and Blackwell GPUs

dense70B131,072 ctxvLLM 0.12.0+text
Guide

Overview

Llama 3.3 70B Instruct is Meta's 70-billion parameter dense language model. NVIDIA provides FP8 and FP4 quantized variants optimized for Hopper (H100/H200) and Blackwell (B200/GB200) GPUs. FP4 is Blackwell-only and provides the best VRAM efficiency.

TPU support is provided through vLLM TPU with a recipe for Trillium.

Prerequisites

  • Hardware: 1x H100/H200 (FP8), 1x B200 (FP4), 2x GPUs or 4x Xeon6/Xeon5 NUMA node for BF16
  • vLLM >= 0.12.0
  • CUDA Driver >= 575 for GPUs
  • Docker with NVIDIA Container Toolkit (recommended) for GPUs

pip (Intel Xeon 6 CPUs)

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

Docker (Intel Xeon 6 CPUs)

docker pull vllm/vllm-openai-cpu:latest-x86_64 # For Intel Xeon 6

Docker (Cloud TPU — Trillium)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium recipe, then run:

docker run -itd --name llama33-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 16384 \
    --host 0.0.0.0 --port 8000

Trillium requires a 4-chip slice minimum.

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for meta-llama/Llama-3.3-70B-Instruct:

docker run -itd --name llama3-70b-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Client Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="nvidia/Llama-3.3-70B-Instruct-FP8",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)

Troubleshooting

FP4 variant not loading: FP4 is only supported on Blackwell (compute capability 10.0). Use FP8 on Hopper.

OOM with BF16 on single GPU: Use the FP8 variant (~70 GB) or FP4 variant (~40 GB) to fit on a single GPU.

References