vLLM/Recipes
Meta

meta-llama/Llama-4-Scout-17B-16E-Instruct

Llama 4 Scout 17B-16E MoE model with NVIDIA FP8/FP4 variants, fits on a single GPU with quantization

moe109B / 17B10,485,760 ctxvLLM 0.12.0+text
Guide

Overview

Llama 4 Scout is Meta's MoE model with 17B active parameters across 16 experts (109B total). NVIDIA provides FP8 and FP4 quantized variants. With FP4 quantization, the model fits on a single B200 GPU — making it one of the most accessible MoE models.

Prerequisites

  • Hardware: 1x B200 (FP4), 1x H100 (FP8), 4x GPUs (BF16) or 4x Xeon6/Xeon5 NUMA node for BF16
  • vLLM >= 0.12.0
  • CUDA Driver >= 575 for GPUs
  • Docker with NVIDIA Container Toolkit (recommended) for GPUs
  • License: Must agree to Meta's Llama 4 Scout Community License for GPUs

pip (Intel Xeon 6 CPUs)

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

Docker (Intel Xeon 6 CPUs)

docker pull vllm/vllm-openai-cpu:latest-x86_64 # For Intel Xeon 6

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for meta-llama/Llama-4-Scout-17B-16E-Instruct:

docker run -itd --name llama4-17b-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --host 0.0.0.0 \
    --port 8000

Client Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="nvidia/Llama-4-Scout-17B-16E-Instruct-FP8",
    messages=[{"role": "user", "content": "Explain MoE models briefly."}],
)
print(response.choices[0].message.content)

Troubleshooting

FP4 only works on Blackwell: FP4 quantization requires compute capability 10.0 (B200/GB200). Use FP8 on Hopper.

TP=1 recommended for best throughput: For maximum throughput per GPU, keep TP=1. Increase TP to 2/4/8 for lower latency.

References