vLLM/Recipes
MiniMax

MiniMaxAI/MiniMax-M3

MiniMax M3 vision-language MoE (427B total / 26B active) for frontier coding, agent toolchains, and 1M-token reasoning via MSA sparse attention — native multimodal (image + video + computer use); BF16 checkpoint with MXFP8 and NVFP4 variants from NVIDIA. Runs on NVIDIA (Hopper/Blackwell) and on AMD CDNA4 (MI350X/MI355X) and CDNA3 (MI300X/MI325X).

Frontier coding and agent (SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0); MSA sparse attention; 1M context

moe427B / 26B1,048,576 ctxvLLM 0.24.0+textmultimodal
Guide

Overview

MiniMax-M3 is a frontier vision-language MoE model from MiniMax.

  • MSA (MiniMax Sparse Attention) — scalable sparse-attention architecture that lifts the context window to 1M tokens. MiniMax reports per-token compute at 1M context reduced to ~1/20 of the previous generation, with

    9× prefill and >15× decode speedup vs dense baselines.

  • Frontier coding and agent capabilities — SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.
  • Native multimodal — image + video inputs, plus computer-use; trained multimodally from step 0.
  • Two reasoning modesthinking (complex reasoning / agents) and non-thinking (latency-sensitive), switchable per request.

Prerequisites

  • OS: Linux
  • Python: 3.10 - 3.13
  • NVIDIA: compute capability >= 9.0 (Hopper) recommended; 8x H200 / H20 for a tight single-node BF16 fit, or multi-node TP for long-context headroom
  • AMD: MI350X/MI355X (gfx950), MI300X/MI325X (gfx942), ROCm 7.2+. BF16 needs TP=8; the MXFP8 variant runs from TP=4.
  • --block-size 128 is mandatory on every platform (MSA sparse/index cache).

Docker (NVIDIA)

MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image:

docker pull vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41

Docker (AMD ROCm)

MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image or nightly after the release:

docker pull vllm/vllm-openai-rocm:minimax-m3
docker run --rm -it --device /dev/kfd --device /dev/dri --group-add video \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined --ipc=host \
  --shm-size=16g -p 8000:8000 \
  --entrypoint /bin/bash \
  vllm/vllm-openai-rocm:minimax-m3

Launching the Server

NVIDIA — TP8 (8x H200 / H20)

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

TP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

DP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))

On AMD MI300X / MI325X / MI355X, run with CUDA graphs and set the following before any of the serve commands below. It avoids the MiniMax-M3 decode breakable-cudagraph path that would otherwise force eager execution (per @hongxiayang):

export VLLM_USE_BREAKABLE_CUDAGRAPH=0

For gfx950: Prefer using the MXFP8 variant MiniMaxAI/MiniMax-M3-MXFP8 for TP=4 and a smaller footprint. Use TP=8 for lower latency or long context length, or the default bf16 model.

TP8 (Text or Vision)

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

TP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

DP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

FP8 KV Cache

Add --kv-cache-dtype fp8 to any command for ~1.5× the KV pool — lossless in our testing across the full native context. Especially worth it for high concurrency or long context, where KV is the binding constraint.

Context Length & GPU Memory

The full 1M-token window (context_length: 1048576) needs a large KV cache. To save GPU memory, you can optionally cap the context with --max-model-len:

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --max-model-len 131072        # 128K instead of the full 1M

AMD ROCm notes: Native context is 512K. To go past it, supply a YaRN rope_scaling on the text config (a top-level override silently misses the decoder's config) and allow the long max length. TP=8 + fp8 KV is the practical combo at 1M:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  vllm serve MiniMaxAI/MiniMax-M3 \
  --block-size 128 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 8 \
  --max-model-len 1048576 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m3 \
  --hf-overrides '{"text_config":{"rope_scaling":{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":524288}}}'
  • Set --max-model-len to the longest prompt + output you actually serve (e.g. 32768, 131072, 262144). A smaller window frees KV-pool headroom for higher concurrency and lets the model fit on fewer GPUs; if you need the full 1M window, consider scaling out with multi-node TP instead.

Client Usage

Recommended sampling parameters (from the model card):

  • temperature = 1.0
  • top_p = 0.95
  • top_k = 40

Default system prompt:

You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax.

Example chat request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M3",
    "temperature": 1.0,
    "top_p": 0.95,
    "messages": [
      {"role": "system", "content": "You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax."},
      {"role": "user", "content": "Explain MSA sparse attention in 3 bullets."}
    ]
  }'

Thinking Modes

M3 reasoning is controlled by the thinking_mode, there are three values:

  • enabled — the model thinks before every response, including after tool results. Use for complex reasoning and agents.
  • disabled — no thinking; the model answers directly. Use for latency-sensitive turns.
  • adaptive (default when unset) — the model decides whether to think based on the task.

Pass it per request through chat_template_kwargs. The same value also tunes the minimax_m3 reasoning parser, so reasoning_content and content are split correctly in every mode.


# Start the MiniMax-M3 model by referring to the command above first.

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3",
    messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
    extra_body={"chat_template_kwargs": {"thinking_mode": "enabled"}},
)
msg = response.choices[0].message
# vLLM exposes the <mm:think> block as `reasoning` (the older
# `reasoning_content` field is deprecated but still aliased).
print(getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None))
print(msg.content)  # the final answer

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M3 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Quantized Variant (MXFP8)

MiniMaxAI/MiniMax-M3-MXFP8 is an MXFP8 checkpoint quantized by NVIDIA from the original FP16 weights — roughly half the VRAM of the BF16 release. Select the mxfp8 variant above, or pass the repo id directly to vllm serve:

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

For best MXFP8 throughput, prefer Blackwell (B200/B300) for native MX tensor cores, or AMD CDNA4 (MI350X/MI355X, gfx950) for native MXFP8 matrix cores.

Quantized Variant (NVFP4, Blackwell)

nvidia/MiniMax-M3-NVFP4 is an NVFP4 checkpoint quantized by NVIDIA — roughly 1/4 the VRAM of the BF16 release, so the 427B model fits comfortably on a single Blackwell node (B200 / B300) with KV-cache headroom. Select the nvfp4 variant above, or pass the repo id directly to vllm serve.

vLLM support is in-flight. MiniMax-M3 NVFP4 needs the modelopt NVFP4 path added in vLLM PR #46380, which is not yet merged. Until it lands in a release, build vLLM from that branch (or a nightly once merged); a stock build will not recognise the NVFP4 quant config.

vllm serve nvidia/MiniMax-M3-NVFP4 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

Add --enable-expert-parallel (TP+EP) or --data-parallel-size 8 --enable-expert-parallel (DP+EP) to scale across the node, exactly as for the BF16/MXFP8 commands above. For text-only serving, add --language-model-only to skip the vision encoder and free VRAM for KV cache.

NVFP4 + EAGLE3 spec decoding (MTP)

The NVFP4 target pairs with the same EAGLE3 draft head as the other variants. Enable the Spec decoding feature above, or append the draft config to the command:

vllm serve nvidia/MiniMax-M3-NVFP4 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice \
  --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'

Troubleshooting

  • --block-size mismatch. MSA's sparse block size is 128; the vLLM KV cache block size must match. Using the default (16) misaligns the sparse attention indexing (on AMD it crashes with No common block size for 16).
  • Parsers. --tool-call-parser and --reasoning-parser both use minimax_m3 — distinct from minimax_m2 used by earlier releases.
  • Long context KV cache. See Context Length & GPU Memory above — cap --max-model-len or scale to multi-node TP if you OOM.
  • Vision encoder. The encoder is small, so at high TP the Encoder Parallel option runs it data-parallel (--mm-encoder-tp-mode data) to avoid TP comm overhead; it also turns on the vision-encoder attention backend (FlashInfer on NVIDIA, --mm-encoder-attn-backend FLASHINFER; AITER FlashAttention on AMD, ROCM_AITER_FA) and the host-shared-memory multimodal processor cache (--mm-processor-cache-type shm). For text-only workloads enable Text only (--language-model-only) to skip loading the encoder and free VRAM — it is mutually exclusive with Encoder Parallel.

References