MiniMaxAI/MiniMax-M3
MiniMax M3 vision-language MoE (427B total / 26B active) for frontier coding, agent toolchains, and 1M-token reasoning via MSA sparse attention — native multimodal (image + video + computer use); BF16 checkpoint with MXFP8 and NVFP4 variants from NVIDIA. Runs on NVIDIA (Hopper/Blackwell) and on AMD CDNA4 (MI350X/MI355X) and CDNA3 (MI300X/MI325X).
Frontier coding and agent (SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0); MSA sparse attention; 1M context
Guide
Overview
MiniMax-M3 is a frontier vision-language MoE model from MiniMax.
- MSA (MiniMax Sparse Attention) — scalable sparse-attention architecture
that lifts the context window to 1M tokens. MiniMax reports per-token
compute at 1M context reduced to ~1/20 of the previous generation, with
9× prefill and >15× decode speedup vs dense baselines.
- Frontier coding and agent capabilities — SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.
- Native multimodal — image + video inputs, plus computer-use; trained multimodally from step 0.
- Two reasoning modes —
thinking(complex reasoning / agents) andnon-thinking(latency-sensitive), switchable per request.
Prerequisites
- OS: Linux
- Python: 3.10 - 3.13
- NVIDIA: compute capability >= 9.0 (Hopper) recommended; 8x H200 / H20 for a tight single-node BF16 fit, or multi-node TP for long-context headroom
- AMD: MI350X/MI355X (gfx950), MI300X/MI325X (gfx942), ROCm 7.2+. BF16 needs TP=8; the MXFP8 variant runs from TP=4.
--block-size 128is mandatory on every platform (MSA sparse/index cache).
Docker (NVIDIA)
MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image:
docker pull vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
Docker (AMD ROCm)
MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image or nightly after the release:
docker pull vllm/vllm-openai-rocm:minimax-m3
docker run --rm -it --device /dev/kfd --device /dev/dri --group-add video \
--cap-add SYS_PTRACE --security-opt seccomp=unconfined --ipc=host \
--shm-size=16g -p 8000:8000 \
--entrypoint /bin/bash \
vllm/vllm-openai-rocm:minimax-m3
Launching the Server
NVIDIA — TP8 (8x H200 / H20)
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
TP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
DP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--data-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))
On AMD MI300X / MI325X / MI355X, run with CUDA graphs and set the following before any of the serve commands below. It avoids the MiniMax-M3 decode breakable-cudagraph path that would otherwise force eager execution (per @hongxiayang):
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
For gfx950: Prefer using the MXFP8 variant MiniMaxAI/MiniMax-M3-MXFP8 for TP=4 and a smaller
footprint. Use TP=8 for lower latency or long context length, or the default bf16 model.
TP8 (Text or Vision)
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
TP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
DP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--data-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
FP8 KV Cache
Add --kv-cache-dtype fp8 to any command for ~1.5× the KV pool — lossless
in our testing across the full native context. Especially worth it for high
concurrency or long context, where KV is the binding constraint.
Context Length & GPU Memory
The full 1M-token window (context_length: 1048576) needs a large KV
cache. To save GPU memory, you can optionally cap the context with
--max-model-len:
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--max-model-len 131072 # 128K instead of the full 1M
AMD ROCm notes: Native context is 512K. To go past it, supply a YaRN rope_scaling on the text config (a top-level override silently misses the decoder's config) and allow the long max length. TP=8 + fp8 KV is the practical combo at 1M:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm serve MiniMaxAI/MiniMax-M3 \
--block-size 128 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--max-model-len 1048576 \
--attention-backend TRITON_ATTN \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend ROCM_AITER_FA \
--tool-call-parser minimax_m3 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m3 \
--hf-overrides '{"text_config":{"rope_scaling":{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":524288}}}'
- Set
--max-model-lento the longest prompt + output you actually serve (e.g.32768,131072,262144). A smaller window frees KV-pool headroom for higher concurrency and lets the model fit on fewer GPUs; if you need the full 1M window, consider scaling out with multi-node TP instead.
Client Usage
Recommended sampling parameters (from the model card):
temperature = 1.0top_p = 0.95top_k = 40
Default system prompt:
You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax.
Example chat request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M3",
"temperature": 1.0,
"top_p": 0.95,
"messages": [
{"role": "system", "content": "You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax."},
{"role": "user", "content": "Explain MSA sparse attention in 3 bullets."}
]
}'
Thinking Modes
M3 reasoning is controlled by the thinking_mode, there are three values:
enabled— the model thinks before every response, including after tool results. Use for complex reasoning and agents.disabled— no thinking; the model answers directly. Use for latency-sensitive turns.adaptive(default when unset) — the model decides whether to think based on the task.
Pass it per request through chat_template_kwargs. The same value also tunes
the minimax_m3 reasoning parser, so reasoning_content and content are
split correctly in every mode.
# Start the MiniMax-M3 model by referring to the command above first.
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3",
messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
extra_body={"chat_template_kwargs": {"thinking_mode": "enabled"}},
)
msg = response.choices[0].message
# vLLM exposes the <mm:think> block as `reasoning` (the older
# `reasoning_content` field is deprecated but still aliased).
print(getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None))
print(msg.content) # the final answer
Benchmarking
vllm bench serve \
--backend vllm \
--model MiniMaxAI/MiniMax-M3 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Quantized Variant (MXFP8)
MiniMaxAI/MiniMax-M3-MXFP8
is an MXFP8 checkpoint quantized by NVIDIA from the original FP16 weights —
roughly half the VRAM of the BF16 release. Select the mxfp8 variant above,
or pass the repo id directly to vllm serve:
vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
For best MXFP8 throughput, prefer Blackwell (B200/B300) for native MX tensor cores, or AMD CDNA4 (MI350X/MI355X, gfx950) for native MXFP8 matrix cores.
Quantized Variant (NVFP4, Blackwell)
nvidia/MiniMax-M3-NVFP4 is
an NVFP4 checkpoint quantized by NVIDIA — roughly 1/4 the VRAM of the BF16
release, so the 427B model fits comfortably on a single Blackwell node (B200 /
B300) with KV-cache headroom. Select the nvfp4 variant above, or pass the
repo id directly to vllm serve.
vLLM support is in-flight. MiniMax-M3 NVFP4 needs the modelopt NVFP4 path added in vLLM PR #46380, which is not yet merged. Until it lands in a release, build vLLM from that branch (or a nightly once merged); a stock build will not recognise the NVFP4 quant config.
vllm serve nvidia/MiniMax-M3-NVFP4 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
Add --enable-expert-parallel (TP+EP) or --data-parallel-size 8 --enable-expert-parallel (DP+EP) to scale across the node, exactly as for the
BF16/MXFP8 commands above. For text-only serving, add --language-model-only
to skip the vision encoder and free VRAM for KV cache.
NVFP4 + EAGLE3 spec decoding (MTP)
The NVFP4 target pairs with the same EAGLE3 draft head as the other variants. Enable the Spec decoding feature above, or append the draft config to the command:
vllm serve nvidia/MiniMax-M3-NVFP4 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice \
--speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
Troubleshooting
--block-sizemismatch. MSA's sparse block size is 128; the vLLM KV cache block size must match. Using the default (16) misaligns the sparse attention indexing (on AMD it crashes withNo common block size for 16).- Parsers.
--tool-call-parserand--reasoning-parserboth useminimax_m3— distinct fromminimax_m2used by earlier releases. - Long context KV cache. See Context Length & GPU Memory above — cap
--max-model-lenor scale to multi-node TP if you OOM. - Vision encoder. The encoder is small, so at high TP the Encoder
Parallel option runs it data-parallel (
--mm-encoder-tp-mode data) to avoid TP comm overhead; it also turns on the vision-encoder attention backend (FlashInfer on NVIDIA,--mm-encoder-attn-backend FLASHINFER; AITER FlashAttention on AMD,ROCM_AITER_FA) and the host-shared-memory multimodal processor cache (--mm-processor-cache-type shm). For text-only workloads enable Text only (--language-model-only) to skip loading the encoder and free VRAM — it is mutually exclusive with Encoder Parallel.