moonshotai/Kimi-K2.5
Open-source native multimodal agentic MoE model with vision-language understanding, tool calling, and thinking modes
Multimodal agentic MoE model with DeepSeek-V3 backbone and MLA attention
Guide
Overview
Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
Prerequisites
- vLLM version: >= 0.15.0 (speculative decoding with Eagle3 requires >= 0.18.0)
- Hardware (BF16): 8x H200 GPUs (verified), or equivalent aggregate VRAM (~640 GB)
- Hardware (NVFP4): 4x Blackwell GPUs (e.g. GB200)
- AMD support: 8x MI300X / MI325X / MI355X with ROCm 7.2.1 and Python 3.12
Install vLLM
Pip (NVIDIA):
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto
Pip (AMD ROCm):
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
Docker (NVIDIA):
docker pull vllm/vllm-openai:latest
AMD MI300X/MI325X
On 8x MI300X or MI325X (gfx942), use the standard W4A16 MoE path with AITER
and INT4 QuickReduce.
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
vllm serve moonshotai/Kimi-K2.5 \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--tensor-parallel-size 8 \
--tool-call-parser kimi_k2 \
--enable-auto-tool-choice \
--reasoning-parser kimi_k2 \
--mm-encoder-tp-mode data
AMD MI350X/MI355X
On 8x MI350X or MI355X (gfx950), add --moe-backend flydsl to use the
optimized FlyDSL W4A16 MoE kernel. Keep LoRA disabled for this path.
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
vllm serve moonshotai/Kimi-K2.5 \
--tensor-parallel-size 8 \
--trust-remote-code \
--mm-encoder-tp-mode data \
--moe-backend flydsl \
--compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}'
Notes:
- The FlyDSL INT4 MoE path does not support expert parallelism; do not add
--enable-expert-parallel. - Keep
--compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}'; it is required for this FlyDSL path on MI350X / MI355X. - vLLM has tuned MI350X/MI355X FlyDSL configs for this Kimi shape at TP=8 and TP=4.
- Keep vLLM's default block size unless you are tuning long-context
throughput;
--block-size 64is safe to try.
Client Usage
Once the vLLM server is running, consume it via the OpenAI-compatible API:
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Read all the text in the image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=messages,
max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Troubleshooting
- OOM errors: Lower
--gpu-memory-utilizationor adjust TP/EP to match your GPU count. - Vision encoder performance: Use
--mm-encoder-tp-mode datato run the vision encoder in data-parallel mode. The encoder is small, so TP adds communication overhead with little gain. - Unique multimodal inputs: Pass
--mm-processor-cache-gb 0to avoid caching overhead. For repeated inputs,--mm-processor-cache-type shmuses host shared memory for better performance at high TP settings. - MoE kernel tuning: Use the
benchmark_moescript from vLLM to tune Triton kernels for your specific hardware. - Async scheduling: Enabled by default for better throughput. Disable if you encounter issues, and file a bug report to vLLM.