Qwen/Qwen3.5-4B

Qwen3.5 compact dense multimodal model (4B) — fits on 16 GB consumer GPUs with full 262K context or one Xeon 6 NUMA node

Consumer-GPU-friendly Qwen3.5 dense with MTP support

View on HuggingFace

dense4B262,144 ctxvLLM 0.17.0+multimodaltext

Guide

Overview

Qwen3.5-4B is the compact dense entry in the Qwen3.5 family — same gated delta networks architecture, vision encoder, 262K context, and MTP decoding as the larger siblings, sized for 16 GB consumer GPUs.

Prerequisites

vLLM version: >= 0.17.0
Hardware: single 16 GB GPU (RTX 4080 / L4 / A10 / T4-24GB) or one Xeon 6 NUMA node

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

pip (Intel Xeon 6 CPUs)

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

Launching the Server

vllm serve Qwen/Qwen3.5-4B \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for Qwen/Qwen3.5-4B:

docker run -itd --name qwen4b-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model Qwen/Qwen3.5-4B \
    --host 0.0.0.0 \
    --port 8000

MTP speculative decoding

vllm serve Qwen/Qwen3.5-4B \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.5-4B",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=128,
)
print(resp.choices[0].message.content)

Troubleshooting

CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size.
Disable reasoning: add --default-chat-template-kwargs '{"enable_thinking": false}'.