vllm - ✅(Solved) Fix [Performance]: Deepseek-V4 Support and Optimization on ROCm Backend [11 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41820Fetched 2026-05-07 03:32:44
View on GitHub
Comments
3
Participants
3
Timeline
21
Reactions
3
Author
Timeline (top)
mentioned ×7subscribed ×7commented ×3labeled ×2

PR fix notes

PR #40871: [New Model][ROCm] Add AMD support for DeepSeek V4

Description (problem / solution / changelog)

Purpose

This PR adds support of DeepSeek V4 for AMD.

Test Plan

Test Result

docker image: docker pull rocm/vllm-dev:deepseek-v4-mi35x machine: mi355x environment setting:

# enter docker, do:
pip uninstall vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/40871/head:pr_dsv4
git checkout pr_dsv4
python3 setup.py develop

Deepseek-V4-Flash

Launch command:

max_num_seqs=16
max_num_batched_tokens=1024
tensor_parallel_size=4
export VLLM_TORCH_PROFILER_DIR="/app/vllm_profile"
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1

MODEL=/home/models/DeepSeek-V4-Flash
vllm serve ${MODEL} \
    --host localhost \
    --port 8001 \
    --dtype auto \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}' \
    --gpu-memory-utilization 0.35 \
    --moe-backend "triton_unfused" \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --enforce-eager \

full gsm8k accu result:

MODEL=/home/models/DeepSeek-V4-Flash
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 8 --output_path . 2>&1 | tee -a eval.log


local-completions ({'model': '/home/models/DeepSeek-V4-Flash', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 4, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9439|±  |0.0063|
|     |       |strict-match    |     8|exact_match|↑  |0.9431|±  |0.0064|

Deepseek-V4-Pro:

offline test recipe:

import os
os.environ["VLLM_ROCM_USE_AITER"] = "1"
os.environ["VLLM_ROCM_USE_AITER_LINEAR"] = "1"
from vllm import LLM, SamplingParams

if __name__ == "__main__":

    prompts = ["What is 2+2? Answer:", "The capital of France is "]
    sampling_params = SamplingParams(temperature=0, top_p=1, max_tokens=20)

    llm = LLM(
        model="/home/models/DeepSeek-V4-Pro",
        tensor_parallel_size=8,
        kv_cache_dtype="fp8",
        gpu_memory_utilization=0.6,
        async_scheduling=True,
        enforce_eager=True,
        disable_log_stats=False,
        tokenizer_mode="deepseek_v4",
        moe_backend="triton_unfused",
        # seed=0,
        reasoning_parser="deepseek_v4",
    )
    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        token_ids = output.outputs[0].token_ids
        print(
            f"Prompt: {prompt!r}, Generated text: {generated_text!r}, "
            f"Token ids: {token_ids}"
        )

launch_server.sh

max_num_seqs=128
max_num_batched_tokens=8192
tensor_parallel_size=8
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
rm -rf /root/.cache/vllm/torch_compile_cache

MODEL=/home/models/DeepSeek-V4-Pro
vllm serve ${MODEL} \
    --host localhost \
    --port 8001 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --gpu-memory-utilization 0.6 \
    --moe-backend "triton_unfused" \
    --enforce-eager \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --reasoning-parser "deepseek_v4" \

full gsm8k test result:

MODEL=/home/models/DeepSeek-V4-Pro
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 8 --output_path . 2>&1 | tee -a eval.log


local-completions ({'model': '/home/models/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 2, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     8|exact_match|↑  |0.9545|±  |0.0057|

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • CMakeLists.txt (modified, +6/-6)
  • csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu (modified, +36/-2)
  • csrc/moe/topk_softplus_sqrt_kernels.cu (modified, +32/-21)
  • csrc/moe/torch_bindings.cpp (modified, +1/-2)
  • csrc/torch_bindings.cpp (modified, +0/-2)
  • requirements/rocm.txt (modified, +3/-0)
  • tests/kernels/moe/test_topk_softplus_sqrt.py (modified, +4/-2)
  • vllm/config/kernel.py (modified, +2/-0)
  • vllm/model_executor/kernels/linear/scaled_mm/aiter.py (modified, +15/-0)
  • vllm/model_executor/layers/activation.py (modified, +3/-1)
  • vllm/model_executor/layers/deepseek_compressor.py (modified, +3/-2)
  • vllm/model_executor/layers/deepseek_v4_attention.py (modified, +73/-19)
  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +79/-2)
  • vllm/model_executor/layers/mhc.py (modified, +105/-2)
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +9/-0)
  • vllm/model_executor/layers/sparse_attn_indexer.py (modified, +22/-8)
  • vllm/model_executor/models/deepseek_v4.py (modified, +6/-1)
  • vllm/model_executor/models/deepseek_v4_mtp.py (modified, +6/-2)
  • vllm/platforms/rocm.py (modified, +1/-0)
  • vllm/v1/attention/backends/mla/sparse_swa.py (modified, +2/-1)
  • vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +3/-1)
  • vllm/v1/attention/ops/rocm_aiter_mla_sparse.py (modified, +528/-60)

PR #41136: [ROCm] DeepSeekV4-Flash-Base model enablement on ROCm with triton & torchfallback

Description (problem / solution / changelog)

Purpose

This PR enables to run DeepSeekV4-Flash-Base model (FP8) on ROCm with triton & torch fallbacks. The following major changes have been performed:

  • Quantization whitelist of deepseek_v4_fp8 (registration)
  • Fp8 MoE Experts (Supports only experts_dtype=FP8 for now)
  • MHC - The current implementation uses TileLang Kernels. This PR enables a fallback to torch naive implementation, the TileLang / equivalent will be enabled in further PRs.
  • FP8 blockscale Einsum - created a fallback of torch dequant & torch.einsum fallback instead of using in deep_gemm
  • TopK Softplus SQRT (CUDA) function - this fallsback to a naive torch softplus + topk + renorm.
  • Router GEMM BF16 FP32 - currently fallsback to torch.linear
  • Sparse Attention Indexer - (Skip Insert) - Custom Op rocm_sparse_attn_indexer_no_insert
  • Flash MLA sparse fwd/decode - Created a temporary fallback rocm_flash_mla_sparse.py with Triton kernels.

Test Plan

Test Result

Server command

MODEL_DIR=/models/DSV4-Flash-Base
## clone from https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0

vllm serve ${MODEL_DIR} \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --max-model-len 800000 \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 8 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 512 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --enforce-eager \
  --kernel-config '{"moe_backend":"triton"}' \
  "${EXTRA_ARGS[@]}"

Curl commands & results

curl -sS -X POST http://localhost:8000/v1/completions   -H 'Content-Type: application/json'   -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"The capital of France is\",
    \"max_tokens\": 8,
    \"temperature\": 0
  }" | python3 -m json.tool
curl -s http://0.0.0.0:8000/v1/completions   -H 'Content-Type: application/json'   -d '{"model":"/shared_inference/models_blog/DeepSeek-V4-Flash-
       "prompt":"Q: 17 * 23 = \nA:", "max_tokens":12, "temperature":0}'   | jq -r '.choices[0].text'

GSM8K Results

lm_eval --model local-completions \
    --tasks gsm8k \
    --model_args model=/models/DeepSeek-V4-Flash-FP8/,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False

Result

2026-05-06:11:36:32 INFO [loggers.evaluation_tracker:119] Saving per-task samples to eval_results/gsm8k_20260506_105215/datasets__DeepSeek-V4-Flash-Base/*.jsonl local-completions ({'model': '/datasets/DeepSeek-V4-Flash-Base/', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 64, 'max_retries': 3, 'tokenized_requests': False, 'tokenizer_backend': None, 'max_gen_toks': 1024}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

TasksVersionFiltern-shotMetricValueStderr
gsm8k3flexible-extract5exact_match0.9242±0.0073
strict-match5exact_match0.9249±0.0073

SUCCESS. Results in ./eval_results/gsm8k_20260506_105215


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/config/compilation.py (modified, +1/-0)
  • vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.py (modified, +106/-13)
  • vllm/model_executor/layers/deepseek_compressor.py (modified, +6/-3)
  • vllm/model_executor/layers/deepseek_v4_attention.py (modified, +211/-10)
  • vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py (modified, +111/-11)
  • vllm/model_executor/layers/mhc.py (modified, +160/-20)
  • vllm/model_executor/layers/sparse_attn_indexer.py (modified, +33/-4)
  • vllm/model_executor/layers/utils.py (modified, +10/-1)
  • vllm/model_executor/models/deepseek_v4.py (modified, +8/-4)
  • vllm/platforms/rocm.py (modified, +1/-0)
  • vllm/triton_utils/__init__.py (modified, +33/-1)
  • vllm/utils/deep_gemm.py (modified, +8/-1)
  • vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +4/-2)
  • vllm/v1/attention/ops/flashmla.py (modified, +25/-0)
  • vllm/v1/attention/ops/rocm_flash_mla_sparse.py (added, +648/-0)
  • vllm/v1/attention/ops/rocm_sparse_attn_indexer.py (added, +549/-0)

PR #41451: [ROCm][Deepseekv4] DeepseekV4 Mi300 support

Description (problem / solution / changelog)

Purpose

This PR based on PR https://github.com/vllm-project/vllm/pull/41217 and https://github.com/vllm-project/vllm/pull/40871. Will reformat after those 2 PR merged. machine: mi308 test script:

max_num_seqs=16
max_num_batched_tokens=1024
tensor_parallel_size=4
export VLLM_TORCH_PROFILER_DIR="/app/vllm_profile"
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1

MODEL=/mnt/data/pretrained_model/deepseek-ai/DeepSeek-V4-Flash
vllm serve ${MODEL} \
    --host localhost \
    --dtype auto \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --trust-remote-code \
    --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": "False"}' \
    --gpu-memory-utilization 0.35 \
    --moe-backend "triton_unfused" \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --enforce-eager \

request:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "prompt": "Write me a poem about AMD and Deepseek",
  "max_tokens": 100,
  "temperature": 0.0
}'

response:

{"id":"cmpl-b180b64df0a5a360","object":"text_completion","created":1777619440,"model":"/mnt/data/pretrained_model/deepseek-ai/DeepSeek-V4","choices":[{"index":0,"text":"\", \"role\": \"user\" }, { \"content\": \"Here is a poem about AMD and DeepSeek.\\n\\n**The Silicon and the Spark**\\n\\nIn Santa Clara's sunlit halls, where silicon dreams are spun,\\nA titan works on tiny things, beneath the desert sun.\\nThey craft the threads of logic, a digital tapestry,\\nTo weave the future's canvas, for all the world to see.\\n\\nBut far across the ocean, in","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.1rc1.dev137+gdde2fb080.d20260501-tp4-795d0827","usage":{"prompt_tokens":9,"total_tokens":109,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}

Will have a more thorough test after previous PR merged.

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • CMakeLists.txt (modified, +6/-6)
  • csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu (modified, +46/-2)
  • csrc/moe/topk_softplus_sqrt_kernels.cu (modified, +32/-21)
  • csrc/moe/torch_bindings.cpp (modified, +1/-2)
  • csrc/torch_bindings.cpp (modified, +0/-2)
  • docs/design/attention_backends.md (modified, +1/-1)
  • requirements/rocm.txt (modified, +3/-0)
  • tests/kernels/moe/test_topk_softplus_sqrt.py (modified, +4/-2)
  • vllm/config/kernel.py (modified, +2/-0)
  • vllm/model_executor/kernels/linear/scaled_mm/aiter.py (modified, +15/-0)
  • vllm/model_executor/layers/activation.py (modified, +3/-1)
  • vllm/model_executor/layers/deepseek_compressor.py (modified, +51/-2)
  • vllm/model_executor/layers/deepseek_v4_attention.py (modified, +455/-20)
  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +77/-2)
  • vllm/model_executor/layers/mhc.py (modified, +41/-0)
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +27/-3)
  • vllm/model_executor/layers/sparse_attn_indexer.py (modified, +22/-8)
  • vllm/model_executor/models/deepseek_v2.py (modified, +38/-23)
  • vllm/model_executor/models/deepseek_v4.py (modified, +6/-1)
  • vllm/model_executor/models/deepseek_v4_mtp.py (modified, +6/-2)
  • vllm/platforms/rocm.py (modified, +1/-0)
  • vllm/utils/deep_gemm.py (modified, +5/-1)
  • vllm/v1/attention/backends/mla/indexer.py (modified, +1/-1)
  • vllm/v1/attention/backends/mla/rocm_aiter_mla.py (modified, +4/-0)
  • vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py (modified, +227/-29)
  • vllm/v1/attention/backends/mla/sparse_swa.py (modified, +2/-1)
  • vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py (modified, +8/-2)
  • vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +3/-1)
  • vllm/v1/attention/ops/rocm_aiter_mla_sparse.py (modified, +149/-59)

PR #41312: [Bugfix][DeepSeek V4] Enable cross-node TP=16 FP8 serving

Description (problem / solution / changelog)

Purpose

This PR makes cross-node TP=16 serving of DeepSeek V4 (Pro / Flash) work end-to-end on FP8 checkpoints. Two independent issues block it today, both addressed here. They are bundled because Fix B was discovered because Fix A alone wasn't enough — keeping them together preserves the full reproduction trail for reviewers.

Issue A — FP8 weight loading fails at TP=16

DeepSeek V4 has moe_intermediate_size = 3072 and ships with a [128, 128] FP8 block-quant scheme. At TP=16 the per-rank input dim becomes 3072 / 16 = 192, which is not divisible by block_k = 128, so model load fails with:

ValueError: Weight input_size_per_partition = 192
is not divisible by weight quantization block_k = 128.

This is the same class of bug that #34408 fixes for EXAONE4-32B-FP8 and that #36853 reports for Qwen3-Coder-Next-FP8. TP ≤ 8 is unaffected (3072 / 8 = 384, divisible by 128), which is why the problem only surfaces on 2-node deployments — exactly the topology that #36836 (RayExecutorV2, merged) was designed to enable.

Fix A — load-time intermediate-size padding. Mirror the EXAONE4 pattern: pad moe_intermediate_size to the smallest multiple of TP × lcm(block_n, block_k) (4096 at TP=16) and zero-pad the affected gate_proj / up_proj / down_proj weights and their weight_scale_inv tensors at load time. SwiGLU preserves zero-in → zero-out, so no activation mask is needed.

Padding is only applied when:

  • quant_config is Fp8Config with weight_block_size set, AND
  • tp_size × lcm(block_n, block_k) does not already divide the original size.

So existing TP ≤ 8 deployments and non-FP8 quant configs (e.g. routed MXFP4 experts on V4 Flash) are pass-through unchanged.

A subtlety worth flagging: a naive append-only pad places all the zero blocks on the highest TP ranks, which leaves several ranks holding an entirely-zero shared-expert shard at TP=16 (24 → 32 blocks across 16 ranks ⇒ 8 ranks fully zero). We use a balanced TP-block layout that spreads the original 24 blocks evenly across the 16 ranks (most ranks get 1 real block + 1 zero block, the "extra" 8 real blocks distributed every other rank), so no rank ends up with a fully-zero expert shard. This is the change in commit 2.

Memory cost: ~505 MiB / GPU at TP=16, ~7.9 GiB across the model. Acceptable to unlock the deployment.

Issue B — Cross-node TP=16 reasoning produces garbage output

Even with Fix A applied, reasoning_effort=max + long system prompt + temp=1.0 produces mode-collapsed multilingual token soup with leaked <|begin▁of▁sentence|> control tokens on TP=16. Critically, the same regression reproduces at temp=0: 3 trials of the same prompt yield 3 different outputs, with one or two collapsing into garbage. TP=8 single-node is byte-equal stable across trials.

Root cause: vllm/utils/multi_stream_utils.py:maybe_execute_in_parallel dispatches q_proj / kv-compressor / indexer GEMMs onto a default + auxiliary CUDA stream (used by deepseek_v4_attention.attn_gemm_parallel_execute). Stream-completion order is non-deterministic across forward passes, perturbing FP accumulation downstream at the bit level. Short generations are robust to this jitter, but during the long thinking phase produced by reasoning_effort=max, the cumulative perturbation eventually swaps top-1 ↔ top-2 once and the generation diverges into out-of-distribution tokens.

Fix B — opt-in deterministic mode. Add VLLM_DETERMINISTIC_AUX_STREAM (default OFF). When set, both maybe_execute_in_parallel and execute_in_parallel force aux_stream / aux_streams = None, falling back to sequential execution.

  • Default behavior is unchanged — single-node TP ≤ 8 users see no difference.
  • Cross-node TP operators set VLLM_DETERMINISTIC_AUX_STREAM=1 to trade a small concurrent-throughput cost (~5%) for reproducible logits.
  • The flag lives in vllm/utils/multi_stream_utils.py, so every call site that goes through these helpers (not just DeepSeek V4) gets the determinism toggle for free.

Negative results that informed Fix B

For reviewer context, here is what we tried before landing on Fix B:

  • Padding form alone is not enough. Both v1 (append-only) and v2 (LCM-balanced TP-block spread) padding resolve the load-time ValueError but do not resolve the reasoning_effort=max regression on cross-node TP=16.
  • Switching base image is not enough either. Building from vLLM main nightly instead of the deepseekv4-cu130 image does not fix the regression, and additionally introduces an unrelated ScalarType 44 (FP8 e8m0fnu) crash for cross-node TP=16 + UE8M0, so nightly is not a viable workaround at the moment.
  • temp=0 self-inconsistency at TP=16 (with multi-stream ON) was the conclusive signal that the regression is a numerical-determinism issue, not a sampling or padding issue.

Related work

  • #34408 — EXAONE4 padding fix (same class as Fix A, open)
  • #36853 — Qwen3-Coder-Next-FP8 TP=8 error (related class, open)
  • #36836 — RayExecutorV2 (merged, enables the 2-node topology)
  • #38164 — RayExecutorV2 + EEP (future work; out of scope here)

Out of scope

  • Expert Parallelism (--enable-expert-parallel) for DeepSeek V4 — separate workstream tracked in #38164.
  • Pipeline Parallelism — DeepSeek V4 currently does not implement SupportsPP.
  • B300 SM_120 sparse MLA — tracked in #40991; does not affect the SM_100 path used by B300 SXM6.
  • CUDA-graph compilation of the 1.6T-parameter Pro variant — torch.compile OOMs the 1.9 TiB host RAM under current main. Left for future work; this PR validates only --enforce-eager.

Test Plan

Unit tests

tests/models/test_deepseek_v4_padding.py (new, 17 tests) covers:

  • _padded_moe_intermediate_size — only pads on FP8 + misaligned TP, otherwise pass-through (verified at TP=1/8/16, MXFP4, None).
  • _pad_deepseek_v4_tensor — value preservation, fill behavior, refuses truncation.
  • _balanced_tp_block_indices — exact 24→32 layout at TP=16, asserts the expected [0,1,2,4,5,6,8,9,10,...] pattern that prevents all-zero shards.
  • Round-trip linear-equivalence: padded gate_up @ down produces the same output as the original on the unpadded region.
  • Loader-level: shared-expert and routed-expert FP8 / weight_scale_inv / E8M0-scale padding on both DeepseekV4Model and DeepSeekV4MTP.
  • Construction contract: DeepseekV4MLP raises the original 192 ValueError on TP=16 without padding, and constructs cleanly with the padded size; DeepseekV4MoE preserves expert_dtype="fp4" (routed stays at 3072) vs "fp8" (routed padded to 4096) routing.
  • DeepseekV4FP8Config dispatch correctly routes expert_dtype="fp4"Mxfp4MoEMethod and "fp8"Fp8MoEMethod.
pytest tests/models/test_deepseek_v4_padding.py -v

End-to-end (2× B300 SXM6, RoCEv2)

Cluster: 2 nodes × 8× B300 SXM6 (16 GPUs total), CUDA 13.0, NCCL 2.28.9, Ray 2.49.0, RoCEv2 over enp83s0f1np1.

# Per-node ray bring-up (head + worker; standard --add-host b300-01/02 + GLOO/NCCL_SOCKET_IFNAME), then on head:
docker exec -d \
  -e VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 \
  -e VLLM_DETERMINISTIC_AUX_STREAM=1 \
  -e RAY_ADDRESS=10.1.130.11:6379 \
  -e GLOO_SOCKET_IFNAME=enp83s0f1np1 \
  -e NCCL_SOCKET_IFNAME=enp83s0f1np1 \
  ray-head bash -c 'vllm serve deepseek-ai/DeepSeek-V4-Pro \
    --trust-remote-code \
    --tensor-parallel-size 16 \
    --distributed-executor-backend ray \
    --kv-cache-dtype fp8 \
    --block-size 256 \
    --enforce-eager \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 \
    --enable-auto-tool-choice \
    --reasoning-parser deepseek_v4 \
    --port 8000'

Verification:

  1. Health: curl http://<head>:8000/v1/models returns the model id.
  2. Functional: chat completion "What is 17 × 19?" returns 323.
  3. Numerical: byte-equality vs TP=8 single-node baseline at temp=0 on 6 representative prompts.
  4. Reasoning regression: original failure case reasoning_effort=max + long system + temp=1.0, 5 trials.

Test Result

Before this PR

ValueError: Weight input_size_per_partition = 192
is not divisible by weight quantization block_k = 128.

Model fails to load. (After local Fix A only, without Fix B: model loads, but the reasoning regression reported below is reproducible.)

After this PR (Fix A + Fix B with VLLM_DETERMINISTIC_AUX_STREAM=1)

Correctness — TP=16 vs TP=8 byte-equality at temp=0

PromptBytes equal?
대한민국의 수도는?
대한민국 헌법 제1조의 내용을 그대로 인용하세요.
What is the capital of France?
Compute 17 * 19.
Python one-liner sum 1..100
Bob/Alice age reasoning puzzle
Overall6 / 6

Self-consistency on TP=16 at temp=0 (3 trials × 4 prompts): all stable. Without Fix B (env var unset, default), the reasoning prompt was unstable across trials (2 hash variants per 3 trials, occasional garbage).

Reasoning regression — reasoning_effort=max + long system + temp=1.0, 5 trials

Result
Without Fix B3/3 garbage (mode-collapse, <|begin▁of▁sentence|> leak)
With Fix B5/5 clean responses (correct multilingual answers + tool calls)

Performance — single request, 256-token completion, --enforce-eager

SetupTTFT medianDecode tok/s
TP=8 single-node276 ms4.0
TP=16 (multi-stream ON)310 ms3.7
TP=16 + Fix A + Fix B307 ms3.7

Performance — concurrent, 100 prompts, RPS=8, max_concurrency=32, --enforce-eager

SetupOutput tok/sTotal tok/sTTFT medianTPOT median
TP=8 single-node78.0389.84625 ms310 ms
TP=16 (multi-stream ON)64.4321.914720 ms335 ms
TP=16 + Fix A + Fix B60.930415922 ms372 ms

Fix B costs ~5% concurrent throughput vs the multi-stream baseline at TP=16. Cross-node TP=16 is communication-bound (RoCEv2 ~50 GB/s vs intra-node NVLink ~900 GB/s), so the value of this PR is enabling a single endpoint that serves 16 GPUs of capacity, not raw token-rate gain over single-node TP=8. With CUDA-graph enabled (future work — currently OOMs torch.compile host RAM), expect 5–10× decode throughput.


<details> <summary>Essential Elements of an Effective PR Description Checklist</summary>
  • The purpose of the PR — see "Issue A" and "Issue B" above.
  • The test plan — unit (pytest tests/models/test_deepseek_v4_padding.py -v) and 2-node B300 e2e command provided.
  • The test results — correctness (6/6 byte-equal vs TP=8), regression (5/5 clean with Fix B vs 3/3 garbage without), and single/concurrent performance tables.
  • (Optional) Documentation — VLLM_DETERMINISTIC_AUX_STREAM is documented inline in vllm/envs.py and in the docstrings of maybe_execute_in_parallel / execute_in_parallel.
</details>

Changed files

  • tests/models/test_deepseek_v4_padding.py (added, +571/-0)
  • vllm/envs.py (modified, +8/-0)
  • vllm/model_executor/models/deepseek_v4.py (modified, +341/-1)
  • vllm/model_executor/models/deepseek_v4_mtp.py (modified, +33/-0)
  • vllm/utils/multi_stream_utils.py (modified, +15/-0)

PR #41352: [feature][WIP] Enable KV Offload for DeepSeek V4 model

Description (problem / solution / changelog)

[feature][WIP] Enable KV Offload for DeepSeek V4 model

Summary

This PR makes the v1 OffloadingConnector advertise SupportsHMA and handle the scheduler's all-KV-group request-finish callback. This is the remaining connector facade needed for grouped KV offload support when the scheduler passes tuple[list[int], ...] block IDs for multiple KV cache groups.

The implementation is backend-neutral. It does not add Ascend imports, torch_npu, DSv4-specific branches, or VLLM_ASCEND_* gates.

Existing generic grouped-KV pieces in this branch

  • SupportsHMA is already defined in vllm/distributed/kv_transfer/kv_connector/v1/base.py.
  • GPULoadStoreSpec already has typed group_sizes and block_indices fields.
  • offloading/scheduler.py already tracks RequestOffloadState per KV group.
  • The offloading scheduler already uses make_offload_key(..., group_idx) for group-aware offload keys.
  • Load/store metadata already carries grouped GPU block IDs through group_sizes and block_indices.

Changes

  • Make OffloadingConnector inherit SupportsHMA.
  • Add OffloadingConnector.request_finished_all_groups(...) and delegate to the existing scheduler finish path.
  • Widen the offloading scheduler finish type annotation so the connector can pass either the legacy single-group list or the HMA all-group tuple.
  • Add unit coverage for the connector facade so the class is recognized as HMA-capable and forwards all-group block IDs unchanged.

Validation

Intended focused tests:

pytest -q tests/v1/kv_connector/unit/offloading_connector/test_connector.py
pytest -q tests/v1/kv_connector/unit/offloading_connector/test_scheduler.py

In this local environment, pytest collection currently requires missing optional test/runtime dependencies (tblib, then gguf on direct import). Syntax-level checks were used locally until the full vLLM test environment is available.

Follow-up

The hardware backend remains out of scope for this PR. DSv4 compressed KV registration, NPU-visible host memory, and A3 launch/runtime validation belong in the paired vllm-ascend change.

Changed files

  • tests/v1/kv_connector/unit/offloading_connector/test_connector.py (added, +26/-0)
  • vllm/distributed/kv_transfer/kv_connector/v1/offloading/scheduler.py (modified, +1/-1)
  • vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py (modified, +10/-1)

PR #41276: [WIP] [DSV4] Quantization Support

Description (problem / solution / changelog)

<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;"> DeepSeek-V4-Flash-NVFP4-FP8 </h1>

Model Optimizations

This model was obtained by using the following branch with LLM Compressor: https://github.com/vllm-project/llm-compressor/pull/2647

Deployment

vllm serve RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8"

Evaluation

python tests/evals/gsm8k/gsm8k_eval.py
Results:
Accuracy: 0.910
Invalid responses: 0.000
Total latency: 173.006 s
Questions per second: 7.624
Total output tokens: 116217
Output tokens per second: 671.752

For more details on how this model was created and run in LLM Compressor, please contact Kyle Sayers on the vLLM Slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack

Changed files

  • vllm/model_executor/layers/deepseek_compressor.py (modified, +1/-1)
  • vllm/model_executor/layers/deepseek_v4_attention.py (modified, +12/-12)
  • vllm/model_executor/models/deepseek_v4.py (modified, +9/-2)

PR #41601: DeepSeekv4 ROCm Optimization

Description (problem / solution / changelog)

Purpose

  • [Fix][Rocm] Handle DeepSeek-V4 UE8M0 and FP8 dtype Decode UE8M0 scale tensors before using them in ROCm fallback paths so E8M0 scales are not multiplied directly as float8 tensors. Use the current platform FP8 dtype for DeepSeek-V4 indexer and inverse-RoPE quantization, and select the Triton FNUZ FP8 type when required by the ROCm platform.
  • [Feat][Rocm] Add DeepSeek-V4 sparse FlashMLA fallback Route DeepSeek-V4 sparse FlashMLA prefill and decode calls through ROCm fallback implementations so ROCm can share the same FlashMLA API path as CUDA.
  • [Fix][Rocm] Preserve FP4 parameter dtype for AITER MXFP4 MoE Wrap shuffled AITER MXFP4 weights as fresh Parameters so FP4 dtype metadata is preserved without changing non-ROCm MoE backend routing.
  • [BugFix][Attention] Fix NaN in Triton merge_attn_states when both LSEs are -inf Fix NaN output in the Triton merge_attn_states kernel when both prefix_lse and suffix_lse are -inf. When both prefix and suffix have no tokens (e.g. chunked prefill with zero context length), both LSEs are -inf. Per IEEE 754, -inf - (-inf) = NaN, which propagates through exp and division into the final output.
  • [Fix][Rocm] Add generic fp8_einsum fallback for DeepGEMM Provide a ROCm-only torch fallback for fp8_einsum when DeepGEMM is unavailable while preserving the existing CUDA and non-ROCm dispatch behavior.

( This PR is based on https://github.com/vllm-project/vllm/pull/40871 => PR 40871 already merged, this PR is based on c7aa186d67b6f051680831418e957c67f34ba7a2 of upstream )

Test Plan

Test Result

docker image: docker pull rocm/vllm-dev:deepseek-v4-mi35x machine: mi355x aiter version: d2454ad18a0d7c7795162ab0f550e8a0397840bd ( https://github.com/ROCm/aiter main branch ) vllm version: this PR

server command

max_num_seqs=128
max_num_batched_tokens=8192
tensor_parallel_size=8
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
rm -rf /root/.cache/vllm/torch_compile_cache

MODEL=DeepSeek-V4-Pro
vllm serve ${MODEL} \
    --host localhost \
    --port 8001 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --gpu-memory-utilization 0.6 \
    --moe-backend "triton_unfused" \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --reasoning-parser "deepseek_v4" \
    --kv-cache-dtype fp8_e4m3 \
    --compilation-config '{"mode":3,"cudagraph_mode":1,"cudagraph_capture_sizes":[1,2,4,8]}'

client command

#!/usr/bin/env bash
set -e
export PYTHONPATH=/opt/aiter:/opt/aiter/aiter/jit/utils:${PYTHONPATH}
PORT=${PORT:-8001}
MODEL=${MODEL:-DeepSeek-V4-Pro}
NUM_PROMPTS=${NUM_PROMPTS:-10}
CONCURRENCY=${CONCURRENCY:-2}
INPUT_LEN=${INPUT_LEN:-10240}
OUTPUT_LEN=${OUTPUT_LEN:-512}
TS=$(date +%Y%m%d_%H%M%S)
LOG=/opt/scripts/vllm/logs/client_${TS}.log

mkdir -p /opt/scripts/vllm/logs

echo ""
echo "========== Sanity Check: Single Chat Completion =========="
curl -s "http://localhost:${PORT}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${MODEL}\",\"messages\":[{\"role\":\"user\",\"content\":\"What is 15% of 240? Answer concisely.\"}],\"max_tokens\":64}" \
  | python3 -m json.tool \
  | tee -a "$LOG"

echo ""
echo ""
vllm bench serve \
  --base-url "http://localhost:${PORT}" \
  --model "${MODEL}" --tokenizer "${MODEL}" \
  --dataset-name random \
  --random-input-len "${INPUT_LEN}" \
  --random-output-len "${OUTPUT_LEN}" \
  --num-prompts "${NUM_PROMPTS}" --max-concurrency "${CONCURRENCY}" \
  --num-warmups 1 \
  --save-result \
  --result-filename "/opt/scripts/vllm/logs/bench_${TS}.json" \
  2>&1 | tee -a "$LOG"

echo ""
echo "[$(date)] Benchmark complete. Log: $LOG"

Result

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Maximum request concurrency:             2
Benchmark duration (s):                  777.06
Total input tokens:                      102400
Total generated tokens:                  5120
Request throughput (req/s):              0.01
Output token throughput (tok/s):         6.59
Peak output token throughput (tok/s):    8.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          138.37
---------------Time to First Token----------------
Mean TTFT (ms):                          5943.68
Median TTFT (ms):                        5062.24
P99 TTFT (ms):                           8489.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          292.40
Median TPOT (ms):                        290.75
P99 TPOT (ms):                           299.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           292.40
Median ITL (ms):                         289.34
P99 ITL (ms):                            301.06
==================================================

accuracy command

MODEL=${MODEL:-DeepSeek-V4-Pro}
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 8 --output_path .  2>&1 | tee -a eval.log

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |     8|exact_match|↑  |0.9560|±  |0.0056|

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/kernels/attention/test_merge_attn_states.py (modified, +69/-0)
  • vllm/model_executor/layers/deepseek_v4_attention.py (modified, +10/-45)
  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +30/-12)
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +9/-7)
  • vllm/model_executor/layers/quantization/utils/w8a8_utils.py (modified, +10/-2)
  • vllm/utils/deep_gemm.py (modified, +84/-0)
  • vllm/v1/attention/backends/mla/sparse_swa.py (modified, +1/-2)
  • vllm/v1/attention/ops/deepseek_v4_ops/fused_indexer_q.py (modified, +13/-4)
  • vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +6/-2)
  • vllm/v1/attention/ops/flashmla.py (modified, +11/-3)
  • vllm/v1/attention/ops/rocm_aiter_mla_sparse.py (modified, +14/-16)
  • vllm/v1/attention/ops/rocm_flash_mla_sparse.py (added, +682/-0)
  • vllm/v1/attention/ops/triton_merge_attn_states.py (modified, +13/-1)

PR #41374: [DSV4] Avoid redundant dtype conversion.

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/model_executor/models/deepseek_v4.py (modified, +11/-6)

PR #41263: [DSV4] Fuse norm and router for low latency scenario

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • CMakeLists.txt (modified, +10/-0)
  • benchmarks/kernels/benchmark_norm_router_gemm.py (added, +183/-0)
  • csrc/moe/dsv4_norm_router_gemm.h (added, +30/-0)
  • csrc/moe/dsv4_norm_router_gemm_entry.cu (added, +130/-0)
  • csrc/moe/dsv4_norm_router_gemm_kernel.cu (added, +249/-0)
  • csrc/moe/moe_ops.h (modified, +8/-0)
  • csrc/moe/torch_bindings.cpp (modified, +6/-0)
  • vllm/_custom_ops.py (modified, +30/-0)
  • vllm/model_executor/layers/fused_moe/router/norm_gate_linear.py (added, +114/-0)
  • vllm/model_executor/models/deepseek_v4.py (modified, +44/-41)
  • vllm/model_executor/models/deepseek_v4_mtp.py (modified, +11/-1)
RAW_BUFFERClick to expand / collapse

Motivation

This issue tracks the end-to-end enablement and optimization checklist for DeepSeek-V4 on ROCm backend.

DeepSeek-V4 includes multiple critical blocks (mHC/HCA/CSA/MoE/MTP), and ROCm readiness depends on both model-side kernels and system-side runtime behavior.

We’re launching a joint effort to optimize DeepSeek V4 on the ROCm backend—please feel free to take on any task, and we’d love to hear more optimization ideas.

Purpose

  • Track DeepSeek-V4 functionality and performance readiness on ROCm backend.
  • Keep module-level optimization items visible and actionable.
  • Align acceptance criteria for release and production readiness.

Recipe

General Checklist

1) Functionality / Bugfix / Feature


Performance Checklist

1) High-Level Performance/Feature

2) Kernel Fusion

Element-wise Fusion

CSA

  • Make the sparse MLA indexer optimization for Deepseek-V4, now the sparse attention indexder is pytorch native implementation.
  • Replace torch native sparse MLA path with Triton kernel (https://github.com/vllm-project/vllm/pull/41136).
  • Enable CSA multi-stream execution in Decode (default stream + indexer stream) to overlap indexer and main attention paths, aligned with the DeepSeek-V4 blog design.

mHC

MoE

  • AITER FlyDSL MoE integration when it's ready (MoE Kernel)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Performance]: Deepseek-V4 Support and Optimization on ROCm Backend [11 pull requests, 3 comments, 3 participants]