vllm - ✅(Solved) Fix [Performance]: Deepseek-V4 Support and Optimization on ROCm Backend [11 pull requests, 3 comments, 3 participants]

vllm2026-05-06 13:14:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41820•Fetched 2026-05-07 03:32:44

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×7subscribed ×7commented ×3labeled ×2

PR fix notes

PR #40871: [New Model][ROCm] Add AMD support for DeepSeek V4

Repository: vllm-project/vllm
Author: whx-sjtu
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/40871

Description (problem / solution / changelog)

Purpose

This PR adds support of DeepSeek V4 for AMD.

Test Plan

Test Result

docker image: docker pull rocm/vllm-dev:deepseek-v4-mi35x machine: mi355x environment setting:

# enter docker, do:
pip uninstall vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/40871/head:pr_dsv4
git checkout pr_dsv4
python3 setup.py develop

Deepseek-V4-Flash

Launch command:

max_num_seqs=16
max_num_batched_tokens=1024
tensor_parallel_size=4
export VLLM_TORCH_PROFILER_DIR="/app/vllm_profile"
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1

MODEL=/home/models/DeepSeek-V4-Flash
vllm serve ${MODEL} \
    --host localhost \
    --port 8001 \
    --dtype auto \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}' \
    --gpu-memory-utilization 0.35 \
    --moe-backend "triton_unfused" \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --enforce-eager \

full gsm8k accu result:

MODEL=/home/models/DeepSeek-V4-Flash
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=4,max_retries=10,max_gen_toks=2048,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 8 --output_path . 2>&1 | tee -a eval.log


local-completions ({'model': '/home/models/DeepSeek-V4-Flash', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 4, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9439|±  |0.0063|
|     |       |strict-match    |     8|exact_match|↑  |0.9431|±  |0.0064|

Deepseek-V4-Pro:

offline test recipe:

import os
os.environ["VLLM_ROCM_USE_AITER"] = "1"
os.environ["VLLM_ROCM_USE_AITER_LINEAR"] = "1"
from vllm import LLM, SamplingParams

if __name__ == "__main__":

    prompts = ["What is 2+2? Answer:", "The capital of France is "]
    sampling_params = SamplingParams(temperature=0, top_p=1, max_tokens=20)

    llm = LLM(
        model="/home/models/DeepSeek-V4-Pro",
        tensor_parallel_size=8,
        kv_cache_dtype="fp8",
        gpu_memory_utilization=0.6,
        async_scheduling=True,
        enforce_eager=True,
        disable_log_stats=False,
        tokenizer_mode="deepseek_v4",
        moe_backend="triton_unfused",
        # seed=0,
        reasoning_parser="deepseek_v4",
    )
    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        token_ids = output.outputs[0].token_ids
        print(
            f"Prompt: {prompt!r}, Generated text: {generated_text!r}, "
            f"Token ids: {token_ids}"
        )

launch_server.sh

max_num_seqs=128
max_num_batched_tokens=8192
tensor_parallel_size=8
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
rm -rf /root/.cache/vllm/torch_compile_cache

MODEL=/home/models/DeepSeek-V4-Pro
vllm serve ${MODEL} \
    --host localhost \
    --port 8001 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --gpu-memory-utilization 0.6 \
    --moe-backend "triton_unfused" \
    --enforce-eager \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --reasoning-parser "deepseek_v4" \

full gsm8k test result:

MODEL=/home/models/DeepSeek-V4-Pro
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 8 --output_path . 2>&1 | tee -a eval.log


local-completions ({'model': '/home/models/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 2, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     8|exact_match|↑  |0.9545|±  |0.0057|

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

CMakeLists.txt (modified, +6/-6)
csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu (modified, +36/-2)
csrc/moe/topk_softplus_sqrt_kernels.cu (modified, +32/-21)
csrc/moe/torch_bindings.cpp (modified, +1/-2)
csrc/torch_bindings.cpp (modified, +0/-2)
requirements/rocm.txt (modified, +3/-0)
tests/kernels/moe/test_topk_softplus_sqrt.py (modified, +4/-2)
vllm/config/kernel.py (modified, +2/-0)
vllm/model_executor/kernels/linear/scaled_mm/aiter.py (modified, +15/-0)
vllm/model_executor/layers/activation.py (modified, +3/-1)
vllm/model_executor/layers/deepseek_compressor.py (modified, +3/-2)
vllm/model_executor/layers/deepseek_v4_attention.py (modified, +73/-19)
vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +79/-2)
vllm/model_executor/layers/mhc.py (modified, +105/-2)
vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +9/-0)
vllm/model_executor/layers/sparse_attn_indexer.py (modified, +22/-8)
vllm/model_executor/models/deepseek_v4.py (modified, +6/-1)
vllm/model_executor/models/deepseek_v4_mtp.py (modified, +6/-2)
vllm/platforms/rocm.py (modified, +1/-0)
vllm/v1/attention/backends/mla/sparse_swa.py (modified, +2/-1)
vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +3/-1)
vllm/v1/attention/ops/rocm_aiter_mla_sparse.py (modified, +528/-60)

PR #41136: [ROCm] DeepSeekV4-Flash-Base model enablement on ROCm with triton & torchfallback

Repository: vllm-project/vllm
Author: lcskrishna
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41136

Description (problem / solution / changelog)

Purpose

This PR enables to run DeepSeekV4-Flash-Base model (FP8) on ROCm with triton & torch fallbacks. The following major changes have been performed:

Quantization whitelist of deepseek_v4_fp8 (registration)
Fp8 MoE Experts (Supports only experts_dtype=FP8 for now)
MHC - The current implementation uses TileLang Kernels. This PR enables a fallback to torch naive implementation, the TileLang / equivalent will be enabled in further PRs.
FP8 blockscale Einsum - created a fallback of torch dequant & torch.einsum fallback instead of using in deep_gemm
TopK Softplus SQRT (CUDA) function - this fallsback to a naive torch softplus + topk + renorm.
Router GEMM BF16 FP32 - currently fallsback to torch.linear
Sparse Attention Indexer - (Skip Insert) - Custom Op rocm_sparse_attn_indexer_no_insert
Flash MLA sparse fwd/decode - Created a temporary fallback rocm_flash_mla_sparse.py with Triton kernels.

Test Plan

Test Result

Server command

MODEL_DIR=/models/DSV4-Flash-Base
## clone from https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0

vllm serve ${MODEL_DIR} \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --max-model-len 800000 \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 8 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 512 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --enforce-eager \
  --kernel-config '{"moe_backend":"triton"}' \
  "${EXTRA_ARGS[@]}"

Curl commands & results

curl -sS -X POST http://localhost:8000/v1/completions   -H 'Content-Type: application/json'   -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"The capital of France is\",
    \"max_tokens\": 8,
    \"temperature\": 0
  }" | python3 -m json.tool

curl -s http://0.0.0.0:8000/v1/completions   -H 'Content-Type: application/json'   -d '{"model":"/shared_inference/models_blog/DeepSeek-V4-Flash-
       "prompt":"Q: 17 * 23 = \nA:", "max_tokens":12, "temperature":0}'   | jq -r '.choices[0].text'

GSM8K Results

lm_eval --model local-completions \
    --tasks gsm8k \
    --model_args model=/models/DeepSeek-V4-Flash-FP8/,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False

Result

2026-05-06:11:36:32 INFO [loggers.evaluation_tracker:119] Saving per-task samples to eval_results/gsm8k_20260506_105215/datasets__DeepSeek-V4-Flash-Base/*.jsonl local-completions ({'model': '/datasets/DeepSeek-V4-Flash-Base/', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 64, 'max_retries': 3, 'tokenized_requests': False, 'tokenizer_backend': None, 'max_gen_toks': 1024}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9242	±	0.0073
		strict-match	5	exact_match	↑	0.9249	±	0.0073

SUCCESS. Results in ./eval_results/gsm8k_20260506_105215

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

vllm/config/compilation.py (modified, +1/-0)
vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.py (modified, +106/-13)
vllm/model_executor/layers/deepseek_compressor.py (modified, +6/-3)
vllm/model_executor/layers/deepseek_v4_attention.py (modified, +211/-10)
vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py (modified, +111/-11)
vllm/model_executor/layers/mhc.py (modified, +160/-20)
vllm/model_executor/layers/sparse_attn_indexer.py (modified, +33/-4)
vllm/model_executor/layers/utils.py (modified, +10/-1)
vllm/model_executor/models/deepseek_v4.py (modified, +8/-4)
vllm/platforms/rocm.py (modified, +1/-0)
vllm/triton_utils/__init__.py (modified, +33/-1)
vllm/utils/deep_gemm.py (modified, +8/-1)
vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +4/-2)
vllm/v1/attention/ops/flashmla.py (modified, +25/-0)
vllm/v1/attention/ops/rocm_flash_mla_sparse.py (added, +648/-0)
vllm/v1/attention/ops/rocm_sparse_attn_indexer.py (added, +549/-0)

PR #41451: [ROCm][Deepseekv4] DeepseekV4 Mi300 support

Repository: vllm-project/vllm
Author: ganyi1996ppo
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41451

Description (problem / solution / changelog)

Purpose

This PR based on PR https://github.com/vllm-project/vllm/pull/41217 and https://github.com/vllm-project/vllm/pull/40871. Will reformat after those 2 PR merged. machine: mi308 test script:

max_num_seqs=16
max_num_batched_tokens=1024
tensor_parallel_size=4
export VLLM_TORCH_PROFILER_DIR="/app/vllm_profile"
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1

MODEL=/mnt/data/pretrained_model/deepseek-ai/DeepSeek-V4-Flash
vllm serve ${MODEL} \
    --host localhost \
    --dtype auto \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --trust-remote-code \
    --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": "False"}' \
    --gpu-memory-utilization 0.35 \
    --moe-backend "triton_unfused" \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --enforce-eager \

request:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "prompt": "Write me a poem about AMD and Deepseek",
  "max_tokens": 100,
  "temperature": 0.0
}'

response:

{"id":"cmpl-b180b64df0a5a360","object":"text_completion","created":1777619440,"model":"/mnt/data/pretrained_model/deepseek-ai/DeepSeek-V4","choices":[{"index":0,"text":"\", \"role\": \"user\" }, { \"content\": \"Here is a poem about AMD and DeepSeek.\\n\\n**The Silicon and the Spark**\\n\\nIn Santa Clara's sunlit halls, where silicon dreams are spun,\\nA titan works on tiny things, beneath the desert sun.\\nThey craft the threads of logic, a digital tapestry,\\nTo weave the future's canvas, for all the world to see.\\n\\nBut far across the ocean, in","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.1rc1.dev137+gdde2fb080.d20260501-tp4-795d0827","usage":{"prompt_tokens":9,"total_tokens":109,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}

Will have a more thorough test after previous PR merged.

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

CMakeLists.txt (modified, +6/-6)
csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu (modified, +46/-2)
csrc/moe/topk_softplus_sqrt_kernels.cu (modified, +32/-21)
csrc/moe/torch_bindings.cpp (modified, +1/-2)
csrc/torch_bindings.cpp (modified, +0/-2)
docs/design/attention_backends.md (modified, +1/-1)
requirements/rocm.txt (modified, +3/-0)
tests/kernels/moe/test_topk_softplus_sqrt.py (modified, +4/-2)
vllm/config/kernel.py (modified, +2/-0)
vllm/model_executor/kernels/linear/scaled_mm/aiter.py (modified, +15/-0)
vllm/model_executor/layers/activation.py (modified, +3/-1)
vllm/model_executor/layers/deepseek_compressor.py (modified, +51/-2)
vllm/model_executor/layers/deepseek_v4_attention.py (modified, +455/-20)
vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +77/-2)
vllm/model_executor/layers/mhc.py (modified, +41/-0)
vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +27/-3)
vllm/model_executor/layers/sparse_attn_indexer.py (modified, +22/-8)
vllm/model_executor/models/deepseek_v2.py (modified, +38/-23)
vllm/model_executor/models/deepseek_v4.py (modified, +6/-1)
vllm/model_executor/models/deepseek_v4_mtp.py (modified, +6/-2)
vllm/platforms/rocm.py (modified, +1/-0)
vllm/utils/deep_gemm.py (modified, +5/-1)
vllm/v1/attention/backends/mla/indexer.py (modified, +1/-1)
vllm/v1/attention/backends/mla/rocm_aiter_mla.py (modified, +4/-0)
vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py (modified, +227/-29)
vllm/v1/attention/backends/mla/sparse_swa.py (modified, +2/-1)
vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py (modified, +8/-2)
vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +3/-1)
vllm/v1/attention/ops/rocm_aiter_mla_sparse.py (modified, +149/-59)

PR #41312: [Bugfix][DeepSeek V4] Enable cross-node TP=16 FP8 serving

Repository: vllm-project/vllm
Author: sigridjineth
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41312

Description (problem / solution / changelog)

Purpose

This PR makes cross-node TP=16 serving of DeepSeek V4 (Pro / Flash) work end-to-end on FP8 checkpoints. Two independent issues block it today, both addressed here. They are bundled because Fix B was discovered because Fix A alone wasn't enough — keeping them together preserves the full reproduction trail for reviewers.

Issue A — FP8 weight loading fails at TP=16

DeepSeek V4 has moe_intermediate_size = 3072 and ships with a [128, 128] FP8 block-quant scheme. At TP=16 the per-rank input dim becomes 3072 / 16 = 192, which is not divisible by block_k = 128, so model load fails with:

ValueError: Weight input_size_per_partition = 192
is not divisible by weight quantization block_k = 128.

This is the same class of bug that #34408 fixes for EXAONE4-32B-FP8 and that #36853 reports for Qwen3-Coder-Next-FP8. TP ≤ 8 is unaffected (3072 / 8 = 384, divisible by 128), which is why the problem only surfaces on 2-node deployments — exactly the topology that #36836 (RayExecutorV2, merged) was designed to enable.

Fix A — load-time intermediate-size padding. Mirror the EXAONE4 pattern: pad moe_intermediate_size to the smallest multiple of TP × lcm(block_n, block_k) (4096 at TP=16) and zero-pad the affected gate_proj / up_proj / down_proj weights and their weight_scale_inv tensors at load time. SwiGLU preserves zero-in → zero-out, so no activation mask is needed.

Padding is only applied when:

quant_config is Fp8Config with weight_block_size set, AND
tp_size × lcm(block_n, block_k) does not already divide the original size.

So existing TP ≤ 8 deployments and non-FP8 quant configs (e.g. routed MXFP4 experts on V4 Flash) are pass-through unchanged.

A subtlety worth flagging: a naive append-only pad places all the zero blocks on the highest TP ranks, which leaves several ranks holding an entirely-zero shared-expert shard at TP=16 (24 → 32 blocks across 16 ranks ⇒ 8 ranks fully zero). We use a balanced TP-block layout that spreads the original 24 blocks evenly across the 16 ranks (most ranks get 1 real block + 1 zero block, the "extra" 8 real blocks distributed every other rank), so no rank ends up with a fully-zero expert shard. This is the change in commit 2.

Memory cost: ~505 MiB / GPU at TP=16, ~7.9 GiB across the model. Acceptable to unlock the deployment.

Issue B — Cross-node TP=16 reasoning produces garbage output

Even with Fix A applied, reasoning_effort=max + long system prompt + temp=1.0 produces mode-collapsed multilingual token soup with leaked <｜begin▁of▁sentence｜> control tokens on TP=16. Critically, the same regression reproduces at temp=0: 3 trials of the same prompt yield 3 different outputs, with one or two collapsing into garbage. TP=8 single-node is byte-equal stable across trials.

Root cause: vllm/utils/multi_stream_utils.py:maybe_execute_in_parallel dispatches q_proj / kv-compressor / indexer GEMMs onto a default + auxiliary CUDA stream (used by deepseek_v4_attention.attn_gemm_parallel_execute). Stream-completion order is non-deterministic across forward passes, perturbing FP accumulation downstream at the bit level. Short generations are robust to this jitter, but during the long thinking phase produced by reasoning_effort=max, the cumulative perturbation eventually swaps top-1 ↔ top-2 once and the generation diverges into out-of-distribution tokens.

Fix B — opt-in deterministic mode. Add VLLM_DETERMINISTIC_AUX_STREAM (default OFF). When set, both maybe_execute_in_parallel and execute_in_parallel force aux_stream / aux_streams = None, falling back to sequential execution.

Default behavior is unchanged — single-node TP ≤ 8 users see no difference.
Cross-node TP operators set VLLM_DETERMINISTIC_AUX_STREAM=1 to trade a small concurrent-throughput cost (~5%) for reproducible logits.
The flag lives in vllm/utils/multi_stream_utils.py, so every call site that goes through these helpers (not just DeepSeek V4) gets the determinism toggle for free.

Negative results that informed Fix B

For reviewer context, here is what we tried before landing on Fix B:

Padding form alone is not enough. Both v1 (append-only) and v2 (LCM-balanced TP-block spread) padding resolve the load-time ValueError but do not resolve the reasoning_effort=max regression on cross-node TP=16.
Switching base image is not enough either. Building from vLLM main nightly instead of the deepseekv4-cu130 image does not fix the regression, and additionally introduces an unrelated ScalarType 44 (FP8 e8m0fnu) crash for cross-node TP=16 + UE8M0, so nightly is not a viable workaround at the moment.
temp=0 self-inconsistency at TP=16 (with multi-stream ON) was the conclusive signal that the regression is a numerical-determinism issue, not a sampling or padding issue.

Related work

#34408 — EXAONE4 padding fix (same class as Fix A, open)
#36853 — Qwen3-Coder-Next-FP8 TP=8 error (related class, open)
#36836 — RayExecutorV2 (merged, enables the 2-node topology)
#38164 — RayExecutorV2 + EEP (future work; out of scope here)

Out of scope

Expert Parallelism (--enable-expert-parallel) for DeepSeek V4 — separate workstream tracked in #38164.
Pipeline Parallelism — DeepSeek V4 currently does not implement SupportsPP.
B300 SM_120 sparse MLA — tracked in #40991; does not affect the SM_100 path used by B300 SXM6.
CUDA-graph compilation of the 1.6T-parameter Pro variant — torch.compile OOMs the 1.9 TiB host RAM under current main. Left for future work; this PR validates only --enforce-eager.

Test Plan

Unit tests

tests/models/test_deepseek_v4_padding.py (new, 17 tests) covers:

_padded_moe_intermediate_size — only pads on FP8 + misaligned TP, otherwise pass-through (verified at TP=1/8/16, MXFP4, None).
_pad_deepseek_v4_tensor — value preservation, fill behavior, refuses truncation.
_balanced_tp_block_indices — exact 24→32 layout at TP=16, asserts the expected [0,1,2,4,5,6,8,9,10,...] pattern that prevents all-zero shards.
Round-trip linear-equivalence: padded gate_up @ down produces the same output as the original on the unpadded region.
Loader-level: shared-expert and routed-expert FP8 / weight_scale_inv / E8M0-scale padding on both DeepseekV4Model and DeepSeekV4MTP.
Construction contract: DeepseekV4MLP raises the original 192 ValueError on TP=16 without padding, and constructs cleanly with the padded size; DeepseekV4MoE preserves expert_dtype="fp4" (routed stays at 3072) vs "fp8" (routed padded to 4096) routing.
DeepseekV4FP8Config dispatch correctly routes expert_dtype="fp4" → Mxfp4MoEMethod and "fp8" → Fp8MoEMethod.

pytest tests/models/test_deepseek_v4_padding.py -v

End-to-end (2× B300 SXM6, RoCEv2)

Cluster: 2 nodes × 8× B300 SXM6 (16 GPUs total), CUDA 13.0, NCCL 2.28.9, Ray 2.49.0, RoCEv2 over enp83s0f1np1.

# Per-node ray bring-up (head + worker; standard --add-host b300-01/02 + GLOO/NCCL_SOCKET_IFNAME), then on head:
docker exec -d \
  -e VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 \
  -e VLLM_DETERMINISTIC_AUX_STREAM=1 \
  -e RAY_ADDRESS=10.1.130.11:6379 \
  -e GLOO_SOCKET_IFNAME=enp83s0f1np1 \
  -e NCCL_SOCKET_IFNAME=enp83s0f1np1 \
  ray-head bash -c 'vllm serve deepseek-ai/DeepSeek-V4-Pro \
    --trust-remote-code \
    --tensor-parallel-size 16 \
    --distributed-executor-backend ray \
    --kv-cache-dtype fp8 \
    --block-size 256 \
    --enforce-eager \
    --tokenizer-mode deepseek_v4 \
    --tool-call-parser deepseek_v4 \
    --enable-auto-tool-choice \
    --reasoning-parser deepseek_v4 \
    --port 8000'

Verification:

Health: curl http://<head>:8000/v1/models returns the model id.
Functional: chat completion "What is 17 × 19?" returns 323.
Numerical: byte-equality vs TP=8 single-node baseline at temp=0 on 6 representative prompts.
Reasoning regression: original failure case reasoning_effort=max + long system + temp=1.0, 5 trials.

Test Result

Before this PR

ValueError: Weight input_size_per_partition = 192
is not divisible by weight quantization block_k = 128.

Model fails to load. (After local Fix A only, without Fix B: model loads, but the reasoning regression reported below is reproducible.)

After this PR (Fix A + Fix B with `VLLM_DETERMINISTIC_AUX_STREAM=1`)

Correctness — TP=16 vs TP=8 byte-equality at temp=0

Prompt	Bytes equal?
`대한민국의 수도는?`	✅
`대한민국 헌법 제1조의 내용을 그대로 인용하세요.`	✅
`What is the capital of France?`	✅
`Compute 17 * 19.`	✅
Python one-liner sum 1..100	✅
Bob/Alice age reasoning puzzle	✅
Overall	6 / 6

Self-consistency on TP=16 at temp=0 (3 trials × 4 prompts): all stable. Without Fix B (env var unset, default), the reasoning prompt was unstable across trials (2 hash variants per 3 trials, occasional garbage).

Reasoning regression — reasoning_effort=max + long system + temp=1.0, 5 trials

	Result
Without Fix B	3/3 garbage (mode-collapse, `<｜begin▁of▁sentence｜>` leak)
With Fix B	5/5 clean responses (correct multilingual answers + tool calls)

Performance — single request, 256-token completion, --enforce-eager

Setup	TTFT median	Decode tok/s
TP=8 single-node	276 ms	4.0
TP=16 (multi-stream ON)	310 ms	3.7
TP=16 + Fix A + Fix B	307 ms	3.7

Performance — concurrent, 100 prompts, RPS=8, max_concurrency=32, --enforce-eager

Setup	Output tok/s	Total tok/s	TTFT median	TPOT median
TP=8 single-node	78.0	389.8	4625 ms	310 ms
TP=16 (multi-stream ON)	64.4	321.9	14720 ms	335 ms
TP=16 + Fix A + Fix B	60.9	304	15922 ms	372 ms

Fix B costs ~5% concurrent throughput vs the multi-stream baseline at TP=16. Cross-node TP=16 is communication-bound (RoCEv2 ~50 GB/s vs intra-node NVLink ~900 GB/s), so the value of this PR is enabling a single endpoint that serves 16 GPUs of capacity, not raw token-rate gain over single-node TP=8. With CUDA-graph enabled (future work — currently OOMs torch.compile host RAM), expect 5–10× decode throughput.

<details> <summary>Essential Elements of an Effective PR Description Checklist</summary>

The purpose of the PR — see "Issue A" and "Issue B" above.
The test plan — unit (pytest tests/models/test_deepseek_v4_padding.py -v) and 2-node B300 e2e command provided.
The test results — correctness (6/6 byte-equal vs TP=8), regression (5/5 clean with Fix B vs 3/3 garbage without), and single/concurrent performance tables.
(Optional) Documentation — VLLM_DETERMINISTIC_AUX_STREAM is documented inline in vllm/envs.py and in the docstrings of maybe_execute_in_parallel / execute_in_parallel.

</details>

Changed files

tests/models/test_deepseek_v4_padding.py (added, +571/-0)
vllm/envs.py (modified, +8/-0)
vllm/model_executor/models/deepseek_v4.py (modified, +341/-1)
vllm/model_executor/models/deepseek_v4_mtp.py (modified, +33/-0)
vllm/utils/multi_stream_utils.py (modified, +15/-0)

PR #41352: [feature][WIP] Enable KV Offload for DeepSeek V4 model

Repository: vllm-project/vllm
Author: foraxe
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41352

Description (problem / solution / changelog)

[feature][WIP] Enable KV Offload for DeepSeek V4 model

Summary

This PR makes the v1 OffloadingConnector advertise SupportsHMA and handle the scheduler's all-KV-group request-finish callback. This is the remaining connector facade needed for grouped KV offload support when the scheduler passes tuple[list[int], ...] block IDs for multiple KV cache groups.

The implementation is backend-neutral. It does not add Ascend imports, torch_npu, DSv4-specific branches, or VLLM_ASCEND_* gates.

Existing generic grouped-KV pieces in this branch

SupportsHMA is already defined in vllm/distributed/kv_transfer/kv_connector/v1/base.py.
GPULoadStoreSpec already has typed group_sizes and block_indices fields.
offloading/scheduler.py already tracks RequestOffloadState per KV group.
The offloading scheduler already uses make_offload_key(..., group_idx) for group-aware offload keys.
Load/store metadata already carries grouped GPU block IDs through group_sizes and block_indices.

Changes

Make OffloadingConnector inherit SupportsHMA.
Add OffloadingConnector.request_finished_all_groups(...) and delegate to the existing scheduler finish path.
Widen the offloading scheduler finish type annotation so the connector can pass either the legacy single-group list or the HMA all-group tuple.
Add unit coverage for the connector facade so the class is recognized as HMA-capable and forwards all-group block IDs unchanged.

Validation

Intended focused tests:

pytest -q tests/v1/kv_connector/unit/offloading_connector/test_connector.py
pytest -q tests/v1/kv_connector/unit/offloading_connector/test_scheduler.py

In this local environment, pytest collection currently requires missing optional test/runtime dependencies (tblib, then gguf on direct import). Syntax-level checks were used locally until the full vLLM test environment is available.

Follow-up

The hardware backend remains out of scope for this PR. DSv4 compressed KV registration, NPU-visible host memory, and A3 launch/runtime validation belong in the paired vllm-ascend change.

Changed files

tests/v1/kv_connector/unit/offloading_connector/test_connector.py (added, +26/-0)
vllm/distributed/kv_transfer/kv_connector/v1/offloading/scheduler.py (modified, +1/-1)
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py (modified, +10/-1)

PR #41276: [WIP] [DSV4] Quantization Support

Repository: vllm-project/vllm
Author: kylesayrs
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41276

Description (problem / solution / changelog)

<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;"> DeepSeek-V4-Flash-NVFP4-FP8 </h1>

Model Optimizations

This model was obtained by using the following branch with LLM Compressor: https://github.com/vllm-project/llm-compressor/pull/2647

Deployment

vllm serve RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8"

Evaluation

python tests/evals/gsm8k/gsm8k_eval.py

Results:
Accuracy: 0.910
Invalid responses: 0.000
Total latency: 173.006 s
Questions per second: 7.624
Total output tokens: 116217
Output tokens per second: 671.752

For more details on how this model was created and run in LLM Compressor, please contact Kyle Sayers on the vLLM Slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack

Changed files

vllm/model_executor/layers/deepseek_compressor.py (modified, +1/-1)
vllm/model_executor/layers/deepseek_v4_attention.py (modified, +12/-12)
vllm/model_executor/models/deepseek_v4.py (modified, +9/-2)

PR #41601: DeepSeekv4 ROCm Optimization

Repository: vllm-project/vllm
Author: bobofang11235
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41601

Description (problem / solution / changelog)

Purpose

[Fix][Rocm] Handle DeepSeek-V4 UE8M0 and FP8 dtype Decode UE8M0 scale tensors before using them in ROCm fallback paths so E8M0 scales are not multiplied directly as float8 tensors. Use the current platform FP8 dtype for DeepSeek-V4 indexer and inverse-RoPE quantization, and select the Triton FNUZ FP8 type when required by the ROCm platform.
[Feat][Rocm] Add DeepSeek-V4 sparse FlashMLA fallback Route DeepSeek-V4 sparse FlashMLA prefill and decode calls through ROCm fallback implementations so ROCm can share the same FlashMLA API path as CUDA.
[Fix][Rocm] Preserve FP4 parameter dtype for AITER MXFP4 MoE Wrap shuffled AITER MXFP4 weights as fresh Parameters so FP4 dtype metadata is preserved without changing non-ROCm MoE backend routing.
[BugFix][Attention] Fix NaN in Triton merge_attn_states when both LSEs are -inf Fix NaN output in the Triton merge_attn_states kernel when both prefix_lse and suffix_lse are -inf. When both prefix and suffix have no tokens (e.g. chunked prefill with zero context length), both LSEs are -inf. Per IEEE 754, -inf - (-inf) = NaN, which propagates through exp and division into the final output.
[Fix][Rocm] Add generic fp8_einsum fallback for DeepGEMM Provide a ROCm-only torch fallback for fp8_einsum when DeepGEMM is unavailable while preserving the existing CUDA and non-ROCm dispatch behavior.

( This PR is based on https://github.com/vllm-project/vllm/pull/40871 => PR 40871 already merged, this PR is based on c7aa186d67b6f051680831418e957c67f34ba7a2 of upstream )

Test Plan

Test Result

docker image: docker pull rocm/vllm-dev:deepseek-v4-mi35x machine: mi355x aiter version: d2454ad18a0d7c7795162ab0f550e8a0397840bd ( https://github.com/ROCm/aiter main branch ) vllm version: this PR

server command

max_num_seqs=128
max_num_batched_tokens=8192
tensor_parallel_size=8
export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1
rm -rf /root/.cache/vllm/torch_compile_cache

MODEL=DeepSeek-V4-Pro
vllm serve ${MODEL} \
    --host localhost \
    --port 8001 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --gpu-memory-utilization 0.6 \
    --moe-backend "triton_unfused" \
    --tokenizer-mode "deepseek_v4" \
    --async-scheduling \
    --reasoning-parser "deepseek_v4" \
    --kv-cache-dtype fp8_e4m3 \
    --compilation-config '{"mode":3,"cudagraph_mode":1,"cudagraph_capture_sizes":[1,2,4,8]}'

client command

#!/usr/bin/env bash
set -e
export PYTHONPATH=/opt/aiter:/opt/aiter/aiter/jit/utils:${PYTHONPATH}
PORT=${PORT:-8001}
MODEL=${MODEL:-DeepSeek-V4-Pro}
NUM_PROMPTS=${NUM_PROMPTS:-10}
CONCURRENCY=${CONCURRENCY:-2}
INPUT_LEN=${INPUT_LEN:-10240}
OUTPUT_LEN=${OUTPUT_LEN:-512}
TS=$(date +%Y%m%d_%H%M%S)
LOG=/opt/scripts/vllm/logs/client_${TS}.log

mkdir -p /opt/scripts/vllm/logs

echo ""
echo "========== Sanity Check: Single Chat Completion =========="
curl -s "http://localhost:${PORT}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${MODEL}\",\"messages\":[{\"role\":\"user\",\"content\":\"What is 15% of 240? Answer concisely.\"}],\"max_tokens\":64}" \
  | python3 -m json.tool \
  | tee -a "$LOG"

echo ""
echo ""
vllm bench serve \
  --base-url "http://localhost:${PORT}" \
  --model "${MODEL}" --tokenizer "${MODEL}" \
  --dataset-name random \
  --random-input-len "${INPUT_LEN}" \
  --random-output-len "${OUTPUT_LEN}" \
  --num-prompts "${NUM_PROMPTS}" --max-concurrency "${CONCURRENCY}" \
  --num-warmups 1 \
  --save-result \
  --result-filename "/opt/scripts/vllm/logs/bench_${TS}.json" \
  2>&1 | tee -a "$LOG"

echo ""
echo "[$(date)] Benchmark complete. Log: $LOG"

Result

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Maximum request concurrency:             2
Benchmark duration (s):                  777.06
Total input tokens:                      102400
Total generated tokens:                  5120
Request throughput (req/s):              0.01
Output token throughput (tok/s):         6.59
Peak output token throughput (tok/s):    8.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          138.37
---------------Time to First Token----------------
Mean TTFT (ms):                          5943.68
Median TTFT (ms):                        5062.24
P99 TTFT (ms):                           8489.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          292.40
Median TPOT (ms):                        290.75
P99 TPOT (ms):                           299.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           292.40
Median ITL (ms):                         289.34
P99 ITL (ms):                            301.06
==================================================

accuracy command

MODEL=${MODEL:-DeepSeek-V4-Pro}
lm_eval --model local-completions --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 --batch_size auto --tasks gsm8k --num_fewshot 8 --output_path .  2>&1 | tee -a eval.log

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |     8|exact_match|↑  |0.9560|±  |0.0056|

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/kernels/attention/test_merge_attn_states.py (modified, +69/-0)
vllm/model_executor/layers/deepseek_v4_attention.py (modified, +10/-45)
vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +30/-12)
vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +9/-7)
vllm/model_executor/layers/quantization/utils/w8a8_utils.py (modified, +10/-2)
vllm/utils/deep_gemm.py (modified, +84/-0)
vllm/v1/attention/backends/mla/sparse_swa.py (modified, +1/-2)
vllm/v1/attention/ops/deepseek_v4_ops/fused_indexer_q.py (modified, +13/-4)
vllm/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py (modified, +6/-2)
vllm/v1/attention/ops/flashmla.py (modified, +11/-3)
vllm/v1/attention/ops/rocm_aiter_mla_sparse.py (modified, +14/-16)
vllm/v1/attention/ops/rocm_flash_mla_sparse.py (added, +682/-0)
vllm/v1/attention/ops/triton_merge_attn_states.py (modified, +13/-1)

PR #41374: [DSV4] Avoid redundant dtype conversion.

Repository: vllm-project/vllm
Author: jeejeelee
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/41374

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

vllm/model_executor/models/deepseek_v4.py (modified, +11/-6)

PR #41263: [DSV4] Fuse norm and router for low latency scenario

Repository: vllm-project/vllm
Author: jeejeelee
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41263

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

CMakeLists.txt (modified, +10/-0)
benchmarks/kernels/benchmark_norm_router_gemm.py (added, +183/-0)
csrc/moe/dsv4_norm_router_gemm.h (added, +30/-0)
csrc/moe/dsv4_norm_router_gemm_entry.cu (added, +130/-0)
csrc/moe/dsv4_norm_router_gemm_kernel.cu (added, +249/-0)
csrc/moe/moe_ops.h (modified, +8/-0)
csrc/moe/torch_bindings.cpp (modified, +6/-0)
vllm/_custom_ops.py (modified, +30/-0)
vllm/model_executor/layers/fused_moe/router/norm_gate_linear.py (added, +114/-0)
vllm/model_executor/models/deepseek_v4.py (modified, +44/-41)
vllm/model_executor/models/deepseek_v4_mtp.py (modified, +11/-1)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #optimization #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Performance]: Deepseek-V4 Support and Optimization on ROCm Backend [11 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #40871: [New Model][ROCm] Add AMD support for DeepSeek V4

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Deepseek-V4-Flash

Deepseek-V4-Pro:

Changed files

PR #41136: [ROCm] DeepSeekV4-Flash-Base model enablement on ROCm with triton & torchfallback

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Server command

Curl commands & results

GSM8K Results

Changed files

PR #41451: [ROCm][Deepseekv4] DeepseekV4 Mi300 support

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #41312: [Bugfix][DeepSeek V4] Enable cross-node TP=16 FP8 serving

Description (problem / solution / changelog)

Purpose

Issue A — FP8 weight loading fails at TP=16

Issue B — Cross-node TP=16 reasoning produces garbage output

Negative results that informed Fix B

Related work

Out of scope

Test Plan

Unit tests

End-to-end (2× B300 SXM6, RoCEv2)

Test Result

Before this PR

After this PR (Fix A + Fix B with VLLM_DETERMINISTIC_AUX_STREAM=1)

Changed files

PR #41352: [feature][WIP] Enable KV Offload for DeepSeek V4 model

Description (problem / solution / changelog)

[feature][WIP] Enable KV Offload for DeepSeek V4 model

Summary

Existing generic grouped-KV pieces in this branch

Changes

Validation

Follow-up

Changed files

PR #41276: [WIP] [DSV4] Quantization Support

Description (problem / solution / changelog)

Model Optimizations

Deployment

Evaluation

Changed files

PR #41601: DeepSeekv4 ROCm Optimization

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #41374: [DSV4] Avoid redundant dtype conversion.

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #41263: [DSV4] Fuse norm and router for low latency scenario

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Motivation

Purpose

Recipe

General Checklist

1) Functionality / Bugfix / Feature

After this PR (Fix A + Fix B with `VLLM_DETERMINISTIC_AUX_STREAM=1`)