vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10) [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40969Fetched 2026-04-28 06:26:08
View on GitHub
Comments
2
Participants
2
Timeline
16
Reactions
0
Participants
Timeline (top)
mentioned ×6subscribed ×6commented ×2cross-referenced ×1

Error Message

V4-Flash with --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' (the implicit default once --enforce-eager is dropped) loads, captures graphs cleanly, and serves the first 5–6 requests correctly. On the 6th–7th request, the engine silently hangs: requests timeout with no Python exception, no NCCL timeout, no OOM. nvidia-smi shows ~100% SM utilization on both ranks but nvtop decode tok/s is zero. Recovery requires container restart.

Fix Action

Fix / Workaround

- Hardware: 2× DGX Spark (NVIDIA GB10, SM 12.1) connected via 100Gb RoCE
- vLLM: built from `4d51588` (V4-Flash main, post #40860 merge), aarch64
- Image: `vllm/vllm-openai:deepseekv4-cu130`
- Model: `deepseek-ai/DeepSeek-V4-Flash` (mxfp4 MoE)
- Backend: `--moe-backend marlin` (with #40923 SM12.x cubin patch applied locally)
- KV cache: `--kv-cache-dtype fp8_ds_mla`
- TP=2, max_model_len=32768, prefix-caching on
- PyTorch 2.10.0+cu130 (jasl's `torch_compat_v3` shim for `torch.float8_e8m0fnu`)

The model produces coherent text on the requests that do succeed (so it isn't a numerical/quant issue), and the same workload runs cleanly when restricted to cudagraph_mode: PIECEWISE only — or with --enforce-eager. Workaround for our deployment is cudagraph_mode: PIECEWISE (4.99 tok/s steady on dual GB10 TP=2).

DeepseekV4Indexer / sparse_swa.py declares _cudagraph_support = AttentionCGSupport.UNIFORM_BATCH. With chunked prefill enabled, batches mix prefill (variable length) with decode (length=1), which violates uniform-query-length. Either:

  • (a) the dispatcher fails to detect mixed batches on V4's sparse-MLA path and replays a wrong-shape FULL graph, OR
  • (b) the metadata builder produces inconsistent state between capture and replay for V4's indexer top-k buffers.

Code Example

- Hardware: 2× DGX Spark (NVIDIA GB10, SM 12.1) connected via 100Gb RoCE
- vLLM: built from `4d51588` (V4-Flash main, post #40860 merge), aarch64
- Image: `vllm/vllm-openai:deepseekv4-cu130`
- Model: `deepseek-ai/DeepSeek-V4-Flash` (mxfp4 MoE)
- Backend: `--moe-backend marlin` (with #40923 SM12.x cubin patch applied locally)
- KV cache: `--kv-cache-dtype fp8_ds_mla`
- TP=2, max_model_len=32768, prefix-caching on
- PyTorch 2.10.0+cu130 (jasl's `torch_compat_v3` shim for `torch.float8_e8m0fnu`)

---

vllm serve /models/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --max-model-len 32768 \
  --moe-backend marlin \
  --kv-cache-dtype fp8_ds_mla \
  --gpu-memory-utilization 0.78 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'
RAW_BUFFERClick to expand / collapse

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10)

Your current environment

- Hardware: 2× DGX Spark (NVIDIA GB10, SM 12.1) connected via 100Gb RoCE
- vLLM: built from `4d51588` (V4-Flash main, post #40860 merge), aarch64
- Image: `vllm/vllm-openai:deepseekv4-cu130`
- Model: `deepseek-ai/DeepSeek-V4-Flash` (mxfp4 MoE)
- Backend: `--moe-backend marlin` (with #40923 SM12.x cubin patch applied locally)
- KV cache: `--kv-cache-dtype fp8_ds_mla`
- TP=2, max_model_len=32768, prefix-caching on
- PyTorch 2.10.0+cu130 (jasl's `torch_compat_v3` shim for `torch.float8_e8m0fnu`)

🐛 Describe the bug

V4-Flash with --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' (the implicit default once --enforce-eager is dropped) loads, captures graphs cleanly, and serves the first 5–6 requests correctly. On the 6th–7th request, the engine silently hangs: requests timeout with no Python exception, no NCCL timeout, no OOM. nvidia-smi shows ~100% SM utilization on both ranks but nvtop decode tok/s is zero. Recovery requires container restart.

The model produces coherent text on the requests that do succeed (so it isn't a numerical/quant issue), and the same workload runs cleanly when restricted to cudagraph_mode: PIECEWISE only — or with --enforce-eager. Workaround for our deployment is cudagraph_mode: PIECEWISE (4.99 tok/s steady on dual GB10 TP=2).

Reproduction

  1. Build vLLM at 4d51588 for SM 12.x (apply #40923 for native Marlin SM12.x cubin)
  2. Launch:
vllm serve /models/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --max-model-len 32768 \
  --moe-backend marlin \
  --kv-cache-dtype fp8_ds_mla \
  --gpu-memory-utilization 0.78 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'
  1. Send 8+ sequential /v1/chat/completions requests, each ~50–100 tokens. Around request #6–7 the worker stops emitting tokens but the API connection stays open until client timeout.

What works (control matrix)

cudagraph_modechunked-prefillresult
PIECEWISEenabled✅ 4.99 tok/s, no hang over 200+ requests
FULL_AND_PIECEWISEenabled❌ hang at req ~6–7
FULLenabled❌ engine fails to start (UNIFORM_BATCH check)
--enforce-eagerenabled✅ 4.5 tok/s, no hang
PIECEWISE + MTP n=1enabled⚠ runs but emits visible speculative-leak # tokens — separate issue, suggests spec-decode + V4 has its own divergence

Suspected area

DeepseekV4Indexer / sparse_swa.py declares _cudagraph_support = AttentionCGSupport.UNIFORM_BATCH. With chunked prefill enabled, batches mix prefill (variable length) with decode (length=1), which violates uniform-query-length. Either:

  • (a) the dispatcher fails to detect mixed batches on V4's sparse-MLA path and replays a wrong-shape FULL graph, OR
  • (b) the metadata builder produces inconsistent state between capture and replay for V4's indexer top-k buffers.

I do not yet have py-spy stack data to confirm which (we briefly attempted py-spy capture but our diagnostic harness wasn't ready). Posting this with the empirical control matrix in case the failure mode is already familiar to V4 contributors — happy to run further diagnostics on dual GB10.

cc @zyongye @jasl @tlrmchlsmth @LucasWilkinson — V4-Flash / V32 indexer area.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the DeepSeek-V4-Flash hang issue is to use cudagraph_mode: PIECEWISE instead of FULL_AND_PIECEWISE to avoid violating uniform-query-length with chunked prefill enabled.

Guidance

  • Verify that the issue is indeed caused by the cudagraph_mode setting by testing with cudagraph_mode: PIECEWISE and checking if the hang occurs.
  • Investigate the _cudagraph_support setting in DeepseekV4Indexer / sparse_swa.py to understand why AttentionCGSupport.UNIFORM_BATCH might be causing issues with mixed batches.
  • Consider collecting py-spy stack data to confirm which part of the code is causing the hang.
  • Test with --enforce-eager to see if the issue is related to the compilation config.

Example

No code snippet is provided as the issue is more related to configuration and settings.

Notes

The issue seems to be specific to the DeepSeek-V4-Flash model and the cudagraph_mode setting. The control matrix provided in the issue suggests that using cudagraph_mode: PIECEWISE resolves the issue.

Recommendation

Apply workaround: use cudagraph_mode: PIECEWISE instead of FULL_AND_PIECEWISE to avoid the hang issue. This is because the control matrix shows that PIECEWISE mode resolves the issue, and it is a safer option until the root cause is fully understood.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING