vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10) [2 comments, 2 participants]

vllm2026-04-27 06:38:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40969•Fetched 2026-04-28 06:26:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tonyliu312

Participants

jasl

tonyliu312

Timeline (top)

mentioned ×6subscribed ×6commented ×2cross-referenced ×1

Error Message

V4-Flash with --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' (the implicit default once --enforce-eager is dropped) loads, captures graphs cleanly, and serves the first 5–6 requests correctly. On the 6th–7th request, the engine silently hangs: requests timeout with no Python exception, no NCCL timeout, no OOM. nvidia-smi shows ~100% SM utilization on both ranks but nvtop decode tok/s is zero. Recovery requires container restart.

Fix Action

Fix / Workaround

- Hardware: 2× DGX Spark (NVIDIA GB10, SM 12.1) connected via 100Gb RoCE
- vLLM: built from `4d51588` (V4-Flash main, post #40860 merge), aarch64
- Image: `vllm/vllm-openai:deepseekv4-cu130`
- Model: `deepseek-ai/DeepSeek-V4-Flash` (mxfp4 MoE)
- Backend: `--moe-backend marlin` (with #40923 SM12.x cubin patch applied locally)
- KV cache: `--kv-cache-dtype fp8_ds_mla`
- TP=2, max_model_len=32768, prefix-caching on
- PyTorch 2.10.0+cu130 (jasl's `torch_compat_v3` shim for `torch.float8_e8m0fnu`)

The model produces coherent text on the requests that do succeed (so it isn't a numerical/quant issue), and the same workload runs cleanly when restricted to cudagraph_mode: PIECEWISE only — or with --enforce-eager. Workaround for our deployment is cudagraph_mode: PIECEWISE (4.99 tok/s steady on dual GB10 TP=2).

DeepseekV4Indexer / sparse_swa.py declares _cudagraph_support = AttentionCGSupport.UNIFORM_BATCH. With chunked prefill enabled, batches mix prefill (variable length) with decode (length=1), which violates uniform-query-length. Either:

(a) the dispatcher fails to detect mixed batches on V4's sparse-MLA path and replays a wrong-shape FULL graph, OR
(b) the metadata builder produces inconsistent state between capture and replay for V4's indexer top-k buffers.

Code Example

- Hardware: 2× DGX Spark (NVIDIA GB10, SM 12.1) connected via 100Gb RoCE
- vLLM: built from `4d51588` (V4-Flash main, post #40860 merge), aarch64
- Image: `vllm/vllm-openai:deepseekv4-cu130`
- Model: `deepseek-ai/DeepSeek-V4-Flash` (mxfp4 MoE)
- Backend: `--moe-backend marlin` (with #40923 SM12.x cubin patch applied locally)
- KV cache: `--kv-cache-dtype fp8_ds_mla`
- TP=2, max_model_len=32768, prefix-caching on
- PyTorch 2.10.0+cu130 (jasl's `torch_compat_v3` shim for `torch.float8_e8m0fnu`)

---

vllm serve /models/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --max-model-len 32768 \
  --moe-backend marlin \
  --kv-cache-dtype fp8_ds_mla \
  --gpu-memory-utilization 0.78 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

RAW_BUFFERClick to expand / collapse

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with `cudagraph_mode=FULL_AND_PIECEWISE` + chunked prefill on SM 12.x (GB10)

Your current environment

- Hardware: 2× DGX Spark (NVIDIA GB10, SM 12.1) connected via 100Gb RoCE
- vLLM: built from `4d51588` (V4-Flash main, post #40860 merge), aarch64
- Image: `vllm/vllm-openai:deepseekv4-cu130`
- Model: `deepseek-ai/DeepSeek-V4-Flash` (mxfp4 MoE)
- Backend: `--moe-backend marlin` (with #40923 SM12.x cubin patch applied locally)
- KV cache: `--kv-cache-dtype fp8_ds_mla`
- TP=2, max_model_len=32768, prefix-caching on
- PyTorch 2.10.0+cu130 (jasl's `torch_compat_v3` shim for `torch.float8_e8m0fnu`)

🐛 Describe the bug

Reproduction

Build vLLM at 4d51588 for SM 12.x (apply #40923 for native Marlin SM12.x cubin)
Launch:

vllm serve /models/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --max-model-len 32768 \
  --moe-backend marlin \
  --kv-cache-dtype fp8_ds_mla \
  --gpu-memory-utilization 0.78 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

Send 8+ sequential /v1/chat/completions requests, each ~50–100 tokens. Around request #6–7 the worker stops emitting tokens but the API connection stays open until client timeout.

What works (control matrix)

`cudagraph_mode`	chunked-prefill	result
`PIECEWISE`	enabled	✅ 4.99 tok/s, no hang over 200+ requests
`FULL_AND_PIECEWISE`	enabled	❌ hang at req ~6–7
`FULL`	enabled	❌ engine fails to start (UNIFORM_BATCH check)
`--enforce-eager`	enabled	✅ 4.5 tok/s, no hang
`PIECEWISE` + MTP n=1	enabled	⚠ runs but emits visible speculative-leak `#` tokens — separate issue, suggests spec-decode + V4 has its own divergence

Suspected area

(a) the dispatcher fails to detect mixed batches on V4's sparse-MLA path and replays a wrong-shape FULL graph, OR
(b) the metadata builder produces inconsistent state between capture and replay for V4's indexer top-k buffers.

I do not yet have py-spy stack data to confirm which (we briefly attempted py-spy capture but our diagnostic harness wasn't ready). Posting this with the empirical control matrix in case the failure mode is already familiar to V4 contributors — happy to run further diagnostics on dual GB10.

cc @zyongye @jasl @tlrmchlsmth @LucasWilkinson — V4-Flash / V32 indexer area.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the DeepSeek-V4-Flash hang issue is to use cudagraph_mode: PIECEWISE instead of FULL_AND_PIECEWISE to avoid violating uniform-query-length with chunked prefill enabled.

Guidance

Verify that the issue is indeed caused by the cudagraph_mode setting by testing with cudagraph_mode: PIECEWISE and checking if the hang occurs.
Investigate the _cudagraph_support setting in DeepseekV4Indexer / sparse_swa.py to understand why AttentionCGSupport.UNIFORM_BATCH might be causing issues with mixed batches.
Consider collecting py-spy stack data to confirm which part of the code is causing the hang.
Test with --enforce-eager to see if the issue is related to the compilation config.

Example

No code snippet is provided as the issue is more related to configuration and settings.

Notes

The issue seems to be specific to the DeepSeek-V4-Flash model and the cudagraph_mode setting. The control matrix provided in the issue suggests that using cudagraph_mode: PIECEWISE resolves the issue.

Recommendation

Apply workaround: use cudagraph_mode: PIECEWISE instead of FULL_AND_PIECEWISE to avoid the hang issue. This is because the control matrix shows that PIECEWISE mode resolves the issue, and it is a safer option until the root cause is fully understood.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10) [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with `cudagraph_mode=FULL_AND_PIECEWISE` + chunked prefill on SM 12.x (GB10)

Your current environment

🐛 Describe the bug

Reproduction

What works (control matrix)

Suspected area

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10) [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10)

Your current environment

🐛 Describe the bug

Reproduction

What works (control matrix)

Suspected area

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with `cudagraph_mode=FULL_AND_PIECEWISE` + chunked prefill on SM 12.x (GB10)