vllm - 💡(How to fix) Fix [Bug]: TP=2 spec-decode (qwen3_next_mtp / DFlash) on Qwen3.6 hybrid GDN crashes at gpu_model_runner.py:1927 num_accepted_tokens_event.synchronize() (cudaErrorIllegalAddress) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41190Fetched 2026-04-30 06:19:36
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

(Worker_TP1) ERROR [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 3496, in synchronize_input_prep
(Worker_TP1) ERROR     yield
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 3875, in execute_model
(Worker_TP1) ERROR     logits_indices, spec_decode_metadata = self._prepare_inputs(
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 1927, in _prepare_inputs
(Worker_TP1) ERROR     self.num_accepted_tokens_event.synchronize()
(Worker_TP1) ERROR torch.AcceleratorError: CUDA error: an illegal memory access was encountered

During handling of the above exception, another exception occurred:
  File ".../vllm/v1/worker/gpu_model_runner.py", line 3498, in synchronize_input_prep
    self.prepare_inputs_event.record()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

[rank0]:[E ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

Root Cause

Side note: in that TP=1 test, qwen3_next_mtp acceptance rate was 0% (2021 drafts, 0 accepted), but this appears to be a separate issue likely caused by the abliterated/distilled MTP head, and is unrelated to the crash discussed here.

Fix Action

Fix / Workaround

What I observed at TP=1 (NOT a confirmed workaround)

Under a different configuration — an NVFP4-quantized variant of the same model family, TP=1, qwen3_next_mtp with num_speculative_tokens=2, and explicit --gdn-prefill-backend triton — no crash occurred. However, too many variables changed simultaneously (TP, model, quantization, context length, gpu_memory_utilization) to call TP=1 a confirmed workaround.

Workaround currently in production

Code Example

vllm serve QuantTrio/Qwen3.6-35B-A3B-AWQ \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.85 \
  --max-num-batched-tokens 16384 \
  --port 8080 \
  --served-model-name qwen3.6-uncensored \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --trust-remote-code \
  --attention-backend flash_attn \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

---

(Worker_TP1) ERROR [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 3496, in synchronize_input_prep
(Worker_TP1) ERROR     yield
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 3875, in execute_model
(Worker_TP1) ERROR     logits_indices, spec_decode_metadata = self._prepare_inputs(
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 1927, in _prepare_inputs
(Worker_TP1) ERROR     self.num_accepted_tokens_event.synchronize()
(Worker_TP1) ERROR torch.AcceleratorError: CUDA error: an illegal memory access was encountered

During handling of the above exception, another exception occurred:
  File ".../vllm/v1/worker/gpu_model_runner.py", line 3498, in synchronize_input_prep
    self.prepare_inputs_event.record()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

[rank0]:[E ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM: 0.19.2rc1.dev226+g53b9640fb (built off PR #40898 head — DFlash + SWA support)
  • GPU: 2× NVIDIA RTX 6000 Ada Generation (sm_89, 48 GB each, no NVLink, P2P over PCIe)
  • Driver: 580.126.09 / CUDA 13.0
  • OS: Ubuntu, Linux 6.8.0-110-generic
  • Python 3.12
  • NCCL env: NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=0 NCCL_IB_DISABLE=1

🐛 Describe the bug

Enabling speculative decoding (tested both qwen3_next_mtp and DFlash methods) on a hybrid-GDN Qwen3.6-35B-A3B model with TP=2 causes a hard crash on the very first chat completion request. The error is cudaErrorIllegalAddress, with the stack landing at self.num_accepted_tokens_event.synchronize() in gpu_model_runner.py:1927. Both TP0 and TP1 workers raise the same error, and the NCCL ProcessGroup watchdog thread subsequently terminates with the same illegal address exception.

Note: GDN prefill backend was auto-selected as Triton/FLA (not FlashInfer) in this build at the time of the crash, so this appears distinct from the FlashInfer-GDN line of bugs in #37729 and #37035, despite affecting the same model family.

Reproduction

Launch:

vllm serve QuantTrio/Qwen3.6-35B-A3B-AWQ \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.85 \
  --max-num-batched-tokens 16384 \
  --port 8080 \
  --served-model-name qwen3.6-uncensored \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --trust-remote-code \
  --attention-backend flash_attn \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Also tested with --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' — same crash at the same line.

Server starts cleanly (Application startup complete), but any chat completion request triggers the crash immediately.

Stack trace

(Worker_TP1) ERROR [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 3496, in synchronize_input_prep
(Worker_TP1) ERROR     yield
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 3875, in execute_model
(Worker_TP1) ERROR     logits_indices, spec_decode_metadata = self._prepare_inputs(
(Worker_TP1) ERROR   File ".../vllm/v1/worker/gpu_model_runner.py", line 1927, in _prepare_inputs
(Worker_TP1) ERROR     self.num_accepted_tokens_event.synchronize()
(Worker_TP1) ERROR torch.AcceleratorError: CUDA error: an illegal memory access was encountered

During handling of the above exception, another exception occurred:
  File ".../vllm/v1/worker/gpu_model_runner.py", line 3498, in synchronize_input_prep
    self.prepare_inputs_event.record()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

[rank0]:[E ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

Test matrix

spec methodTPnum_spec_tokenscudagraph_modeGDN backendresult
qwen3_next_mtp21fulltriton (auto)CRASH
qwen3_next_mtp22fulltriton (auto)CRASH
qwen3_next_mtp22piecewisetriton (auto)CRASH
qwen3_next_mtp22none (eager)triton (auto)CRASH
dflash215fulltriton (auto)CRASH (same line)

Neither cudagraph_mode, num_speculative_tokens, nor the choice of spec method affects the crash.

What I observed at TP=1 (NOT a confirmed workaround)

Under a different configuration — an NVFP4-quantized variant of the same model family, TP=1, qwen3_next_mtp with num_speculative_tokens=2, and explicit --gdn-prefill-backend triton — no crash occurred. However, too many variables changed simultaneously (TP, model, quantization, context length, gpu_memory_utilization) to call TP=1 a confirmed workaround.

Side note: in that TP=1 test, qwen3_next_mtp acceptance rate was 0% (2021 drafts, 0 accepted), but this appears to be a separate issue likely caused by the abliterated/distilled MTP head, and is unrelated to the crash discussed here.

Related issues

  • #37729 — V1 engine silent deadlock under concurrent load, root-caused to the FlashInfer GDN prefill kernel. Different surface from this bug (deadlock vs hard crash), and our crash occurred with Triton GDN, not FlashInfer GDN — but same model family.
  • #37035 — gdn_attn.py:237 OOB at num_speculative_tokens>=5 (FlashInfer GDN kernel). Different crash location and different threshold for spec-tokens.
  • #37750 — Closed PR that attempted to fix #37729 by adding a device= parameter to torch.Event(...) calls, including num_accepted_tokens_event. The fix turned out not to address the root cause and was not merged, but the same event object is the one crashing here.

Hypothesis (unverified)

num_accepted_tokens_event requires cross-rank coordination at TP>1. The fact that this event is the crash point — and that #37750 specifically called out its torch.Event(device=...) initialization as suspect — suggests a multi-rank event lifetime or ordering bug in the spec-decode preparation path. A fruitful place to look might be how spec-decode metadata is broadcast across ranks via the multiproc executor before this first synchronize() call, and whether the event's device-side state can be invalidated between record() and synchronize() under TP>1.

Workaround currently in production

Disabled --speculative-config entirely. Pure-decode runs cleanly at 174 tok/s short / 141 tok/s long on this hardware.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page.

extent analysis

TL;DR

The most likely fix for the crash caused by enabling speculative decoding on a hybrid-GDN Qwen3.6-35B-A3B model with TP=2 is to investigate and resolve the multi-rank event lifetime or ordering bug in the spec-decode preparation path.

Guidance

  1. Investigate event initialization: Review how num_accepted_tokens_event is initialized, particularly its device parameter, to ensure it is correctly set up for cross-rank coordination at TP>1.
  2. Verify spec-decode metadata broadcast: Check how spec-decode metadata is broadcast across ranks via the multiproc executor before the first synchronize() call to identify potential issues with event state invalidation.
  3. Test with different TP settings: Although TP=1 is not a confirmed workaround, testing with different TP settings may help isolate the issue and provide more insight into the problem.
  4. Review related issues: Examine the related issues (#37729, #37035, #37750) to see if any of the fixes or hypotheses can be applied to this case.

Example

No code snippet is provided as the issue requires a deeper investigation into the event initialization and spec-decode metadata broadcast.

Notes

The crash may be related to a multi-rank event lifetime or ordering bug, and resolving this issue will likely require a thorough understanding of the spec-decode preparation path and cross-rank coordination.

Recommendation

Apply a workaround by disabling --speculative-config entirely, as currently done in production, until a proper fix can be implemented. This will allow the model to run cleanly, albeit without speculative decoding.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING