vllm - 💡(How to fix) Fix [Bug]: MTP speculative decoding crash with illegal memory access on long sequences (Qwen3.6-27B-FP8, v0.19.1) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40756Fetched 2026-04-24 10:36:26
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Error Message

All worker processes then fail with torch.AcceleratorError: CUDA error: an illegal memory access was encountered in gpu_model_runner.py line 1706 (prev_common_req_indices_tensor = torch.tensor(...)). torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Description When using MTP speculative decoding (num_spec_tokens=5) with the FP8‑quantized Qwen3.6‑27B model as both target and draft model, the engine crashes on long requests.

Environment

vLLM version: 0.19.1

Model: Qwen3.6-27B-FP8

TP size: 4, fp8 quantization, prefix caching + chunked prefill enabled

Speculative config: method=mtp, model=same as target, num_spec_tokens=5

Symptoms

The crash occurs after the request has accumulated ~26 k total tokens and has generated >1200 output tokens.

Just before the crash, spec metrics become abnormal: accepted tokens equal drafted tokens, acceptance rate jumps to 100%, and the scheduled draft tokens are all -1.

All worker processes then fail with torch.AcceleratorError: CUDA error: an illegal memory access was encountered in gpu_model_runner.py line 1706 (prev_common_req_indices_tensor = torch.tensor(...)).

Excerpt from logs

SpecDecoding metrics: Mean acceptance length: 6.00, ..., Avg Draft acceptance rate: 100.0% scheduled_spec_decode_tokens={...: [-1, -1, -1, -1, -1]} ... torch.AcceleratorError: CUDA error: an illegal memory access was encountered File "gpu_model_runner.py", line 1706, in _prepare_input_ids prev_common_req_indices_tensor = torch.tensor(

To reproduce

Serve Qwen3.6-27B-FP8 with MTP speculative decoding using --num-speculative-tokens 5.

Send a conversation that grows to 25k+ tokens total.

The crash typically happens after 1000+ tokens have been generated.

Expected behavior Generation should continue without invalid draft tokens or illegal memory access.

Additional notes

GPU memory usage was low (~5.4% KV cache), so it’s not an OOM issue.

The problem is reproducible; disabling speculative decoding avoids the crash.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Disabling speculative decoding or reducing the number of speculative tokens may temporarily mitigate the crash issue with MTP speculative decoding in the Qwen3.6-27B-FP8 model.

Guidance

  • Investigate the speculative decoding configuration, specifically the num_spec_tokens parameter, to determine if reducing its value can prevent the crash without significantly impacting performance.
  • Verify that the issue is not related to GPU memory usage, as the reported usage is low (~5.4% KV cache), but consider monitoring GPU memory allocation during the crash to rule out any memory-related issues.
  • Review the gpu_model_runner.py code, particularly line 1706, to understand the context of the torch.AcceleratorError and how it relates to the speculative decoding process.
  • Test the model with a smaller input size or a different model to isolate if the issue is specific to the Qwen3.6-27B-FP8 model or the speculative decoding configuration.

Example

No specific code example can be provided without modifying the existing codebase, but reviewing the gpu_model_runner.py file and the speculative decoding configuration may offer insights into the cause of the crash.

Notes

The exact cause of the crash is unclear, but it appears to be related to the speculative decoding process. Disabling speculative decoding or reducing the number of speculative tokens may be a temporary workaround, but a more permanent solution will require further investigation into the gpu_model_runner.py code and the speculative decoding configuration.

Recommendation

Apply workaround: Reduce the num_spec_tokens value to a lower number (e.g., 1 or 2) to mitigate the crash issue while investigating the root cause. This may impact performance but can help prevent the crash and allow for further debugging.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING