vllm - ✅(Solved) Fix [Bug]: There is something wrong with the use of mtp in qwen3.5-moe model: when it is changed to 0.17.0, it is wrong to directly report CudaError: an illegal memory access was encountered when reasoning with mtp. [1 pull requests, 6 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36498Fetched 2026-04-08 00:36:32
View on GitHub
Comments
6
Participants
5
Timeline
12
Reactions
0
Timeline (top)
commented ×6subscribed ×3cross-referenced ×2labeled ×1

Error Message

Qwen3.5 35ba3b and 122ba3b were tested. Before that, mtp was still running normally in version v0.16.0rc2.dev456, with an acceptance rate of 70%, but the overall decoding time was higher than that of turning off mtp, which was very strange. Today, after updating to version 0.17.0, during the high concurrency test, the error is reported directly: cudaerror: an illegal memory access was encountered.

Fix Action

Fixed

PR fix notes

PR #36587: skip triton when graph capture

Description (problem / solution / changelog)

when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}', console show error, <img width="2648" height="1290" alt="image" src="https://github.com/user-attachments/assets/e5385971-ceca-4661-8802-2dd0f73d222e" /> ,after implement the code, everything work. <img width="2380" height="1038" alt="image" src="https://github.com/user-attachments/assets/7f255f41-5913-4ad1-8cf8-7fbd777f98f8" />

Fix: Skip _forward_core during CUDA Graph capture to avoid Triton kernel errors Problem When running the Qwen3Next model with speculative decoding (MTP method) in vLLM, CUDA Graph capture in FULL mode fails with:

RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing Root cause: During CUDA Graph capture, the _forward_core method of Qwen3NextGatedDeltaNet is invoked as a custom op. This method calls Triton JIT kernels — causal_conv1d_update and fused_sigmoid_gating_delta_rule_update — which internally trigger _init_handles() → load_binary() → cuModuleLoadData(). The CUDA driver forbids cuModuleLoadData while a stream is being captured, causing the runtime error.

Solution Added an early return guard in _forward_core (vllm/model_executor/models/qwen3_next.py) that detects CUDA Graph capture state via torch.cuda.is_current_stream_capturing() and skips the entire method body during capture.

if torch.cuda.is_current_stream_capturing(): return

Changed files

  • vllm/model_executor/models/qwen3_next.py (modified, +5/-0)
RAW_BUFFERClick to expand / collapse

Your current environment

vllm=0.17.0 H800*2 cuda 12.9.86 use vllm/vllm-openai:v0.17.0 docker

🐛 Describe the bug

Qwen3.5 35ba3b and 122ba3b were tested. Before that, mtp was still running normally in version v0.16.0rc2.dev456, with an acceptance rate of 70%, but the overall decoding time was higher than that of turning off mtp, which was very strange. Today, after updating to version 0.17.0, during the high concurrency test, the error is reported directly: cudaerror: an illegal memory access was encountered. My startup script: vllm serve /data/models/Qwen3.5-35B-A3B
--host 0.0.0.0
--served-model-name default
--port 9002
--language-model-only
--max-num-seqs 128
--speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
--max-model-len auto

or

vllm serve /data/models/Qwen3.5-122B-A10B-FP8
--host 0.0.0.0
--served-model-name default
--port 9002
--language-model-only
--max-num-seqs 64
--tensor-parallel-size 2
--speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
--max-model-len auto

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the mtp method and CUDA memory access. To fix this, we can try the following steps:

  • Update the vllm configuration to use a more stable mtp method or disable it temporarily.
  • Increase the CUDA memory allocation to prevent memory access errors.

Here are the concrete steps:

  1. Disable MTP temporarily:
vllm serve /data/models/Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--served-model-name default \
--port 9002 \
--language-model-only \
--max-num-seqs 128 \
--speculative_config '{"method": "none"}' \
--max-model-len auto
  1. Increase CUDA memory allocation:
export CUDA_VISIBLE_DEVICES=0
export CUDA_DEVICE_ORDER=PCI_BUS_ID
vllm serve /data/models/Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--served-model-name default \
--port 9002 \
--language-model-only \
--max-num-seqs 128 \
--speculative_config '{"method": "mtp", "num_speculative_tokens":2}' \
--max-model-len auto
  1. Update vllm configuration:
# in vllm config file
speculative_config = {
    "method": "mtp",
    "num_speculative_tokens": 1  # reduce the number of speculative tokens
}

Verification

To verify that the fix worked, run the vllm server with the updated configuration and check for any CUDA memory access errors.

Extra Tips

  • Make sure to monitor the CUDA memory usage and adjust the allocation accordingly.
  • If the issue persists, try reducing the max-num-seqs value or disabling the mtp method altogether.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING