vllm - ✅(Solved) Fix [Bug]: There is something wrong with the use of mtp in qwen3.5-moe model: when it is changed to 0.17.0, it is wrong to directly report CudaError: an illegal memory access was encountered when reasoning with mtp. [1 pull requests, 6 comments, 5 participants]

vllm2026-03-09 12:45:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36498•Fetched 2026-04-08 00:36:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×6subscribed ×3cross-referenced ×2labeled ×1

Error Message

Fix Action

Fixed

Fixed by PR: skip triton when graph capture (https://github.com/vllm-project/vllm/pull/36587)

PR fix notes

PR #36587: skip triton when graph capture

Repository: vllm-project/vllm
Author: flutist
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36587

Description (problem / solution / changelog)

when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}', console show error, <img width="2648" height="1290" alt="image" src="https://github.com/user-attachments/assets/e5385971-ceca-4661-8802-2dd0f73d222e" /> ,after implement the code, everything work. <img width="2380" height="1038" alt="image" src="https://github.com/user-attachments/assets/7f255f41-5913-4ad1-8cf8-7fbd777f98f8" />

Fix: Skip _forward_core during CUDA Graph capture to avoid Triton kernel errors Problem When running the Qwen3Next model with speculative decoding (MTP method) in vLLM, CUDA Graph capture in FULL mode fails with:

RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing Root cause: During CUDA Graph capture, the _forward_core method of Qwen3NextGatedDeltaNet is invoked as a custom op. This method calls Triton JIT kernels — causal_conv1d_update and fused_sigmoid_gating_delta_rule_update — which internally trigger _init_handles() → load_binary() → cuModuleLoadData(). The CUDA driver forbids cuModuleLoadData while a stream is being captured, causing the runtime error.

Solution Added an early return guard in _forward_core (vllm/model_executor/models/qwen3_next.py) that detects CUDA Graph capture state via torch.cuda.is_current_stream_capturing() and skips the entire method body during capture.

if torch.cuda.is_current_stream_capturing(): return

Changed files

vllm/model_executor/models/qwen3_next.py (modified, +5/-0)

RAW_BUFFERClick to expand / collapse

Your current environment

vllm=0.17.0 H800*2 cuda 12.9.86 use vllm/vllm-openai:v0.17.0 docker

🐛 Describe the bug

Qwen3.5 35ba3b and 122ba3b were tested. Before that, mtp was still running normally in version v0.16.0rc2.dev456, with an acceptance rate of 70%, but the overall decoding time was higher than that of turning off mtp, which was very strange. Today, after updating to version 0.17.0, during the high concurrency test, the error is reported directly: cudaerror: an illegal memory access was encountered. My startup script: vllm serve /data/models/Qwen3.5-35B-A3B
--host 0.0.0.0
--served-model-name default
--port 9002
--language-model-only
--max-num-seqs 128
--speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
--max-model-len auto

vllm serve /data/models/Qwen3.5-122B-A10B-FP8
--host 0.0.0.0
--served-model-name default
--port 9002
--language-model-only
--max-num-seqs 64
--tensor-parallel-size 2
--speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
--max-model-len auto

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the mtp method and CUDA memory access. To fix this, we can try the following steps:

Update the vllm configuration to use a more stable mtp method or disable it temporarily.
Increase the CUDA memory allocation to prevent memory access errors.

Here are the concrete steps:

Disable MTP temporarily:

vllm serve /data/models/Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--served-model-name default \
--port 9002 \
--language-model-only \
--max-num-seqs 128 \
--speculative_config '{"method": "none"}' \
--max-model-len auto

Increase CUDA memory allocation:

export CUDA_VISIBLE_DEVICES=0
export CUDA_DEVICE_ORDER=PCI_BUS_ID
vllm serve /data/models/Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--served-model-name default \
--port 9002 \
--language-model-only \
--max-num-seqs 128 \
--speculative_config '{"method": "mtp", "num_speculative_tokens":2}' \
--max-model-len auto

Update vllm configuration:

# in vllm config file
speculative_config = {
    "method": "mtp",
    "num_speculative_tokens": 1  # reduce the number of speculative tokens
}

Verification

To verify that the fix worked, run the vllm server with the updated configuration and check for any CUDA memory access errors.

Extra Tips

Make sure to monitor the CUDA memory usage and adjust the allocation accordingly.
If the issue persists, try reducing the max-num-seqs value or disabling the mtp method altogether.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #docker error #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: There is something wrong with the use of mtp in qwen3.5-moe model: when it is changed to 0.17.0, it is wrong to directly report CudaError: an illegal memory access was encountered when reasoning with mtp. [1 pull requests, 6 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #36587: skip triton when graph capture

Description (problem / solution / changelog)

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: There is something wrong with the use of mtp in qwen3.5-moe model: when it is changed to 0.17.0, it is wrong to directly report CudaError: an illegal memory access was encountered when reasoning with mtp. [1 pull requests, 6 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #36587: skip triton when graph capture

Description (problem / solution / changelog)

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING