vllm - ✅(Solved) Fix [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP for Qwen3.5-397B-A17B under high concurrency [2 pull requests, 21 comments, 10 participants]

xiaochengyige · 2026-03-10T08:49:13Z

[vllm] PR 36925: Bugfix signature match for passing spec step idx in qwen3-next and qwen3.5 - Repository: vllm-project/vllm - Author: JGSweets - State: closed… # PR #36925: [Bugfix] signature match for passing `spec_step_idx` in qwen3-next and qwen3.5 - Repository: vllm-project/vllm - Author: JGSweets - State: closed | merged: False - Link: https://github.com/vllm-project/vllm/pull/36925 ## Description (problem / solution / changelog) ## Purpose qwen3-next and qwen3.5 currently do not pass `spec_step_idx` in their MTPs. When `num_speculative_tokens > 1` eventually an illegal memory access occurs. It may possibly be due to this not being passed. Possible fix for: https://github.com/vllm-project/vllm/issues/36613 ## Test Plan Would like assistance in validating whether this could be a fix for said issue. ## Test Result --- Essential Elements of an Effective PR Description Checklist - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). ## Changed files - `vllm/model_executor/models/qwen3_5_mtp.py` (modified, +7/-1) - `vllm/model_executor/models/qwen3_next_mtp.py` (modified, +7/-1) --- # PR #98: SM121/GB10: Build arch split, prefix caching, Qwen3.5 NVFP4 support - Repository: eugr/spark-vllm-docker - Author: RobTand - State: open | merged: False - Link: https://github.com/eugr/spark-vllm-docker/pull/98 ## Description (problem / solution / changelog) ## Summary - Fix SM121 build arch split for NVFP4 on DGX Spark - Enable prefix caching for Nemotron with Mamba align mode - Add Qwen3.5-122B NVFP4 recipe (FlashInfer CUTLASS MoE, MTP) - Add Mamba SSM SM121 recognition fix ## Changes ### Build system (`build-and-copy.sh`, `Dockerfile`) - `GPU_ARCH_LIST=12.0a` (was `12.1a`) — vLLM's cmake gates FP4 kernels on `cuda_archs_loose_intersection("12.0a", ...)`. Without `12.0`, `ENABLE_NVFP4_SM120` is never set → "No compiled nvfp4 quantization kernel" - Hardcode `FLASHINFER_CUDA_ARCH_LIST=12.1a` — FlashInfer JIT needs SM121 for E2M1 software fallback (via fix-e2m1-sm121 mod). These MUST differ: vLLM needs 12.0a, FlashInfer needs 12.1a - Dockerfile defaults updated to `12.0a;12.1a` for TORCH_CUDA_ARCH_LIST, SM121 added to CUDA_SUPPORTED_ARCHS ### Nemotron recipe - Enable prefix caching with `--mamba-cache-mode align` (default `all` mode crashes on SM121 in `selective_state_update` Triton kernel) - Add `--max-num-batched-tokens 16384` (Mamba block size 8400 must be <= max_num_batched_tokens in align mode) ### Qwen3.5 recipe (new) - Model: `scottgl/Qwen3.5-122B-A10B-MTP-NVFP4` - Same FlashInfer CUTLASS MoE path as Nemotron (VLLM_NVFP4_GEMM_BACKEND=cutlass, VLLM_USE_FLASHINFER_MOE_FP4=1) - MTP speculative decoding (n=2) - Tested: 25-40 tok/s on DGX Spark ### SM121 mod (`mods/fix-e2m1-sm121`) - Add Mamba SSM fix: recognize SM121 (capability family 120) as Blackwell-class in `mamba_mixer2.py` for correct `selective_state_update` block sizes ## Performance | Model | Throughput | MTP | Prefix Cache | |---|---|---|---| | Nemotron-3-Super-120B NVFP4 | 15-20 tok/s | n=2 | Yes (align) | | Qwen3.5-122B-A10B NVFP4 | 25-40 tok/s | n=2 | Yes (align) | ## Upstream PR status - FlashInfer #2786 (K=64 tiles) — open, mergeable - FlashInfer #2798 (CUTLASS 4.4.2) — open, mergeable - vLLM #34822 (is_blackwell_class) — open, has conflicts - vLLM #35947 (E2M1 software) — open, low activity 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Changed files - `Dockerfile` (modified, +57/-5) - `bakeoff.sh` (added, +46/-0) - `bench_mtp.py` (added, +111/-0) - `build-and-copy.sh` (modified, +49/-5) - `e2m1_nvfp4_sm121.patch` (added, +111/-0) - `eval-quality.py` (added, +258/-0) - `fix_quantization_utils_sm121.py` (added, +139/-0) - `flashinfer_e2m1_sm121.patch` (added, +22/-0) - `flashinfer_k64_sm120.patch` (added, +227/-0) - `flashinfer_k64_sm120_v442.patch` (added, +118/-0) - `launch-cluster.sh` (modified, +1/-1) - `mods/fix-e2m1-sm121/patch_fp4_common.py` (added, +78/-0) - `mods/fix-e2m1-sm121/run.sh` (added, +132/-0) - `mods/fix-fla-sm121/patch_fla_sm121.py` (added, +93/-0) - `mods/fix-fla-sm121/run.sh` (added, +24/-0) - `mods/fix-mistral-guidance/pr37081-src-only.patch` (added, +1261/-0) - `mods/fix-mistral-guidance/run.sh` (added, +62/-0) - `mods/fix-mistral-reasoning/run.sh` (added, +109/-0) - `mods/fix-mistral-tool-role/run.sh` (added, +54/-0) - `mods/fix-qwen3.5-autoround/run.sh` (modified, +8/-1) - `mods/gpu-mem-util-gb/gpu_mem.patch` (removed, +0/-255) - `mod

vllm2026-03-10 08:49:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36613•Fetched 2026-04-08 00:36:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×21subscribed ×17mentioned ×10cross-referenced ×4

Error Message

I am experiencing a critical crash (CUDA ILM / Illegal Memory Access error) when serving the Qwen3.5-397B-A17B model with Multi-Token Prediction (MTP) enabled under high concurrent requests, same as https://github.com/vllm-project/vllm/issues/34948#issuecomment-3977914704. 3. The server will suddenly crash with a CUDA ILM error during request processing.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 176 On-line CPU(s) list: 0-175 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8458P CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 44 Socket(s): 2 Stepping: 8 Frequency boost: enabled CPU(s) scaling MHz: 51% CPU max MHz: 2701.0000 CPU min MHz: 800.0000 BogoMIPS: 5400.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 4.1 MiB (88 instances) L1i cache: 2.8 MiB (88 instances) L2 cache: 176 MiB (88 instances) L3 cache: 165 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-43,88-131 NUMA node1 CPU(s): 44-87,132-175 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #36925: [Bugfix] signature match for passing `spec_step_idx` in qwen3-next and qwen3.5

Repository: vllm-project/vllm
Author: JGSweets
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/36925

Description (problem / solution / changelog)

Purpose

qwen3-next and qwen3.5 currently do not pass spec_step_idx in their MTPs. When num_speculative_tokens > 1 eventually an illegal memory access occurs. It may possibly be due to this not being passed.

Possible fix for: https://github.com/vllm-project/vllm/issues/36613

Test Plan

Would like assistance in validating whether this could be a fix for said issue.

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/model_executor/models/qwen3_5_mtp.py (modified, +7/-1)
vllm/model_executor/models/qwen3_next_mtp.py (modified, +7/-1)

PR #98: SM121/GB10: Build arch split, prefix caching, Qwen3.5 NVFP4 support

Repository: eugr/spark-vllm-docker
Author: RobTand
State: open | merged: False
Link: https://github.com/eugr/spark-vllm-docker/pull/98

Description (problem / solution / changelog)

Summary

Fix SM121 build arch split for NVFP4 on DGX Spark
Enable prefix caching for Nemotron with Mamba align mode
Add Qwen3.5-122B NVFP4 recipe (FlashInfer CUTLASS MoE, MTP)
Add Mamba SSM SM121 recognition fix

Changes

Build system (`build-and-copy.sh`, `Dockerfile`)

GPU_ARCH_LIST=12.0a (was 12.1a) — vLLM's cmake gates FP4 kernels on cuda_archs_loose_intersection("12.0a", ...). Without 12.0, ENABLE_NVFP4_SM120 is never set → "No compiled nvfp4 quantization kernel"
Hardcode FLASHINFER_CUDA_ARCH_LIST=12.1a — FlashInfer JIT needs SM121 for E2M1 software fallback (via fix-e2m1-sm121 mod). These MUST differ: vLLM needs 12.0a, FlashInfer needs 12.1a
Dockerfile defaults updated to 12.0a;12.1a for TORCH_CUDA_ARCH_LIST, SM121 added to CUDA_SUPPORTED_ARCHS

Nemotron recipe

Enable prefix caching with --mamba-cache-mode align (default all mode crashes on SM121 in selective_state_update Triton kernel)
Add --max-num-batched-tokens 16384 (Mamba block size 8400 must be <= max_num_batched_tokens in align mode)

Qwen3.5 recipe (new)

Model: scottgl/Qwen3.5-122B-A10B-MTP-NVFP4
Same FlashInfer CUTLASS MoE path as Nemotron (VLLM_NVFP4_GEMM_BACKEND=cutlass, VLLM_USE_FLASHINFER_MOE_FP4=1)
MTP speculative decoding (n=2)
Tested: 25-40 tok/s on DGX Spark

SM121 mod (`mods/fix-e2m1-sm121`)

Add Mamba SSM fix: recognize SM121 (capability family 120) as Blackwell-class in mamba_mixer2.py for correct selective_state_update block sizes

Performance

Model	Throughput	MTP	Prefix Cache
Nemotron-3-Super-120B NVFP4	15-20 tok/s	n=2	Yes (align)
Qwen3.5-122B-A10B NVFP4	25-40 tok/s	n=2	Yes (align)

Upstream PR status

FlashInfer #2786 (K=64 tiles) — open, mergeable
FlashInfer #2798 (CUTLASS 4.4.2) — open, mergeable
vLLM #34822 (is_blackwell_class) — open, has conflicts
vLLM #35947 (E2M1 software) — open, low activity

🤖 Generated with Claude Code

Changed files

Dockerfile (modified, +57/-5)
bakeoff.sh (added, +46/-0)
bench_mtp.py (added, +111/-0)
build-and-copy.sh (modified, +49/-5)
e2m1_nvfp4_sm121.patch (added, +111/-0)
eval-quality.py (added, +258/-0)
fix_quantization_utils_sm121.py (added, +139/-0)
flashinfer_e2m1_sm121.patch (added, +22/-0)
flashinfer_k64_sm120.patch (added, +227/-0)
flashinfer_k64_sm120_v442.patch (added, +118/-0)
launch-cluster.sh (modified, +1/-1)
mods/fix-e2m1-sm121/patch_fp4_common.py (added, +78/-0)
mods/fix-e2m1-sm121/run.sh (added, +132/-0)
mods/fix-fla-sm121/patch_fla_sm121.py (added, +93/-0)
mods/fix-fla-sm121/run.sh (added, +24/-0)
mods/fix-mistral-guidance/pr37081-src-only.patch (added, +1261/-0)
mods/fix-mistral-guidance/run.sh (added, +62/-0)
mods/fix-mistral-reasoning/run.sh (added, +109/-0)
mods/fix-mistral-tool-role/run.sh (added, +54/-0)
mods/fix-qwen3.5-autoround/run.sh (modified, +8/-1)
mods/gpu-mem-util-gb/gpu_mem.patch (removed, +0/-255)
mods/gpu-mem-util-gb/run.sh (removed, +0/-6)
nightly-update.sh (added, +101/-0)
recipes/minimax-m2.5-reap-139b-nvfp4.yaml (added, +33/-0)
recipes/mistral-small-4-119b-nvfp4.yaml (added, +65/-0)
recipes/nemotron-3-super-nvfp4.yaml (modified, +19/-10)
recipes/qwen3.5-122b-a10b-int4-autoround.yaml (added, +46/-0)
recipes/qwen3.5-122b-a10b-nvfp4.yaml (added, +64/-0)
recipes/qwen3.5-27b-nvfp4.yaml (added, +56/-0)
recipes/qwen3.5-35b-a3b-nvfp4-baseline.yaml (added, +47/-0)
recipes/qwen3.5-35b-a3b-nvfp4-test.yaml (added, +55/-0)
recipes/qwen3.5-397b-a17b-nvfp4.yaml (added, +64/-0)
recipes/qwen3.5-397b-int4-autoround.yaml (modified, +0/-1)
run-bakeoff-overnight.sh (added, +53/-0)
test_e2m1_all_flags.sh (added, +43/-0)
test_e2m1_flags.cu (added, +72/-0)
vllm_cmake_arch_suffix.patch (added, +42/-0)
vllm_mistral_lark_grammar.patch (added, +1938/-0)
vllm_pr_37081.diff (added, +1938/-0)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:16:04) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.15.0-92-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 570.124.06
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.9.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             176
On-line CPU(s) list:                0-175
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8458P
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 44
Socket(s):                          2
Stepping:                           8
Frequency boost:                    enabled
CPU(s) scaling MHz:                 51%
CPU max MHz:                        2701.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5400.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          4.1 MiB (88 instances)
L1i cache:                          2.8 MiB (88 instances)
L2 cache:                           176 MiB (88 instances)
L3 cache:                           165 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-43,88-131
NUMA node1 CPU(s):                  44-87,132-175
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.19.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] flashinfer-python                           0.6.4            pypi_0           pypi
[conda] numpy                                       2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                          12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12                      12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12                      12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12                    12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                           9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                       1.19.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                           11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                          1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                          10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                        11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                        12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12                      0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                          4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base                4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                                13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                            2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                       12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                         3.4.5            pypi_0           pypi
[conda] nvidia-nvtx-cu12                            12.8.90          pypi_0           pypi
[conda] pyzmq                                       27.1.0           pypi_0           pypi
[conda] torch                                       2.10.0           pypi_0           pypi
[conda] torch-c-dlpack-ext                          0.1.5            pypi_0           pypi
[conda] torchaudio                                  2.10.0           pypi_0           pypi
[conda] torchvision                                 0.25.0           pypi_0           pypi
[conda] transformers                                4.57.6           pypi_0           pypi
[conda] triton                                      3.6.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity       GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     NODE    PIX     SYS     NODE    SYS     NODE    0-43,88-131     0          N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     PIX     NODE    SYS     NODE    SYS     NODE    0-43,88-131     0          N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     NODE    NODE    SYS     PIX     SYS     NODE    0-43,88-131     0          N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    SYS     SYS     NODE    NODE    SYS     NODE    SYS     PIX     0-43,88-131     0          N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     PIX     NODE    SYS     SYS     NODE    SYS     NODE    SYS     44-87,132-175   1          N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     NODE    PIX     SYS     SYS     NODE    SYS     NODE    SYS     44-87,132-175   1          N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     NODE    NODE    SYS     SYS     PIX     SYS     NODE    SYS     44-87,132-175   1          N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     NODE    NODE    SYS     SYS     NODE    SYS     PIX     SYS     44-87,132-175   1          N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      SYS     SYS     NODE    NODE    SYS     NODE    SYS     NODE
NIC1    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS      X      NODE    SYS     SYS     NODE    SYS     NODE    SYS
NIC2    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     NODE     X      SYS     SYS     NODE    SYS     NODE    SYS
NIC3    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS      X      NODE    SYS     NODE    SYS     NODE
NIC4    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE     X      SYS     NODE    SYS     NODE
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     NODE    NODE    SYS     SYS      X      SYS     NODE    SYS
NIC6    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    SYS      X      SYS     NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     NODE    NODE    SYS     SYS     NODE    SYS      X      SYS
NIC8    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    SYS     NODE    SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
  NIC4: mlx5_bond_4
  NIC5: mlx5_bond_5
  NIC6: mlx5_bond_6
  NIC7: mlx5_bond_7
  NIC8: mlx5_bond_8

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

vllm serve Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

---

# This works perfectly under high load
vllm serve Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --reasoning-parser qwen3

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:16:04) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.15.0-92-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 570.124.06
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.9.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             176
On-line CPU(s) list:                0-175
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8458P
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 44
Socket(s):                          2
Stepping:                           8
Frequency boost:                    enabled
CPU(s) scaling MHz:                 51%
CPU max MHz:                        2701.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5400.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          4.1 MiB (88 instances)
L1i cache:                          2.8 MiB (88 instances)
L2 cache:                           176 MiB (88 instances)
L3 cache:                           165 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-43,88-131
NUMA node1 CPU(s):                  44-87,132-175
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.19.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] flashinfer-python                           0.6.4            pypi_0           pypi
[conda] numpy                                       2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                          12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12                      12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12                      12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12                    12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                           9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                       1.19.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                           11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                          1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                          10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                        11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                        12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12                      0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                          4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base                4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                                13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                            2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                       12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                         3.4.5            pypi_0           pypi
[conda] nvidia-nvtx-cu12                            12.8.90          pypi_0           pypi
[conda] pyzmq                                       27.1.0           pypi_0           pypi
[conda] torch                                       2.10.0           pypi_0           pypi
[conda] torch-c-dlpack-ext                          0.1.5            pypi_0           pypi
[conda] torchaudio                                  2.10.0           pypi_0           pypi
[conda] torchvision                                 0.25.0           pypi_0           pypi
[conda] transformers                                4.57.6           pypi_0           pypi
[conda] triton                                      3.6.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity       GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     NODE    PIX     SYS     NODE    SYS     NODE    0-43,88-131     0          N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     PIX     NODE    SYS     NODE    SYS     NODE    0-43,88-131     0          N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    SYS     SYS     NODE    NODE    SYS     PIX     SYS     NODE    0-43,88-131     0          N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    SYS     SYS     NODE    NODE    SYS     NODE    SYS     PIX     0-43,88-131     0          N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     PIX     NODE    SYS     SYS     NODE    SYS     NODE    SYS     44-87,132-175   1          N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     NODE    PIX     SYS     SYS     NODE    SYS     NODE    SYS     44-87,132-175   1          N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     NODE    NODE    SYS     SYS     PIX     SYS     NODE    SYS     44-87,132-175   1          N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     NODE    NODE    SYS     SYS     NODE    SYS     PIX     SYS     44-87,132-175   1          N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      SYS     SYS     NODE    NODE    SYS     NODE    SYS     NODE
NIC1    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS      X      NODE    SYS     SYS     NODE    SYS     NODE    SYS
NIC2    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     NODE     X      SYS     SYS     NODE    SYS     NODE    SYS
NIC3    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS      X      NODE    SYS     NODE    SYS     NODE
NIC4    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE     X      SYS     NODE    SYS     NODE
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     NODE    NODE    SYS     SYS      X      SYS     NODE    SYS
NIC6    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    SYS      X      SYS     NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     NODE    NODE    SYS     SYS     NODE    SYS      X      SYS
NIC8    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    SYS     SYS     NODE    NODE    SYS     NODE    SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
  NIC4: mlx5_bond_4
  NIC5: mlx5_bond_5
  NIC6: mlx5_bond_6
  NIC7: mlx5_bond_7
  NIC8: mlx5_bond_8

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

</details>

🐛 Describe the bug

Description: I am experiencing a critical crash (CUDA ILM / Illegal Memory Access error) when serving the Qwen3.5-397B-A17B model with Multi-Token Prediction (MTP) enabled under high concurrent requests, same as https://github.com/vllm-project/vllm/issues/34948#issuecomment-3977914704.

The service runs perfectly fine under the same high-concurrency workload when MTP is disabled. The crash only occurs when the --speculative-config parameter is explicitly added and the server is hit with a high volume of concurrent requests.

Steps to Reproduce:

Start the vLLM(0.17.0) server with the Qwen3.5-397B-A17B model and the following speculative decoding configuration:

vllm serve Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Send high-concurrency requests to the server.
The server will suddenly crash with a CUDA ILM error during request processing.

Control Test (Works Fine): If I run the exact same command without the --speculative-config flag, the server handles the high concurrency perfectly without any CUDA errors or crashes:

# This works perfectly under high load
vllm serve Qwen/Qwen3.5-397B-A17B \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --max-model-len 262144 \
    --reasoning-parser qwen3

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the CUDA ILM / Illegal Memory Access error when serving the Qwen3.5-397B-A17B model with Multi-Token Prediction (MTP) enabled under high concurrent requests, follow these steps:

Update CUDA and cuDNN versions: Ensure you are using the latest compatible versions of CUDA and cuDNN. The current setup is using CUDA 12.8 and cuDNN 9.10.2.21, which might not be optimal for the Qwen3.5-397B-A17B model.
Modify speculative decoding configuration: Adjust the --speculative-config parameter to reduce the number of speculative tokens or change the method. For example:

vllm serve Qwen/Qwen3.5-397B-A17B
--host 0.0.0.0
--port 8000
--tensor-parallel-size 8
--max-model-len 262144
--reasoning-parser qwen3
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'

3. **Implement error handling**: Add try-except blocks in the code to catch and handle CUDA errors, preventing the server from crashing. Example:
   ```python
import torch

try:
    # Code that might raise a CUDA error
    output = model(input_ids)
except RuntimeError as e:
    # Handle the error, e.g., retry or return an error message
    print(f"CUDA error: {e}")

Optimize model serving: Consider using model pruning, quantization, or knowledge distillation to reduce the model's computational requirements and memory usage.

Verification

To verify that the fix worked:

Run the modified command with the updated speculative decoding configuration.
Send high-concurrency requests to the server.
Monitor the server's performance and check for any CUDA errors or crashes.

Extra Tips

Regularly update dependencies, including CUDA, cuDNN, and PyTorch, to ensure compatibility and optimal performance.
Use tools like nvidia-smi to monitor GPU usage and identify potential bottlenecks.
Consider implementing a queueing system to manage concurrent requests and prevent overloading the server.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #LLM response #prompt template #agent execution #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP for Qwen3.5-397B-A17B under high concurrency [2 pull requests, 21 comments, 10 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #36925: [Bugfix] signature match for passing spec_step_idx in qwen3-next and qwen3.5

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #98: SM121/GB10: Build arch split, prefix caching, Qwen3.5 NVFP4 support

Description (problem / solution / changelog)

Summary

Changes

Build system (build-and-copy.sh, Dockerfile)

Nemotron recipe

Qwen3.5 recipe (new)

SM121 mod (mods/fix-e2m1-sm121)

Performance

Upstream PR status

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #36925: [Bugfix] signature match for passing `spec_step_idx` in qwen3-next and qwen3.5

Build system (`build-and-copy.sh`, `Dockerfile`)

SM121 mod (`mods/fix-e2m1-sm121`)