vllm - ✅(Solved) Fix [Bug]: Qwen3.5 crashes when using suffix-decoding [45 pull requests, 1 participants]

vllm2026-04-08 06:08:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39270•Fetched 2026-04-09 07:52:13

View on GitHub

Comments

Participants

Timeline

Reactions

Author

xhdidi

Participants

xhdidi

Timeline (top)

closed ×1labeled ×1

PR fix notes

PR #39695: Introduce De-dup/Similarity-Check in CI Workflow for PR/Issue

Repository: vllm-project/vllm
Author: panpan0000
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39695

Description (problem / solution / changelog)

Co-Author: Trae + GPT5.3-Codex

Purpose

Example to explain https://github.com/vllm-project/vllm/issues/39694

Example Algorithm:

Scoring: 0.75 * text_similarity + 0.25 * file_overlap .
Threshold used for report: 0.75 .
Using Github Action CI Cache to temp save the Github API result cache for recent 1000 PR/500 issue..etc

Test Plan

Using 1000 recent PR to test the similarity check :

High-similarity pairs ( >=0.75 ): 26

Test Result

PR Similarity

Repo: vllm-project/vllm
PR count: 1000
Candidate pairs: 17375
High-similarity pairs (>= 0.75): 26

Score	Text	Files	PR A	PR B
100%	100%	100%	#39553 Okakarpa shadow clone	#39577 Okakarpa shadow clone
99%	99%	100%	#37929 [Core] Use standalone autograd_cache_key for compilation dedup optimization	#39517 [Core] Use standalone autograd_cache_key for compilation dedup optimization
96%	95%	100%	#37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu	#39257 [XPU] update triton version for torch 2.11 upgrade
96%	95%	100%	#37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu	#39313 [XPU] upgrade to triton-xpu 3.7.0
95%	97%	88%	#38249 [Misc] Organize NixlConnector into own directory	#39354 [KVConnector][NIXL] Organize NIXL connector into its own directory
95%	93%	100%	#39410 [XPU] Disable fusion passes on XPU Platform	#39671 use spawn multiproc method on xpu
94%	92%	100%	#38856 [LMCache] vLLM Block Allocation Event	#39719 fix(lmcache): correct store for cached requests while enable prefix cache
94%	91%	100%	#39606 Pass extra_config to the constructor of LMCacheMPXXXAdapter	#39719 fix(lmcache): correct store for cached requests while enable prefix cache
94%	91%	100%	#39257 [XPU] update triton version for torch 2.11 upgrade	#39313 [XPU] upgrade to triton-xpu 3.7.0
91%	100%	67%	#39432 Gfx1250 wip	#39437 Gfx1250 wip rebase test
90%	92%	85%	#36823 [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace	#38775 [vLLM IR] 4/N Compile native implementation
90%	86%	100%	#39402 [kv_offload+HMA[10/N]: Support load with multiple KV groups	#39403 [kv_offload+HMA][11/N]: Support store with multiple KV groups
86%	98%	50%	#23995 Feature/deepseek v31 lora support	#39661 [DOC] Update Gemma 4
82%	76%	100%	#39110 [Core] Disable HMA for eagle/MTP with sliding window models	#39376 [Core] Disable HMA for eagle/MTP with sliding window models
82%	76%	100%	#39401 [kv_offload+HMA][9/N]: Support lookup with multiple KV groups	#39402 [kv_offload+HMA[10/N]: Support load with multiple KV groups
82%	76%	100%	#39401 [kv_offload+HMA][9/N]: Support lookup with multiple KV groups	#39403 [kv_offload+HMA][11/N]: Support store with multiple KV groups
80%	96%	33%	#26583 add log for request trace	#39646 V0.12.0 support n sampling delay split to eliminate redundant prefill computation and memory
79%	97%	22%	#35721 [LoRA] Support dual CUDA streams-Linear Layer	#37297 [LoRA] Support FP8 LoRA E2E inference-dense model
79%	94%	32%	#39153 [Frontend][4/n] Improve pooling entrypoints	pooling.
79%	74%	91%	#38775 [vLLM IR] 4/N Compile native implementation	#39453 Port activations to IR op 1/3
79%	88%	50%	#39312 [Mergify] Update model vendor auto-label rules	#39429 [CI/Build] Update auto-rebase rule
78%	100%	13%	#39723 [SimpleCPUOffloadConnector]: Add support for `reset_cache()`	#39726 [SimpleCPUOffloadConnector]: Add support for reset_cache()
77%	98%	14%	#38780 [vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops	#38798 [vLLM IR][RMSNorm] Port RMSNormGated to vLLM IR Ops
77%	69%	100%	#39744 [v1] Expose num_prompt_tokens in CommonAttentionMetadata	#39745 [v1] Expose num_prompt_tokens in CommonAttentionMetadata
77%	81%	62%	#23133 Split compressed_tensors_moe.py into separate wna16, int8, fp8, nvfp4	#29427 [Refactor] Split up compressed_tensors_moe.py into separate files per method
76%	82%	59%	#39267 [vllm IR] 1/N Port FP8 Quantization to vLLM IR Ops	#39481 [vllm IR] Port FP8 Quantization to vLLM IR Ops

Similar Issues:

Repo: vllm-project/vllm
Issue count: 500
Candidate pairs: 9909
High-similarity pairs (>= 0.75): 12

Match Score	Desc Similarity	Title Overlap	Issue A	Issue B
100%	100%	100%	#39270 [Bug]: Qwen3.5 crashes when using suffix-decoding	#39271 [Bug]: Qwen3.5 crashes when using suffix-decoding
100%	100%	100%	#39372 [Bug]:	#39373 [Bug]:
100%	100%	100%	#39372 [Bug]:	#39374 [Bug]:
100%	100%	100%	#39373 [Bug]:	#39374 [Bug]:
100%	100%	100%	#39433 RFC: Add logit_scale to PoolerConfig for Affine Score Calibration (Platt Scaling)	#39434 [RFC]: Add logit_scale to PoolerConfig for Affine Score Calibration (Platt Scaling)
100%	100%	100%	#39299 [Performance] DSV3.2 Indexer: Overlap indexer k+w path
81%	95%	25%	#31888 [Usage]: rollout slow	#38642 [Usage]: 模型返回值reasoning_content
80%	88%	50%	#38734 [Transformers v5] SarvamMLAForCausalLM	#38740 [Transformers v5] NemotronParseForConditionalGeneration
79%	94%	20%	#29245 [Usage]: 启动 qwen3 vl 超级超级超级慢，sglang 启动很快，可能的原因是什么？	#38642 [Usage]: 模型返回值reasoning_content
77%	92%	17%	#29245 [Usage]: 启动 qwen3 vl 超级超级超级慢，sglang 启动很快，可能的原因是什么？	#31888 [Usage]: rollout slow
77%	89%	29%	#38384 [Transformers v5] Distributed shutdown test timetout	#38740 [Transformers v5] NemotronParseForConditionalGeneration
76%	88%	31%	#31661 [Bug]: jina-reranker-m0 [image_index] IndexError: list index out of range	#32151 [Bug]: jina-reranker-m0 infer error

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

.github/workflows/detect-duplicate-issues.yml (added, +64/-0)
.github/workflows/detect-duplicate-prs.yml (added, +55/-0)
.github/workflows/scripts/detect_duplicate_issues.py (added, +453/-0)
.github/workflows/scripts/detect_duplicate_prs.py (added, +317/-0)

RAW_BUFFERClick to expand / collapse

Your current environment

============================== System Info

OS : Ubuntu 24.04.1 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect CMake version : version 4.1.0 Libc version : glibc-2.39

============================== PyTorch Info

PyTorch version : 2.10.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A

============================== Python Environment

Python version : 3.12.7 (main, Mar 2 2026, 18:41:32) [GCC 13.3.0] (64-bit runtime) Python platform : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.39

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : Could not collect CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA H20-3e GPU 1: NVIDIA H20-3e GPU 2: NVIDIA H20-3e GPU 3: NVIDIA H20-3e GPU 4: NVIDIA H20-3e GPU 5: NVIDIA H20-3e GPU 6: NVIDIA H20-3e GPU 7: NVIDIA H20-3e

Nvidia driver version : 535.230.02 cuDNN version : Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.8.0 HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Processor CPU family: 6 Model: 207 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 2 Stepping: 2 BogoMIPS: 5600.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd avx512vbmi umip pku waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 128 MiB (64 instances) L3 cache: 640 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.6 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-cufile-cu12==1.13.1.3 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.4.2 [pip3] nvidia-cutlass-dsl-libs-base==4.4.2 [pip3] nvidia-ml-py==13.595.45 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] pyzmq==27.1.0 [pip3] torch==2.10.0 [pip3] torch-c-dlpack-ext==0.1.5 [pip3] torchaudio==2.10.0 [pip3] torchvision==0.25.0 [pip3] transformers==4.57.6 [pip3] triton==3.6.0 [conda] Could not collect

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.19.0 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU AffinityNUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PHB SYS SYS 0-63 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PXB PHB SYS SYS 0-63 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 PHB PIX SYS SYS 0-63 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 PHB PXB SYS SYS 0-63 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX PHB 64-127 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS PXB PHB 64-127 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS PHB PIX 64-127 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS PHB PXB 64-127 1 N/A NIC0 PIX PXB PHB PHB SYS SYS SYS SYS X PHB SYS SYS NIC1 PHB PHB PIX PXB SYS SYS SYS SYS PHB X SYS SYS NIC2 SYS SYS SYS SYS PIX PXB PHB PHB SYS SYS X PHB NIC3 SYS SYS SYS SYS PHB PHB PIX PXB SYS SYS PHB X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3

============================== Environment Variables

PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

🐛 Describe the bug

I used the following command to start the qwen3.5 service. The service frequently crashes when sending requests. <code>vllm serve Qwen3.5-35B-A3B --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.9 --speculative-config '{"method":"suffix","num_speculative_tokens":16}' --enable-prefix-caching</code>

Before submitting a new issue...

#39272

extent analysis

TL;DR

The vllm serve command with specified parameters may be causing the service to crash due to high GPU memory utilization or incorrect speculative configuration.

Guidance

Verify that the GPU memory utilization is not exceeding the specified limit of 0.9 by monitoring the GPU memory usage during the service execution.
Check the speculative configuration to ensure it is correctly set up for the Qwen3.5 model, as incorrect configuration may lead to crashes.
Consider reducing the --tensor-parallel-size or --max-model-len to decrease the memory requirements and see if the service becomes more stable.
Ensure that the TORCHINDUCTOR_CACHE_DIR environment variable is set to a directory with sufficient storage space, as the default /tmp directory may fill up quickly.

Example

No specific code example is provided, as the issue seems to be related to command-line parameters and environment variables.

Notes

The provided information does not include error logs or specific crash messages, which would be helpful in diagnosing the issue. Additionally, the vllm serve command and its parameters may require specific versions of libraries or dependencies to function correctly.

Recommendation

Apply a workaround by reducing the GPU memory utilization or adjusting the speculative configuration to see if the service becomes more stable. If the issue persists, consider seeking further assistance or providing additional information, such as error logs or crash messages.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #environment variable #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.