vllm - ✅(Solved) Fix [Bug]: Gibberish output and collapsing generation throughput with Qwen3.5-35B-A3B-FP8 and speculative decoding enabled [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36872Fetched 2026-04-08 00:43:51
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
2
Participants
Timeline (top)
subscribed ×3cross-referenced ×2labeled ×1referenced ×1

When serving Qwen/Qwen3.5-35B-A3B-FP8 with MTP speculative decoding enabled, the model progressively produces incoherent tokens (random Unicode characters, repeated symbols, gibberish). The model is used in a RAG engine with text and images as inputs.

Error Message

During multi-turn chat completions with tool calling, the output quality degrades across consecutive requests. The speculative decoding metrics reflect this degradation: Request 1 (output is coherent):

SpecDecoding metrics: Mean acceptance length: 2.23, Avg Draft acceptance rate: 61.3%
Per-position acceptance rate: 0.798, 0.427

Root Cause

When serving Qwen/Qwen3.5-35B-A3B-FP8 with MTP speculative decoding enabled, the model progressively produces incoherent tokens (random Unicode characters, repeated symbols, gibberish). The model is used in a RAG engine with text and images as inputs.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 15 On-line CPU(s) list: 0-14 Vendor ID: AuthenticAMD Model name: AMD EPYC 9124 16-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 15 Stepping: 1 BogoMIPS: 5999.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor fsrm arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 960 KiB (15 instances) L1i cache: 960 KiB (15 instances) L2 cache: 7.5 MiB (15 instances) L3 cache: 240 MiB (15 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-14 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Vulnerable: Clear CPU buffers attempted, no microcode Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

PR fix notes

PR #36910: [WIP] [BugFix] Forward spec_step_idx in MTP wrappers and eagle proposer/speculator

Description (problem / solution / changelog)

Purpose

Fix spec_step_idx not being forwarded from MTP wrapper classes (Qwen3NextMTP, Qwen3_5MTP) to the inner MultiTokenPredictor, and not being passed by the eagle proposer/speculator loops.

MTP models with num_mtp_layers > 1 use spec_step_idx % num_mtp_layers to select which decoder layer to use for each speculative step. Two issues caused the wrong layer to always be selected:

  1. Wrapper classes swallow spec_step_idx: Qwen3NextMTP.forward() and Qwen3_5MTP.forward() accept **kwargs but never forward spec_step_idx to the inner model's forward(), which expects it.

  2. Eagle proposer/speculator never pass spec_step_idx: The proposer loop and speculator loop never include spec_step_idx in model_kwargs or compute_logits calls, so every draft token beyond the first silently uses layer 0 instead of the correct layer.

This follows the existing correct pattern used by DeepSeekMTPModel, ExaoneMoeMTP, and other MTP implementations that explicitly accept and forward spec_step_idx.

Fixes #36872

Test Plan

Test Result

Changed files

  • vllm/model_executor/models/qwen3_5_mtp.py (modified, +7/-1)
  • vllm/model_executor/models/qwen3_next_mtp.py (modified, +7/-1)
  • vllm/v1/spec_decode/eagle.py (modified, +9/-3)
  • vllm/v1/worker/gpu/spec_decode/eagle/speculator.py (modified, +5/-2)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.8.0-101-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA L40S
Nvidia driver version        : 580.126.20
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  15
On-line CPU(s) list:                     0-14
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9124 16-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      1
Core(s) per socket:                      1
Socket(s):                               15
Stepping:                                1
BogoMIPS:                                5999.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor fsrm arch_capabilities
Virtualization:                          AMD-V
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               960 KiB (15 instances)
L1i cache:                               960 KiB (15 instances)
L2 cache:                                7.5 MiB (15 instances)
L3 cache:                                240 MiB (15 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-14
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: Clear CPU buffers attempted, no microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu126
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-14	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_ubuntu

---

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --port x \
  --gpu-memory-utilization 0.95 \
  --max-model-len 62144 \
  --tensor-parallel-size 1 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

---

SpecDecoding metrics: Mean acceptance length: 2.23, Avg Draft acceptance rate: 61.3%
Per-position acceptance rate: 0.798, 0.427

---

SpecDecoding metrics: Mean acceptance length: 1.02, Avg Draft acceptance rate: 0.9%
Per-position acceptance rate: 0.018, 0.000

---

SpecDecoding metrics: Mean acceptance length: 1.00, Avg Draft acceptance rate: 0.0%
Per-position acceptance rate: 0.000, 0.000
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.8.0-101-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA L40S
Nvidia driver version        : 580.126.20
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  15
On-line CPU(s) list:                     0-14
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9124 16-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      1
Core(s) per socket:                      1
Socket(s):                               15
Stepping:                                1
BogoMIPS:                                5999.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor fsrm arch_capabilities
Virtualization:                          AMD-V
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               960 KiB (15 instances)
L1i cache:                               960 KiB (15 instances)
L2 cache:                                7.5 MiB (15 instances)
L3 cache:                                240 MiB (15 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-14
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: Clear CPU buffers attempted, no microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu126
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-14	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_ubuntu
</details>

Description

When serving Qwen/Qwen3.5-35B-A3B-FP8 with MTP speculative decoding enabled, the model progressively produces incoherent tokens (random Unicode characters, repeated symbols, gibberish). The model is used in a RAG engine with text and images as inputs.

Launch command

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --port x \
  --gpu-memory-utilization 0.95 \
  --max-model-len 62144 \
  --tensor-parallel-size 1 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Observed behavior

During multi-turn chat completions with tool calling, the output quality degrades across consecutive requests. The speculative decoding metrics reflect this degradation: Request 1 (output is coherent):

SpecDecoding metrics: Mean acceptance length: 2.23, Avg Draft acceptance rate: 61.3%
Per-position acceptance rate: 0.798, 0.427

Request 2 (output starts to degrade):

SpecDecoding metrics: Mean acceptance length: 1.02, Avg Draft acceptance rate: 0.9%
Per-position acceptance rate: 0.018, 0.000

Request 3 (output is completely incoherent):

SpecDecoding metrics: Mean acceptance length: 1.00, Avg Draft acceptance rate: 0.0%
Per-position acceptance rate: 0.000, 0.000

The generated output contains random Unicode characters (e.g. ŧŧŧŧŧŧ), mixed-language fragments, and other garbage tokens.

Expected behavior

The draft acceptance rate should remain stable, or at least degrade gracefully without producing corrupted output.

Additional context

  • The AWQ-4bit variant (cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit) with the same speculative config also exhibited repetition loops during generation.
  • After running some tests, disabling speculative decoding seems to resolve the issue, to confirm.

extent analysis

Fix Plan

To address the issue of incoherent tokens generated by the Qwen/Qwen3.5-35B-A3B-FP8 model with MTP speculative decoding enabled, we will attempt the following steps:

  1. Disable Speculative Decoding: As a temporary workaround, disable speculative decoding to confirm if it resolves the issue.
    • Modify the launch command to remove the --speculative-config flag or set "method" to null.
    • Example:

vllm serve Qwen/Qwen3.5-35B-A3B-FP8
--port x
--gpu-memory-utilization 0.95
--max-model-len 62144
--tensor-parallel-size 1
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder

2. **Adjust Speculative Decoding Configuration**: If disabling speculative decoding is not feasible, try adjusting the speculative decoding configuration.
   - Reduce the `num_speculative_tokens` value to decrease the model's speculation.
   - Example:
     ```bash
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --port x \
  --gpu-memory-utilization 0.95 \
  --max-model-len 62144 \
  --tensor-parallel-size 1 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'
  1. Model Updates or Alternatives: Consider updating the model or exploring alternative models that may be less prone to speculation issues.
  2. Environment and Dependency Updates: Ensure that all dependencies, including PyTorch and CUDA, are up-to-date, as updates may include fixes for similar issues.

Verification

To verify the fix, monitor the model's output quality and speculative decoding metrics after applying the changes. Check for:

  • Coherent output without random Unicode characters or gibberish.
  • Stable draft acceptance rates.
  • Improved per-position acceptance rates.

Extra Tips

  • Regularly review and update models and dependencies to incorporate fixes and improvements.
  • Consider implementing additional logging or monitoring to detect and respond to output quality degradation.
  • Explore alternative speculative decoding methods or configurations that may better suit the specific use case.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The draft acceptance rate should remain stable, or at least degrade gracefully without producing corrupted output.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gibberish output and collapsing generation throughput with Qwen3.5-35B-A3B-FP8 and speculative decoding enabled [1 pull requests, 1 participants]