vllm - 💡(How to fix) Fix [Bug]: TurboQuant KV + spec-decode + chunked-prefill crashes CUDA graph capture at query_start_loc.tolist() in continuation-prefill path (Qwen3-Next hybrid dense) [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40807Fetched 2026-04-25 06:03:55
View on GitHub
Comments
3
Participants
2
Timeline
19
Reactions
5
Timeline (top)
subscribed ×6mentioned ×5commented ×3cross-referenced ×3

Error Message

File "vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory() File "vllm/v1/worker/gpu_model_runner.py", line 5945, in profile_cudagraph_memory self._warmup_and_capture(...) File "vllm/v1/worker/gpu_model_runner.py", line 5503, in _dummy_run outputs = self.model(...) File "vllm/model_executor/models/qwen3_next.py", line 500, in forward ... File "vllm/v1/attention/backends/turboquant_attn.py", line 570, in _prefill_attention qsl = query_start_loc.tolist() RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture unless the CPU tensor is pinned. Please use tensor.pin_memory() or allocate the tensor with pin_memory=True.

Root Cause

TurboQuant offers ~5× KV capacity on this model — currently unreachable in any performant configuration because of this sync.

Fix Action

Fix / Workaround

Tested workarounds

  1. Pin-memory the source tensorsquery_start_loc = query_start_loc.pin_memory() before .tolist(), so the CPU destination is page-locked and the copy is legal during capture.
  2. Precompute at metadata-builder time — populate qsl_cpu: list[int] and seq_lens_cpu: list[int] on the host at TurboQuantMetadataBuilder.build(), before the attention op enters any graph-captured region. Mirrors the _tq_cu_q / _tq_cu_k scratch-tensor reuse pattern Sandermage's Genesis Patch 23 uses for flash_attn_varlen_func calls in _continuation_prefill.

For reference, our working runtime patch wraps both .tolist() sites with torch.cuda.is_current_stream_capturing() guards that fall back to the fast path during capture (safe because unified_attention_with_output is in V1's splitting_ops list — attention outputs during capture are only consulted for memory profiling, not graph content). Source at https://github.com/noonghunna/qwen36-27b-single-3090/blob/main/patches/patch_tolist_cudagraph.py.

Code Example

File "vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
    cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
  File "vllm/v1/worker/gpu_model_runner.py", line 5945, in profile_cudagraph_memory
    self._warmup_and_capture(...)
  File "vllm/v1/worker/gpu_model_runner.py", line 5503, in _dummy_run
    outputs = self.model(...)
  File "vllm/model_executor/models/qwen3_next.py", line 500, in forward
    ...
  File "vllm/v1/attention/backends/turboquant_attn.py", line 570, in _prefill_attention
    qsl = query_start_loc.tolist()
  RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture
  unless the CPU tensor is pinned. Please use tensor.pin_memory() or allocate
  the tensor with pin_memory=True.

---

docker run --gpus all -p 8000:8000 \
    -v /path/to/hf-cache:/root/.cache/huggingface \
    vllm/vllm-openai:nightly \
      --model Lorbus/Qwen3.6-27B-int4-AutoRound \
      --quantization auto_round \
      --dtype float16 \
      --tensor-parallel-size 1 \
      --max-model-len 125000 \
      --max-num-batched-tokens 4128 \
      --max-num-seqs 1 \
      --gpu-memory-utilization 0.92 \
      --kv-cache-dtype turboquant_k8v4 \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --disable-custom-all-reduce \
      --trust-remote-code \
      --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
RAW_BUFFERClick to expand / collapse

Your current environment

Collecting environment information...

      System Info

============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.35

============================== PyTorch Info

PyTorch version : 2.11.0+cu129 Is debug build : False CUDA used to build PyTorch : 12.9 ROCM used to build PyTorch : N/A XPU used to build PyTorch : N/A

============================== Python Environment

Python version : 3.12.13 (main, Mar 4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime) Python platform : Linux-6.8.0-110-generic-x86_64-with-glibc2.35

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : 12.9.86 CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version : 595.58.03 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD EPYC 7543 32-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 BogoMIPS: 5599.65 Virtualization: AMD-V L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 256 MiB (16 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-15

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.7 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.9.1.4 [pip3] nvidia-cuda-cupti-cu12==12.9.79 [pip3] nvidia-cuda-nvrtc-cu12==12.9.86 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] nvidia-cudnn-cu12==9.17.1.4 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft-cu12==11.4.1.4 [pip3] nvidia-cufile-cu12==1.14.1.1 [pip3] nvidia-curand-cu12==10.3.10.19 [pip3] nvidia-cusolver-cu12==11.7.5.82 [pip3] nvidia-cusparse-cu12==12.5.10.65 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.4.2 [pip3] nvidia-cutlass-dsl-libs-base==4.4.2 [pip3] nvidia-ml-py==13.595.45 [pip3] nvidia-nccl-cu12==2.28.9 [pip3] nvidia-nvjitlink-cu12==12.9.86 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvtx-cu12==12.9.79 [pip3] pyzmq==27.1.0 [pip3] torch==2.11.0+cu129 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torchaudio==2.11.0+cu129 [pip3] torchvision==0.26.0+cu129 [pip3] transformers==5.5.4 [pip3] triton==3.6.0 [conda] Could not collect

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.19.2rc1.dev21+g893611813 (git sha: 893611813) vLLM Build Flags: CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-15 0 N/A GPU1 PHB X 0-15 0 N/A

Legend:

X    = Self
SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX  = Connection traversing at most a single PCIe bridge
NV#  = Connection traversing a bonded set of # NVLinks

============================== Environment Variables

CUDA_VERSION=12.9.1 LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64 NVIDIA_DRIVER_CAPABILITIES=compute,utility VLLM_ENABLE_CUDA_COMPATIBILITY=0 TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0 VLLM_USAGE_SOURCE=production-docker-image NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu PYTORCH_NVML_BASED_CUDA_CHECK=1

🐛 Describe the bug

🐛 Describe the bug

With TurboQuant KV (--kv-cache-dtype turboquant_k8v4, turboquant_4bit_nc, or turboquant_3bit_nc) combined with --speculative-config method=mtp and --enable-chunked-prefill, engine initialization crashes during CUDA graph capture warmup:

File "vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
  cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
File "vllm/v1/worker/gpu_model_runner.py", line 5945, in profile_cudagraph_memory
  self._warmup_and_capture(...)
File "vllm/v1/worker/gpu_model_runner.py", line 5503, in _dummy_run
  outputs = self.model(...)
File "vllm/model_executor/models/qwen3_next.py", line 500, in forward
  ...
File "vllm/v1/attention/backends/turboquant_attn.py", line 570, in _prefill_attention
  qsl = query_start_loc.tolist()
RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture
unless the CPU tensor is pinned. Please use tensor.pin_memory() or allocate
the tensor with pin_memory=True.

The same .tolist() pattern exists at three sites (all in non-fast-path branches):

  • Line 465: prefill_max_seq = max(attn_metadata.seq_lens[num_decodes:].tolist()) (in forward() mixed-batch branch)
  • Line 570: qsl = query_start_loc.tolist() (in _prefill_attention continuation branch) ← this one
  • Line 571: seq_lens_list = attn_metadata.seq_lens.tolist() (same branch, same issue)

vLLM version tested: 0.19.2rc1.dev21+g893611813 (image vllm/vllm-openai:nightly built 2026-04-20). Bug still present in main HEAD — commit fe9c3d6 (PR #40092 merged 2026-04-23) did not touch the affected code path. Reproduces on 2× RTX 3090 (Ampere sm_86, PCIe).

Why PR #40092 didn't fix it

PR #40092 (merged 2026-04-23 as fe9c3d6) updated tests/evals/gsm8k/configs/Qwen3-4B-TQ-*.yaml to drop --enforce-eager, implying TurboQuant is now cudagraph-safe in CI. But those test configs don't enable --enable-chunked-prefill or --speculative-config, so warmup only exercises the fast path at turboquant_attn.py:551 (Python-int equality check on max_query_len == max_seq_len, no .tolist()). With spec-dec + chunked prefill, _dummy_run constructs a batch where those values differ, falling into the continuation branch and hitting the .tolist() sync.

PR #40092 unblocked the eager-ok test matrix but left the spec-dec + chunked-prefill combination blocked.

Minimal reproduction

Using Lorbus/Qwen3.6-27B-int4-AutoRound (Qwen3-Next hybrid dense with auto_round:auto_gptq INT4 packing and bundled BF16 MTP head):

docker run --gpus all -p 8000:8000 \
  -v /path/to/hf-cache:/root/.cache/huggingface \
  vllm/vllm-openai:nightly \
    --model Lorbus/Qwen3.6-27B-int4-AutoRound \
    --quantization auto_round \
    --dtype float16 \
    --tensor-parallel-size 1 \
    --max-model-len 125000 \
    --max-num-batched-tokens 4128 \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.92 \
    --kv-cache-dtype turboquant_k8v4 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --disable-custom-all-reduce \
    --trust-remote-code \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

The hybrid-model gate in arg_utils.py:1652 (NotImplementedError: TurboQuant KV cache is not supported for hybrid (attention + Mamba) models) needs to be bypassed first — see separate tracking in #40124. After bypassing, this .tolist() crash is what follows.

Tested workarounds

ConfigResult
Default (cudagraph_mode=PIECEWISE after spec-dec auto-downgrade)Crashes as above
--performance-mode interactivity --async-scheduling --attention-config.flash_attn_version 2Same crash — doesn't affect the continuation
path
--compilation-config.cudagraph_mode noneBoots and serves, but −55% short-prompt TPS (30/40 TPS narr/code vs 66/92 with cudagraphs).
At 227K ctx, 9 TPS warm decode — slower than llama.cpp mainline at 262K. Cudagraphs-off eats all the TurboQuant KV savings on this hardware.

KV cache size delta

Same Qwen3.6-27B + MTP n=3 on 2× RTX 3090 TP=2 (using a hybrid-gate-bypassed config):

  • turboquant_k8v4: GPU KV cache size = 874,368 tokens (262K fits trivially with 3.3× headroom)
  • fp8_e5m2: 171,200 tokens (effective ~131K)

TurboQuant offers ~5× KV capacity on this model — currently unreachable in any performant configuration because of this sync.

Proposed fix direction

The .tolist() calls in the continuation branch need to be made graph-capture-safe. Two directions:

  1. Pin-memory the source tensorsquery_start_loc = query_start_loc.pin_memory() before .tolist(), so the CPU destination is page-locked and the copy is legal during capture.
  2. Precompute at metadata-builder time — populate qsl_cpu: list[int] and seq_lens_cpu: list[int] on the host at TurboQuantMetadataBuilder.build(), before the attention op enters any graph-captured region. Mirrors the _tq_cu_q / _tq_cu_k scratch-tensor reuse pattern Sandermage's Genesis Patch 23 uses for flash_attn_varlen_func calls in _continuation_prefill.

Option 2 is strictly more robust (no sync at all, just host-side int access) but touches the metadata class. Option 1 is ~2 LoC. Happy to submit either as a PR — guidance on which direction maintainers prefer would help.

For reference, our working runtime patch wraps both .tolist() sites with torch.cuda.is_current_stream_capturing() guards that fall back to the fast path during capture (safe because unified_attention_with_output is in V1's splitting_ops list — attention outputs during capture are only consulted for memory profiling, not graph content). Source at https://github.com/noonghunna/qwen36-27b-single-3090/blob/main/patches/patch_tolist_cudagraph.py.

Relation to #40069

Concrete failure data for two of the unchecked items in the TurboQuant follow-ups tracking issue:

  • Feature compatibility → Speculative decoding / Eagle — spec-dec + TurboQuant triggers this on any model
  • Backend coverage → Hybrid attention models (Qwen3.5, mamba+attention) — reproduces on Qwen3.6-27B (dense hybrid), would also affect Qwen3.6-35B-A3B (MoE hybrid) once the arg_utils gate is lifted via #39931 or similar

Why this matters downstream

Lorbus/Qwen3.6-27B-int4-AutoRound's model card recommends --kv-cache-dtype tq-t4nc (= turboquant_4bit_nc) for 262K context. Any user on Ampere following that recipe with speculative decoding enabled — which is the card's other recommendation — hits this bug today. Fixing the sync unblocks the recipe on Ampere hardware.

With the runtime-patch workaround applied, we measured on 1× RTX 3090: 85 TPS sustained / 106 peak / 125K context / vision enabled — full vLLM server at datacenter-comparable throughput on a 24 GB consumer card. Full recipe reproducible at https://github.com/noonghunna/qwen36-27b-single-3090 — happy to share bench data if helpful.

Happy to test any patch.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the CUDA graph capture warmup crash is to make the .tolist() calls graph-capture-safe by either pinning the memory of the source tensors or precomputing the values at metadata-builder time.

Guidance

  • Identify the .tolist() calls in the continuation branch of the turboquant_attn.py file and consider pinning the memory of the source tensors using pin_memory() to make the copy legal during capture.
  • Alternatively, precompute the values at metadata-builder time by populating qsl_cpu and seq_lens_cpu on the host at TurboQuantMetadataBuilder.build(), before the attention op enters any graph-captured region.
  • Verify the fix by running the minimal reproduction script with the modified code and checking for the absence of the RuntimeError.
  • Consider testing both proposed fix directions and measuring their performance impact to determine the most suitable solution.

Example

# Pinning memory example
query_start_loc = query_start_loc.pin_memory()
qsl = query_start_loc.tolist()

# Precomputing example
class TurboQuantMetadataBuilder:
    def build(self):
        # ...
        self.qsl_cpu = query_start_loc.tolist()
        self.seq_lens_cpu = attn_metadata.seq_lens.tolist()
        # ...

Notes

The proposed fixes aim to address the RuntimeError caused by copying between CPU and CUDA tensors during CUDA graph capture. The choice between pinning memory and precomputing values depends on performance considerations and the specific requirements of the project.

Recommendation

Apply the precomputing fix direction, as it is strictly more robust and avoids any synchronization issues during capture. This approach may require modifying the metadata class, but it provides a more reliable solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: TurboQuant KV + spec-decode + chunked-prefill crashes CUDA graph capture at query_start_loc.tolist() in continuation-prefill path (Qwen3-Next hybrid dense) [3 comments, 2 participants]