vllm - 💡(How to fix) Fix [Bug]: Latest Nightly build with TurboQuant KV cache crashes on large chunked continuation prefill after workspace lock ( testing PR #39931 implementing TQ on Hybrid Attention Models e.g Qwen3.5-9B) [5 comments, 4 participants]

vllm2026-05-05 11:24:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41726•Fetched 2026-05-06 06:15:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×10mentioned ×7commented ×5cross-referenced ×1

Error Message

Error

EngineCore encountered a fatal error.

Root Cause

Suspected root cause

RAW_BUFFERClick to expand / collapse

Your current environment

Environment

vLLM: 0.20.2rc1.dev35+g4845aee6b
Python: 3.12.13
Torch: 2.11.0+cu130
CUDA driver/runtime: NVIDIA driver 595.71.05, CUDA 13
GPU: NVIDIA GeForce RTX 5080
FlashInfer: 0.6.8.post1
Transformers: 5.7.0

🐛 Describe the bug

Summary

Testing latest commit #39931

turbo --kv-cache-dtype can crash at runtime when chunked prefill resumes a long prompt. The failure happens in TurboQuant attention’s large continuation-prefill path after the global workspace has already been locked.

This appears TurboQuant-specific.

Model

Model: NVFP4 quantization of Qwen3.5-9b

Relevant config:

architectures: Qwen3_5ForConditionalGeneration model_type: qwen3_5 text_config.hidden_size: 4096 text_config.num_attention_heads: 16 text_config.num_key_value_heads: 4 text_config.head_dim: 256 text_config.num_hidden_layers: 32 text_config.full_attention_interval: 4 quantization_config.quant_method: modelopt quantization_config.quant_algo: NVFP4

Startup logs detect the TurboQuant hybrid layers:

TQ hybrid: full-attention layers [3, 7, 11, 15, 19, 23, 27, 31] Using TURBOQUANT attention backend out of potential backends: ['TURBOQUANT'].

Command

vllm serve /home/yk/AI/Models/models/qwen3.5-9b-nvfp4-ptq
--gpu-memory-utilization 0.95
--kv-cache-dtype turboquant_4bit_nc
--max-model-len 102400
--enable-chunked-prefill
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--language-model-only
--max-num-seqs 1
--max-num-batched-tokens 4096
--enable-prefix-caching
--default-chat-template-kwargs '{"enable_thinking": true}'

What happened

The server started successfully and the first few OpenAI-compatible chat completion requests succeeded. The failure occurred later from Open WebUI during a conversation with tools/web-search enabled, after toolschemas and/or tool results made the prompt larger.

The failing request appears to have crossed a chunked-prefill boundary:

SchedulerOutput( scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData( req_ids=['chatcmpl-8cab1ba7625a3c7c-9669f94e'], ... num_computed_tokens=[4096], ... ), num_scheduled_tokens={chatcmpl-8cab1ba7625a3c7c-9669f94e: 3497}, total_num_scheduled_tokens=3497, ... )

So vLLM had already computed the first 4096 prompt tokens and then resumed the same request with another 3497 prompt tokens.

Error

AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:720:_continuation_prefill' requires 16.00 MB, current size is 1.02 MB. Workspace growth is not allowed after locking.

Full relevant stack:

EngineCore encountered a fatal error.

File ".../vllm/v1/worker/gpu_model_runner.py", line 4089, in execute_model model_output = self._model_forward(...)

File ".../vllm/v1/worker/gpu_model_runner.py", line 3562, in _model_forward return self.model(...)

File ".../vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs)

File ".../vllm/model_executor/models/qwen3_5.py", line 695, in forward hidden_states = self.language_model.model(...)

File ".../vllm/compilation/decorators.py", line 520, in call return self.aot_compiled_fn(self, *args, **kwargs)

File ".../vllm/model_executor/models/qwen3_next.py", line 495, in forward

File ".../vllm/compilation/caching.py", line 215, in call return self.optimized_call(*args, **kwargs)

File ".../vllm/model_executor/layers/attention/attention.py", line 723, in unified_attention_with_output self.impl.forward(...)

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 438, in forward attn_out = self._prefill_attention(...)

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 669, in _prefill_attention out = self._continuation_prefill(...)

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 720, in _continuation_prefill k_buf, v_buf = current_workspace_manager().get_simultaneous(...)

File ".../vllm/v1/worker/workspace.py", line 110, in get_simultaneous current_workspace = self._ensure_workspace_size(total_bytes)

File ".../vllm/v1/worker/workspace.py", line 157, in _ensure_workspace_size raise AssertionError( AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:720:_continuation_prefill' requires 16.00 MB, current size is 1.02 MB. Workspace growth is not allowed after locking.

Suspected root cause

TurboQuant attention has a special continuation-prefill path for chunked-prefill resumes.

In turboquant_attn.py, small continuation chunks use the decode kernel:

_CONTINUATION_DECODE_THRESHOLD = 128

if q_len <= _CONTINUATION_DECODE_THRESHOLD: out = triton_turboquant_decode_attention(...) else: out = self._continuation_prefill(...)

The failing request had:

cached_len = 4096 q_len = 3497

Since q_len > 128, TurboQuant entered _continuation_prefill().

That path dequants the previously cached TurboQuant K/V into shared workspace:

alloc_len = math.ceil(cached_len / block_size) * block_size buf_shape = (1, Hk, alloc_len, D)

k_buf, v_buf = current_workspace_manager().get_simultaneous( (buf_shape, torch.float16), (buf_shape, torch.float16), )

For this model/settings:

cached_len = 4096 block_size = 2048 Hk = 4 D = 256 dtype = fp16 buffers = K + V

Required workspace:

2 * 4096 * 4 * 256 * 2 bytes = 16 MB

But vLLM had locked the global workspace after warmup/CUDA graph capture with only ~1.02 MB reserved.

The workspace lock is intentional in vLLM:

gpu_model_runner.py: Lock workspace to prevent resizing during execution, Max workspace sizes should have been captured during warmup/profiling.

The issue appears to be that TurboQuant’s startup warmup/profiling does not exercise or reserve the large continuation-prefill workspace shape, so the first real long chunked-prefill continuation tries to grow the workspace after it has been locked.

Why this edge case was triggered

This did not happen on short initial requests. It happened after Open WebUI with tools/web-search produced a larger prompt/conversation context. Once the prompt exceeded --max-num-batched-tokens 4096, vLLM resumed prefill for the same request:

num_computed_tokens=[4096] num_scheduled_tokens=3497

That combination caused TurboQuant to enter the large continuation-prefill path and request a 16 MB workspace.

prefix caching does not seem to be the cause. The scheduled_cached_reqs here appears to refer to an internally resumed chunked-prefill request with already-computed KV, not prefix-cache reuse.

Expected behavior

TurboQuant should either:

reserve/profile enough workspace before lock_workspace(), or
avoid using a workspace path that can require larger allocations after lock

Suggested root fix

TurboQuant should reserve/profile worst-case continuation-prefill workspace before workspace locking.

Possible approaches:

During warmup/profiling, exercise a dummy TurboQuant continuation prefill where:
- cached_len > 0
- q_len > _CONTINUATION_DECODE_THRESHOLD
- cached_len is at least one chunk, e.g. max_num_batched_tokens
Or expose TurboQuant continuation-prefill workspace requirements to the global workspace profiler.
At minimum, reserve enough for:

2 * ceil(max_cached_len / block_size) * block_size * num_kv_heads * head_dim * sizeof(fp16)

For this model and one previous chunk:

2 * 4096 * 4 * 256 * 2 = 16 MB

A more conservative reservation would scale with the maximum possible chunked continuation cached length, but that could become large for very long context, so the policy likely needs to be tied to scheduler/chunking limits.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to reserve/profile enough workspace before lock_workspace() for TurboQuant's worst-case continuation-prefill scenario.

Guidance

Identify the maximum possible chunked continuation cached length and calculate the required workspace size using the formula: 2 * ceil(max_cached_len / block_size) * block_size * num_kv_heads * head_dim * sizeof(fp16)
Modify the warmup/profiling process to exercise a dummy TurboQuant continuation prefill with cached_len > 0 and q_len > _CONTINUATION_DECODE_THRESHOLD
Consider exposing TurboQuant continuation-prefill workspace requirements to the global workspace profiler
Reserve enough workspace to accommodate the calculated size, e.g., 16 MB for the given model and settings

Example

No code snippet is provided as the issue is more related to the configuration and profiling of the TurboQuant model.

Notes

The fix may require adjustments to the warmup/profiling process, and the reserved workspace size may need to be scaled based on the maximum possible chunked continuation cached length.

Recommendation

Apply a workaround by reserving enough workspace before lock_workspace() to accommodate the worst-case continuation-prefill scenario, as calculating the exact required size may be complex and dependent on various factors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#inference speed #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Latest Nightly build with TurboQuant KV cache crashes on large chunked continuation prefill after workspace lock ( testing PR #39931 implementing TQ on Hybrid Attention Models e.g Qwen3.5-9B) [5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error

Root Cause

Suspected root cause

Your current environment

Environment

🐛 Describe the bug

Summary

Model

Command

What happened

Error

Suspected root cause

Why this edge case was triggered

Expected behavior

Suggested root fix

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Latest Nightly build with TurboQuant KV cache crashes on large chunked continuation prefill after workspace lock ( testing PR #39931 implementing TQ on Hybrid Attention Models e.g Qwen3.5-9B) [5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error

Root Cause

Suspected root cause

Your current environment

Environment

🐛 Describe the bug

Summary

Model

Command

What happened

Error

Suspected root cause

Why this edge case was triggered

Expected behavior

Suggested root fix

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING