vllm - 💡(How to fix) Fix [Bug]: TurboQuant workspace locked at 3.06 MB — continuation_prefill requires 12 MB on any prompt >4096 tokens (Qwen3.6-27B NVFP4 hybrid, Blackwell SM120)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

<p>Any request whose prompt exceeds ~4096 tokens crashes the vLLM engine with a workspace locking assertion error in <code>_continuation_prefill</code>. Short requests (&lt;4096 tokens) complete successfully. The engine must be restarted after each crash.</p> <h2>Error</h2> <li>Engine crashes immediately with the workspace assertion error above</li>

Root Cause

<pre><code class="language-text">Cannot run collect_env.py on host — vLLM runs inside Podman container. Container image: docker.io/vllm/vllm-openai:nightly (pulled 2026-05-21) vLLM version: 0.21.1rc1.dev169+ga6682d1d2 GPU: NVIDIA RTX Pro 6000 Blackwell (SM120, 96GB GDDR7) CUDA: 13.0 (cu130 nightly) OS: Bazzite Linux (Fedora immutable, based on Fedora 41) Container runtime: Podman rootless Model: sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP Architecture: Qwen3-Next hybrid (GDN linear attention + full attention layers) Quantization: ModelOpt NVFP4 (--quantization modelopt) </code></pre> <h2>🐛 Describe the bug</h2> <p>Any request whose prompt exceeds ~4096 tokens crashes the vLLM engine with a workspace locking assertion error in <code>_continuation_prefill</code>. Short requests (&lt;4096 tokens) complete successfully. The engine must be restarted after each crash.</p> <p><strong>Critically: changing <code>--max-num-batched-tokens</code> does not affect the workspace allocation.</strong> Tested with 4096, 8192, and 32768 — workspace is always locked at 3.06 MB regardless of this value.</p> <h2>Launch command</h2> <pre><code class="language-bash">podman run --name vllm-qwen36 \ --gpus all --network=host \ -v /path/to/models:/models:Z \ --ipc=host -d \ -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \ docker.io/vllm/vllm-openai:nightly \ /models/qwen3.6-27b-nvfp4 \ --quantization modelopt \ --max-model-len 262144 \ --served-model-name qwen3.6 \ --max-num-seqs 6 \ --max-num-batched-tokens 4096 \ --kv-cache-dtype turboquant_4bit_nc \ --gpu-memory-utilization 0.55 \ --enable-prefix-caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --trust-remote-code </code></pre> <h2>Startup log (confirms correct configuration)</h2> <pre><code>TQ hybrid: full-attention layers [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63] Using TURBOQUANT attention backend out of potential backends: ['TURBOQUANT'] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM Setting attention block size to 3072 tokens to ensure that attention page size is &gt;= mamba page size GPU KV cache size: 1,852,104 tokens enable_prefix_caching=True, enable_chunked_prefill=True Application startup complete </code></pre> <h2>Error</h2> <pre><code>AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:747:_continuation_prefill' requires 12.00 MB, current size is 3.06 MB. Workspace growth is not allowed after locking. </code></pre> <pre><code>File "vllm/v1/engine/core.py", line 1152, in run_engine_core engine_core.run_busy_loop() File "vllm/v1/engine/core.py", line 1232, in _process_engine_step outputs, model_executed = self.step_fn() File "vllm/v1/attention/backends/turboquant_attn.py", line 444, in forward attn_out = self._prefill_attention( File "vllm/v1/attention/backends/turboquant_attn.py", line 696, in _prefill_attention out = self._continuation_prefill( File "vllm/v1/attention/backends/turboquant_attn.py", line 747, in _continuation_prefill k_buf, v_buf = current_workspace_manager().get_simultaneous( File "vllm/v1/worker/workspace.py", line 110, in get_simultaneous current_workspace = self._ensure_workspace_size(total_bytes) File "vllm/v1/worker/workspace.py", line 157, in _ensure_workspace_size raise AssertionError( AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:747:_continuation_prefill' requires 12.00 MB, current size is 3.06 MB. Workspace growth is not allowed after locking. </code></pre> <h2>Reproduction steps</h2> <ol> <li>Start vLLM with any TurboQuant preset (<code>turboquant_4bit_nc</code>, <code>turboquant_3bit_nc</code>, <code>turboquant_k8v4</code>) on a Qwen3.6-27B NVFP4 hybrid model on Blackwell SM120</li> <li>Send any request with a prompt &gt;4096 tokens (e.g. a system prompt + conversation history totalling ~18k tokens as used in agentic frameworks)</li> <li>Engine crashes immediately with the workspace assertion error above</li> <li>Requests &lt;4096 tokens complete successfully at ~60 tok/s</li> </ol> <h2>Root cause analysis</h2> <p>The workspace is pre-allocated during the warmup profiling run based on the dummy batch size. With <code>--max-num-batched-tokens 4096</code> the warmup only exercises short sequences, allocating 3.06 MB.</p> <p>At inference time, any prompt &gt;4096 tokens triggers chunked prefill. The continuation chunks are 3072 tokens (the attention block size set at startup). Processing a 3072-token continuation chunk requires a 12 MB workspace buffer in <code>_continuation_prefill</code>.</p> <p><strong>The workspace pre-allocation is completely decoupled from <code>--max-num-batched-tokens</code></strong>. Setting this flag to 4096, 8192, or 32768 produces identical workspace allocation (3.06 MB) and identical crashes. This suggests the workspace is sized based on the warmup dummy run's observed shapes rather than the configured batch token limit.</p> <h2>What works / what doesn't</h2>

Fix Action

Fix / Workaround

<h2>Proposed fix</h2> <p><code>workspace.py</code> should pre-allocate workspace based on worst-case continuation prefill size before locking, rather than the observed profiling run size. The worst-case size should be derived from:</p> <pre><code>max_workspace = 2 × attention_block_size × num_kv_heads × head_dim × bytes_per_element </code></pre> <p>For this model: <code>2 × 3072 × 32 × 128 × 2 = 50,331,648 bytes (~48 MB)</code></p> <p>Alternatively, the workspace lock should be removed or made resizable for the continuation prefill path.</p> <h2>Related issues</h2> <ul> <li>#40807 — different code path (spec-decode + chunked prefill <code>.tolist()</code> sync crash)</li> <li>#40420 — different failure mode (CUDA OOM at 185k+ tokens, not workspace locking at 4k)</li> <li>#40124 — Ampere SM86 patches, different hardware</li> </ul> <p><strong>This is a distinct bug</strong>: workspace undersized for any prompt &gt;4096 tokens on Blackwell SM120 with hybrid models, regardless of <code>--max-num-batched-tokens</code> configuration.</p></body></html>
RAW_BUFFERClick to expand / collapse

Your current environment

<html><head></head><body> <h2> Current environment</h2> <pre><code class="language-text">Cannot run collect_env.py on host — vLLM runs inside Podman container. Container image: docker.io/vllm/vllm-openai:nightly (pulled 2026-05-21) vLLM version: 0.21.1rc1.dev169+ga6682d1d2 GPU: NVIDIA RTX Pro 6000 Blackwell (SM120, 96GB GDDR7) CUDA: 13.0 (cu130 nightly) OS: Bazzite Linux (Fedora immutable, based on Fedora 41) Container runtime: Podman rootless Model: sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP Architecture: Qwen3-Next hybrid (GDN linear attention + full attention layers) Quantization: ModelOpt NVFP4 (--quantization modelopt) </code></pre> <h2>🐛 Describe the bug</h2> <p>Any request whose prompt exceeds ~4096 tokens crashes the vLLM engine with a workspace locking assertion error in <code>_continuation_prefill</code>. Short requests (&lt;4096 tokens) complete successfully. The engine must be restarted after each crash.</p> <p><strong>Critically: changing <code>--max-num-batched-tokens</code> does not affect the workspace allocation.</strong> Tested with 4096, 8192, and 32768 — workspace is always locked at 3.06 MB regardless of this value.</p> <h2>Launch command</h2> <pre><code class="language-bash">podman run --name vllm-qwen36 \ --gpus all --network=host \ -v /path/to/models:/models:Z \ --ipc=host -d \ -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \ docker.io/vllm/vllm-openai:nightly \ /models/qwen3.6-27b-nvfp4 \ --quantization modelopt \ --max-model-len 262144 \ --served-model-name qwen3.6 \ --max-num-seqs 6 \ --max-num-batched-tokens 4096 \ --kv-cache-dtype turboquant_4bit_nc \ --gpu-memory-utilization 0.55 \ --enable-prefix-caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --trust-remote-code </code></pre> <h2>Startup log (confirms correct configuration)</h2> <pre><code>TQ hybrid: full-attention layers [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63] Using TURBOQUANT attention backend out of potential backends: ['TURBOQUANT'] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM Setting attention block size to 3072 tokens to ensure that attention page size is &gt;= mamba page size GPU KV cache size: 1,852,104 tokens enable_prefix_caching=True, enable_chunked_prefill=True Application startup complete </code></pre> <h2>Error</h2> <pre><code>AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:747:_continuation_prefill' requires 12.00 MB, current size is 3.06 MB. Workspace growth is not allowed after locking. </code></pre> <pre><code>File "vllm/v1/engine/core.py", line 1152, in run_engine_core engine_core.run_busy_loop() File "vllm/v1/engine/core.py", line 1232, in _process_engine_step outputs, model_executed = self.step_fn() File "vllm/v1/attention/backends/turboquant_attn.py", line 444, in forward attn_out = self._prefill_attention( File "vllm/v1/attention/backends/turboquant_attn.py", line 696, in _prefill_attention out = self._continuation_prefill( File "vllm/v1/attention/backends/turboquant_attn.py", line 747, in _continuation_prefill k_buf, v_buf = current_workspace_manager().get_simultaneous( File "vllm/v1/worker/workspace.py", line 110, in get_simultaneous current_workspace = self._ensure_workspace_size(total_bytes) File "vllm/v1/worker/workspace.py", line 157, in _ensure_workspace_size raise AssertionError( AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:747:_continuation_prefill' requires 12.00 MB, current size is 3.06 MB. Workspace growth is not allowed after locking. </code></pre> <h2>Reproduction steps</h2> <ol> <li>Start vLLM with any TurboQuant preset (<code>turboquant_4bit_nc</code>, <code>turboquant_3bit_nc</code>, <code>turboquant_k8v4</code>) on a Qwen3.6-27B NVFP4 hybrid model on Blackwell SM120</li> <li>Send any request with a prompt &gt;4096 tokens (e.g. a system prompt + conversation history totalling ~18k tokens as used in agentic frameworks)</li> <li>Engine crashes immediately with the workspace assertion error above</li> <li>Requests &lt;4096 tokens complete successfully at ~60 tok/s</li> </ol> <h2>Root cause analysis</h2> <p>The workspace is pre-allocated during the warmup profiling run based on the dummy batch size. With <code>--max-num-batched-tokens 4096</code> the warmup only exercises short sequences, allocating 3.06 MB.</p> <p>At inference time, any prompt &gt;4096 tokens triggers chunked prefill. The continuation chunks are 3072 tokens (the attention block size set at startup). Processing a 3072-token continuation chunk requires a 12 MB workspace buffer in <code>_continuation_prefill</code>.</p> <p><strong>The workspace pre-allocation is completely decoupled from <code>--max-num-batched-tokens</code></strong>. Setting this flag to 4096, 8192, or 32768 produces identical workspace allocation (3.06 MB) and identical crashes. This suggests the workspace is sized based on the warmup dummy run's observed shapes rather than the configured batch token limit.</p> <h2>What works / what doesn't</h2>
ScenarioResult
Requests <4096 tokens✅ ~60 tok/s, stable
Requests >4096 tokens❌ Engine crash, restart required
--max-num-batched-tokens 4096❌ Same crash
--max-num-batched-tokens 8192❌ Same crash
--max-num-batched-tokens 32768❌ Same crash
turboquant_4bit_nc❌ Crashes
turboquant_3bit_nc❌ Crashes
fp8 (baseline)✅ Works correctly
<h2>Proposed fix</h2> <p><code>workspace.py</code> should pre-allocate workspace based on worst-case continuation prefill size before locking, rather than the observed profiling run size. The worst-case size should be derived from:</p> <pre><code>max_workspace = 2 × attention_block_size × num_kv_heads × head_dim × bytes_per_element </code></pre> <p>For this model: <code>2 × 3072 × 32 × 128 × 2 = 50,331,648 bytes (~48 MB)</code></p> <p>Alternatively, the workspace lock should be removed or made resizable for the continuation prefill path.</p> <h2>Related issues</h2> <ul> <li>#40807 — different code path (spec-decode + chunked prefill <code>.tolist()</code> sync crash)</li> <li>#40420 — different failure mode (CUDA OOM at 185k+ tokens, not workspace locking at 4k)</li> <li>#40124 — Ampere SM86 patches, different hardware</li> </ul> <p><strong>This is a distinct bug</strong>: workspace undersized for any prompt &gt;4096 tokens on Blackwell SM120 with hybrid models, regardless of <code>--max-num-batched-tokens</code> configuration.</p></body></html>

🐛 Describe the bug

File "vllm/v1/engine/core.py", line 1152, in run_engine_core engine_core.run_busy_loop() File "vllm/v1/engine/core.py", line 1232, in _process_engine_step outputs, model_executed = self.step_fn() File "vllm/v1/attention/backends/turboquant_attn.py", line 444, in forward attn_out = self._prefill_attention( File "vllm/v1/attention/backends/turboquant_attn.py", line 696, in _prefill_attention out = self._continuation_prefill( File "vllm/v1/attention/backends/turboquant_attn.py", line 747, in _continuation_prefill k_buf, v_buf = current_workspace_manager().get_simultaneous( File "vllm/v1/worker/workspace.py", line 110, in get_simultaneous current_workspace = self._ensure_workspace_size(total_bytes) File "vllm/v1/worker/workspace.py", line 157, in _ensure_workspace_size raise AssertionError( AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:747:_continuation_prefill' requires 12.00 MB, current size is 3.06 MB. Workspace growth is not allowed after locking.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING