vllm - ✅(Solved) Fix [Bug]: TurboQuant _continuation_prefill workspace allocation fails at long context — v0.20.0 regression [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41565Fetched 2026-05-04 04:58:47
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
cross-referenced ×3commented ×1unsubscribed ×1

On vLLM 0.20.0 with TurboQuant enabled, any prefill request whose cached KV grows beyond ~6-8K tokens triggers an assertion failure in _continuation_prefill's workspace allocation. The workspace is sized to 0.26 MB at startup but _continuation_prefill needs ~2 MB at long context, and the workspace is locked against growth after engine init.

The same workloads run cleanly on the v0.18-era TQ fork (vibhavagarwal5/PR #38479 base), so this is a v0.20-specific regression introduced when the workspace manager (PR #40941, merged Apr 27) was wired into the TQ continuation prefill path.

Error Message

AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB, current size is 0.26 MB. Workspace growth is not allowed after locking.

Root Cause

Root cause analysis (best-effort)

Fix Action

Fix / Workaround

This is the same conceptual failure mode jhsmith409 reported on PR #39931 for hybrid GDN models (chunk_fwd_o OOM at o = torch.empty_like(v)), where their proposed mitigation was a 1.25 GiB safety-margin overlay in gpu_worker.py. Different op, same upstream cause.

PR fix notes

PR #40798: [TurboQuant] Share decode scratch workspace across layers

Description (problem / solution / changelog)

Summary

TurboQuant currently registers three decode scratch buffers on every attention layer. These buffers are temporary decode workspace, but because they are registered per layer they scale with the number of TurboQuant layers and with max_num_seqs.

This PR moves TurboQuant decode scratch allocation to the v1 workspace manager so the scratch tensors are shared across layers. It also reserves the maximum TurboQuant decode workspace before CUDA graph capture locks the workspace, preventing locked-workspace growth at runtime.

This PR is scoped to TurboQuant decode scratch memory usage. It does not claim to fix TurboQuant speculative-decoding correctness issues.

Motivation

On large models and large H100/H200 server defaults, the per-layer scratch buffers can consume tens of GiB of non-KV memory and significantly reduce KV cache capacity.

Measured on H200 with Llama-3.1-70B, TP=2, kv_cache_dtype=turboquant_3bit_nc, max_model_len=65536, gpu_memory_utilization=0.90:

versionmodel loading memoryavailable KV memoryGPU KV cache sizemax concurrency @ 65,536
before105.23 GiB14.61 GiB400,128 tokens6.11x
after65.74 GiB53.97 GiB1,478,384 tokens22.56x

The old per-layer buffer cost for this setup is about 39.5 GiB:

B = 1024
Hq = 32
S = 32
D = 128
TQ layers = 76
mid_o ~= B * Hq * S * (D + 1) * 4 bytes * 76 ~= 39.5 GiB

Changes

  • Keep TurboQuant centroids as per-layer state.
  • Stop registering _tq_mid_o_buf, _tq_output_buf, and _tq_lse_buf on every attention layer.
  • Allocate TurboQuant decode scratch through WorkspaceManager.get_simultaneous().
  • Reserve a max-size TurboQuant decode workspace before lock_workspace() in CUDA graph capture.
  • Reserve across all attention groups so hybrid model layouts do not miss TurboQuant groups outside the first group.
  • Add behavior tests for the workspace allocation and reservation paths.

Duplicate-work note

I checked the open TurboQuant scratch/workspace work before publishing this PR. This PR is intentionally scoped to the v1 workspace-manager reservation path and the measured H200 Llama-3.1-70B TP=2 memory-capacity regression. It is not a general cleanup and does not claim to solve unrelated TurboQuant speculative-decoding correctness issues.

Testing

  • python -m py_compile tests/quantization/test_turboquant.py vllm/v1/worker/gpu_model_runner.py vllm/model_executor/layers/attention/attention.py vllm/v1/attention/ops/triton_turboquant_decode.py vllm/v1/attention/backends/turboquant_attn.py
  • git diff --check
  • H200 container: python3 -m pytest tests/quantization/test_turboquant.py -k TurboQuantDecodeWorkspace -q
    • Result: 4 passed, 117 deselected, 17 warnings in 2.03s
  • H200 startup/capacity validation:
    • model: /mnt/afs/models/Llama/Llama-3.1-70B
    • GPUs: H200 2x, TP=2
    • --kv-cache-dtype turboquant_3bit_nc
    • --max-model-len 65536
    • --gpu-memory-utilization 0.90
    • result: model loading memory reduced from 105.23 GiB to 65.74 GiB; GPU KV cache size increased from 400,128 to 1,478,384 tokens
  • H200 serving benchmark, /v1/completions, random dataset, 32 prompts, input length 4096, output length 128, request_rate=inf, max_concurrency=8, ignore_eos=true:
    • before: 0.3186 req/s, 40.78 output tok/s, mean TTFT 16809 ms, mean TPOT 65.16 ms
    • after: 0.3343 req/s, 42.80 output tok/s, mean TTFT 15629 ms, mean TPOT 65.14 ms
  • Sampled MMLU-Pro regression, leaderboard_mmlu_pro, limit=20, 5-shot, local-completions, num_concurrent=4:
    • before: acc=0.60 ± 0.1124
    • after: acc=0.60 ± 0.1124
    • same predicted option on 20/20 samples and same correctness on 20/20 samples

AI Assistance

AI assistance was used to prepare this patch and PR text. Guipeng Zhang is the human submitter responsible for reviewing and defending the change.

Changed files

  • tests/quantization/test_turboquant.py (modified, +133/-0)
  • vllm/model_executor/layers/attention/attention.py (modified, +4/-24)
  • vllm/v1/attention/backends/turboquant_attn.py (modified, +0/-11)
  • vllm/v1/attention/ops/triton_turboquant_decode.py (modified, +15/-8)
  • vllm/v1/worker/gpu_model_runner.py (modified, +31/-1)

Code Example

vllm serve /path/to/Nemotron-3-Super-120B-AWQ-4bit \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.92 \
  --max-num-seqs 4 --max-model-len 131072 \
  --trust-remote-code --enable-expert-parallel \
  --mamba-ssm-cache-dtype float16 \
  --kv-cache-dtype turboquant_3bit_nc

---

import requests
prompt = 'In computational science, ' * 2000  # ~8K tokens after tokenization
r = requests.post("http://localhost:8001/v1/chat/completions", json={
    "model": "/path/to/Nemotron-3-Super-120B-AWQ-4bit",
    "messages": [{"role": "user", "content": prompt + " Reply OK."}],
    "max_tokens": 50,
})

---

AssertionError: Workspace is locked but allocation from
'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB,
current size is 0.26 MB. Workspace growth is not allowed after locking.

---

# Allocate slightly over to align to block_size for the grid.
# Reuse cached buffers to avoid per-call allocation (~16MB at 8K).
alloc_len = math.ceil(cached_len / block_size) * block_size
buf_shape = (1, Hk, alloc_len, D)
# Use WorkspaceManager for dequant buffers.
# Shared across all layers — saves 60× memory at long context.
# Required for CUDA Graph capture (per-layer growth incompatible with CG).
k_buf, v_buf = current_workspace_manager().get_simultaneous(
    (buf_shape, torch.float16),
    (buf_shape, torch.float16),
)
RAW_BUFFERClick to expand / collapse

[Bug]: TurboQuant _continuation_prefill workspace allocation fails at long context — v0.20.0 regression

Summary

On vLLM 0.20.0 with TurboQuant enabled, any prefill request whose cached KV grows beyond ~6-8K tokens triggers an assertion failure in _continuation_prefill's workspace allocation. The workspace is sized to 0.26 MB at startup but _continuation_prefill needs ~2 MB at long context, and the workspace is locked against growth after engine init.

The same workloads run cleanly on the v0.18-era TQ fork (vibhavagarwal5/PR #38479 base), so this is a v0.20-specific regression introduced when the workspace manager (PR #40941, merged Apr 27) was wired into the TQ continuation prefill path.

Reproduction

Setup: 8× RTX A4000 SM86, CUDA 13.0, driver 580.76.05, vLLM 0.20.0 from the GA tag.

vllm serve /path/to/Nemotron-3-Super-120B-AWQ-4bit \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.92 \
  --max-num-seqs 4 --max-model-len 131072 \
  --trust-remote-code --enable-expert-parallel \
  --mamba-ssm-cache-dtype float16 \
  --kv-cache-dtype turboquant_3bit_nc

Send a long prompt:

import requests
prompt = 'In computational science, ' * 2000  # ~8K tokens after tokenization
r = requests.post("http://localhost:8001/v1/chat/completions", json={
    "model": "/path/to/Nemotron-3-Super-120B-AWQ-4bit",
    "messages": [{"role": "user", "content": prompt + " Reply OK."}],
    "max_tokens": 50,
})

Expected: response.

Actual:

AssertionError: Workspace is locked but allocation from
'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB,
current size is 0.26 MB. Workspace growth is not allowed after locking.

Engine dies; subsequent requests return HTTP 500 EngineCore encountered an issue.

Threshold sweep on the same hardware

Prompt repetitions × 'In computational science, 'Total prompt tokensv0.20.0 + TurboQuantPR #38479 fork
20008,020CRASH (workspace)works
300012,020CRASH (workspace)works
500020,020CRASH (workspace)works

Same model, same hardware, same TQ preset (tq-t3ncturboquant_3bit_nc). Only difference: vLLM revision.

Independent of TheTom's sparse-V PR

We hit the same crash with the sparse-V PR (#41422) applied AND with stock v0.20.0. The bug is in v0.20's _continuation_prefill workspace usage, not in any open PR.

Root cause analysis (best-effort)

vllm/v1/attention/backends/turboquant_attn.py:748-760:

# Allocate slightly over to align to block_size for the grid.
# Reuse cached buffers to avoid per-call allocation (~16MB at 8K).
alloc_len = math.ceil(cached_len / block_size) * block_size
buf_shape = (1, Hk, alloc_len, D)
# Use WorkspaceManager for dequant buffers.
# Shared across all layers — saves 60× memory at long context.
# Required for CUDA Graph capture (per-layer growth incompatible with CG).
k_buf, v_buf = current_workspace_manager().get_simultaneous(
    (buf_shape, torch.float16),
    (buf_shape, torch.float16),
)

The buffer shape scales with cached_len. The workspace manager (PR #40941) locks the workspace at engine init based on the high-water mark observed during the dummy run. The dummy run uses short chunked-prefill, so the locked-in workspace is sized for short cached lengths only. At inference time, the first request that exceeds the dummy run's cached_len triggers the lock-violation assertion.

Two interacting issues:

  1. Profile undercount: the dummy run (is_profile=True, force_eager=True) doesn't exercise long-context continuation prefill. Whatever schedule the profile runs, it produces a cached_len smaller than what real workloads encounter when prefill chunking kicks in past max_num_batched_tokens.

  2. Lock-vs-growth: the workspace manager is correct to lock for CUDA Graph compatibility, but a TQ-aware sizing heuristic is needed so the locked size accommodates the worst case the user actually configured (max_model_len).

This is the same conceptual failure mode jhsmith409 reported on PR #39931 for hybrid GDN models (chunk_fwd_o OOM at o = torch.empty_like(v)), where their proposed mitigation was a 1.25 GiB safety-margin overlay in gpu_worker.py. Different op, same upstream cause.

Suggested fixes

In rough order of invasiveness:

  1. Profile-time long-context dummy run: have the engine init explicitly run a continuation prefill at max_model_len or a representative cached length so the workspace high-water mark covers production workloads. Adds ~5-10 seconds to startup. Symmetrical to the existing short-prefill profile pass.

  2. TQ-aware static workspace sizing: compute the worst-case _continuation_prefill buffer from (num_kv_heads, max_model_len, head_dim) at engine init and reserve it ahead of locking, regardless of what the dummy run touches.

  3. Allow late-growth with replay-aware lock: relax the workspace lock to permit growth that re-records affected CUDA Graphs. More complex, broader implications.

  4. Per-call fallback: when the workspace can't satisfy the request, fall back to direct torch.empty allocation with a one-shot warning. Pessimizes long-context decode but keeps the engine alive instead of asserting.

We applied #1 locally as a stopgap (overrode the dummy run to include a worst-case continuation prefill) and confirmed it resolves the crash on Super-120B. Happy to share the diff.

Environment

  • vLLM: 0.20.0 (vllm==0.20.0+precompiled, also reproduced on rebuilt-from-source 0.20.1.dev with the same line numbers)
  • PyTorch: 2.11.0+cu130
  • Driver/CUDA: 580.76.05 / 13.0
  • Hardware: 8× RTX A4000 (SM86)
  • Models that reproduce: Nemotron-3-Super-120B-AWQ-4bit, Nemotron-Cascade-2-30B-A3B (BF16). Likely any TQ-enabled model whose _continuation_prefill is invoked with cached_len larger than the engine's profile run.
  • Models that do NOT reproduce: same models on PR #38479 fork (pre-workspace-manager).

Related

  • PR #40941 (merged Apr 27): introduced the WorkspaceManager and wired it into TQ continuation prefill. Likely where the regression originated.
  • PR #41422 (TheTom sparse-V): independent change; its long-context test plan is blocked by this bug on Ampere/Ada.
  • PR #39931 #41185 #41123 (hybrid model TQ work): unrelated but shares the same family of "dummy run undercounts the real workload" failure mode.

— MidasMining, 8× RTX A4000 SM86 / vLLM 0.20.0

extent analysis

TL;DR

The most likely fix for the workspace allocation failure in _continuation_prefill is to implement a profile-time long-context dummy run to ensure the workspace high-water mark covers production workloads.

Guidance

  • Verify that the issue is indeed caused by the workspace manager locking the workspace at engine init based on a short chunked-prefill dummy run, which undercounts the real workload.
  • Consider implementing a TQ-aware static workspace sizing heuristic to compute the worst-case _continuation_prefill buffer size at engine init and reserve it ahead of locking.
  • As a temporary workaround, apply the per-call fallback approach, which falls back to direct torch.empty allocation with a one-shot warning when the workspace can't satisfy the request.
  • Review the suggested fixes in the issue body, including profile-time long-context dummy run, TQ-aware static workspace sizing, and allowing late-growth with replay-aware lock.

Example

No code snippet is provided as the issue is more related to the configuration and setup of the workspace manager.

Notes

The issue seems to be specific to vLLM 0.20.0 and TurboQuant enabled, and the suggested fixes may have different implications depending on the specific use case and hardware configuration.

Recommendation

Apply the profile-time long-context dummy run fix, as it is the least invasive and has been confirmed to resolve the crash on Super-120B. This approach adds a minimal overhead to the engine init time and ensures that the workspace high-water mark covers production workloads.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING