vllm - ✅(Solved) Fix [Bug]: TurboQuant _continuation_prefill workspace allocation fails at long context — v0.20.0 regression [1 pull requests, 1 comments, 1 participants]

MidasMining · 2026-05-03T17:22:27Z

[vllm] On vLLM 0.20.0 with TurboQuant enabled, any prefill request whose cached KV grows beyond ~6-8K tokens triggers an assertion failure in continuation pref… On vLLM 0.20.0 with TurboQuant enabled, any prefill request whose cached KV grows beyond ~6-8K tokens triggers an assertion failure in `_continuation_prefill`'s workspace allocation. The workspace is sized to 0.26 MB at startup but `_continuation_prefill` needs ~2 MB at long context, and the workspace is locked against growth after engine init. The same workloads run cleanly on the v0.18-era TQ fork (vibhavagarwal5/PR #38479 base), so this is a v0.20-specific regression introduced when the workspace manager (PR #40941, merged Apr 27) was wired into the TQ continuation prefill path. # PR #40798: [TurboQuant] Share decode scratch workspace across layers - Repository: vllm-project/vllm - Author: Bot1822 - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/40798 ## Description (problem / solution / changelog) ## Summary TurboQuant currently registers three decode scratch buffers on every attention layer. These buffers are temporary decode workspace, but because they are registered per layer they scale with the number of TurboQuant layers and with `max_num_seqs`. This PR moves TurboQuant decode scratch allocation to the v1 workspace manager so the scratch tensors are shared across layers. It also reserves the maximum TurboQuant decode workspace before CUDA graph capture locks the workspace, preventing locked-workspace growth at runtime. This PR is scoped to TurboQuant decode scratch memory usage. It does not claim to fix TurboQuant speculative-decoding correctness issues. ## Motivation On large models and large H100/H200 server defaults, the per-layer scratch buffers can consume tens of GiB of non-KV memory and significantly reduce KV cache capacity. Measured on H200 with `Llama-3.1-70B`, TP=2, `kv_cache_dtype=turboquant_3bit_nc`, `max_model_len=65536`, `gpu_memory_utilization=0.90`: | version | model loading memory | available KV memory | GPU KV cache size | max concurrency @ 65,536 | | --- | ---: | ---: | ---: | ---: | | before | 105.23 GiB | 14.61 GiB | 400,128 tokens | 6.11x | | after | 65.74 GiB | 53.97 GiB | 1,478,384 tokens | 22.56x | The old per-layer buffer cost for this setup is about 39.5 GiB: ```text B = 1024 Hq = 32 S = 32 D = 128 TQ layers = 76 mid_o ~= B * Hq * S * (D + 1) * 4 bytes * 76 ~= 39.5 GiB ``` ## Changes - Keep TurboQuant centroids as per-layer state. - Stop registering `_tq_mid_o_buf`, `_tq_output_buf`, and `_tq_lse_buf` on every attention layer. - Allocate TurboQuant decode scratch through `WorkspaceManager.get_simultaneous()`. - Reserve a max-size TurboQuant decode workspace before `lock_workspace()` in CUDA graph capture. - Reserve across all attention groups so hybrid model layouts do not miss TurboQuant groups outside the first group. - Add behavior tests for the workspace allocation and reservation paths. ## Duplicate-work note I checked the open TurboQuant scratch/workspace work before publishing this PR. This PR is intentionally scoped to the v1 workspace-manager reservation path and the measured H200 `Llama-3.1-70B` TP=2 memory-capacity regression. It is not a general cleanup and does not claim to solve unrelated TurboQuant speculative-decoding correctness issues. ## Testing - `python -m py_compile tests/quantization/test_turboquant.py vllm/v1/worker/gpu_model_runner.py vllm/model_executor/layers/attention/attention.py vllm/v1/attention/ops/triton_turboquant_decode.py vllm/v1/attention/backends/turboquant_attn.py` - `git diff --check` - H200 container: `python3 -m pytest tests/quantization/test_turboquant.py -k TurboQuantDecodeWorkspace -q` - Result: `4 passed, 117 deselected, 17 warnings in 2.03s` - H200 startup/capacity validation: - model: `/mnt/afs/models/Llama/Llama-3.1-70B` - GPUs: H200 2x, TP=2 - `--kv-cache-dtype turboquant_3bit_nc` - `--max-model-len 65536` - `--gpu-memory-utilization 0.90` - result: model loading memory reduced from 105.23 GiB to 65.74 GiB; GPU KV cache size increased from 400,128 to 1,478,384 tokens - H200 serving benchmark, `/v1/completions`, random dataset, 32 prompts, input length 4096, output length 128, `request_rate=inf`, `max_concurrency=8`, `ignore_eos=true`: - before: 0.3186 req/s, 40.78 output tok/s, mean TTFT 16809 ms, mean TPOT 65.16 ms - after: 0.3343 req/s, 42.80 output tok/s, mean TTFT 15629 ms, mean TPOT 65.14 ms - Sampled MMLU-Pro regression, `leaderboard_mmlu_pro`, `limit=20`, 5-shot, `local-completions`, `num_concurrent=4`: - before: `acc=0.60 ± 0.1124` - after: `acc=0.60 ± 0.1124` - same predicted option on 20/20 samples and same correctness on 20/20 samples ## AI Assistance AI assistance was used to prepare this patch and PR text. Guipeng Zhang is the human submitter responsible for reviewing and defending the change. ## Changed files - `tests/quantization/test_turboquant.py` (modified, +133/-0) - `vllm/model_executor/layers/attention/attention.

vllm2026-05-03 17:22:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41565•Fetched 2026-05-04 04:58:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

MidasMining

Participants

MidasMining

Timeline (top)

cross-referenced ×3commented ×1unsubscribed ×1

On vLLM 0.20.0 with TurboQuant enabled, any prefill request whose cached KV grows beyond ~6-8K tokens triggers an assertion failure in _continuation_prefill's workspace allocation. The workspace is sized to 0.26 MB at startup but _continuation_prefill needs ~2 MB at long context, and the workspace is locked against growth after engine init.

The same workloads run cleanly on the v0.18-era TQ fork (vibhavagarwal5/PR #38479 base), so this is a v0.20-specific regression introduced when the workspace manager (PR #40941, merged Apr 27) was wired into the TQ continuation prefill path.

Error Message

AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB, current size is 0.26 MB. Workspace growth is not allowed after locking.

Root Cause

Root cause analysis (best-effort)

Fix Action

Fix / Workaround

This is the same conceptual failure mode jhsmith409 reported on PR #39931 for hybrid GDN models (chunk_fwd_o OOM at o = torch.empty_like(v)), where their proposed mitigation was a 1.25 GiB safety-margin overlay in gpu_worker.py. Different op, same upstream cause.

PR fix notes

PR #40798: [TurboQuant] Share decode scratch workspace across layers

Repository: vllm-project/vllm
Author: Bot1822
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40798

Description (problem / solution / changelog)

Summary

TurboQuant currently registers three decode scratch buffers on every attention layer. These buffers are temporary decode workspace, but because they are registered per layer they scale with the number of TurboQuant layers and with max_num_seqs.

This PR moves TurboQuant decode scratch allocation to the v1 workspace manager so the scratch tensors are shared across layers. It also reserves the maximum TurboQuant decode workspace before CUDA graph capture locks the workspace, preventing locked-workspace growth at runtime.

This PR is scoped to TurboQuant decode scratch memory usage. It does not claim to fix TurboQuant speculative-decoding correctness issues.

Motivation

On large models and large H100/H200 server defaults, the per-layer scratch buffers can consume tens of GiB of non-KV memory and significantly reduce KV cache capacity.

Measured on H200 with Llama-3.1-70B, TP=2, kv_cache_dtype=turboquant_3bit_nc, max_model_len=65536, gpu_memory_utilization=0.90:

version	model loading memory	available KV memory	GPU KV cache size	max concurrency @ 65,536
before	105.23 GiB	14.61 GiB	400,128 tokens	6.11x
after	65.74 GiB	53.97 GiB	1,478,384 tokens	22.56x

The old per-layer buffer cost for this setup is about 39.5 GiB:

B = 1024
Hq = 32
S = 32
D = 128
TQ layers = 76
mid_o ~= B * Hq * S * (D + 1) * 4 bytes * 76 ~= 39.5 GiB

Changes

Keep TurboQuant centroids as per-layer state.
Stop registering _tq_mid_o_buf, _tq_output_buf, and _tq_lse_buf on every attention layer.
Allocate TurboQuant decode scratch through WorkspaceManager.get_simultaneous().
Reserve a max-size TurboQuant decode workspace before lock_workspace() in CUDA graph capture.
Reserve across all attention groups so hybrid model layouts do not miss TurboQuant groups outside the first group.
Add behavior tests for the workspace allocation and reservation paths.

Duplicate-work note

I checked the open TurboQuant scratch/workspace work before publishing this PR. This PR is intentionally scoped to the v1 workspace-manager reservation path and the measured H200 Llama-3.1-70B TP=2 memory-capacity regression. It is not a general cleanup and does not claim to solve unrelated TurboQuant speculative-decoding correctness issues.

Testing

python -m py_compile tests/quantization/test_turboquant.py vllm/v1/worker/gpu_model_runner.py vllm/model_executor/layers/attention/attention.py vllm/v1/attention/ops/triton_turboquant_decode.py vllm/v1/attention/backends/turboquant_attn.py
git diff --check
H200 container: python3 -m pytest tests/quantization/test_turboquant.py -k TurboQuantDecodeWorkspace -q
- Result: 4 passed, 117 deselected, 17 warnings in 2.03s
H200 startup/capacity validation:
- model: /mnt/afs/models/Llama/Llama-3.1-70B
- GPUs: H200 2x, TP=2
- --kv-cache-dtype turboquant_3bit_nc
- --max-model-len 65536
- --gpu-memory-utilization 0.90
- result: model loading memory reduced from 105.23 GiB to 65.74 GiB; GPU KV cache size increased from 400,128 to 1,478,384 tokens
H200 serving benchmark, /v1/completions, random dataset, 32 prompts, input length 4096, output length 128, request_rate=inf, max_concurrency=8, ignore_eos=true:
- before: 0.3186 req/s, 40.78 output tok/s, mean TTFT 16809 ms, mean TPOT 65.16 ms
- after: 0.3343 req/s, 42.80 output tok/s, mean TTFT 15629 ms, mean TPOT 65.14 ms
Sampled MMLU-Pro regression, leaderboard_mmlu_pro, limit=20, 5-shot, local-completions, num_concurrent=4:
- before: acc=0.60 ± 0.1124
- after: acc=0.60 ± 0.1124
- same predicted option on 20/20 samples and same correctness on 20/20 samples

AI Assistance

AI assistance was used to prepare this patch and PR text. Guipeng Zhang is the human submitter responsible for reviewing and defending the change.

Changed files

tests/quantization/test_turboquant.py (modified, +133/-0)
vllm/model_executor/layers/attention/attention.py (modified, +4/-24)
vllm/v1/attention/backends/turboquant_attn.py (modified, +0/-11)
vllm/v1/attention/ops/triton_turboquant_decode.py (modified, +15/-8)
vllm/v1/worker/gpu_model_runner.py (modified, +31/-1)

Code Example

vllm serve /path/to/Nemotron-3-Super-120B-AWQ-4bit \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.92 \
  --max-num-seqs 4 --max-model-len 131072 \
  --trust-remote-code --enable-expert-parallel \
  --mamba-ssm-cache-dtype float16 \
  --kv-cache-dtype turboquant_3bit_nc

---

import requests
prompt = 'In computational science, ' * 2000  # ~8K tokens after tokenization
r = requests.post("http://localhost:8001/v1/chat/completions", json={
    "model": "/path/to/Nemotron-3-Super-120B-AWQ-4bit",
    "messages": [{"role": "user", "content": prompt + " Reply OK."}],
    "max_tokens": 50,
})

---

AssertionError: Workspace is locked but allocation from
'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB,
current size is 0.26 MB. Workspace growth is not allowed after locking.

---

# Allocate slightly over to align to block_size for the grid.
# Reuse cached buffers to avoid per-call allocation (~16MB at 8K).
alloc_len = math.ceil(cached_len / block_size) * block_size
buf_shape = (1, Hk, alloc_len, D)
# Use WorkspaceManager for dequant buffers.
# Shared across all layers — saves 60× memory at long context.
# Required for CUDA Graph capture (per-layer growth incompatible with CG).
k_buf, v_buf = current_workspace_manager().get_simultaneous(
    (buf_shape, torch.float16),
    (buf_shape, torch.float16),
)

RAW_BUFFERClick to expand / collapse

[Bug]: TurboQuant `_continuation_prefill` workspace allocation fails at long context — v0.20.0 regression

Summary

Reproduction

Setup: 8× RTX A4000 SM86, CUDA 13.0, driver 580.76.05, vLLM 0.20.0 from the GA tag.

vllm serve /path/to/Nemotron-3-Super-120B-AWQ-4bit \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.92 \
  --max-num-seqs 4 --max-model-len 131072 \
  --trust-remote-code --enable-expert-parallel \
  --mamba-ssm-cache-dtype float16 \
  --kv-cache-dtype turboquant_3bit_nc

Send a long prompt:

import requests
prompt = 'In computational science, ' * 2000  # ~8K tokens after tokenization
r = requests.post("http://localhost:8001/v1/chat/completions", json={
    "model": "/path/to/Nemotron-3-Super-120B-AWQ-4bit",
    "messages": [{"role": "user", "content": prompt + " Reply OK."}],
    "max_tokens": 50,
})

Expected: response.

Actual:

AssertionError: Workspace is locked but allocation from
'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB,
current size is 0.26 MB. Workspace growth is not allowed after locking.

Engine dies; subsequent requests return HTTP 500 EngineCore encountered an issue.

Threshold sweep on the same hardware

Prompt repetitions × `'In computational science, '`	Total prompt tokens	v0.20.0 + TurboQuant	PR #38479 fork
2000	8,020	CRASH (workspace)	works
3000	12,020	CRASH (workspace)	works
5000	20,020	CRASH (workspace)	works

Same model, same hardware, same TQ preset (tq-t3nc ↔ turboquant_3bit_nc). Only difference: vLLM revision.

Independent of TheTom's sparse-V PR

We hit the same crash with the sparse-V PR (#41422) applied AND with stock v0.20.0. The bug is in v0.20's _continuation_prefill workspace usage, not in any open PR.

Root cause analysis (best-effort)

vllm/v1/attention/backends/turboquant_attn.py:748-760:

# Allocate slightly over to align to block_size for the grid.
# Reuse cached buffers to avoid per-call allocation (~16MB at 8K).
alloc_len = math.ceil(cached_len / block_size) * block_size
buf_shape = (1, Hk, alloc_len, D)
# Use WorkspaceManager for dequant buffers.
# Shared across all layers — saves 60× memory at long context.
# Required for CUDA Graph capture (per-layer growth incompatible with CG).
k_buf, v_buf = current_workspace_manager().get_simultaneous(
    (buf_shape, torch.float16),
    (buf_shape, torch.float16),
)

The buffer shape scales with cached_len. The workspace manager (PR #40941) locks the workspace at engine init based on the high-water mark observed during the dummy run. The dummy run uses short chunked-prefill, so the locked-in workspace is sized for short cached lengths only. At inference time, the first request that exceeds the dummy run's cached_len triggers the lock-violation assertion.

Two interacting issues:

Profile undercount: the dummy run (is_profile=True, force_eager=True) doesn't exercise long-context continuation prefill. Whatever schedule the profile runs, it produces a cached_len smaller than what real workloads encounter when prefill chunking kicks in past max_num_batched_tokens.
Lock-vs-growth: the workspace manager is correct to lock for CUDA Graph compatibility, but a TQ-aware sizing heuristic is needed so the locked size accommodates the worst case the user actually configured (max_model_len).

Suggested fixes

In rough order of invasiveness:

Profile-time long-context dummy run: have the engine init explicitly run a continuation prefill at max_model_len or a representative cached length so the workspace high-water mark covers production workloads. Adds ~5-10 seconds to startup. Symmetrical to the existing short-prefill profile pass.
TQ-aware static workspace sizing: compute the worst-case _continuation_prefill buffer from (num_kv_heads, max_model_len, head_dim) at engine init and reserve it ahead of locking, regardless of what the dummy run touches.
Allow late-growth with replay-aware lock: relax the workspace lock to permit growth that re-records affected CUDA Graphs. More complex, broader implications.
Per-call fallback: when the workspace can't satisfy the request, fall back to direct torch.empty allocation with a one-shot warning. Pessimizes long-context decode but keeps the engine alive instead of asserting.

We applied #1 locally as a stopgap (overrode the dummy run to include a worst-case continuation prefill) and confirmed it resolves the crash on Super-120B. Happy to share the diff.

Environment

vLLM: 0.20.0 (vllm==0.20.0+precompiled, also reproduced on rebuilt-from-source 0.20.1.dev with the same line numbers)
PyTorch: 2.11.0+cu130
Driver/CUDA: 580.76.05 / 13.0
Hardware: 8× RTX A4000 (SM86)
Models that reproduce: Nemotron-3-Super-120B-AWQ-4bit, Nemotron-Cascade-2-30B-A3B (BF16). Likely any TQ-enabled model whose _continuation_prefill is invoked with cached_len larger than the engine's profile run.
Models that do NOT reproduce: same models on PR #38479 fork (pre-workspace-manager).

PR #40941 (merged Apr 27): introduced the WorkspaceManager and wired it into TQ continuation prefill. Likely where the regression originated.
PR #41422 (TheTom sparse-V): independent change; its long-context test plan is blocked by this bug on Ampere/Ada.
PR #39931 #41185 #41123 (hybrid model TQ work): unrelated but shares the same family of "dummy run undercounts the real workload" failure mode.

— MidasMining, 8× RTX A4000 SM86 / vLLM 0.20.0

extent analysis

TL;DR

The most likely fix for the workspace allocation failure in _continuation_prefill is to implement a profile-time long-context dummy run to ensure the workspace high-water mark covers production workloads.

Guidance

Verify that the issue is indeed caused by the workspace manager locking the workspace at engine init based on a short chunked-prefill dummy run, which undercounts the real workload.
Consider implementing a TQ-aware static workspace sizing heuristic to compute the worst-case _continuation_prefill buffer size at engine init and reserve it ahead of locking.
As a temporary workaround, apply the per-call fallback approach, which falls back to direct torch.empty allocation with a one-shot warning when the workspace can't satisfy the request.
Review the suggested fixes in the issue body, including profile-time long-context dummy run, TQ-aware static workspace sizing, and allowing late-growth with replay-aware lock.

Example

No code snippet is provided as the issue is more related to the configuration and setup of the workspace manager.

Notes

The issue seems to be specific to vLLM 0.20.0 and TurboQuant enabled, and the suggested fixes may have different implications depending on the specific use case and hardware configuration.

Recommendation

Apply the profile-time long-context dummy run fix, as it is the least invasive and has been confirmed to resolve the crash on Super-120B. This approach adds a minimal overhead to the engine init time and ensures that the workspace high-water mark covers production workloads.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#autograd error #model save/load #optimization #mixed precision #training loop

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: TurboQuant _continuation_prefill workspace allocation fails at long context — v0.20.0 regression [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause analysis (best-effort)

Fix Action

Fix / Workaround

PR fix notes

PR #40798: [TurboQuant] Share decode scratch workspace across layers

Description (problem / solution / changelog)

Summary

Motivation

Changes

Duplicate-work note

Testing

AI Assistance

Changed files

Code Example

[Bug]: TurboQuant _continuation_prefill workspace allocation fails at long context — v0.20.0 regression

Summary

Reproduction

Threshold sweep on the same hardware

Independent of TheTom's sparse-V PR

Root cause analysis (best-effort)

Suggested fixes

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

[Bug]: TurboQuant `_continuation_prefill` workspace allocation fails at long context — v0.20.0 regression