vllm - ✅(Solved) Fix [Bug]: TurboQuant `_continuation_prefill` OOMs and kills engine at long-context prefill (~185K actual tokens) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On the open hybrid-TurboQuant stack (umbrella PR #39931), a single chat-completion request whose tokenized prompt exceeds ~185K tokens reliably kills the entire vLLM engine (CUDA OOM in the prefill path), not just the individual request. The server has to be restarted. This is well below the --max-model-len=262144 the model advertises and KV cache reports as provisioned.

Error Message

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 619, in _prefill_attention out = self._continuation_prefill(...) File ".../vllm/v1/attention/backends/turboquant_attn.py", line 729, in _continuation_prefill v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 184.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 169.75 MiB is free.

Root Cause

_continuation_prefill has two redundant peak-memory materializations at long context:

  1. FP32 intermediate for the inverse Hadamard rotation of MSE keys. k_cached[0, :, :cached_len, :].reshape(-1, D).float() @ Pi allocates a cached_len*Hk*D*4 B FP32 tensor, then the .to(torch.float16) cast allocates another FP16 copy. This FP32 widening serves no purpose — the keys were reconstructed from 3-4 bit MSE indices, so FP16 roundoff on the rotation is orders of magnitude below the quantization noise already in the cache.
  2. Redundant .contiguous() on transposed K/V views before torch.cat. torch.cat always produces a contiguous output regardless of input contiguity, so the upstream .contiguous() calls doubled the transient footprint for no reason.

Together these spikes exceed the remaining activation headroom on a 32 GiB card once cached_len approaches ~190K.

Fix Action

Fix

PR against JartX/vllm#feature-hybrid-turboquant (cross-fork): surgical two-change fix in _continuation_prefill — rotate in FP16, drop redundant .contiguous(). Verified to push the ceiling from ~196K actual to at least 208K actual tokens (260K target), covering the full max-model-len. NIAH retrieval accuracy unchanged. Will link here once opened.

PR fix notes

PR #11: fix(turboquant_attn): eliminate peak-memory spike in _continuation_prefill

Description (problem / solution / changelog)

Fixes vllm-project/vllm#40420 — long-prefill OOM in _continuation_prefill that crashes the entire engine (not just the request) at ~185K actual tokens on a 32 GiB card, well below the configured --max-model-len=262144.

Root cause

_continuation_prefill has two redundant peak-memory materializations at long context:

  1. FP32 intermediate for the inverse Hadamard rotation. k_cached[..].reshape(-1, D).float() @ Pi allocates a cached_len*Hk*D*4 B FP32 tensor, then .to(torch.float16) allocates a FP16 copy. The FP32 widening serves no accuracy purpose — keys were reconstructed from 3-4 bit MSE indices, so FP16 roundoff on the rotation is orders of magnitude below the quantization noise already in the cache.
  2. Redundant .contiguous() on transposed K/V views before torch.cat. torch.cat produces a contiguous output regardless of input contiguity, so the upstream .contiguous() doubled the transient footprint for no reason.

Changes

Two small, local changes to _continuation_prefill. No API changes, no new persistent state, no change to kernel interfaces. 18 insertions / 12 deletions, single file.

Verification

RTX 5090 (32 GiB) + cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit + --kv-cache-dtype=turboquant_4bit_nc --max-model-len=262144 --gpu-memory-utilization=0.87:

target tokensactual prompt tokensbeforeafter
230,000184,079PASSPASS
245,000~196,000CRASHPASS
255,000204,081CRASHPASS
260,000208,082CRASHPASS

New ceiling covers the full advertised max-model-len. NIAH retrieval accuracy (45/45 across 8k/32k/128k/160k/192k) is unchanged on the fixed path.

Scope

Does not touch the Triton dequant kernel, the decode path, or the cache layout — purely the PyTorch glue between dequant and flash_attn_varlen_func. Should be safe to merge into feature/hybrid_turboquant independently of the other outstanding PRs in the stack.

Changed files

  • vllm/v1/attention/backends/turboquant_attn.py (modified, +18/-12)

Code Example

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 619, in _prefill_attention
    out = self._continuation_prefill(...)
File ".../vllm/v1/attention/backends/turboquant_attn.py", line 729, in _continuation_prefill
    v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 184.00 MiB.
  GPU 0 has a total capacity of 31.36 GiB of which 169.75 MiB is free.
RAW_BUFFERClick to expand / collapse

Summary

On the open hybrid-TurboQuant stack (umbrella PR #39931), a single chat-completion request whose tokenized prompt exceeds ~185K tokens reliably kills the entire vLLM engine (CUDA OOM in the prefill path), not just the individual request. The server has to be restarted. This is well below the --max-model-len=262144 the model advertises and KV cache reports as provisioned.

Environment

  • GPU: RTX 5090 (sm_120, 32 GiB)
  • Model: cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
  • --kv-cache-dtype=turboquant_4bit_nc --max-model-len=262144 --max-num-seqs=16 --gpu-memory-utilization=0.87
  • GPU KV cache size reported at startup: 548,864 tokens · 2.00x concurrency at 256K
  • Base image: vllm/vllm-openai:cu130-nightly + PR #39931 + the 6 open follow-up PRs I've been tracking (LCM page-size #40128, GDN dual-stream #39748, FA3/4 passthrough #40092, FLA TMA gate #37700, hybrid kv-token capacity #40384, triton decode OOB clamp #40074).

Reproducer

  1. Start the stack with --kv-cache-dtype=turboquant_4bit_nc --max-model-len=262144 --gpu-memory-utilization=0.87.
  2. Send one chat completion whose tokenized prompt is between ~185K and 262K tokens (e.g. a long NIAH-style haystack).

OOM-probe boundary:

target tokens (haystack)actual prompt tokensresult
200,000160,083PASS
215,000172,080PASS
230,000184,079PASS
245,000~196,000CRASH (engine dies)

Short-context (up to ~154K actual), stress_32k, and 5-way concurrent story generation all pass clean on the same config — this is purely a long-prefill path issue.

Stack (reproduced twice, identical)

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 619, in _prefill_attention
    out = self._continuation_prefill(...)
File ".../vllm/v1/attention/backends/turboquant_attn.py", line 729, in _continuation_prefill
    v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 184.00 MiB.
  GPU 0 has a total capacity of 31.36 GiB of which 169.75 MiB is free.

Root cause

_continuation_prefill has two redundant peak-memory materializations at long context:

  1. FP32 intermediate for the inverse Hadamard rotation of MSE keys. k_cached[0, :, :cached_len, :].reshape(-1, D).float() @ Pi allocates a cached_len*Hk*D*4 B FP32 tensor, then the .to(torch.float16) cast allocates another FP16 copy. This FP32 widening serves no purpose — the keys were reconstructed from 3-4 bit MSE indices, so FP16 roundoff on the rotation is orders of magnitude below the quantization noise already in the cache.
  2. Redundant .contiguous() on transposed K/V views before torch.cat. torch.cat always produces a contiguous output regardless of input contiguity, so the upstream .contiguous() calls doubled the transient footprint for no reason.

Together these spikes exceed the remaining activation headroom on a 32 GiB card once cached_len approaches ~190K.

Severity

Engine-wide, not request-scoped. The engine dies with EngineDeadError / EngineCore encountered an issue, the API surface returns 500 for subsequent calls, and the container needs a restart. Any public-facing deployment built on this stack is one long-prompt-per-session away from an outage.

Effective usable max-model-len on 32 GiB is ~230K target (~185K actual) — not the 256K the config advertises.

Fix

PR against JartX/vllm#feature-hybrid-turboquant (cross-fork): surgical two-change fix in _continuation_prefill — rotate in FP16, drop redundant .contiguous(). Verified to push the ceiling from ~196K actual to at least 208K actual tokens (260K target), covering the full max-model-len. NIAH retrieval accuracy unchanged. Will link here once opened.

extent analysis

TL;DR

The most likely fix is to apply a two-change fix in _continuation_prefill to rotate in FP16 and drop redundant .contiguous() calls to prevent CUDA out-of-memory errors.

Guidance

  • Identify the _continuation_prefill function in turboquant_attn.py and locate the lines causing the memory allocation issues.
  • Modify the code to perform the inverse Hadamard rotation of MSE keys in FP16 instead of FP32 to reduce memory usage.
  • Remove the redundant .contiguous() calls on transposed K/V views before torch.cat to further reduce memory allocation.
  • Verify the changes by testing the updated code with long-prefill paths and checking for CUDA out-of-memory errors.

Example

# Before
v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
# ...
k_cached[0, :, :cached_len, :].reshape(-1, D).float() @ Pi

# After
v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
# ...
k_cached[0, :, :cached_len, :].reshape(-1, D).to(torch.float16) @ Pi

Notes

The provided fix is specific to the _continuation_prefill function and may not address other potential memory issues in the codebase. Additionally, the effectiveness of the fix may depend on the specific use case and input data.

Recommendation

Apply the workaround by modifying the _continuation_prefill function to rotate in FP16 and drop redundant .contiguous() calls, as this fix has been verified to push the ceiling from ~196K actual to at least 208K actual tokens.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: TurboQuant `_continuation_prefill` OOMs and kills engine at long-context prefill (~185K actual tokens) [1 pull requests]