vllm - ✅(Solved) Fix [Bug]: TurboQuant `_continuation_prefill` OOMs and kills engine at long-context prefill (~185K actual tokens) [1 pull requests]

vllm2026-04-21 01:30:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On the open hybrid-TurboQuant stack (umbrella PR #39931), a single chat-completion request whose tokenized prompt exceeds ~185K tokens reliably kills the entire vLLM engine (CUDA OOM in the prefill path), not just the individual request. The server has to be restarted. This is well below the --max-model-len=262144 the model advertises and KV cache reports as provisioned.

Error Message

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 619, in _prefill_attention out = self._continuation_prefill(...) File ".../vllm/v1/attention/backends/turboquant_attn.py", line 729, in _continuation_prefill v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 184.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 169.75 MiB is free.

Root Cause

_continuation_prefill has two redundant peak-memory materializations at long context:

FP32 intermediate for the inverse Hadamard rotation of MSE keys. k_cached[0, :, :cached_len, :].reshape(-1, D).float() @ Pi allocates a cached_len*Hk*D*4 B FP32 tensor, then the .to(torch.float16) cast allocates another FP16 copy. This FP32 widening serves no purpose — the keys were reconstructed from 3-4 bit MSE indices, so FP16 roundoff on the rotation is orders of magnitude below the quantization noise already in the cache.
Redundant .contiguous() on transposed K/V views before torch.cat. torch.cat always produces a contiguous output regardless of input contiguity, so the upstream .contiguous() calls doubled the transient footprint for no reason.

Together these spikes exceed the remaining activation headroom on a 32 GiB card once cached_len approaches ~190K.

Fix Action

Fix

PR against JartX/vllm#feature-hybrid-turboquant (cross-fork): surgical two-change fix in _continuation_prefill — rotate in FP16, drop redundant .contiguous(). Verified to push the ceiling from ~196K actual to at least 208K actual tokens (260K target), covering the full max-model-len. NIAH retrieval accuracy unchanged. Will link here once opened.

PR fix notes

PR #11: fix(turboquant_attn): eliminate peak-memory spike in _continuation_prefill

Repository: JartX/vllm
Author: jhsmith409
State: open | merged: False
Link: https://github.com/JartX/vllm/pull/11

Description (problem / solution / changelog)

Fixes vllm-project/vllm#40420 — long-prefill OOM in _continuation_prefill that crashes the entire engine (not just the request) at ~185K actual tokens on a 32 GiB card, well below the configured --max-model-len=262144.

Root cause

_continuation_prefill has two redundant peak-memory materializations at long context:

FP32 intermediate for the inverse Hadamard rotation. k_cached[..].reshape(-1, D).float() @ Pi allocates a cached_len*Hk*D*4 B FP32 tensor, then .to(torch.float16) allocates a FP16 copy. The FP32 widening serves no accuracy purpose — keys were reconstructed from 3-4 bit MSE indices, so FP16 roundoff on the rotation is orders of magnitude below the quantization noise already in the cache.
Redundant .contiguous() on transposed K/V views before torch.cat. torch.cat produces a contiguous output regardless of input contiguity, so the upstream .contiguous() doubled the transient footprint for no reason.

Changes

Two small, local changes to _continuation_prefill. No API changes, no new persistent state, no change to kernel interfaces. 18 insertions / 12 deletions, single file.

Verification

RTX 5090 (32 GiB) + cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit + --kv-cache-dtype=turboquant_4bit_nc --max-model-len=262144 --gpu-memory-utilization=0.87:

target tokens	actual prompt tokens	before	after
230,000	184,079	PASS	PASS
245,000	~196,000	CRASH	PASS
255,000	204,081	CRASH	PASS
260,000	208,082	CRASH	PASS

New ceiling covers the full advertised max-model-len. NIAH retrieval accuracy (45/45 across 8k/32k/128k/160k/192k) is unchanged on the fixed path.

Scope

Does not touch the Triton dequant kernel, the decode path, or the cache layout — purely the PyTorch glue between dequant and flash_attn_varlen_func. Should be safe to merge into feature/hybrid_turboquant independently of the other outstanding PRs in the stack.

Changed files

vllm/v1/attention/backends/turboquant_attn.py (modified, +18/-12)

Code Example

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 619, in _prefill_attention
    out = self._continuation_prefill(...)
File ".../vllm/v1/attention/backends/turboquant_attn.py", line 729, in _continuation_prefill
    v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 184.00 MiB.
  GPU 0 has a total capacity of 31.36 GiB of which 169.75 MiB is free.

RAW_BUFFERClick to expand / collapse

Summary

Environment

GPU: RTX 5090 (sm_120, 32 GiB)
Model: cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
--kv-cache-dtype=turboquant_4bit_nc --max-model-len=262144 --max-num-seqs=16 --gpu-memory-utilization=0.87
GPU KV cache size reported at startup: 548,864 tokens · 2.00x concurrency at 256K
Base image: vllm/vllm-openai:cu130-nightly + PR #39931 + the 6 open follow-up PRs I've been tracking (LCM page-size #40128, GDN dual-stream #39748, FA3/4 passthrough #40092, FLA TMA gate #37700, hybrid kv-token capacity #40384, triton decode OOB clamp #40074).

Reproducer

Start the stack with --kv-cache-dtype=turboquant_4bit_nc --max-model-len=262144 --gpu-memory-utilization=0.87.
Send one chat completion whose tokenized prompt is between ~185K and 262K tokens (e.g. a long NIAH-style haystack).

OOM-probe boundary:

target tokens (haystack)	actual prompt tokens	result
200,000	160,083	PASS
215,000	172,080	PASS
230,000	184,079	PASS
245,000	~196,000	CRASH (engine dies)

Short-context (up to ~154K actual), stress_32k, and 5-way concurrent story generation all pass clean on the same config — this is purely a long-prefill path issue.

Stack (reproduced twice, identical)

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 619, in _prefill_attention
    out = self._continuation_prefill(...)
File ".../vllm/v1/attention/backends/turboquant_attn.py", line 729, in _continuation_prefill
    v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 184.00 MiB.
  GPU 0 has a total capacity of 31.36 GiB of which 169.75 MiB is free.

Root cause

_continuation_prefill has two redundant peak-memory materializations at long context:

FP32 intermediate for the inverse Hadamard rotation of MSE keys. k_cached[0, :, :cached_len, :].reshape(-1, D).float() @ Pi allocates a cached_len*Hk*D*4 B FP32 tensor, then the .to(torch.float16) cast allocates another FP16 copy. This FP32 widening serves no purpose — the keys were reconstructed from 3-4 bit MSE indices, so FP16 roundoff on the rotation is orders of magnitude below the quantization noise already in the cache.
Redundant .contiguous() on transposed K/V views before torch.cat. torch.cat always produces a contiguous output regardless of input contiguity, so the upstream .contiguous() calls doubled the transient footprint for no reason.

Together these spikes exceed the remaining activation headroom on a 32 GiB card once cached_len approaches ~190K.

Severity

Engine-wide, not request-scoped. The engine dies with EngineDeadError / EngineCore encountered an issue, the API surface returns 500 for subsequent calls, and the container needs a restart. Any public-facing deployment built on this stack is one long-prompt-per-session away from an outage.

Effective usable max-model-len on 32 GiB is ~230K target (~185K actual) — not the 256K the config advertises.

Fix

extent analysis

TL;DR

The most likely fix is to apply a two-change fix in _continuation_prefill to rotate in FP16 and drop redundant .contiguous() calls to prevent CUDA out-of-memory errors.

Guidance

Identify the _continuation_prefill function in turboquant_attn.py and locate the lines causing the memory allocation issues.
Modify the code to perform the inverse Hadamard rotation of MSE keys in FP16 instead of FP32 to reduce memory usage.
Remove the redundant .contiguous() calls on transposed K/V views before torch.cat to further reduce memory allocation.
Verify the changes by testing the updated code with long-prefill paths and checking for CUDA out-of-memory errors.

Example

# Before
v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
# ...
k_cached[0, :, :cached_len, :].reshape(-1, D).float() @ Pi

# After
v_full = torch.cat([v_cached_trim.to(qdtype), val_chunk], dim=0)
# ...
k_cached[0, :, :cached_len, :].reshape(-1, D).to(torch.float16) @ Pi

Notes

The provided fix is specific to the _continuation_prefill function and may not address other potential memory issues in the codebase. Additionally, the effectiveness of the fix may depend on the specific use case and input data.

Recommendation

Apply the workaround by modifying the _continuation_prefill function to rotate in FP16 and drop redundant .contiguous() calls, as this fix has been verified to push the ceiling from ~196K actual to at least 208K actual tokens.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #model loading #dependency error #configuration error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: TurboQuant `_continuation_prefill` OOMs and kills engine at long-context prefill (~185K actual tokens) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #11: fix(turboquant_attn): eliminate peak-memory spike in _continuation_prefill

Description (problem / solution / changelog)

Root cause

Changes

Verification

Scope

Changed files

Code Example

Summary

Environment

Reproducer

Stack (reproduced twice, identical)

Root cause

Severity

Fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING