vllm - 💡(How to fix) Fix [Bug]: Qwen3-VL deepstack ValueError "Requested more deepstack tokens than available in buffer" with chunked prefill + prefix caching [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41485Fetched 2026-05-02 05:27:52
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
0
Timeline (top)
mentioned ×3subscribed ×3commented ×2

Error Message

The `num_tokens` requested is consistently 16 but the buffer's `deepstack_input_embeds_num_tokens` varies (we have observed 9, 11, 14). EngineCore logs `EngineCore encountered a fatal error` → systemd has restarted the service 39 times in 3 days.

  • Prefix-cache hit count at crash time is high relative to prompt length (e.g. 1648/1657 hits → residual 9 tokens, which equals the `_num_tokens=9` in the error).

Fix Action

Fix / Workaround

Workaround we're testing

A `--enable-prefix-caching False` workaround should also avoid the bad code path but with a worse latency hit.

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment (excerpts)</summary>
  • vLLM: v0.20.0 (latest at time of report, Docker vllm/vllm-openai:v0.20.0)
  • Model: Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
  • GPU: 1× NVIDIA RTX 6000 Ada (48 GB)
  • Driver / CUDA: as shipped in the v0.20.0 image
  • OS: Linux 6.17 (Ubuntu 24.04 host)

Launch flags (relevant): ``` --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 --quantization fp8 --kv-cache-dtype fp8_e4m3 --max-model-len 8192 --gpu-memory-utilization 0.50 --enable-prefix-caching # default --enable-chunked-prefill # default --max-num-seqs 40 --seed 0 ```

</details>

🐛 Describe the bug

`Qwen3VLForConditionalGeneration._get_deepstack_input_embeds` raises `ValueError`, EngineCore catches it as fatal, and the engine dies (V1 `EngineDeadError`):

``` File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line ~1716, in _get_deepstack_input_embeds raise ValueError( ValueError: Requested more deepstack tokens than available in buffer: num_tokens=16 > self.deepstack_input_embeds_num_tokens=9 ```

The `num_tokens` requested is consistently 16 but the buffer's `deepstack_input_embeds_num_tokens` varies (we have observed 9, 11, 14). EngineCore logs `EngineCore encountered a fatal error` → systemd has restarted the service 39 times in 3 days.

Reproduction signal

  • Workload: OpenAI Chat Completions, single-image multimodal requests with prompts of 1500–2000 tokens. Production safety-monitoring pipeline that issues two-pass calls (primary + a longer "rescue" prompt on the same image).
  • First crash always happens after the engine has been serving normally for many minutes — it isn't a startup issue.
  • Different `mm_hash` and different image content each crash — this rules out a poison input being retried.
  • `num_tokens=16` is constant across all crashes (Qwen3-VL deepstack always seems to request 16-token writes); the buffer size is whatever `_set_deepstack_input_embeds` wrote on the previous chunk.
  • Prefix-cache hit count at crash time is high relative to prompt length (e.g. 1648/1657 hits → residual 9 tokens, which equals the `_num_tokens=9` in the error).

The pattern strongly suggests a chunked-prefill / prefix-cache misalignment: when prefix caching leaves a small residual chunk that lands inside the image's vision-token region, the buffer is sized to the chunk remainder while the language model still asks for the full 16-token deepstack contribution.

Code pointer

`vllm/model_executor/models/qwen3_vl.py`, in `Qwen3VLForConditionalGeneration.forward`:

```python if inputs_embeds is not None and get_pp_group().is_first_rank: deepstack_input_embeds = self._get_deepstack_input_embeds( inputs_embeds.size(0) # called with TOTAL step token count ) ... self._clear_deepstack_input_embeds(inputs_embeds.size(0)) ```

`_get_deepstack_input_embeds` is called with `inputs_embeds.size(0)` (total step tokens), while `_set_deepstack_input_embeds` was called earlier with the vision-token count for the chunk that ran the vision encoder. Across chunks of a single multimodal request these two counts are not the same — text-only chunks of a chunked prefill, and decode steps that happen to fall into the same image's deepstack window, both seem to trip the assert.

Frequency

``` $ journalctl -u syncai-vllm-vl --since '2026-04-29' \ | grep 'Requested more deepstack tokens' | wc -l 39 ```

The first occurrence in our logs is 2026-04-29 18:52 (vLLM 0.20.0 was released 2026-04-27, two days earlier). The bug has happened roughly every 2–4 hours since under organic production load.

Workaround we're testing

Setting `--enable-chunked-prefill False` on the vLLM launch — the assumption is that without chunked prefill, the deepstack `_set` always sees the full vision-token count and the `_get` call's `inputs_embeds.size(0)` aligns with what was written. Will report back.

A `--enable-prefix-caching False` workaround should also avoid the bad code path but with a worse latency hit.

Suggested investigation

The buffer + index scheme in `_get_deepstack_input_embeds` / `_set_deepstack_input_embeds` / `_clear_deepstack_input_embeds` assumes a single-shot call per request — it tracks `deepstack_input_embeds_num_tokens` as a scalar instance attribute. Under chunked prefill of a multimodal request, the SAME `Qwen3VLForConditionalGeneration` instance handles multiple chunks of the same request (and concurrent requests), so this scalar is racy.

Either:

  1. The `_get` call site needs to compute the vision-token slice for this chunk only (not pass `inputs_embeds.size(0)`), or
  2. The buffer needs a per-request key (req_id), or
  3. Chunked prefill needs to be disabled for vision-encoder-bearing steps of multimodal requests.

Before submitting a new issue

  • I have searched for existing issues — closest matches are unrelated (#38257 OOM, #29778 image_embeds out-of-bounds, #31404 MM cache AssertionError).
  • Latest version (v0.20.0).

extent analysis

TL;DR

The most likely fix is to modify the _get_deepstack_input_embeds call to compute the vision-token slice for the current chunk only, rather than passing the total step token count.

Guidance

  • Investigate modifying the _get_deepstack_input_embeds method to calculate the vision-token slice for the current chunk, rather than relying on the inputs_embeds.size(0) value.
  • Consider adding a per-request key (req_id) to the buffer to handle concurrent requests and chunks.
  • Evaluate the impact of disabling chunked prefill for vision-encoder-bearing steps of multimodal requests.
  • Verify the workaround of setting --enable-chunked-prefill False and report back on its effectiveness.

Example

# Modified _get_deepstack_input_embeds call
deepstack_input_embeds = self._get_deepstack_input_embeds(
    vision_token_count  # calculate vision-token count for current chunk
)

Notes

The issue is likely caused by the mismatch between the total step token count and the vision-token count for each chunk. The suggested investigation points out the need for a per-request key or a modification to the _get_deepstack_input_embeds method.

Recommendation

Apply the workaround of setting --enable-chunked-prefill False and investigate the modifications suggested in the guidance section to find a more permanent fix. This is because the workaround may have a latency impact, and a more targeted solution is needed to address the root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING