vllm - ✅(Solved) Fix [Bug]: `token_capacity_kv_cache_groups` (#40384) should also exclude `SlidingWindowSpec` / `ChunkedLocalAttentionSpec` [2 pull requests]

vllm2026-04-21 00:10:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

PR #40384 introduces token_capacity_kv_cache_groups() in vllm/v1/core/kv_cache_utils.py that filters out MambaSpec groups when mamba_cache_mode != 'all', fixing the per-token KV capacity divisor for hybrid Mamba+attention models.

There are two more KVCacheSpec subtypes with the same property — bounded memory regardless of sequence length — but they're still counted in the divisor:

SlidingWindowSpec — max_memory_usage_bytes bounded by min(sliding_window + max_num_batched_tokens, max_model_len) (vllm/v1/kv_cache_interface.py:341-353)
ChunkedLocalAttentionSpec — bounded by attention_chunk_size + max_num_batched_tokens (same file, lines 360-379)

Both inherit from AttentionSpec, so the current isinstance(g.kv_cache_spec, AttentionSpec) check in token_capacity_kv_cache_groups() includes them in the per-token divisor — even though they don't scale with full sequence length.

Root Cause

This works because SlidingWindowSpec and ChunkedLocalAttentionSpec inherit from AttentionSpec directly, not FullAttentionSpec. TQFullAttentionSpec / MLAAttentionSpec / SinkFullAttentionSpec all do inherit from FullAttentionSpec, so they're correctly kept.

Fix Action

Fixed

Fixed by PR: [Bugfix] Exclude O(1) Mamba groups from hybrid KV cache token capacity (https://github.com/vllm-project/vllm/pull/40384)
Fixed by PR: [Bugfix] out-of-bounds error for routed experts capture (https://github.com/vllm-project/vllm/pull/37118)

PR fix notes

PR #40384: [Bugfix] Exclude O(1) Mamba groups from hybrid KV cache token capacity

Repository: vllm-project/vllm
Author: jhsmith409
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40384

Description (problem / solution / changelog)

Summary

On hybrid attention + Mamba models (Qwen3-Next, Qwen3.5/3.6 MoE hybrids, RecurrentGemma, Jamba, Zamba2, Nemotron-H, …), the reported GPU KV cache token capacity and the scheduler's max_num_kv_tokens are deflated by the number of Mamba groups, which in the default mamba_cache_mode='none' (and 'align') pre-reserve a fixed number of blocks and do not scale with sequence length.

Both _report_kv_cache_config() (vllm/v1/core/kv_cache_utils.py) and Scheduler.__init__ (vllm/v1/core/sched/scheduler.py) currently compute per-token capacity as:

num_tokens = num_blocks // len(kv_cache_config.kv_cache_groups) * min_block_size

For a typical hybrid with one attention group and N Mamba groups, that's off by a factor of (1 + N) / 1 — 2× understatement for the common case, 4× for Nemotron-H-style 1 attn + 3 mamba groups. The max_num_kv_tokens number is what sizes the routed_experts buffer for MoE and what the scheduler believes is its budget; getting this wrong shows up as (a) misleading boot-time logs and (b) over-conservative scheduling of concurrent requests on the very models (hybrid MoE) where extra concurrency is the whole point.

Fix

Factor the filter into a tiny helper token_capacity_kv_cache_groups(vllm_config, kv_cache_config) in kv_cache_utils.py that returns only the groups that scale with sequence length (attention always, Mamba only when mamba_cache_mode == 'all').
Use that helper in both _report_kv_cache_config and Scheduler.__init__.
Fall back to all groups if the filter would produce an empty list (preserves dense-model and Mamba-only paths).

The helper is exported (no leading underscore) because scheduler.py imports it; if the maintainers would rather keep it scheduler-local or inline it, happy to rewrite.

Why this is not duplicating an existing PR

Checked on 2026-04-20:

gh pr list --repo vllm-project/vllm --state open --search \"max_num_kv_tokens\" → none.
gh pr list --repo vllm-project/vllm --state open --search \"mamba kv_cache_groups token\" → none.
Referenced in https://github.com/vllm-project/vllm/issues/40124 (patch 9) as still not upstream.

Test plan + results

python -m pytest tests/v1/core/test_kv_cache_utils.py -v

No existing test exercises the filter directly; I'll follow up with a small unit test in a separate commit once PR feedback lands (or now, if reviewers prefer). Syntax check (python -m py_compile) is clean; ruff check/ruff format were not available in my local sandbox but the edits follow the surrounding style.

End-to-end verification on our runtime stack (cu130-nightly + TurboQuant hybrid overlay + RedHatAI/Qwen3.6-35B-A3B-NVFP4, turboquant_k8v4, max_model_len=8192, max_num_seqs=8, --gpu-memory-utilization=0.85, torch.compile + cudagraph):

Before: INFO kv_cache_utils.py:1363] GPU KV cache size: 143,936 tokens
After: I'll reply in a follow-up comment with the log delta from a re-run on this branch; expect a clean 2× jump on 1 attn + 1 mamba group.

AI-assist disclosure (per AGENTS.md)

Change was drafted with help from Claude (Anthropic); human submitter reviewed every line end-to-end and understands the hybrid KV cache group semantics. Original bug identification and filter design credit to @Sandermage — ref his issue #40124 tracking table (patch 9) and the ai-jz/vllm#1 approach he references. Co-authored-by: trailers included.

Changed files

tests/v1/core/test_kv_cache_utils.py (modified, +80/-0)
vllm/v1/core/kv_cache_utils.py (modified, +30/-6)
vllm/v1/core/sched/scheduler.py (modified, +5/-7)

PR #37118: [Bugfix] out-of-bounds error for routed experts capture

Repository: vllm-project/vllm
Author: HollowMan6
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37118

Description (problem / solution / changelog)

Purpose

This PR fixes an out-of-bounds error in routed expert capture when enable_return_routed_experts=True is used with hybrid KV cache groups.

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
    output = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 756, in sample_tokens
    return self.model_runner.sample_tokens(grammar_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4060, in sample_tokens
    capturer.save_captured_experts(indices=self.slot_mapping)  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 217, in save_captured_experts
    self._host_buffer_view[indices, :, :] = data
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 699312 is out of bounds for axis 0 with size 699312

The routed-experts side buffer was sized with:

(num_blocks // num_groups) * min_block_size

on both the worker and scheduler sides.

That formula is only a coarse aggregate/token-capacity estimate for hybrid KV cache layouts. However, routed expert capture/readback indexes the buffer with the selected attention group's actual slot_mapping, whose address space is based on that attention KV group directly.

As a result, in hybrid/padded KV-cache configurations, the routed-experts buffer can be smaller than the valid range of slot_mapping, which leads to out-of-bounds writes/reads.

Use the routed-experts attention group's full KV address space to size the buffer consistently on both sides:

kv_cache_config.num_blocks * attn_group.kv_cache_spec.block_size

Routed expert capture is indexed by the attention group's slot_mapping, so the auxiliary buffer must match the full addressable range of that mapping. Sizing it from the specific attention group keeps the writer and reader aligned

fixes crashes when enable_return_routed_experts=True
no behavior change when routed expert return is disabled
no change to model weights, routing logic, or sampling semantics

Test Plan

End to end tests

Test Result

Now error is gone

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/v1/core/sched/scheduler.py (modified, +5/-9)
vllm/v1/worker/gpu_model_runner.py (modified, +5/-9)

Code Example

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if isinstance(g.kv_cache_spec, FullAttentionSpec)
    or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

---

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if not isinstance(g.kv_cache_spec, (
        SlidingWindowSpec,
        ChunkedLocalAttentionSpec,
        MambaSpec,
        EncoderOnlyAttentionSpec,
        CrossAttentionSpec,
    )) or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

RAW_BUFFERClick to expand / collapse

Hi! Following up on #40384 — wanted to flag a related bug class in the same helper that I noticed while auditing #40384's reach for our hybrid-Mamba deployment.

(Small disclaimer: I'm from Ukraine and my English is still a work in progress, so I'm using AI to help with translation. Hope it reads okay!)

Description

There are two more KVCacheSpec subtypes with the same property — bounded memory regardless of sequence length — but they're still counted in the divisor:

SlidingWindowSpec — max_memory_usage_bytes bounded by min(sliding_window + max_num_batched_tokens, max_model_len) (vllm/v1/kv_cache_interface.py:341-353)
ChunkedLocalAttentionSpec — bounded by attention_chunk_size + max_num_batched_tokens (same file, lines 360-379)

Impact

For a hybrid model with 1 sliding-window (e.g. 4k) + 1 full-attention group running at max_model_len=160k:

Current: num_blocks // 2 * min_block_size → underreports capacity by ~50%
After fix: num_blocks // 1 * min_block_size → correct full-attention capacity

Same regression direction as the Mamba bug — just less severe in the sense that bounded memory still scales with window/chunk size, not full sequence length.

Affected models in tree today

Anything with sliding-window layers in a hybrid config:

Gemma 3 family (mixed local/global attention)
Phi-3 / Phi-4 with sliding-window-only mode
Mistral variants with sliding_window enabled in config
(potentially) Olmo / OLMoE hybrid attention variants

Plus future models that add ChunkedLocalAttentionSpec for local-attention layers.

Suggested fix

Two equally valid approaches:

(a) Tighten the positive list to FullAttentionSpec only:

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if isinstance(g.kv_cache_spec, FullAttentionSpec)
    or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

(b) Negative list excluding all known bounded specs:

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if not isinstance(g.kv_cache_spec, (
        SlidingWindowSpec,
        ChunkedLocalAttentionSpec,
        MambaSpec,
        EncoderOnlyAttentionSpec,
        CrossAttentionSpec,
    )) or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

This is more defensive — also excludes EncoderOnlyAttentionSpec (returns 0 bytes per line 416) and CrossAttentionSpec (Whisper-family, scales with max_encoder_len not per-token). Neither hits any in-tree hybrid model today, but it's a foot-gun for future spec additions.

I'd lean toward (b) for maintainability, but (a) is a one-character change.

Reproduction

I haven't run a hybrid sliding+full attention model on my A5000 setup (we run Qwen3.6-A3B which is Mamba+full, already covered by #40384). Found this by code-reading after my own bug report in #40124 was solved by #40384. Happy to write a small unit test on top of tests/v1/core/test_kv_cache_utils.py mirroring the new test_token_capacity_groups_* cases — let me know if a maintainer wants to take this on or if I should send a PR.

Why filing now

To avoid forgetting the same audit conclusion. The two-line fix is small enough that it'd be a shame to leave it as latent regression for any sliding-window hybrid model deployment.

extent analysis

TL;DR

The most likely fix is to update the token_capacity_kv_cache_groups() function to exclude SlidingWindowSpec and ChunkedLocalAttentionSpec from the per-token divisor calculation.

Guidance

Identify the affected models by checking for the presence of sliding-window layers in hybrid configurations, such as Gemma 3, Phi-3, Phi-4, and Mistral variants.
Update the token_capacity_kv_cache_groups() function using one of the two suggested approaches: tightening the positive list to FullAttentionSpec only or using a negative list to exclude known bounded specs.
Verify the fix by running a hybrid sliding+full attention model and checking the per-token divisor calculation.
Consider adding a unit test to tests/v1/core/test_kv_cache_utils.py to ensure the fix is maintained.

Example

# Approach (a): Tighten the positive list to FullAttentionSpec only
groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if isinstance(g.kv_cache_spec, FullAttentionSpec)
    or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

# Approach (b): Negative list excluding all known bounded specs
groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if not isinstance(g.kv_cache_spec, (
        SlidingWindowSpec,
        ChunkedLocalAttentionSpec,
        MambaSpec,
        EncoderOnlyAttentionSpec,
        CrossAttentionSpec,
    )) or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

Notes

The fix should be applied to avoid underreporting capacity by ~50% in hybrid models with sliding-window layers. The suggested approaches have different maintainability implications, with approach (b) being more defensive but also more complex.

Recommendation

Apply workaround (b) to exclude all known bounded specs, as it provides a more defensive approach and avoids potential future regressions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: `token_capacity_kv_cache_groups` (#40384) should also exclude `SlidingWindowSpec` / `ChunkedLocalAttentionSpec` [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #40384: [Bugfix] Exclude O(1) Mamba groups from hybrid KV cache token capacity

Description (problem / solution / changelog)

Summary

Fix

Why this is not duplicating an existing PR

Test plan + results

AI-assist disclosure (per AGENTS.md)

Changed files

PR #37118: [Bugfix] out-of-bounds error for routed experts capture

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Description

Impact

Affected models in tree today

Suggested fix

Reproduction

Why filing now

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING