vllm - ✅(Solved) Fix [Bug]: `token_capacity_kv_cache_groups` (#40384) should also exclude `SlidingWindowSpec` / `ChunkedLocalAttentionSpec` [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

PR #40384 introduces token_capacity_kv_cache_groups() in vllm/v1/core/kv_cache_utils.py that filters out MambaSpec groups when mamba_cache_mode != 'all', fixing the per-token KV capacity divisor for hybrid Mamba+attention models.

There are two more KVCacheSpec subtypes with the same property — bounded memory regardless of sequence length — but they're still counted in the divisor:

  1. SlidingWindowSpecmax_memory_usage_bytes bounded by min(sliding_window + max_num_batched_tokens, max_model_len) (vllm/v1/kv_cache_interface.py:341-353)
  2. ChunkedLocalAttentionSpec — bounded by attention_chunk_size + max_num_batched_tokens (same file, lines 360-379)

Both inherit from AttentionSpec, so the current isinstance(g.kv_cache_spec, AttentionSpec) check in token_capacity_kv_cache_groups() includes them in the per-token divisor — even though they don't scale with full sequence length.

Root Cause

This works because SlidingWindowSpec and ChunkedLocalAttentionSpec inherit from AttentionSpec directly, not FullAttentionSpec. TQFullAttentionSpec / MLAAttentionSpec / SinkFullAttentionSpec all do inherit from FullAttentionSpec, so they're correctly kept.

Fix Action

Fixed

PR fix notes

PR #40384: [Bugfix] Exclude O(1) Mamba groups from hybrid KV cache token capacity

Description (problem / solution / changelog)

Summary

On hybrid attention + Mamba models (Qwen3-Next, Qwen3.5/3.6 MoE hybrids, RecurrentGemma, Jamba, Zamba2, Nemotron-H, …), the reported GPU KV cache token capacity and the scheduler's max_num_kv_tokens are deflated by the number of Mamba groups, which in the default mamba_cache_mode='none' (and 'align') pre-reserve a fixed number of blocks and do not scale with sequence length.

Both _report_kv_cache_config() (vllm/v1/core/kv_cache_utils.py) and Scheduler.__init__ (vllm/v1/core/sched/scheduler.py) currently compute per-token capacity as:

num_tokens = num_blocks // len(kv_cache_config.kv_cache_groups) * min_block_size

For a typical hybrid with one attention group and N Mamba groups, that's off by a factor of (1 + N) / 1 — 2× understatement for the common case, 4× for Nemotron-H-style 1 attn + 3 mamba groups. The max_num_kv_tokens number is what sizes the routed_experts buffer for MoE and what the scheduler believes is its budget; getting this wrong shows up as (a) misleading boot-time logs and (b) over-conservative scheduling of concurrent requests on the very models (hybrid MoE) where extra concurrency is the whole point.

Fix

  • Factor the filter into a tiny helper token_capacity_kv_cache_groups(vllm_config, kv_cache_config) in kv_cache_utils.py that returns only the groups that scale with sequence length (attention always, Mamba only when mamba_cache_mode == 'all').
  • Use that helper in both _report_kv_cache_config and Scheduler.__init__.
  • Fall back to all groups if the filter would produce an empty list (preserves dense-model and Mamba-only paths).

The helper is exported (no leading underscore) because scheduler.py imports it; if the maintainers would rather keep it scheduler-local or inline it, happy to rewrite.

Why this is not duplicating an existing PR

Checked on 2026-04-20:

  • gh pr list --repo vllm-project/vllm --state open --search \"max_num_kv_tokens\" → none.
  • gh pr list --repo vllm-project/vllm --state open --search \"mamba kv_cache_groups token\" → none.
  • Referenced in https://github.com/vllm-project/vllm/issues/40124 (patch 9) as still not upstream.

Test plan + results

python -m pytest tests/v1/core/test_kv_cache_utils.py -v

No existing test exercises the filter directly; I'll follow up with a small unit test in a separate commit once PR feedback lands (or now, if reviewers prefer). Syntax check (python -m py_compile) is clean; ruff check/ruff format were not available in my local sandbox but the edits follow the surrounding style.

End-to-end verification on our runtime stack (cu130-nightly + TurboQuant hybrid overlay + RedHatAI/Qwen3.6-35B-A3B-NVFP4, turboquant_k8v4, max_model_len=8192, max_num_seqs=8, --gpu-memory-utilization=0.85, torch.compile + cudagraph):

  • Before: INFO kv_cache_utils.py:1363] GPU KV cache size: 143,936 tokens
  • After: I'll reply in a follow-up comment with the log delta from a re-run on this branch; expect a clean 2× jump on 1 attn + 1 mamba group.

AI-assist disclosure (per AGENTS.md)

Change was drafted with help from Claude (Anthropic); human submitter reviewed every line end-to-end and understands the hybrid KV cache group semantics. Original bug identification and filter design credit to @Sandermage — ref his issue #40124 tracking table (patch 9) and the ai-jz/vllm#1 approach he references. Co-authored-by: trailers included.

Changed files

  • tests/v1/core/test_kv_cache_utils.py (modified, +80/-0)
  • vllm/v1/core/kv_cache_utils.py (modified, +30/-6)
  • vllm/v1/core/sched/scheduler.py (modified, +5/-7)

PR #37118: [Bugfix] out-of-bounds error for routed experts capture

Description (problem / solution / changelog)

<!-- markdownlint-disable -->

Purpose

This PR fixes an out-of-bounds error in routed expert capture when enable_return_routed_experts=True is used with hybrid KV cache groups.

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
    output = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 756, in sample_tokens
    return self.model_runner.sample_tokens(grammar_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4060, in sample_tokens
    capturer.save_captured_experts(indices=self.slot_mapping)  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 217, in save_captured_experts
    self._host_buffer_view[indices, :, :] = data
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 699312 is out of bounds for axis 0 with size 699312

The routed-experts side buffer was sized with:

(num_blocks // num_groups) * min_block_size

on both the worker and scheduler sides.

That formula is only a coarse aggregate/token-capacity estimate for hybrid KV cache layouts. However, routed expert capture/readback indexes the buffer with the selected attention group's actual slot_mapping, whose address space is based on that attention KV group directly.

As a result, in hybrid/padded KV-cache configurations, the routed-experts buffer can be smaller than the valid range of slot_mapping, which leads to out-of-bounds writes/reads.

Use the routed-experts attention group's full KV address space to size the buffer consistently on both sides:

kv_cache_config.num_blocks * attn_group.kv_cache_spec.block_size

Routed expert capture is indexed by the attention group's slot_mapping, so the auxiliary buffer must match the full addressable range of that mapping. Sizing it from the specific attention group keeps the writer and reader aligned

  • fixes crashes when enable_return_routed_experts=True
  • no behavior change when routed expert return is disabled
  • no change to model weights, routing logic, or sampling semantics

Test Plan

End to end tests

Test Result

Now error is gone


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/v1/core/sched/scheduler.py (modified, +5/-9)
  • vllm/v1/worker/gpu_model_runner.py (modified, +5/-9)

Code Example

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if isinstance(g.kv_cache_spec, FullAttentionSpec)
    or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

---

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if not isinstance(g.kv_cache_spec, (
        SlidingWindowSpec,
        ChunkedLocalAttentionSpec,
        MambaSpec,
        EncoderOnlyAttentionSpec,
        CrossAttentionSpec,
    )) or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]
RAW_BUFFERClick to expand / collapse

Hi! Following up on #40384 — wanted to flag a related bug class in the same helper that I noticed while auditing #40384's reach for our hybrid-Mamba deployment.

(Small disclaimer: I'm from Ukraine and my English is still a work in progress, so I'm using AI to help with translation. Hope it reads okay!)

Description

PR #40384 introduces token_capacity_kv_cache_groups() in vllm/v1/core/kv_cache_utils.py that filters out MambaSpec groups when mamba_cache_mode != 'all', fixing the per-token KV capacity divisor for hybrid Mamba+attention models.

There are two more KVCacheSpec subtypes with the same property — bounded memory regardless of sequence length — but they're still counted in the divisor:

  1. SlidingWindowSpecmax_memory_usage_bytes bounded by min(sliding_window + max_num_batched_tokens, max_model_len) (vllm/v1/kv_cache_interface.py:341-353)
  2. ChunkedLocalAttentionSpec — bounded by attention_chunk_size + max_num_batched_tokens (same file, lines 360-379)

Both inherit from AttentionSpec, so the current isinstance(g.kv_cache_spec, AttentionSpec) check in token_capacity_kv_cache_groups() includes them in the per-token divisor — even though they don't scale with full sequence length.

Impact

For a hybrid model with 1 sliding-window (e.g. 4k) + 1 full-attention group running at max_model_len=160k:

  • Current: num_blocks // 2 * min_block_size → underreports capacity by ~50%
  • After fix: num_blocks // 1 * min_block_size → correct full-attention capacity

Same regression direction as the Mamba bug — just less severe in the sense that bounded memory still scales with window/chunk size, not full sequence length.

Affected models in tree today

Anything with sliding-window layers in a hybrid config:

  • Gemma 3 family (mixed local/global attention)
  • Phi-3 / Phi-4 with sliding-window-only mode
  • Mistral variants with sliding_window enabled in config
  • (potentially) Olmo / OLMoE hybrid attention variants

Plus future models that add ChunkedLocalAttentionSpec for local-attention layers.

Suggested fix

Two equally valid approaches:

(a) Tighten the positive list to FullAttentionSpec only:

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if isinstance(g.kv_cache_spec, FullAttentionSpec)
    or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

This works because SlidingWindowSpec and ChunkedLocalAttentionSpec inherit from AttentionSpec directly, not FullAttentionSpec. TQFullAttentionSpec / MLAAttentionSpec / SinkFullAttentionSpec all do inherit from FullAttentionSpec, so they're correctly kept.

(b) Negative list excluding all known bounded specs:

groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if not isinstance(g.kv_cache_spec, (
        SlidingWindowSpec,
        ChunkedLocalAttentionSpec,
        MambaSpec,
        EncoderOnlyAttentionSpec,
        CrossAttentionSpec,
    )) or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

This is more defensive — also excludes EncoderOnlyAttentionSpec (returns 0 bytes per line 416) and CrossAttentionSpec (Whisper-family, scales with max_encoder_len not per-token). Neither hits any in-tree hybrid model today, but it's a foot-gun for future spec additions.

I'd lean toward (b) for maintainability, but (a) is a one-character change.

Reproduction

I haven't run a hybrid sliding+full attention model on my A5000 setup (we run Qwen3.6-A3B which is Mamba+full, already covered by #40384). Found this by code-reading after my own bug report in #40124 was solved by #40384. Happy to write a small unit test on top of tests/v1/core/test_kv_cache_utils.py mirroring the new test_token_capacity_groups_* cases — let me know if a maintainer wants to take this on or if I should send a PR.

Why filing now

To avoid forgetting the same audit conclusion. The two-line fix is small enough that it'd be a shame to leave it as latent regression for any sliding-window hybrid model deployment.

extent analysis

TL;DR

The most likely fix is to update the token_capacity_kv_cache_groups() function to exclude SlidingWindowSpec and ChunkedLocalAttentionSpec from the per-token divisor calculation.

Guidance

  • Identify the affected models by checking for the presence of sliding-window layers in hybrid configurations, such as Gemma 3, Phi-3, Phi-4, and Mistral variants.
  • Update the token_capacity_kv_cache_groups() function using one of the two suggested approaches: tightening the positive list to FullAttentionSpec only or using a negative list to exclude known bounded specs.
  • Verify the fix by running a hybrid sliding+full attention model and checking the per-token divisor calculation.
  • Consider adding a unit test to tests/v1/core/test_kv_cache_utils.py to ensure the fix is maintained.

Example

# Approach (a): Tighten the positive list to FullAttentionSpec only
groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if isinstance(g.kv_cache_spec, FullAttentionSpec)
    or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

# Approach (b): Negative list excluding all known bounded specs
groups = [
    g
    for g in kv_cache_config.kv_cache_groups
    if not isinstance(g.kv_cache_spec, (
        SlidingWindowSpec,
        ChunkedLocalAttentionSpec,
        MambaSpec,
        EncoderOnlyAttentionSpec,
        CrossAttentionSpec,
    )) or (isinstance(g.kv_cache_spec, MambaSpec) and mamba_scales)
]

Notes

The fix should be applied to avoid underreporting capacity by ~50% in hybrid models with sliding-window layers. The suggested approaches have different maintainability implications, with approach (b) being more defensive but also more complex.

Recommendation

Apply workaround (b) to exclude all known bounded specs, as it provides a more defensive approach and avoids potential future regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: `token_capacity_kv_cache_groups` (#40384) should also exclude `SlidingWindowSpec` / `ChunkedLocalAttentionSpec` [2 pull requests]