vllm - ✅(Solved) Fix [Bug]: Gemma 4 31B INT4 on 2×24GB GPUs (TP=2): GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, BF16 KV [1 pull requests, 4 comments, 4 participants]

vllm2026-04-07 01:17:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39133•Fetched 2026-04-08 03:01:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4subscribed ×3cross-referenced ×1mentioned ×1

Error Message

triton.compiler.errors.CompilationError: at 1:0: def triton_red_fused__to_copy_add_cat_clamp_index_select_mul_reciprocal_rms_norm_split_split_with_sizes_sub_unsqueeze_view_1(...): ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Root Cause

This appears to be a Triton / SM 8.6 hardware limitation for Inductor-fused kernels and not a vLLM-side check; I am not reporting it as a vLLM bug, but noting it because it means fp8_e5m2 is the only fp8 variant that could potentially be used on this hardware.

PR fix notes

PR #39866: [Scheduler] Cap SWA admission budget at sliding_window + chunk_size

Repository: vllm-project/vllm
Author: jhaotingc
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39866

Description (problem / solution / changelog)

For hybrid SWA+full-attention models (e.g., Gemma 4), the can_fit_full_sequence admission gate passes full_num_tokens to get_num_blocks_to_allocate for all layer groups, including sliding window groups. Since total_computed_tokens is 0 for new requests, get_num_skipped_tokens returns 0, causing SWA groups to budget ceil(full_num_tokens / block_size) blocks instead of the window- sized amount they actually need.

This over-budget throttles concurrent request admission. On Gemma 4 31B with 50 SWA layers (window=1024) and max_num_batched_tokens=8192, each SWA group budgets 1001 blocks instead of 576, causing 4 concurrent 65K-context sessions to be serialized through the gate.

Fix: In KVCacheCoordinator.get_num_blocks_to_allocate, cap effective_num_tokens for SlidingWindowManager groups at sliding_window + max_num_batched_tokens. The window term is the steady-state max blocks, and the chunk term accounts for blocks needed during a single prefill chunk before remove_skipped_blocks frees OOW blocks. This matches TensorRT-LLM's getNeededBlocksOneStep.

Plumbing: max_num_batched_tokens flows from SchedulerConfig through KVCacheManager and get_kv_cache_coordinator to all coordinator subclasses.

Purpose

can_fit_full_sequence over-budgets SWA (sliding window attention) layers by passing full_num_tokens (the entire request length) to get_num_blocks_to_allocate, even though SWA layers never hold more than sliding_window blocks at steady state. Out-of-window blocks are freed between chunked prefill chunks by remove_skipped_blocks(), so the admission gate only needs to budget for the window plus one chunk buffer.

This causes the scheduler to serialize requests on hybrid SWA+full-attention models (e.g., Gemma 4) when it should be running them concurrently. For example, with Gemma 4 31B (50 SWA layers, sliding_window=1024, max_num_batched_tokens=8192), each SWA group was budgeted at 1001 blocks instead of 576 — a 42% over-reservation that throttles admission.

The fix caps effective_num_tokens for SlidingWindowManager groups at sliding_window + max_num_batched_tokens in KVCacheCoordinator.get_num_blocks_to_allocate(), matching TensorRT-LLM's approach in getNeededBlocksOneStep. The max_num_batched_tokens term accounts for blocks needed during a single prefill chunk before OOW release runs.

This change is borrowing logics from TRT-LLM https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp#L2775

Test Plan

Verified with debug instrumentation that remove_skipped_blocks() frees OOW blocks between chunked prefill chunks
Verified admission budget drops from 1001 to 576 blocks per SWA group
SCBench Code.RepoQA correctness test (20 examples, 5 turns, ~65K token contexts) with and without the fix — identical quality scores
End-to-end multi-session serving test confirming no crashes or errors
More testing and accuracy benchmark should be done

Test Result

Admission Budget (per SWA group, Gemma 4 31B, sliding_window=1024, max_num_batched_tokens=8192)

	Before	After
`num_tokens` passed to SWA group	16,001	9,216
`required_blocks` per SWA group	1,001	576
Total across 5 SWA groups	5,005	2,880

SCBench Code.RepoQA — Gemma 4 31B NVFP4, 1x B200, TP1

(This is a dataset I find that has long shared prefix cache)

Server command:

vllm serve /path/to/Gemma-4-31B-IT-NVFP4 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --async-scheduling \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --compilation-config '{"compile_sizes":[1,2,4,8,16,32,64],"cudagraph_capture_sizes":[1,2,4,8,16,32,64]}'

Results:

Metric	Before (no cap)	After (with cap)
Overall Pass@1 (threshold=0.8)	73.0%	73.0%
Turn 0 Pass@[email protected]	70.0%	70.0%
Turn 1 Pass@[email protected]	85.0%	85.0%
Turn 2 Pass@[email protected]	80.0%	80.0%
Turn 3 Pass@[email protected]	60.0%	60.0%
Turn 4 Pass@[email protected]	70.0%	70.0%
Total wall time	2,295.7s	286.7s
Prefix cache hit rate	41.6%	40.5%
Avg prompt throughput	~8K tok/s	~16-25K tok/s

The speedup comes from the scheduler admitting all 4 concurrent sessions simultaneously instead of serializing them through the over-budgeted admission gate. With the fix, multi-turn long-context workloads on hybrid SWA models achieve near-perfect session parallelism.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/v1/core/kv_cache_coordinator.py (modified, +24/-1)
vllm/v1/core/kv_cache_manager.py (modified, +3/-0)
vllm/v1/core/sched/scheduler.py (modified, +1/-0)

Code Example

/models/cyankiwi/gemma-4-31B-it-AWQ-4bit
  --tensor-parallel-size 2
  --max-num-seqs 1
  --gpu-memory-utilization 0.96
  --max-model-len 131072
  --reasoning-parser gemma4
  --enable-auto-tool-choice
  --tool-call-parser gemma4
  --enable-prefix-caching
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'
  --async-scheduling

---

INFO [config.py:99]  Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
INFO [cuda.py:302]   Using AttentionBackendEnum.TRITON_ATTN backend.
INFO [gpu_model_runner.py:4820]  Model loading took 10.46 GiB memory and 2.480848 seconds
INFO [gpu_worker.py:436]  Available KV cache memory: 11.54 GiB
INFO [kv_cache_utils.py:1319] GPU KV cache size: 25,200 tokens
INFO [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 1.87x

---

quant_method = (
       quant_config.get_quant_method(layer, prefix=prefix) if quant_config else None
   )

   # See [Note: Register q/k/v/prob scales in state dict]
   if should_load_quant_weights(quant_method):
       assert isinstance(quant_method, BaseKVCacheMethod)
       # TODO (mgoin): kv cache dtype should be specified in the FP8
       # checkpoint config and become the "auto" behavior
       if layer.kv_cache_dtype == "fp8_e5m2":
           raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")

---

def should_load_quant_weights(quant_method: QuantizeMethodBase | None) -> bool:
       """Returns whether the quantization method should load quantized weights."""
       return quant_method is not None and not isinstance(
           quant_method, UnquantizedLinearMethod
       )

---

triton.compiler.errors.CompilationError: at 1:0:
   def triton_red_fused__to_copy_add_cat_clamp_index_select_mul_reciprocal_rms_norm_split_split_with_sizes_sub_unsqueeze_view_1(...):
   ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM: 0.1.dev1+gd56e95223 (custom container built from this commit of main)
Hardware: 2× NVIDIA RTX 3090 (SM 8.6, 24 GB each), tensor parallel across both
Model: cyankiwi/gemma-4-31B-it-AWQ-4bit
- Quantization (from the model's config.json → quantization_config): quant_method: compressed-tensors, format: pack-quantized, num_bits: 4, group_size: 32, observer: mse, strategy: group, symmetric: true, type: int.
- Architecture (from the model's config.json → text_config):
  - num_hidden_layers: 60
  - num_attention_heads: 32
  - num_key_value_heads: 16
  - head_dim: 256
  - global_head_dim: 512
  - sliding_window: 1024
  - layer_types: 60 entries, of which 50 are sliding_attention and 10 are full_attention (verified by counting the array)

`vllm serve` arguments

/models/cyankiwi/gemma-4-31B-it-AWQ-4bit
  --tensor-parallel-size 2
  --max-num-seqs 1
  --gpu-memory-utilization 0.96
  --max-model-len 131072
  --reasoning-parser gemma4
  --enable-auto-tool-choice
  --tool-call-parser gemma4
  --enable-prefix-caching
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'
  --async-scheduling

kv_cache_dtype is left at its default value of auto, which resolves to BF16 in this configuration.

Observed log output (verbatim)

INFO [config.py:99]  Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
INFO [cuda.py:302]   Using AttentionBackendEnum.TRITON_ATTN backend.
INFO [gpu_model_runner.py:4820]  Model loading took 10.46 GiB memory and 2.480848 seconds
INFO [gpu_worker.py:436]  Available KV cache memory: 11.54 GiB
INFO [kv_cache_utils.py:1319] GPU KV cache size: 25,200 tokens
INFO [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 1.87x

After loading the INT4 weights, each TP rank reports 11.54 GiB available for KV cache, and the resulting total GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, and BF16 KV.

The model declares sliding_window: 1024 and 50 of its 60 layers are marked sliding_attention in layer_types.

Questions for maintainers

Is vLLM's Gemma 4 implementation expected to exploit the sliding_attention layer type and the 1024-token sliding_window when sizing the KV cache? If yes, is the 25,200-token result consistent with that exploitation on this hardware/config, or would a larger value be expected?
Is there a config flag I should be setting to enable sliding-window KV sharing for Gemma 4 that I am missing? (For reference, llama.cpp on the same model exposes --swa-full to toggle related behavior, and its default sizing for this architecture produces a substantially larger effective context on comparable memory.)

Separately, when I tried to set --kv-cache-dtype fp8_e5m2 on this INT4 checkpoint, the server raised ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.") from vllm/model_executor/layers/attention/attention.py around lines 162–170. The relevant code on the running container is:

quant_method = (
    quant_config.get_quant_method(layer, prefix=prefix) if quant_config else None
)

# See [Note: Register q/k/v/prob scales in state dict]
if should_load_quant_weights(quant_method):
    assert isinstance(quant_method, BaseKVCacheMethod)
    # TODO (mgoin): kv cache dtype should be specified in the FP8
    # checkpoint config and become the "auto" behavior
    if layer.kv_cache_dtype == "fp8_e5m2":
        raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")

with should_load_quant_weights defined as:

def should_load_quant_weights(quant_method: QuantizeMethodBase | None) -> bool:
    """Returns whether the quantization method should load quantized weights."""
    return quant_method is not None and not isinstance(
        quant_method, UnquantizedLinearMethod
    )

The gate that fires before the fp8_e5m2 error is should_load_quant_weights(quant_method), which returns True for any quant_method other than UnquantizedLinearMethod. For this compressed-tensors INT4 checkpoint that gate is True, even though the checkpoint is not fp8. The error message references "fp8 checkpoints" but the gate does not specifically check for an fp8 checkpoint. Is this intended for all quantized checkpoints, or is the error message describing a narrower condition than the actual check?

For completeness: --kv-cache-dtype fp8_e4m3 on the same configuration fails during Inductor codegen with
```
triton.compiler.errors.CompilationError: at 1:0:
def triton_red_fused__to_copy_add_cat_clamp_index_select_mul_reciprocal_rms_norm_split_split_with_sizes_sub_unsqueeze_view_1(...):
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
```
This appears to be a Triton / SM 8.6 hardware limitation for Inductor-fused kernels and not a vLLM-side check; I am not reporting it as a vLLM bug, but noting it because it means fp8_e5m2 is the only fp8 variant that could potentially be used on this hardware.

What this report is asking for

Guidance on whether the observed 25,200-token KV cache size for Gemma 4 31B at this configuration is expected on vLLM's current main, and clarification on the fp8_e5m2 gate above (intended scope, and/or message wording).

I am not asserting that vLLM is incorrect — I am reporting the exact observations and asking whether they match the current design.

extent analysis

TL;DR

The observed 25,200-token KV cache size for Gemma 4 31B might not be exploiting the sliding_attention layer type and sliding_window as expected, and the error message for fp8_e5m2 kv-cache dtype seems to be misleading.

Guidance

Verify KV cache sizing: Check the vLLM documentation or source code to see if the sliding_attention layer type and sliding_window are expected to influence KV cache sizing. If so, investigate why the observed 25,200-token size might be smaller than expected.
Investigate config flags: Look for config flags or options that might enable sliding-window KV sharing for Gemma 4, similar to the --swa-full flag in llama.cpp.
Clarify fp8_e5m2 error message: The error message for fp8_e5m2 kv-cache dtype seems to reference "fp8 checkpoints" specifically, but the underlying check appears to apply to all quantized checkpoints. Seek clarification on the intended scope of this check and whether the error message should be updated for accuracy.
Consider alternative kv-cache dtypes: Given the limitations of fp8_e4m3 on SM 8.6 hardware, explore other kv-cache dtype options that might be supported and suitable for the Gemma 4 31B model.

Example

No specific code snippet is provided, as the issue is more related to configuration and design expectations.

Notes

The issue highlights the complexity of configuring and optimizing large language models like Gemma 4 31B. The observed KV cache size and error messages may indicate a need for further clarification or adjustments in the vLLM configuration or documentation.

Recommendation

Apply workaround: Investigate and adjust the configuration to potentially enable sliding-window KV sharing for Gemma 4, and seek clarification on the fp8_e5m2 error message to ensure accurate understanding and usage of kv-cache dtypes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model loading #agent setup #task chaining #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Gemma 4 31B INT4 on 2×24GB GPUs (TP=2): GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, BF16 KV [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #39866: [Scheduler] Cap SWA admission budget at sliding_window + chunk_size

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Admission Budget (per SWA group, Gemma 4 31B, sliding_window=1024, max_num_batched_tokens=8192)

SCBench Code.RepoQA — Gemma 4 31B NVFP4, 1x B200, TP1

Changed files

Code Example

Your current environment

`vllm serve` arguments

Observed log output (verbatim)

Questions for maintainers

What this report is asking for

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma 4 31B INT4 on 2×24GB GPUs (TP=2): GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, BF16 KV [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #39866: [Scheduler] Cap SWA admission budget at sliding_window + chunk_size

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Admission Budget (per SWA group, Gemma 4 31B, sliding_window=1024, max_num_batched_tokens=8192)

SCBench Code.RepoQA — Gemma 4 31B NVFP4, 1x B200, TP1

Changed files

Code Example

Your current environment

vllm serve arguments

Observed log output (verbatim)

Questions for maintainers

What this report is asking for

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`vllm serve` arguments