vllm - ✅(Solved) Fix [Bug]: Gemma 4 31B INT4 on 2×24GB GPUs (TP=2): GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, BF16 KV [1 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39133Fetched 2026-04-08 03:01:49
View on GitHub
Comments
4
Participants
4
Timeline
9
Reactions
1
Author
Timeline (top)
commented ×4subscribed ×3cross-referenced ×1mentioned ×1

Error Message

triton.compiler.errors.CompilationError: at 1:0: def triton_red_fused__to_copy_add_cat_clamp_index_select_mul_reciprocal_rms_norm_split_split_with_sizes_sub_unsqueeze_view_1(...): ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Root Cause

This appears to be a Triton / SM 8.6 hardware limitation for Inductor-fused kernels and not a vLLM-side check; I am not reporting it as a vLLM bug, but noting it because it means fp8_e5m2 is the only fp8 variant that could potentially be used on this hardware.

PR fix notes

PR #39866: [Scheduler] Cap SWA admission budget at sliding_window + chunk_size

Description (problem / solution / changelog)

For hybrid SWA+full-attention models (e.g., Gemma 4), the can_fit_full_sequence admission gate passes full_num_tokens to get_num_blocks_to_allocate for all layer groups, including sliding window groups. Since total_computed_tokens is 0 for new requests, get_num_skipped_tokens returns 0, causing SWA groups to budget ceil(full_num_tokens / block_size) blocks instead of the window- sized amount they actually need.

This over-budget throttles concurrent request admission. On Gemma 4 31B with 50 SWA layers (window=1024) and max_num_batched_tokens=8192, each SWA group budgets 1001 blocks instead of 576, causing 4 concurrent 65K-context sessions to be serialized through the gate.

Fix: In KVCacheCoordinator.get_num_blocks_to_allocate, cap effective_num_tokens for SlidingWindowManager groups at sliding_window + max_num_batched_tokens. The window term is the steady-state max blocks, and the chunk term accounts for blocks needed during a single prefill chunk before remove_skipped_blocks frees OOW blocks. This matches TensorRT-LLM's getNeededBlocksOneStep.

Plumbing: max_num_batched_tokens flows from SchedulerConfig through KVCacheManager and get_kv_cache_coordinator to all coordinator subclasses.

Purpose

can_fit_full_sequence over-budgets SWA (sliding window attention) layers by passing full_num_tokens (the entire request length) to get_num_blocks_to_allocate, even though SWA layers never hold more than sliding_window blocks at steady state. Out-of-window blocks are freed between chunked prefill chunks by remove_skipped_blocks(), so the admission gate only needs to budget for the window plus one chunk buffer.

This causes the scheduler to serialize requests on hybrid SWA+full-attention models (e.g., Gemma 4) when it should be running them concurrently. For example, with Gemma 4 31B (50 SWA layers, sliding_window=1024, max_num_batched_tokens=8192), each SWA group was budgeted at 1001 blocks instead of 576 — a 42% over-reservation that throttles admission.

The fix caps effective_num_tokens for SlidingWindowManager groups at sliding_window + max_num_batched_tokens in KVCacheCoordinator.get_num_blocks_to_allocate(), matching TensorRT-LLM's approach in getNeededBlocksOneStep. The max_num_batched_tokens term accounts for blocks needed during a single prefill chunk before OOW release runs.

This change is borrowing logics from TRT-LLM https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp#L2775

Related issue: https://github.com/vllm-project/vllm/issues/39133

Test Plan

  • Verified with debug instrumentation that remove_skipped_blocks() frees OOW blocks between chunked prefill chunks
  • Verified admission budget drops from 1001 to 576 blocks per SWA group
  • SCBench Code.RepoQA correctness test (20 examples, 5 turns, ~65K token contexts) with and without the fix — identical quality scores
  • End-to-end multi-session serving test confirming no crashes or errors
  • More testing and accuracy benchmark should be done

Test Result

Admission Budget (per SWA group, Gemma 4 31B, sliding_window=1024, max_num_batched_tokens=8192)

BeforeAfter
num_tokens passed to SWA group16,0019,216
required_blocks per SWA group1,001576
Total across 5 SWA groups5,0052,880

SCBench Code.RepoQA — Gemma 4 31B NVFP4, 1x B200, TP1

(This is a dataset I find that has long shared prefix cache)

Server command:

vllm serve /path/to/Gemma-4-31B-IT-NVFP4 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --async-scheduling \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --compilation-config '{"compile_sizes":[1,2,4,8,16,32,64],"cudagraph_capture_sizes":[1,2,4,8,16,32,64]}'

Results:

MetricBefore (no cap)After (with cap)
Overall Pass@1 (threshold=0.8)73.0%73.0%
Turn 0 Pass@[email protected]70.0%70.0%
Turn 1 Pass@[email protected]85.0%85.0%
Turn 2 Pass@[email protected]80.0%80.0%
Turn 3 Pass@[email protected]60.0%60.0%
Turn 4 Pass@[email protected]70.0%70.0%
Total wall time2,295.7s286.7s
Prefix cache hit rate41.6%40.5%
Avg prompt throughput~8K tok/s~16-25K tok/s

The speedup comes from the scheduler admitting all 4 concurrent sessions simultaneously instead of serializing them through the over-budgeted admission gate. With the fix, multi-turn long-context workloads on hybrid SWA models achieve near-perfect session parallelism.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/v1/core/kv_cache_coordinator.py (modified, +24/-1)
  • vllm/v1/core/kv_cache_manager.py (modified, +3/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +1/-0)

Code Example

/models/cyankiwi/gemma-4-31B-it-AWQ-4bit
  --tensor-parallel-size 2
  --max-num-seqs 1
  --gpu-memory-utilization 0.96
  --max-model-len 131072
  --reasoning-parser gemma4
  --enable-auto-tool-choice
  --tool-call-parser gemma4
  --enable-prefix-caching
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'
  --async-scheduling

---

INFO [config.py:99]  Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
INFO [cuda.py:302]   Using AttentionBackendEnum.TRITON_ATTN backend.
INFO [gpu_model_runner.py:4820]  Model loading took 10.46 GiB memory and 2.480848 seconds
INFO [gpu_worker.py:436]  Available KV cache memory: 11.54 GiB
INFO [kv_cache_utils.py:1319] GPU KV cache size: 25,200 tokens
INFO [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 1.87x

---

quant_method = (
       quant_config.get_quant_method(layer, prefix=prefix) if quant_config else None
   )

   # See [Note: Register q/k/v/prob scales in state dict]
   if should_load_quant_weights(quant_method):
       assert isinstance(quant_method, BaseKVCacheMethod)
       # TODO (mgoin): kv cache dtype should be specified in the FP8
       # checkpoint config and become the "auto" behavior
       if layer.kv_cache_dtype == "fp8_e5m2":
           raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")

---

def should_load_quant_weights(quant_method: QuantizeMethodBase | None) -> bool:
       """Returns whether the quantization method should load quantized weights."""
       return quant_method is not None and not isinstance(
           quant_method, UnquantizedLinearMethod
       )

---

triton.compiler.errors.CompilationError: at 1:0:
   def triton_red_fused__to_copy_add_cat_clamp_index_select_mul_reciprocal_rms_norm_split_split_with_sizes_sub_unsqueeze_view_1(...):
   ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM: 0.1.dev1+gd56e95223 (custom container built from this commit of main)
  • Hardware: 2× NVIDIA RTX 3090 (SM 8.6, 24 GB each), tensor parallel across both
  • Model: cyankiwi/gemma-4-31B-it-AWQ-4bit
    • Quantization (from the model's config.jsonquantization_config): quant_method: compressed-tensors, format: pack-quantized, num_bits: 4, group_size: 32, observer: mse, strategy: group, symmetric: true, type: int.
    • Architecture (from the model's config.jsontext_config):
      • num_hidden_layers: 60
      • num_attention_heads: 32
      • num_key_value_heads: 16
      • head_dim: 256
      • global_head_dim: 512
      • sliding_window: 1024
      • layer_types: 60 entries, of which 50 are sliding_attention and 10 are full_attention (verified by counting the array)

vllm serve arguments

/models/cyankiwi/gemma-4-31B-it-AWQ-4bit
  --tensor-parallel-size 2
  --max-num-seqs 1
  --gpu-memory-utilization 0.96
  --max-model-len 131072
  --reasoning-parser gemma4
  --enable-auto-tool-choice
  --tool-call-parser gemma4
  --enable-prefix-caching
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'
  --async-scheduling

kv_cache_dtype is left at its default value of auto, which resolves to BF16 in this configuration.

Observed log output (verbatim)

INFO [config.py:99]  Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
INFO [cuda.py:302]   Using AttentionBackendEnum.TRITON_ATTN backend.
INFO [gpu_model_runner.py:4820]  Model loading took 10.46 GiB memory and 2.480848 seconds
INFO [gpu_worker.py:436]  Available KV cache memory: 11.54 GiB
INFO [kv_cache_utils.py:1319] GPU KV cache size: 25,200 tokens
INFO [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 1.87x

After loading the INT4 weights, each TP rank reports 11.54 GiB available for KV cache, and the resulting total GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, and BF16 KV.

The model declares sliding_window: 1024 and 50 of its 60 layers are marked sliding_attention in layer_types.

Questions for maintainers

  1. Is vLLM's Gemma 4 implementation expected to exploit the sliding_attention layer type and the 1024-token sliding_window when sizing the KV cache? If yes, is the 25,200-token result consistent with that exploitation on this hardware/config, or would a larger value be expected?

  2. Is there a config flag I should be setting to enable sliding-window KV sharing for Gemma 4 that I am missing? (For reference, llama.cpp on the same model exposes --swa-full to toggle related behavior, and its default sizing for this architecture produces a substantially larger effective context on comparable memory.)

  3. Separately, when I tried to set --kv-cache-dtype fp8_e5m2 on this INT4 checkpoint, the server raised ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.") from vllm/model_executor/layers/attention/attention.py around lines 162–170. The relevant code on the running container is:

    quant_method = (
        quant_config.get_quant_method(layer, prefix=prefix) if quant_config else None
    )
    
    # See [Note: Register q/k/v/prob scales in state dict]
    if should_load_quant_weights(quant_method):
        assert isinstance(quant_method, BaseKVCacheMethod)
        # TODO (mgoin): kv cache dtype should be specified in the FP8
        # checkpoint config and become the "auto" behavior
        if layer.kv_cache_dtype == "fp8_e5m2":
            raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")

    with should_load_quant_weights defined as:

    def should_load_quant_weights(quant_method: QuantizeMethodBase | None) -> bool:
        """Returns whether the quantization method should load quantized weights."""
        return quant_method is not None and not isinstance(
            quant_method, UnquantizedLinearMethod
        )

    The gate that fires before the fp8_e5m2 error is should_load_quant_weights(quant_method), which returns True for any quant_method other than UnquantizedLinearMethod. For this compressed-tensors INT4 checkpoint that gate is True, even though the checkpoint is not fp8. The error message references "fp8 checkpoints" but the gate does not specifically check for an fp8 checkpoint. Is this intended for all quantized checkpoints, or is the error message describing a narrower condition than the actual check?

  4. For completeness: --kv-cache-dtype fp8_e4m3 on the same configuration fails during Inductor codegen with

    triton.compiler.errors.CompilationError: at 1:0:
    def triton_red_fused__to_copy_add_cat_clamp_index_select_mul_reciprocal_rms_norm_split_split_with_sizes_sub_unsqueeze_view_1(...):
    ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

    This appears to be a Triton / SM 8.6 hardware limitation for Inductor-fused kernels and not a vLLM-side check; I am not reporting it as a vLLM bug, but noting it because it means fp8_e5m2 is the only fp8 variant that could potentially be used on this hardware.

What this report is asking for

Guidance on whether the observed 25,200-token KV cache size for Gemma 4 31B at this configuration is expected on vLLM's current main, and clarification on the fp8_e5m2 gate above (intended scope, and/or message wording).

I am not asserting that vLLM is incorrect — I am reporting the exact observations and asking whether they match the current design.

extent analysis

TL;DR

The observed 25,200-token KV cache size for Gemma 4 31B might not be exploiting the sliding_attention layer type and sliding_window as expected, and the error message for fp8_e5m2 kv-cache dtype seems to be misleading.

Guidance

  1. Verify KV cache sizing: Check the vLLM documentation or source code to see if the sliding_attention layer type and sliding_window are expected to influence KV cache sizing. If so, investigate why the observed 25,200-token size might be smaller than expected.
  2. Investigate config flags: Look for config flags or options that might enable sliding-window KV sharing for Gemma 4, similar to the --swa-full flag in llama.cpp.
  3. Clarify fp8_e5m2 error message: The error message for fp8_e5m2 kv-cache dtype seems to reference "fp8 checkpoints" specifically, but the underlying check appears to apply to all quantized checkpoints. Seek clarification on the intended scope of this check and whether the error message should be updated for accuracy.
  4. Consider alternative kv-cache dtypes: Given the limitations of fp8_e4m3 on SM 8.6 hardware, explore other kv-cache dtype options that might be supported and suitable for the Gemma 4 31B model.

Example

No specific code snippet is provided, as the issue is more related to configuration and design expectations.

Notes

The issue highlights the complexity of configuring and optimizing large language models like Gemma 4 31B. The observed KV cache size and error messages may indicate a need for further clarification or adjustments in the vLLM configuration or documentation.

Recommendation

Apply workaround: Investigate and adjust the configuration to potentially enable sliding-window KV sharing for Gemma 4, and seek clarification on the fp8_e5m2 error message to ensure accurate understanding and usage of kv-cache dtypes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma 4 31B INT4 on 2×24GB GPUs (TP=2): GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, BF16 KV [1 pull requests, 4 comments, 4 participants]