vllm - ✅(Solved) Fix [Bug] --calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37554Fetched 2026-04-08 01:02:09
View on GitHub
Comments
3
Participants
2
Timeline
9
Reactions
0
Author
Participants
Timeline (top)
commented ×3cross-referenced ×2subscribed ×2closed ×1

Error Message

  1. Warn on hybrid + calculate-kv-scales: At minimum, emit a warning that scale calibration may be unreliable for hybrid models with recurrent layers.

Root Cause

calc_kv_scales() at vllm/model_executor/layers/attention/attention.py:481:

def calc_kv_scales(self, query, key, value):
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)  # k_range = 448 (fp8 max)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)
    # ... scales frozen after first call

In Qwen3.5's hybrid architecture (64 layers alternating attention ↔ GDN):

  1. Dummy profile pass starts with random/zero inputs
  2. Early attention layers compute reasonable scales
  3. GDN layers process through uninitialized recurrent state (conv_state, ssm_state) → garbage activations
  4. Later attention layers compute scales from garbage → wildly wrong scales
  5. During real inference, these wrong scales produce catastrophic FP8 quantization errors
  6. Errors cascade through GDN recurrent state, amplifying with each layer

This does not affect pure attention models (e.g., Llama) because all layers produce reasonable activations from dummy data.

Fix Action

Fixed

PR fix notes

PR #37565: [Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention…

Description (problem / solution / changelog)

Description

Fixes #37554.

When using --calculate-kv-scales alongside --kv-cache-dtype fp8 on hybrid models like Qwen3.5, the calculated KV cache scales are silently corrupted. During the calibration dummy forward pass, recurrent layers (GDN, Mamba, SSM) process without proper state initialization. This yields garbage activations that are subsequently consumed by downstream attention layers, causing completely wrong FP8 KV scales to be frozen. As a result, inference suffers from severe output corruption (e.g., hallucinated inputs, topic loops, and gibberish).

Since --calculate-kv-scales is currently deprecated (to be removed in v0.19), the safest and most effective solution is to seamlessly disable it for all hybrid models and log a warning advising the user to rely on the default scale of 1.0, which behaves correctly for these models.

This PR adds a defensive guard in HybridAttentionMambaModelConfig.verify_and_update_config(), covering all registered is_hybrid models out-of-the-box.

Changed files

  • vllm/model_executor/models/config.py (modified, +17/-1)

Code Example

# BROKEN: --calculate-kv-scales poisons scales via uninitialized GDN state
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Expected: 68
# Actual: "7 × 7 = 56" (hallucinated the input number)

---

# WORKING: remove --calculate-kv-scales, everything else identical
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Result: 68
---

def calc_kv_scales(self, query, key, value):
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)  # k_range = 448 (fp8 max)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)
    # ... scales frozen after first call
RAW_BUFFERClick to expand / collapse

My current environment

  • vLLM version: v0.17.1 (also confirmed on v0.16.1rc1 and nightly v0.17.1rc1.dev88)
  • GPU: NVIDIA RTX PRO 6000 Blackwell (SM120, 96 GB VRAM)
  • Model: Qwen3.5-35B-A3B (FP8 block-quantized via llm-compressor, Qwen3_5MoeForConditionalGeneration)

Bug description

--calculate-kv-scales causes catastrophic output corruption when used with FP8 KV cache on hybrid GDN+Attention models. The corruption is silent — no errors or warnings — and manifests as hallucinated inputs, topic fixation loops, and gibberish on even the simplest queries.

Root cause: calc_kv_scales() in vllm/model_executor/layers/attention/attention.py computes per-layer FP8 quantization scales from a single dummy forward pass during model profiling. In hybrid GDN+Attention models like Qwen3.5, the GDN (Gated Delta Network) layers have uninitialized recurrent state during this dummy pass. Their outputs are garbage, which feeds into downstream attention layers, producing wildly incorrect per-layer scales. These bad scales are then frozen and used for all subsequent inference, causing catastrophic FP8 quantization errors.

Fix: Removing --calculate-kv-scales (scales default to 1.0) resolves the issue completely. Default scales work correctly because typical BF16 KV activations are well within FP8 E4M3's ±448 representable range.

Minimal reproducer

# BROKEN: --calculate-kv-scales poisons scales via uninitialized GDN state
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Expected: 68
# Actual: "7 × 7 = 56" (hallucinated the input number)
# WORKING: remove --calculate-kv-scales, everything else identical
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Result: 68 ✓

Isolation evidence

ConfigurationOutput
Qwen3.5 + FLASHINFER + BF16 KVCorrect
Qwen3.5 + FLASHINFER + FP8 KV + --calculate-kv-scalesCorrupted — hallucinated inputs, topic loops, gibberish
Qwen3.5 + FLASHINFER + FP8 KV (no --calculate-kv-scales)Correct
Llama 3.3 70B + FLASHINFER + FP8 KV + --calculate-kv-scalesCorrect (pure attention, no GDN layers)

The Llama 3.3 test confirms the bug is specific to hybrid GDN+Attention architectures where recurrent state is uninitialized during the profile pass — not a general FP8 or FlashInfer issue.

Detailed symptoms

With --calculate-kv-scales enabled:

  • Arithmetic: "7 × 8 + 3 × 4" → model answered "7 × 7 = 56" (hallucinated input). "37 × 43" → answered "39 × 26".
  • Multi-turn context: Told "My name is Alexander" → responded "I don't actually know your name".
  • First-request corruption: The very first request after server start is already broken — this is not cross-request state leakage.

Without --calculate-kv-scales (scales = 1.0):

  • All arithmetic, factual recall, multi-turn context, and code generation tests produce correct output identical to BF16 KV cache.

Root cause analysis

calc_kv_scales() at vllm/model_executor/layers/attention/attention.py:481:

def calc_kv_scales(self, query, key, value):
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)  # k_range = 448 (fp8 max)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)
    # ... scales frozen after first call

In Qwen3.5's hybrid architecture (64 layers alternating attention ↔ GDN):

  1. Dummy profile pass starts with random/zero inputs
  2. Early attention layers compute reasonable scales
  3. GDN layers process through uninitialized recurrent state (conv_state, ssm_state) → garbage activations
  4. Later attention layers compute scales from garbage → wildly wrong scales
  5. During real inference, these wrong scales produce catastrophic FP8 quantization errors
  6. Errors cascade through GDN recurrent state, amplifying with each layer

This does not affect pure attention models (e.g., Llama) because all layers produce reasonable activations from dummy data.

Suggested fixes

  1. Skip recurrent layers during calibration: Zero or clamp GDN/Mamba outputs during the scale calibration pass so they don't poison downstream attention layer scales.
  2. Multi-pass calibration: Run the calibration multiple times so recurrent state warms up before freezing scales.
  3. Warn on hybrid + calculate-kv-scales: At minimum, emit a warning that scale calibration may be unreliable for hybrid models with recurrent layers.
  4. Document the limitation: Note that --calculate-kv-scales should not be used with hybrid models and that default scales (1.0) work correctly.

Affected models

Any hybrid model with recurrent layers (GDN, Mamba, SSM) interleaved with attention layers:

  • Qwen3.5-35B-A3B (Qwen3_5MoeForConditionalGeneration)
  • Likely also: Qwen3-Next, Jamba, and other hybrid architectures

extent analysis

Fix Plan

To resolve the issue, we can implement the following fixes:

  • Skip recurrent layers during calibration: Modify the calc_kv_scales function to zero or clamp GDN/Mamba outputs during the scale calibration pass.
  • Multi-pass calibration: Run the calibration multiple times to warm up the recurrent state before freezing scales.
  • Warn on hybrid + calculate-kv-scales: Emit a warning when using --calculate-kv-scales with hybrid models.

Code Changes

Here's an example of how to modify the calc_kv_scales function to skip recurrent layers during calibration:

def calc_kv_scales(self, query, key, value):
    # Zero or clamp GDN/Mamba outputs during scale calibration
    if self.is_hybrid_model:
        key = torch.zeros_like(key)
        value = torch.zeros_like(value)
    
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)

Alternatively, you can implement multi-pass calibration:

def calc_kv_scales(self, query, key, value, num_passes=3):
    for _ in range(num_passes):
        # Run the calibration pass
        self._k_scale.copy_(torch.abs(key).max() / self.k_range)
        self._v_scale.copy_(torch.abs(value).max() / self.v_range)
        
        # Update the recurrent state
        key, value = self.update_recurrent_state(key, value)

Verification

To verify that the fix worked, run the model with the modified calc_kv_scales function and check that the output is correct for various inputs.

Extra Tips

  • When using hybrid models with recurrent layers, it's recommended to use the default scales (1.0) instead of calculating them with --calculate-kv-scales.
  • Document the limitation of --calculate-kv-scales with hybrid models to avoid similar issues in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug] --calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) [1 pull requests, 3 comments, 2 participants]