vllm - ✅(Solved) Fix [Bug] --calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) [1 pull requests, 3 comments, 2 participants]

vllm2026-03-19 11:00:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37554•Fetched 2026-04-08 01:02:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

daudo

Participants

daudo

Young-Leo

Timeline (top)

commented ×3cross-referenced ×2subscribed ×2closed ×1

Error Message

Warn on hybrid + calculate-kv-scales: At minimum, emit a warning that scale calibration may be unreliable for hybrid models with recurrent layers.

Root Cause

calc_kv_scales() at vllm/model_executor/layers/attention/attention.py:481:

def calc_kv_scales(self, query, key, value):
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)  # k_range = 448 (fp8 max)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)
    # ... scales frozen after first call

In Qwen3.5's hybrid architecture (64 layers alternating attention ↔ GDN):

Dummy profile pass starts with random/zero inputs
Early attention layers compute reasonable scales
GDN layers process through uninitialized recurrent state (conv_state, ssm_state) → garbage activations
Later attention layers compute scales from garbage → wildly wrong scales
During real inference, these wrong scales produce catastrophic FP8 quantization errors
Errors cascade through GDN recurrent state, amplifying with each layer

This does not affect pure attention models (e.g., Llama) because all layers produce reasonable activations from dummy data.

Fix Action

Fixed

Fixed by PR: [Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention… (https://github.com/vllm-project/vllm/pull/37565)

PR fix notes

PR #37565: [Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention…

Repository: vllm-project/vllm
Author: Young-Leo
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37565

Description (problem / solution / changelog)

Description

Fixes #37554.

When using --calculate-kv-scales alongside --kv-cache-dtype fp8 on hybrid models like Qwen3.5, the calculated KV cache scales are silently corrupted. During the calibration dummy forward pass, recurrent layers (GDN, Mamba, SSM) process without proper state initialization. This yields garbage activations that are subsequently consumed by downstream attention layers, causing completely wrong FP8 KV scales to be frozen. As a result, inference suffers from severe output corruption (e.g., hallucinated inputs, topic loops, and gibberish).

Since --calculate-kv-scales is currently deprecated (to be removed in v0.19), the safest and most effective solution is to seamlessly disable it for all hybrid models and log a warning advising the user to rely on the default scale of 1.0, which behaves correctly for these models.

This PR adds a defensive guard in HybridAttentionMambaModelConfig.verify_and_update_config(), covering all registered is_hybrid models out-of-the-box.

Changed files

vllm/model_executor/models/config.py (modified, +17/-1)

Code Example

# BROKEN: --calculate-kv-scales poisons scales via uninitialized GDN state
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Expected: 68
# Actual: "7 × 7 = 56" (hallucinated the input number)

---

# WORKING: remove --calculate-kv-scales, everything else identical
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Result: 68 ✓

---

def calc_kv_scales(self, query, key, value):
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)  # k_range = 448 (fp8 max)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)
    # ... scales frozen after first call

RAW_BUFFERClick to expand / collapse

My current environment

vLLM version: v0.17.1 (also confirmed on v0.16.1rc1 and nightly v0.17.1rc1.dev88)
GPU: NVIDIA RTX PRO 6000 Blackwell (SM120, 96 GB VRAM)
Model: Qwen3.5-35B-A3B (FP8 block-quantized via llm-compressor, Qwen3_5MoeForConditionalGeneration)

Bug description

--calculate-kv-scales causes catastrophic output corruption when used with FP8 KV cache on hybrid GDN+Attention models. The corruption is silent — no errors or warnings — and manifests as hallucinated inputs, topic fixation loops, and gibberish on even the simplest queries.

Root cause: calc_kv_scales() in vllm/model_executor/layers/attention/attention.py computes per-layer FP8 quantization scales from a single dummy forward pass during model profiling. In hybrid GDN+Attention models like Qwen3.5, the GDN (Gated Delta Network) layers have uninitialized recurrent state during this dummy pass. Their outputs are garbage, which feeds into downstream attention layers, producing wildly incorrect per-layer scales. These bad scales are then frozen and used for all subsequent inference, causing catastrophic FP8 quantization errors.

Fix: Removing --calculate-kv-scales (scales default to 1.0) resolves the issue completely. Default scales work correctly because typical BF16 KV activations are well within FP8 E4M3's ±448 representable range.

Minimal reproducer

# BROKEN: --calculate-kv-scales poisons scales via uninitialized GDN state
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Expected: 68
# Actual: "7 × 7 = 56" (hallucinated the input number)

# WORKING: remove --calculate-kv-scales, everything else identical
vllm serve Qwen/Qwen3.5-35B-A3B \
  --kv-cache-dtype fp8 \
  --attention-backend FLASHINFER \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}'

# Query: "What is 7 * 8 + 3 * 4?"
# Result: 68 ✓

Isolation evidence

Configuration	Output
Qwen3.5 + FLASHINFER + BF16 KV	Correct
Qwen3.5 + FLASHINFER + FP8 KV + `--calculate-kv-scales`	Corrupted — hallucinated inputs, topic loops, gibberish
Qwen3.5 + FLASHINFER + FP8 KV (no `--calculate-kv-scales`)	Correct
Llama 3.3 70B + FLASHINFER + FP8 KV + `--calculate-kv-scales`	Correct (pure attention, no GDN layers)

The Llama 3.3 test confirms the bug is specific to hybrid GDN+Attention architectures where recurrent state is uninitialized during the profile pass — not a general FP8 or FlashInfer issue.

Detailed symptoms

With --calculate-kv-scales enabled:

Arithmetic: "7 × 8 + 3 × 4" → model answered "7 × 7 = 56" (hallucinated input). "37 × 43" → answered "39 × 26".
Multi-turn context: Told "My name is Alexander" → responded "I don't actually know your name".
First-request corruption: The very first request after server start is already broken — this is not cross-request state leakage.

Without --calculate-kv-scales (scales = 1.0):

All arithmetic, factual recall, multi-turn context, and code generation tests produce correct output identical to BF16 KV cache.

Root cause analysis

calc_kv_scales() at vllm/model_executor/layers/attention/attention.py:481:

def calc_kv_scales(self, query, key, value):
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)  # k_range = 448 (fp8 max)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)
    # ... scales frozen after first call

In Qwen3.5's hybrid architecture (64 layers alternating attention ↔ GDN):

Dummy profile pass starts with random/zero inputs
Early attention layers compute reasonable scales
GDN layers process through uninitialized recurrent state (conv_state, ssm_state) → garbage activations
Later attention layers compute scales from garbage → wildly wrong scales
During real inference, these wrong scales produce catastrophic FP8 quantization errors
Errors cascade through GDN recurrent state, amplifying with each layer

This does not affect pure attention models (e.g., Llama) because all layers produce reasonable activations from dummy data.

Suggested fixes

Skip recurrent layers during calibration: Zero or clamp GDN/Mamba outputs during the scale calibration pass so they don't poison downstream attention layer scales.
Multi-pass calibration: Run the calibration multiple times so recurrent state warms up before freezing scales.
Warn on hybrid + calculate-kv-scales: At minimum, emit a warning that scale calibration may be unreliable for hybrid models with recurrent layers.
Document the limitation: Note that --calculate-kv-scales should not be used with hybrid models and that default scales (1.0) work correctly.

Affected models

Any hybrid model with recurrent layers (GDN, Mamba, SSM) interleaved with attention layers:

Qwen3.5-35B-A3B (Qwen3_5MoeForConditionalGeneration)
Likely also: Qwen3-Next, Jamba, and other hybrid architectures

extent analysis

Fix Plan

To resolve the issue, we can implement the following fixes:

Skip recurrent layers during calibration: Modify the calc_kv_scales function to zero or clamp GDN/Mamba outputs during the scale calibration pass.
Multi-pass calibration: Run the calibration multiple times to warm up the recurrent state before freezing scales.
Warn on hybrid + calculate-kv-scales: Emit a warning when using --calculate-kv-scales with hybrid models.

Code Changes

Here's an example of how to modify the calc_kv_scales function to skip recurrent layers during calibration:

def calc_kv_scales(self, query, key, value):
    # Zero or clamp GDN/Mamba outputs during scale calibration
    if self.is_hybrid_model:
        key = torch.zeros_like(key)
        value = torch.zeros_like(value)
    
    self._k_scale.copy_(torch.abs(key).max() / self.k_range)
    self._v_scale.copy_(torch.abs(value).max() / self.v_range)

Alternatively, you can implement multi-pass calibration:

def calc_kv_scales(self, query, key, value, num_passes=3):
    for _ in range(num_passes):
        # Run the calibration pass
        self._k_scale.copy_(torch.abs(key).max() / self.k_range)
        self._v_scale.copy_(torch.abs(value).max() / self.v_range)
        
        # Update the recurrent state
        key, value = self.update_recurrent_state(key, value)

Verification

To verify that the fix worked, run the model with the modified calc_kv_scales function and check that the output is correct for various inputs.

Extra Tips

When using hybrid models with recurrent layers, it's recommended to use the default scales (1.0) instead of calculating them with --calculate-kv-scales.
Document the limitation of --calculate-kv-scales with hybrid models to avoid similar issues in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug] --calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #37565: [Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention…

Description (problem / solution / changelog)

Description

Changed files

Code Example

My current environment

Bug description

Minimal reproducer

Isolation evidence

Detailed symptoms

Root cause analysis

Suggested fixes

Affected models

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug] --calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #37565: [Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention…

Description (problem / solution / changelog)

Description

Changed files

Code Example

My current environment

Bug description

Minimal reproducer

Isolation evidence

Detailed symptoms

Root cause analysis

Suggested fixes

Affected models

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING