vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38652Fetched 2026-04-08 01:58:45
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
1
Participants
Timeline (top)
subscribed ×2assigned ×1

FP8 KV cache produces corrupted output on MLA models when the conversation exceeds a single turn. Single-turn responses are coherent, multi-turn with system prompts degrades to garbage.

Error Message

The FLASHMLA backend appears to apply FP8 quantization to MLA's latent vectors (kv_c_normed) without calibrated per-tensor scaling. As the KV cache grows with conversation turns, quantization error compounds. For comparison, TurboQuant+ (PolarQuant) compression on the same model scores 4.63/5 across the same scenarios, because it normalizes each vector independently before quantization, preventing error accumulation.

Root Cause

The vLLM log shows: "Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor."

The FLASHMLA backend appears to apply FP8 quantization to MLA's latent vectors (kv_c_normed) without calibrated per-tensor scaling. As the KV cache grows with conversation turns, quantization error compounds.

For comparison, TurboQuant+ (PolarQuant) compression on the same model scores 4.63/5 across the same scenarios, because it normalizes each vector independently before quantization, preventing error accumulation.

Code Example

vllm serve zai-org/GLM-4.7-Flash --kv-cache-dtype fp8 --trust-remote-code --max-model-len 8192

---

{"messages": [{"role": "user", "content": "Say hello in Finnish"}]}
→ coherent response about "Hei" / "Terve"

---

{"messages": [
  {"role": "system", "content": "You are a helpful sales assistant for a SaaS product."},
  {"role": "user", "content": "What pricing plans do you offer?"}
]}
"...a S componentes_obs Worce!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM 0.18.1
  • H100 80GB SXM5, CUDA 12.8
  • Ubuntu 24.04
  • Model: zai-org/GLM-4.7-Flash (MLA architecture, glm4_moe_lite)
  • Backend selected: FLASHMLA

Description

FP8 KV cache produces corrupted output on MLA models when the conversation exceeds a single turn. Single-turn responses are coherent, multi-turn with system prompts degrades to garbage.

Reproduction

vllm serve zai-org/GLM-4.7-Flash --kv-cache-dtype fp8 --trust-remote-code --max-model-len 8192

Single turn (works fine):

{"messages": [{"role": "user", "content": "Say hello in Finnish"}]}
→ coherent response about "Hei" / "Terve"

Multi-turn with system prompt (broken):

{"messages": [
  {"role": "system", "content": "You are a helpful sales assistant for a SaaS product."},
  {"role": "user", "content": "What pricing plans do you offer?"}
]}
"...a S componentes_obs Worce!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Benchmark data

Tested across 20 multi-turn conversation scenarios with Llama-3.3-70B judge (score 1-5):

ModelKV CacheAvg Score
GLM-4.7-Flash BF16FP16 (baseline)4.61
GLM-4.7-Flash BF16FP81.07
Qwen3-235B AWQFP16 (baseline)4.74
Qwen3-235B AWQFP84.71

FP8 works fine on standard FlashAttention models (Qwen3-235B). Only MLA models are affected.

Analysis

The vLLM log shows: "Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor."

The FLASHMLA backend appears to apply FP8 quantization to MLA's latent vectors (kv_c_normed) without calibrated per-tensor scaling. As the KV cache grows with conversation turns, quantization error compounds.

For comparison, TurboQuant+ (PolarQuant) compression on the same model scores 4.63/5 across the same scenarios, because it normalizes each vector independently before quantization, preventing error accumulation.

Before submitting a new issue...

  • I have searched existing issues for "fp8 MLA" and "fp8 kv cache MLA" and found no duplicates.

extent analysis

TL;DR

  • Applying a proper scaling factor or using a different quantization method, such as TurboQuant+, may resolve the corrupted output issue with FP8 KV cache on MLA models.

Guidance

  • Investigate the implementation of FP8 quantization in the FLASHMLA backend to determine if a calibrated per-tensor scaling factor can be applied to prevent quantization error accumulation.
  • Compare the performance of different quantization methods, such as TurboQuant+ (PolarQuant) compression, to see if they can achieve better results without compromising accuracy.
  • Consider using a different data type, such as FP16, for the KV cache if the performance benefits of FP8 are not critical.
  • Review the vLLM log messages to ensure that the FP8 data type is being used correctly and that any potential issues are being addressed.

Example

  • No code snippet is provided as the issue does not contain sufficient information to create a specific example.

Notes

  • The issue appears to be specific to MLA models and may not affect other models, such as standard FlashAttention models.
  • The use of FP8 quantization may require additional calibration or scaling to achieve accurate results.

Recommendation

  • Apply workaround: Use a different quantization method, such as TurboQuant+ (PolarQuant) compression, which has been shown to achieve better results without compromising accuracy.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING