vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn [1 participants]

vllm2026-03-31 17:50:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38652•Fetched 2026-04-08 01:58:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

varjoranta

Participants

varjoranta

Assignees

MatthewBonanni

Timeline (top)

subscribed ×2assigned ×1

FP8 KV cache produces corrupted output on MLA models when the conversation exceeds a single turn. Single-turn responses are coherent, multi-turn with system prompts degrades to garbage.

Error Message

The FLASHMLA backend appears to apply FP8 quantization to MLA's latent vectors (kv_c_normed) without calibrated per-tensor scaling. As the KV cache grows with conversation turns, quantization error compounds. For comparison, TurboQuant+ (PolarQuant) compression on the same model scores 4.63/5 across the same scenarios, because it normalizes each vector independently before quantization, preventing error accumulation.

Root Cause

The vLLM log shows: "Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor."

For comparison, TurboQuant+ (PolarQuant) compression on the same model scores 4.63/5 across the same scenarios, because it normalizes each vector independently before quantization, preventing error accumulation.

Code Example

vllm serve zai-org/GLM-4.7-Flash --kv-cache-dtype fp8 --trust-remote-code --max-model-len 8192

---

{"messages": [{"role": "user", "content": "Say hello in Finnish"}]}
→ coherent response about "Hei" / "Terve"

---

{"messages": [
  {"role": "system", "content": "You are a helpful sales assistant for a SaaS product."},
  {"role": "user", "content": "What pricing plans do you offer?"}
]}
→ "...a S componentes_obs Worce!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM 0.18.1
H100 80GB SXM5, CUDA 12.8
Ubuntu 24.04
Model: zai-org/GLM-4.7-Flash (MLA architecture, glm4_moe_lite)
Backend selected: FLASHMLA

Description

FP8 KV cache produces corrupted output on MLA models when the conversation exceeds a single turn. Single-turn responses are coherent, multi-turn with system prompts degrades to garbage.

Reproduction

vllm serve zai-org/GLM-4.7-Flash --kv-cache-dtype fp8 --trust-remote-code --max-model-len 8192

Single turn (works fine):

{"messages": [{"role": "user", "content": "Say hello in Finnish"}]}
→ coherent response about "Hei" / "Terve"

Multi-turn with system prompt (broken):

{"messages": [
  {"role": "system", "content": "You are a helpful sales assistant for a SaaS product."},
  {"role": "user", "content": "What pricing plans do you offer?"}
]}
→ "...a S componentes_obs Worce!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Benchmark data

Tested across 20 multi-turn conversation scenarios with Llama-3.3-70B judge (score 1-5):

Model	KV Cache	Avg Score
GLM-4.7-Flash BF16	FP16 (baseline)	4.61
GLM-4.7-Flash BF16	FP8	1.07
Qwen3-235B AWQ	FP16 (baseline)	4.74
Qwen3-235B AWQ	FP8	4.71

FP8 works fine on standard FlashAttention models (Qwen3-235B). Only MLA models are affected.

Analysis

The vLLM log shows: "Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor."

Before submitting a new issue...

I have searched existing issues for "fp8 MLA" and "fp8 kv cache MLA" and found no duplicates.

extent analysis

TL;DR

Applying a proper scaling factor or using a different quantization method, such as TurboQuant+, may resolve the corrupted output issue with FP8 KV cache on MLA models.

Guidance

Investigate the implementation of FP8 quantization in the FLASHMLA backend to determine if a calibrated per-tensor scaling factor can be applied to prevent quantization error accumulation.
Compare the performance of different quantization methods, such as TurboQuant+ (PolarQuant) compression, to see if they can achieve better results without compromising accuracy.
Consider using a different data type, such as FP16, for the KV cache if the performance benefits of FP8 are not critical.
Review the vLLM log messages to ensure that the FP8 data type is being used correctly and that any potential issues are being addressed.

Example

No code snippet is provided as the issue does not contain sufficient information to create a specific example.

Notes

The issue appears to be specific to MLA models and may not affect other models, such as standard FlashAttention models.
The use of FP8 quantization may require additional calibration or scaling to achieve accurate results.

Recommendation

Apply workaround: Use a different quantization method, such as TurboQuant+ (PolarQuant) compression, which has been shown to achieve better results without compromising accuracy.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#LLM response #prompt template #agent execution #callback error #memory management

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

Description

Reproduction

Benchmark data

Analysis

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

Description

Reproduction

Benchmark data

Analysis

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING