vllm - ✅(Solved) Fix [Bug]: Accuracy Issue with FlashMLA Sparse on DeepSeek V3.2 [1 pull requests, 5 comments, 2 participants]

vllm2026-03-09 17:16:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36524•Fetched 2026-04-08 00:36:26

View on GitHub

Comments

Participants

Timeline

Reactions

Author

wzhao18

Participants

LopezCastroRoberto

wzhao18

Timeline (top)

commented ×5subscribed ×4mentioned ×3cross-referenced ×1

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding (https://github.com/vllm-project/vllm/pull/36616)

PR fix notes

PR #36616: [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding

Repository: vllm-project/vllm
Author: AjAnubolu
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36616

Description (problem / solution / changelog)

Pass topk_length to flash_mla_sparse_fwd for precise attention masking and use new_zeros instead of new_empty for BF16 head padding.

Closes #36524

Changed files

vllm/v1/attention/backends/mla/flashmla_sparse.py (modified, +22/-7)

Code Example

Your output of `python collect_env.py` here

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

I ran some eval for Deepseek V3.2 comparing flashMLA and flashinfer on both bf16 and fp8, and find that flashMLA has noticeably worse accuracy than flashinfer, particularly at fp8. Note flashMLA fp8 uses fp8_ds_mla whereas flashinfer uses the standard fp8 format, which could be the reason for its suboptimal performance. However, even for f16, the flashMLA performance is no better than flashinfer across the board. This may be indicative of potential bug in flashMLA integration.

KV Cache	Backend	Benchmark	pass@1 (avg-32)	majority@32	pass@32
bf16	FlashMLA	AIME25	88.33%	93.33%	93.33%
		GPQA-diamond	82.89%	85.86%	96.46%
bf16	FlashInfer	AIME25	90.83%	93.33%	100.00%
		GPQA-diamond	83.14%	87.63%	97.47%
fp8_ds_mla	FlashMLA	AIME25	81.98%	88.33%	90.00%
		GPQA-diamond	78.55%	81.82%	94.95%
fp8	FlashInfer	AIME25	89.17%	93.33%	100.00%
		GPQA-diamond	83.55%	86.36%	96.46%

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the accuracy issue with flashMLA, particularly with fp8 and bf16, we need to investigate and adjust the integration of flashMLA.

Here are the steps to take:

Review the fp8_ds_mla format used by flashMLA and compare it with the standard fp8 format used by flashinfer.
Update the flashMLA integration to use the standard fp8 format.
Adjust the model configuration to optimize performance for both bf16 and fp8.

Example code snippet to update the flashMLA integration:

# Import necessary libraries
import torch

# Define a function to update the model configuration
def update_model_config(model, backend, precision):
    if backend == "FlashMLA" and precision == "fp8":
        # Update the model to use the standard fp8 format
        model.half()  # Convert model to half precision
        # Additional configuration updates as needed
    elif backend == "FlashMLA" and precision == "bf16":
        # Update the model to optimize performance for bf16
        model.bfloat16()  # Convert model to bfloat16 precision
        # Additional configuration updates as needed

# Example usage
model = torch.nn.Module()  # Initialize a PyTorch model
update_model_config(model, "FlashMLA", "fp8")

Verification

To verify that the fix worked, re-run the evaluation benchmarks for flashMLA with the updated integration and compare the results with flashinfer.

Extra Tips

Ensure that the model is properly optimized for the target precision and backend.
Monitor the model's performance and adjust the configuration as needed to achieve optimal results.
Consider adding additional logging or debugging statements to help identify any issues that may arise during the update process.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #pipeline error #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Accuracy Issue with FlashMLA Sparse on DeepSeek V3.2 [1 pull requests, 5 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36616: [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Accuracy Issue with FlashMLA Sparse on DeepSeek V3.2 [1 pull requests, 5 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36616: [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING