vllm - ✅(Solved) Fix [Bug]: Accuracy Issue with FlashMLA Sparse on DeepSeek V3.2 [1 pull requests, 5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36524Fetched 2026-04-08 00:36:26
View on GitHub
Comments
5
Participants
2
Timeline
14
Reactions
0
Author
Timeline (top)
commented ×5subscribed ×4mentioned ×3cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #36616: [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding

Description (problem / solution / changelog)

Pass topk_length to flash_mla_sparse_fwd for precise attention masking and use new_zeros instead of new_empty for BF16 head padding.

Closes #36524

Changed files

  • vllm/v1/attention/backends/mla/flashmla_sparse.py (modified, +22/-7)

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

I ran some eval for Deepseek V3.2 comparing flashMLA and flashinfer on both bf16 and fp8, and find that flashMLA has noticeably worse accuracy than flashinfer, particularly at fp8. Note flashMLA fp8 uses fp8_ds_mla whereas flashinfer uses the standard fp8 format, which could be the reason for its suboptimal performance. However, even for f16, the flashMLA performance is no better than flashinfer across the board. This may be indicative of potential bug in flashMLA integration.

KV CacheBackendBenchmarkpass@1 (avg-32)majority@32pass@32
bf16FlashMLAAIME2588.33%93.33%93.33%
GPQA-diamond82.89%85.86%96.46%
bf16FlashInferAIME2590.83%93.33%100.00%
GPQA-diamond83.14%87.63%97.47%
fp8_ds_mlaFlashMLAAIME2581.98%88.33%90.00%
GPQA-diamond78.55%81.82%94.95%
fp8FlashInferAIME2589.17%93.33%100.00%
GPQA-diamond83.55%86.36%96.46%

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the accuracy issue with flashMLA, particularly with fp8 and bf16, we need to investigate and adjust the integration of flashMLA.

Here are the steps to take:

  • Review the fp8_ds_mla format used by flashMLA and compare it with the standard fp8 format used by flashinfer.
  • Update the flashMLA integration to use the standard fp8 format.
  • Adjust the model configuration to optimize performance for both bf16 and fp8.

Example code snippet to update the flashMLA integration:

# Import necessary libraries
import torch

# Define a function to update the model configuration
def update_model_config(model, backend, precision):
    if backend == "FlashMLA" and precision == "fp8":
        # Update the model to use the standard fp8 format
        model.half()  # Convert model to half precision
        # Additional configuration updates as needed
    elif backend == "FlashMLA" and precision == "bf16":
        # Update the model to optimize performance for bf16
        model.bfloat16()  # Convert model to bfloat16 precision
        # Additional configuration updates as needed

# Example usage
model = torch.nn.Module()  # Initialize a PyTorch model
update_model_config(model, "FlashMLA", "fp8")

Verification

To verify that the fix worked, re-run the evaluation benchmarks for flashMLA with the updated integration and compare the results with flashinfer.

Extra Tips

  • Ensure that the model is properly optimized for the target precision and backend.
  • Monitor the model's performance and adjust the configuration as needed to achieve optimal results.
  • Consider adding additional logging or debugging statements to help identify any issues that may arise during the update process.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING