vllm - 💡(How to fix) Fix [Bug]: NVFP4 + MLA error during processing [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38439Fetched 2026-04-08 01:45:40
View on GitHub
Comments
3
Participants
2
Timeline
9
Reactions
0
Author
Timeline (top)
commented ×3renamed ×2closed ×1labeled ×1

Code Example

from vllm import LLM
llm = LLM("inference-optimization/DeepSeek-V3-debug-empty-NVFP4A16")

---

(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/attention/mla_attention.py", line 714, in
 process_weights_after_loading
(EngineCore pid=3426083)     kv_b_proj_weight = get_and_maybe_dequant_weights(
(EngineCore pid=3426083)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/quant_utils.py", line 
390, in get_and_maybe_dequant_weights
(EngineCore pid=3426083)     dequant_weights = layer.quant_method.apply(layer, eye, bias=None).to(out_dtype)
(EngineCore pid=3426083)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compresse
d_tensors.py", line 921, in apply
(EngineCore pid=3426083)     return scheme.apply_weights(layer, x, bias=bias)
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/c
ompressed_tensors_w4a16_nvfp4.py", line 100, in apply_weights
(EngineCore pid=3426083)     return apply_fp4_marlin_linear(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", 
line 180, in apply_fp4_marlin_linear
(EngineCore pid=3426083)     output = ops.marlin_gemm(
(EngineCore pid=3426083)              ^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/_custom_ops.py", line 1301, in marlin_gemm
(EngineCore pid=3426083)     return torch.ops._C.marlin_gemm(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/env/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __cal
l__ 
(EngineCore pid=3426083)     return self._op(*args, **kwargs)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

N/A

</details>

🐛 Describe the bug

I believe that #33972 has broken models with MLA, as MLA attention processing assumes that the fp4 marlin linear can be run with weight_global_scale.dtype=torch.float32 and eye bfloat16.

Replication:

from vllm import LLM
llm = LLM("inference-optimization/DeepSeek-V3-debug-empty-NVFP4A16")
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/attention/mla_attention.py", line 714, in
 process_weights_after_loading
(EngineCore pid=3426083)     kv_b_proj_weight = get_and_maybe_dequant_weights(
(EngineCore pid=3426083)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/quant_utils.py", line 
390, in get_and_maybe_dequant_weights
(EngineCore pid=3426083)     dequant_weights = layer.quant_method.apply(layer, eye, bias=None).to(out_dtype)
(EngineCore pid=3426083)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compresse
d_tensors.py", line 921, in apply
(EngineCore pid=3426083)     return scheme.apply_weights(layer, x, bias=bias)
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/c
ompressed_tensors_w4a16_nvfp4.py", line 100, in apply_weights
(EngineCore pid=3426083)     return apply_fp4_marlin_linear(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", 
line 180, in apply_fp4_marlin_linear
(EngineCore pid=3426083)     output = ops.marlin_gemm(
(EngineCore pid=3426083)              ^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/_custom_ops.py", line 1301, in marlin_gemm
(EngineCore pid=3426083)     return torch.ops._C.marlin_gemm(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/env/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __cal
l__ 
(EngineCore pid=3426083)     return self._op(*args, **kwargs)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! To fix the issue, we need to ensure that the weight_global_scale.dtype is compatible with!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! the bfloat16 data type used in the MLA attention processing.

Here are the steps to fix the issue:

  • Modify the get_and_maybe_dequant_weights function in quant_utils.py to handle the data type conversion:

def get_and_maybe_dequant_weights(layer, eye, bias=None, out_dtype=torch.float32): # ... existing code ... dequant_weights = layer.quant_method.apply(layer, eye, bias=None).to(out_dtype) # Add a check to ensure the data type is compatible if out_dtype == torch.float32 and layer.weight.dtype == torch.bfloat16: dequant_weights = dequant_weights.to(torch.bfloat16) return dequant_weights

*   Update the `apply_fp4_marlin_linear` function in `marlin_utils_fp4.py` to handle the data type conversion:
    ```python
def apply_fp4_marlin_linear(x, weight, bias=None, out_dtype=torch.float32):
    # ... existing code ...
    output = ops.marlin_gemm(x, weight, bias=bias)
    # Add a check to ensure the data type is compatible
    if out_dtype == torch.float32 and x.dtype == torch.bfloat16:
        output = output.to(torch.bfloat16)
    return output

Verification

To verify that the fix worked, you can test the LLM model with the modified code:

from vllm import LLM
llm = LLM("inference-optimization/DeepSeek-V3-debug-empty-NVFP4A16")

If the issue is resolved, the model should run without errors.

Extra Tips

  • Make sure to test the modified code thoroughly to ensure that it works correctly for all scenarios.
  • Consider adding additional checks and handling for other data types to ensure compatibility and prevent similar issues in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: NVFP4 + MLA error during processing [3 comments, 2 participants]