vllm - 💡(How to fix) Fix [Bug]: NVFP4 + MLA error during processing [3 comments, 2 participants]

vllm2026-03-28 21:47:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38439•Fetched 2026-04-08 01:45:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kylesayrs

Participants

baonudesifeizhai

kylesayrs

Timeline (top)

commented ×3renamed ×2closed ×1labeled ×1

Code Example

from vllm import LLM
llm = LLM("inference-optimization/DeepSeek-V3-debug-empty-NVFP4A16")

---

(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/attention/mla_attention.py", line 714, in
 process_weights_after_loading
(EngineCore pid=3426083)     kv_b_proj_weight = get_and_maybe_dequant_weights(
(EngineCore pid=3426083)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/quant_utils.py", line 
390, in get_and_maybe_dequant_weights
(EngineCore pid=3426083)     dequant_weights = layer.quant_method.apply(layer, eye, bias=None).to(out_dtype)
(EngineCore pid=3426083)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compresse
d_tensors.py", line 921, in apply
(EngineCore pid=3426083)     return scheme.apply_weights(layer, x, bias=bias)
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/c
ompressed_tensors_w4a16_nvfp4.py", line 100, in apply_weights
(EngineCore pid=3426083)     return apply_fp4_marlin_linear(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", 
line 180, in apply_fp4_marlin_linear
(EngineCore pid=3426083)     output = ops.marlin_gemm(
(EngineCore pid=3426083)              ^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/_custom_ops.py", line 1301, in marlin_gemm
(EngineCore pid=3426083)     return torch.ops._C.marlin_gemm(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/env/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __cal
l__ 
(EngineCore pid=3426083)     return self._op(*args, **kwargs)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

N/A

</details>

🐛 Describe the bug

I believe that #33972 has broken models with MLA, as MLA attention processing assumes that the fp4 marlin linear can be run with weight_global_scale.dtype=torch.float32 and eye bfloat16.

Replication:

from vllm import LLM
llm = LLM("inference-optimization/DeepSeek-V3-debug-empty-NVFP4A16")

(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/attention/mla_attention.py", line 714, in
 process_weights_after_loading
(EngineCore pid=3426083)     kv_b_proj_weight = get_and_maybe_dequant_weights(
(EngineCore pid=3426083)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/quant_utils.py", line 
390, in get_and_maybe_dequant_weights
(EngineCore pid=3426083)     dequant_weights = layer.quant_method.apply(layer, eye, bias=None).to(out_dtype)
(EngineCore pid=3426083)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compresse
d_tensors.py", line 921, in apply
(EngineCore pid=3426083)     return scheme.apply_weights(layer, x, bias=bias)
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/c
ompressed_tensors_w4a16_nvfp4.py", line 100, in apply_weights
(EngineCore pid=3426083)     return apply_fp4_marlin_linear(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py", 
line 180, in apply_fp4_marlin_linear
(EngineCore pid=3426083)     output = ops.marlin_gemm(
(EngineCore pid=3426083)              ^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/vllm/_custom_ops.py", line 1301, in marlin_gemm
(EngineCore pid=3426083)     return torch.ops._C.marlin_gemm(
(EngineCore pid=3426083)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3426083)   File "/home/kylesayrs/vllm/env/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __cal
l__ 
(EngineCore pid=3426083)     return self._op(*args, **kwargs)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! To fix the issue, we need to ensure that the weight_global_scale.dtype is compatible with!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! the bfloat16 data type used in the MLA attention processing.

Here are the steps to fix the issue:

Modify the get_and_maybe_dequant_weights function in quant_utils.py to handle the data type conversion:

def get_and_maybe_dequant_weights(layer, eye, bias=None, out_dtype=torch.float32): # ... existing code ... dequant_weights = layer.quant_method.apply(layer, eye, bias=None).to(out_dtype) # Add a check to ensure the data type is compatible if out_dtype == torch.float32 and layer.weight.dtype == torch.bfloat16: dequant_weights = dequant_weights.to(torch.bfloat16) return dequant_weights

*   Update the `apply_fp4_marlin_linear` function in `marlin_utils_fp4.py` to handle the data type conversion:
    ```python
def apply_fp4_marlin_linear(x, weight, bias=None, out_dtype=torch.float32):
    # ... existing code ...
    output = ops.marlin_gemm(x, weight, bias=bias)
    # Add a check to ensure the data type is compatible
    if out_dtype == torch.float32 and x.dtype == torch.bfloat16:
        output = output.to(torch.bfloat16)
    return output

Verification

To verify that the fix worked, you can test the LLM model with the modified code:

from vllm import LLM
llm = LLM("inference-optimization/DeepSeek-V3-debug-empty-NVFP4A16")

If the issue is resolved, the model should run without errors.

Extra Tips

Make sure to test the modified code thoroughly to ensure that it works correctly for all scenarios.
Consider adding additional checks and handling for other data types to ensure compatibility and prevent similar issues in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #vector store #embedding generation #cache error #pipeline error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: NVFP4 + MLA error during processing [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: NVFP4 + MLA error during processing [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING