vllm - 💡(How to fix) Fix MLA: kv_b_proj.weight.dtype AttributeError on quantized ColumnParallelLinear in chunked prefill

vllm2026-05-28 16:06:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When using a quantized (AWQ/GPTQ/compressed-tensors) model with MLA attention, vLLM crashes with an AttributeError during chunked prefill because kv_b_proj is a ColumnParallelLinear that lacks a .weight attribute after quantization.

Error Message

AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

at vllm/model_executor/layers/attention/mla_attention.py:2094 in _compute_prefill_context:

kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)

Root Cause

Lines 2084-2087 already correctly handle quantized layers:

_kv_b_proj_w_dtype = (
    self.kv_b_proj.weight.dtype
    if hasattr(self.kv_b_proj, "weight")
    else self.kv_b_proj.params_dtype
)

But line 2094 ignores _kv_b_proj_w_dtype and directly accesses self.kv_b_proj.weight.dtype without the hasattr guard.

Fix Action

Fix

On line 2094, replace:

kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)

with:

kv_c_normed = kv_c_normed.to(_kv_b_proj_w_dtype)

The variable _kv_b_proj_w_dtype is already computed with the correct hasattr guard immediately above.

Code Example

AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

---

kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)

---

_kv_b_proj_w_dtype = (
    self.kv_b_proj.weight.dtype
    if hasattr(self.kv_b_proj, "weight")
    else self.kv_b_proj.params_dtype
)

---

kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)

---

kv_c_normed = kv_c_normed.to(_kv_b_proj_w_dtype)

---

File "vllm/model_executor/layers/attention/mla_attention.py", line 2094, in _compute_prefill_context
    kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)
                                  ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

RAW_BUFFERClick to expand / collapse

Description

Error

AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

at vllm/model_executor/layers/attention/mla_attention.py:2094 in _compute_prefill_context:

kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)

Root Cause

Lines 2084-2087 already correctly handle quantized layers:

_kv_b_proj_w_dtype = (
    self.kv_b_proj.weight.dtype
    if hasattr(self.kv_b_proj, "weight")
    else self.kv_b_proj.params_dtype
)

But line 2094 ignores _kv_b_proj_w_dtype and directly accesses self.kv_b_proj.weight.dtype without the hasattr guard.

Fix

On line 2094, replace:

kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)

with:

kv_c_normed = kv_c_normed.to(_kv_b_proj_w_dtype)

The variable _kv_b_proj_w_dtype is already computed with the correct hasattr guard immediately above.

Impact

vLLM EngineCore crashes with a fatal error, forcing a full restart (service enters crash loop under load). All inflight requests fail with HTTP 500.

Environment

vLLM version: v0.21.1rc1.dev384 (nightly, also present in current main)
Model: GLM-4.7-Flash-AWQ-4bit (quantized with compressed-tensors)
GPU: NVIDIA RTX 3090 (CUDA 12.9)
Quantization: compressed-tensors (AWQ group_size=32, num_bits=4)
CUDA graphs: enabled
MLA: enabled (model uses Multi-head Latent Attention)

Stack Trace

File "vllm/model_executor/layers/attention/mla_attention.py", line 2094, in _compute_prefill_context
    kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)
                                  ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix MLA: kv_b_proj.weight.dtype AttributeError on quantized ColumnParallelLinear in chunked prefill

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

Code Example

Description

Error

Root Cause

Fix

Impact

Environment

Stack Trace

Still need to ship something?

TRENDING