vllm - 💡(How to fix) Fix [Bug]: AttributeError in mla_attention.py L2094 (_compute_prefill_context) on long prefill with AWQ model — regression after PR #34695

vllm2026-05-21 00:35:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

Fix Action

Fix / Workaround

I applied this one-line patch locally and re-ran the exact same reproducer. The crash is gone, and the previously-failing 27000-token prompt now completes successfully in ~0.5s prefill. Prefix caching also works correctly afterwards (80% hit rate observed on subsequent identical requests).

Code Example

vLLM version: 0.21.1rc1.dev110+g129019f33.d20260519
Model: cyankiwi/GLM-4.7-Flash-AWQ-4bit (AWQ 4-bit, compressed-tensors)
Quantization: compressed-tensors
GPU: 2x NVIDIA DGX Spark (GB10)
Deployment: dual-node, tensor-parallel-size=2, no-Ray (multiproc)
OS: Ubuntu 24.04 ARM64
Python: 3.12

---

vLLM version: 0.21.1rc1.dev110+g129019f33.d20260519
Model: cyankiwi/GLM-4.7-Flash-AWQ-4bit (AWQ 4-bit, compressed-tensors)
Quantization: compressed-tensors
GPU: 2x NVIDIA DGX Spark (GB10)
Deployment: dual-node, tensor-parallel-size=2, no-Ray (multiproc)
OS: Ubuntu 24.04 ARM64
Python: 3.12

---

AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

---

# For quantized layers (AWQ/GPTQ) that lack a .weight attribute,
# use params_dtype which is the expected input dtype.
_kv_b_proj_w_dtype = (
    self.kv_b_proj.weight.dtype
    if hasattr(self.kv_b_proj, "weight")
    else self.kv_b_proj.params_dtype
)
# For NVFP4, weights are packed uint8 — keep input in model dtype
# since the NVFP4 linear layer quantizes internally.
if (
    use_fp8_prefill or _kv_b_proj_w_dtype != current_platform.fp8_dtype()
) and _kv_b_proj_w_dtype != torch.uint8:
    kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)   # ← line 2094, BUG

---

(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/worker/gpu_worker.py", line 843, in execute_model
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/worker/gpu_model_runner.py", line 4203, in execute_model
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/worker/gpu_model_runner.py", line 3680, in _model_forward
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/models/glm4_moe_lite.py", line 606, in forward
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 1062, in unified_mla_attention_with_output
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     layer.forward_impl(
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 691, in forward_impl
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     self.impl.forward_mha(...)
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 2300, in forward_mha
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     context_output, context_lse = self._compute_prefill_context(...)
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 2094, in _compute_prefill_context
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962] AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

---

vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
     --trust-remote-code \
     --tool-call-parser glm47 --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.7-flash \
     --max-model-len 202752 \
     --tensor-parallel-size 2 \
     --host 0.0.0.0 --port 8000 \
     --gpu-memory-utilization 0.7

---

filler = "The quick brown fox jumps over the lazy dog. " * 2700
   payload = {
       "model": "glm-4.7-flash",
       "messages": [{"role": "user", "content": filler + "\n\nSummarize."}],
       "max_tokens": 50,
   }

---

- kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)
+ kv_c_normed = kv_c_normed.to(_kv_b_proj_w_dtype)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

vLLM version: 0.21.1rc1.dev110+g129019f33.d20260519
Model: cyankiwi/GLM-4.7-Flash-AWQ-4bit (AWQ 4-bit, compressed-tensors)
Quantization: compressed-tensors
GPU: 2x NVIDIA DGX Spark (GB10)
Deployment: dual-node, tensor-parallel-size=2, no-Ray (multiproc)
OS: Ubuntu 24.04 ARM64
Python: 3.12

</details>

🐛 Describe the bug

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

vLLM version: 0.21.1rc1.dev110+g129019f33.d20260519
Model: cyankiwi/GLM-4.7-Flash-AWQ-4bit (AWQ 4-bit, compressed-tensors)
Quantization: compressed-tensors
GPU: 2x NVIDIA DGX Spark (GB10)
Deployment: dual-node, tensor-parallel-size=2, no-Ray (multiproc)
OS: Ubuntu 24.04 ARM64
Python: 3.12

</details>

🐛 Describe the bug

In vllm/model_executor/layers/attention/mla_attention.py, function _compute_prefill_context (around line 2094), the code defines a safe local variable _kv_b_proj_w_dtype with a hasattr(self.kv_b_proj, "weight") fallback, but then a few lines below still uses self.kv_b_proj.weight.dtype directly in the .to() call. This crashes AWQ/GPTQ quantized models on long-prompt prefill with:

AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

The crash kills the EngineCore (not just the request), requiring server restart.

This appears to be a regression introduced after PR #34695 was merged — likely by a subsequent commit that added NVFP4 support and rewrote the surrounding if condition but forgot to use the safe variable inside the .to() call.

The problematic code (current state in the running container)

vllm/model_executor/layers/attention/mla_attention.py lines ~2080–2095:

# For quantized layers (AWQ/GPTQ) that lack a .weight attribute,
# use params_dtype which is the expected input dtype.
_kv_b_proj_w_dtype = (
    self.kv_b_proj.weight.dtype
    if hasattr(self.kv_b_proj, "weight")
    else self.kv_b_proj.params_dtype
)
# For NVFP4, weights are packed uint8 — keep input in model dtype
# since the NVFP4 linear layer quantizes internally.
if (
    use_fp8_prefill or _kv_b_proj_w_dtype != current_platform.fp8_dtype()
) and _kv_b_proj_w_dtype != torch.uint8:
    kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)   # ← line 2094, BUG

The _kv_b_proj_w_dtype variable is defined precisely to be used here, but the .to() call still goes to self.kv_b_proj.weight.dtype directly, bypassing the hasattr guard.

Symptom: only triggered by long-prompt prefill

Short prompts (smoke test with tool calling, ~163 tokens) work fine.
Engine startup, warmup, and model loading succeed.
API endpoints respond normally.
But the first request with a prompt long enough to enter _compute_prefill_context (~27000 tokens in my repro) crashes the worker, taking down EngineCore.

Full traceback

(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/worker/gpu_worker.py", line 843, in execute_model
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/worker/gpu_model_runner.py", line 4203, in execute_model
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/v1/worker/gpu_model_runner.py", line 3680, in _model_forward
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/models/glm4_moe_lite.py", line 606, in forward
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 1062, in unified_mla_attention_with_output
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     layer.forward_impl(
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 691, in forward_impl
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     self.impl.forward_mha(...)
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 2300, in forward_mha
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     context_output, context_lse = self._compute_prefill_context(...)
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]   File ".../vllm/model_executor/layers/attention/mla_attention.py", line 2094, in _compute_prefill_context
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962]     kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)
(Worker_TP0 pid=142) ERROR 05-20 23:24:00 [multiproc_executor.py:962] AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

Steps to reproduce

Serve cyankiwi/GLM-4.7-Flash-AWQ-4bit on a dual GB10 cluster with the vLLM version above:

   vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
     --trust-remote-code \
     --tool-call-parser glm47 --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.7-flash \
     --max-model-len 202752 \
     --tensor-parallel-size 2 \
     --host 0.0.0.0 --port 8000 \
     --gpu-memory-utilization 0.7

Confirm short request works (e.g. a normal tool-calling chat with ~163-token prompt).
Send a request with a long prompt (~27000 tokens):

   filler = "The quick brown fox jumps over the lazy dog. " * 2700
   payload = {
       "model": "glm-4.7-flash",
       "messages": [{"role": "user", "content": filler + "\n\nSummarize."}],
       "max_tokens": 50,
   }

EngineCore crashes with the traceback above. All subsequent requests return 500.

Suggested fix (one-line, no semantic change)

Use the already-defined safe variable _kv_b_proj_w_dtype in the .to() call:

- kv_c_normed = kv_c_normed.to(self.kv_b_proj.weight.dtype)
+ kv_c_normed = kv_c_normed.to(_kv_b_proj_w_dtype)

This restores the intent of the surrounding code (which already defines _kv_b_proj_w_dtype with a hasattr fallback exactly for this purpose).

Verification

Before submitting a new issue...

Make sure I already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: AttributeError in mla_attention.py L2094 (_compute_prefill_context) on long prefill with AWQ model — regression after PR #34695

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Your current environment

🐛 Describe the bug

The problematic code (current state in the running container)

Symptom: only triggered by long-prompt prefill

Full traceback

Steps to reproduce

Suggested fix (one-line, no semantic change)

Verification

Before submitting a new issue...

Before submitting a new issue...

Still need to ship something?

TRENDING