vllm - ✅(Solved) Fix [Bug]: Regression in 0.19.1 - Gemma 4 26B MoE fails to load packed experts (KeyError: down_proj_packed). Worked in dev6. [1 pull requests, 4 comments, 2 participants]

vllm2026-04-22 07:21:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40591•Fetched 2026-04-23 07:24:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ghazal-bh

Participants

ghazal-bh

lucianommartins

Timeline (top)

commented ×4mentioned ×3subscribed ×3labeled ×1

Error Message

Error Logs (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 1388, in load_weights (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] param = params_dict[name] (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] ~~~~~~~~~~~^^^^^^ (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] KeyError: 'layers.0.moe.experts.0.down_proj_packed' (APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig (APIServer pid=1) Value error, Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the quantization argument (awq).

Root Cause

Additionally, as a workaround, if I try to change the flag to --quantization awq, the API server fails to start entirely due to a Pydantic ValidationError, because the model's config.json specifies compressed-tensors.

Fix Action

Fix / Workaround

PR fix notes

PR #40708: [BugFix] Fix Gemma4 'layers.0.moe.experts.0.down_proj_packed' KeyError issue

Repository: vllm-project/vllm
Author: SoluMilken
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40708

Description (problem / solution / changelog)

Purpose

Fix 40247 40591

Test Plan & Result

My device

GPU: 2 x RTX 4000 Ada (20 GB).

Model 1: google/gemma-4-E4B-it

reason: verifies that the standard unquantized Gemma 4 loading path still works
testing script:

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-4-E4B-it",
    tensor_parallel_size=2,
    enforce_eager=True,
)

messages = [{"role": "user", "content": "What is 2 + 2?"}]
outputs = llm.chat(messages, SamplingParams(temperature=0.7, max_tokens=128))
print(outputs[0].outputs[0].text)

testing result: <img width="1889" height="937" alt="image" src="https://github.com/user-attachments/assets/d8b05dc2-2f41-4b9b-a173-f42d54a3ab5b" />

Model 2: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

reason: verifies the AWQ-style dot-suffix naming path such as .qweight and .scales
testing script:

from vllm import LLM, SamplingParams

llm = LLM(
    model="cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit",
    tensor_parallel_size=2,
    enforce_eager=True,
)

messages = [{"role": "user", "content": "What is 2 + 2?"}]
outputs = llm.chat(messages, SamplingParams(temperature=0.7, max_tokens=128))
print(outputs[0].outputs[0].text)

testing result: <img width="1888" height="944" alt="image" src="https://github.com/user-attachments/assets/fccb196e-2f54-4e24-865b-ae64745107b5" />

Model 3: 2imi9/gemma-4-E4B-it-NVFP4A16

reason: verifies the compressed-tensors underscore-suffix naming path such as weight_packed and weight_scale
testing script:

from vllm import LLM, SamplingParams

llm = LLM(
    model="2imi9/gemma-4-E4B-it-NVFP4A16",
    tensor_parallel_size=2,
    enforce_eager=True,
)

messages = [{"role": "user", "content": "What is 2 + 2?"}]
outputs = llm.chat(messages, SamplingParams(temperature=0.7, max_tokens=128))
print(outputs[0].outputs[0].text)

testing result: <img width="1892" height="953" alt="image" src="https://github.com/user-attachments/assets/31b8a390-32e5-4185-8187-6f3719e3fc41" />

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

vllm/model_executor/models/gemma4.py (modified, +19/-9)

Code Example

OS: Linux (Docker)

vLLM version: v0.19.1 (Docker image: vllm/vllm-openai:v0.19.1)

Hardware: Single GPU (CUDA_VISIBLE_DEVICES=0)

Model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

OS: Linux (Docker)

vLLM version: v0.19.1 (Docker image: vllm/vllm-openai:v0.19.1)

Hardware: Single GPU (CUDA_VISIBLE_DEVICES=0)

Model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit```

</details>


### 🐛 Describe the bug

When attempting to serve **cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit** using the stable vllm-openai:v0.19.1 Docker image with --quantization compressed-tensors, the engine core crashes during model loading with a KeyError: 'layers.0.moe.experts.0.down_proj_packed'.

Note on regression: This exact configuration and model worked perfectly on a recent development build (v0.19.1.dev6+g6d4a8e6d2), indicating that the logic to unpack Gemma 4 MoE weights was broken or reverted between dev6 and the stable 0.19.1 release.

Additionally, as a workaround, if I try to change the flag to --quantization awq, the API server fails to start entirely due to a Pydantic ValidationError, because the model's config.json specifies compressed-tensors.

Error Logs

Log 1 (Using --quantization compressed-tensors):
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 1388, in load_weights
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108]     param = params_dict[name]
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108]             ~~~~~~~~~~~^^^^^^
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] KeyError: 'layers.0.moe.experts.0.down_proj_packed'
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}


Log 2 (Using --quantization awq to try and bypass):
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1)   Value error, Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the `quantization` argument (awq).


Expected Behavior
The model should load successfully and allocate the KV cache, exactly as it did on the image: vllm/vllm-openai:gemma4 .

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to update the model loading logic to correctly handle the Gemma 4 MoE weights, potentially by reverting the changes made between the development build v0.19.1.dev6+g6d4a8e6d2 and the stable 0.19.1 release.

Guidance

Verify that the model config.json specifies the correct quantization method, which is currently set to compressed-tensors.
Check the differences in the model loading logic between the development build v0.19.1.dev6+g6d4a8e6d2 and the stable 0.19.1 release to identify the potential cause of the regression.
Consider using the development build v0.19.1.dev6+g6d4a8e6d2 as a temporary workaround until the issue is resolved in the stable release.
Review the error logs to ensure that the KeyError is the root cause of the issue and not a symptom of a larger problem.

Notes

The issue seems to be related to a regression introduced between the development build and the stable release, and updating the model loading logic may resolve the issue. However, without more information about the changes made between the two releases, it is difficult to provide a more specific solution.

Recommendation

Apply workaround: Use the development build v0.19.1.dev6+g6d4a8e6d2 as a temporary solution until the issue is resolved in the stable release, as it has been confirmed to work with the exact configuration and model.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #tensor shape #autograd error #model save/load #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Regression in 0.19.1 - Gemma 4 26B MoE fails to load packed experts (KeyError: down_proj_packed). Worked in dev6. [1 pull requests, 4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #40708: [BugFix] Fix Gemma4 'layers.0.moe.experts.0.down_proj_packed' KeyError issue

Description (problem / solution / changelog)

Purpose

Test Plan & Result

My device

Model 1: google/gemma-4-E4B-it

Model 2: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

Model 3: 2imi9/gemma-4-E4B-it-NVFP4A16

Changed files

Code Example

Your current environment

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Regression in 0.19.1 - Gemma 4 26B MoE fails to load packed experts (KeyError: down_proj_packed). Worked in dev6. [1 pull requests, 4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #40708: [BugFix] Fix Gemma4 'layers.0.moe.experts.0.down_proj_packed' KeyError issue

Description (problem / solution / changelog)

Purpose

Test Plan & Result

My device

Model 1: google/gemma-4-E4B-it

Model 2: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

Model 3: 2imi9/gemma-4-E4B-it-NVFP4A16

Changed files

Code Example

Your current environment

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING