vllm - ✅(Solved) Fix [Bug]: Regression in 0.19.1 - Gemma 4 26B MoE fails to load packed experts (KeyError: down_proj_packed). Worked in dev6. [1 pull requests, 4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40591Fetched 2026-04-23 07:24:08
View on GitHub
Comments
4
Participants
2
Timeline
11
Reactions
0
Author
Timeline (top)
commented ×4mentioned ×3subscribed ×3labeled ×1

Error Message

Error Logs (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 1388, in load_weights (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] param = params_dict[name] (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] ~~~~~~~~~~~^^^^^^ (EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] KeyError: 'layers.0.moe.experts.0.down_proj_packed' (APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig (APIServer pid=1) Value error, Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the quantization argument (awq).

Root Cause

Additionally, as a workaround, if I try to change the flag to --quantization awq, the API server fails to start entirely due to a Pydantic ValidationError, because the model's config.json specifies compressed-tensors.

Fix Action

Fix / Workaround

Additionally, as a workaround, if I try to change the flag to --quantization awq, the API server fails to start entirely due to a Pydantic ValidationError, because the model's config.json specifies compressed-tensors.

PR fix notes

PR #40708: [BugFix] Fix Gemma4 'layers.0.moe.experts.0.down_proj_packed' KeyError issue

Description (problem / solution / changelog)

Purpose

Fix 40247 40591

Test Plan & Result

My device

  • GPU: 2 x RTX 4000 Ada (20 GB).

Model 1: google/gemma-4-E4B-it

  • reason: verifies that the standard unquantized Gemma 4 loading path still works
  • testing script:
from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-4-E4B-it",
    tensor_parallel_size=2,
    enforce_eager=True,
)

messages = [{"role": "user", "content": "What is 2 + 2?"}]
outputs = llm.chat(messages, SamplingParams(temperature=0.7, max_tokens=128))
print(outputs[0].outputs[0].text)
  • testing result: <img width="1889" height="937" alt="image" src="https://github.com/user-attachments/assets/d8b05dc2-2f41-4b9b-a173-f42d54a3ab5b" />

Model 2: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit

  • reason: verifies the AWQ-style dot-suffix naming path such as .qweight and .scales
  • testing script:
from vllm import LLM, SamplingParams

llm = LLM(
    model="cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit",
    tensor_parallel_size=2,
    enforce_eager=True,
)

messages = [{"role": "user", "content": "What is 2 + 2?"}]
outputs = llm.chat(messages, SamplingParams(temperature=0.7, max_tokens=128))
print(outputs[0].outputs[0].text)
  • testing result: <img width="1888" height="944" alt="image" src="https://github.com/user-attachments/assets/fccb196e-2f54-4e24-865b-ae64745107b5" />

Model 3: 2imi9/gemma-4-E4B-it-NVFP4A16

  • reason: verifies the compressed-tensors underscore-suffix naming path such as weight_packed and weight_scale
  • testing script:
from vllm import LLM, SamplingParams

llm = LLM(
    model="2imi9/gemma-4-E4B-it-NVFP4A16",
    tensor_parallel_size=2,
    enforce_eager=True,
)

messages = [{"role": "user", "content": "What is 2 + 2?"}]
outputs = llm.chat(messages, SamplingParams(temperature=0.7, max_tokens=128))
print(outputs[0].outputs[0].text)
  • testing result: <img width="1892" height="953" alt="image" src="https://github.com/user-attachments/assets/31b8a390-32e5-4185-8187-6f3719e3fc41" />

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/model_executor/models/gemma4.py (modified, +19/-9)

Code Example

OS: Linux (Docker)

vLLM version: v0.19.1 (Docker image: vllm/vllm-openai:v0.19.1)

Hardware: Single GPU (CUDA_VISIBLE_DEVICES=0)

Model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
OS: Linux (Docker)

vLLM version: v0.19.1 (Docker image: vllm/vllm-openai:v0.19.1)

Hardware: Single GPU (CUDA_VISIBLE_DEVICES=0)

Model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit```

</details>


### 🐛 Describe the bug

When attempting to serve **cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit** using the stable vllm-openai:v0.19.1 Docker image with --quantization compressed-tensors, the engine core crashes during model loading with a KeyError: 'layers.0.moe.experts.0.down_proj_packed'.

Note on regression: This exact configuration and model worked perfectly on a recent development build (v0.19.1.dev6+g6d4a8e6d2), indicating that the logic to unpack Gemma 4 MoE weights was broken or reverted between dev6 and the stable 0.19.1 release.

Additionally, as a workaround, if I try to change the flag to --quantization awq, the API server fails to start entirely due to a Pydantic ValidationError, because the model's config.json specifies compressed-tensors.

Error Logs

Log 1 (Using --quantization compressed-tensors):
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 1388, in load_weights
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108]     param = params_dict[name]
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108]             ~~~~~~~~~~~^^^^^^
(EngineCore pid=187) ERROR 04-21 14:28:35 [core.py:1108] KeyError: 'layers.0.moe.experts.0.down_proj_packed'
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}


Log 2 (Using --quantization awq to try and bypass):
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1)   Value error, Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the `quantization` argument (awq).


Expected Behavior
The model should load successfully and allocate the KV cache, exactly as it did on the image: vllm/vllm-openai:gemma4 .

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to update the model loading logic to correctly handle the Gemma 4 MoE weights, potentially by reverting the changes made between the development build v0.19.1.dev6+g6d4a8e6d2 and the stable 0.19.1 release.

Guidance

  • Verify that the model config.json specifies the correct quantization method, which is currently set to compressed-tensors.
  • Check the differences in the model loading logic between the development build v0.19.1.dev6+g6d4a8e6d2 and the stable 0.19.1 release to identify the potential cause of the regression.
  • Consider using the development build v0.19.1.dev6+g6d4a8e6d2 as a temporary workaround until the issue is resolved in the stable release.
  • Review the error logs to ensure that the KeyError is the root cause of the issue and not a symptom of a larger problem.

Notes

The issue seems to be related to a regression introduced between the development build and the stable release, and updating the model loading logic may resolve the issue. However, without more information about the changes made between the two releases, it is difficult to provide a more specific solution.

Recommendation

Apply workaround: Use the development build v0.19.1.dev6+g6d4a8e6d2 as a temporary solution until the issue is resolved in the stable release, as it has been confirmed to work with the exact configuration and model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING