vllm - ✅(Solved) Fix [Bug]: Gemma 4 MoE (26B-A4B-it) crashes at startup — AssertionError: top_k is None in MoEMixin.recursive_replace [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39066Fetched 2026-04-08 02:52:39
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Error Message

Traceback (most recent call last): File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace assert top_k is not None AssertionError

Root Cause

MoEMixin.recursive_replace resolves the top-k value via:

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)
assert top_k is not None

Gemma4TextConfig stores this value under the attribute name top_k_experts — not num_experts_per_tok or top_k. The lookup returns None and the assert fires.

# transformers/models/gemma4/configuration_gemma4.py
@dataclass
class Gemma4TextConfig:
    num_experts: int | None = None
    top_k_experts: int | None = None   # <-- this is the attribute used

Fix Action

Fix

Add "top_k_experts" to the getattr_iter lookup list in moe.py:197:

# Before
top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

# After
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)

This is addressed by the linked PR.

PR fix notes

PR #39067: [Transformers/Bugfix] Fix Gemma4 MoE top_k lookup + duplicate kv_seqlens in op schema

Description (problem / solution / changelog)

Summary

This PR contains two independent bugfixes on the same branch.


Fix 1 — Gemma4 MoE crashes at startup with `AssertionError: top_k is None`

Fixes #39066

`google/gemma-4-26B-A4B-it` crashes at startup because `Gemma4TextConfig` stores the top-k expert count under `top_k_experts`, while `MoEMixin.recursive_replace` only checked `num_experts_per_tok` and `top_k`.

Change (`vllm/model_executor/models/transformers/moe.py`): ```python

Before

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

After

top_k = getattr_iter( text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None ) ```

Reproduction: ```bash vllm serve google/gemma-4-26B-A4B-it

Crashes immediately: AssertionError at moe.py:198 — assert top_k is not None

```


Fix 2 — Duplicate `q_seqlens` parameter in `convert_vertical_slash_indexes` op schema

Fixes #39068

The PyTorch op schema strings for `convert_vertical_slash_indexes` and `convert_vertical_slash_indexes_mergehead` in `csrc/torch_bindings.cpp` listed `Tensor q_seqlens` twice. The second argument must be `kv_seqlens`, consistent with `csrc/ops.h` and `vllm/_custom_ops.py`.

Change (`csrc/torch_bindings.cpp`, lines 83 and 94): ```cpp // Before (both ops): " Tensor q_seqlens, Tensor q_seqlens, "

// After (both ops): " Tensor q_seqlens, Tensor kv_seqlens, " ```


What this PR does NOT address

  • #38999 — `--data-parallel-size > 1` crash in CUDA communicator
  • #39000 — MXFP4 quantization crash during weight loading

Why not a duplicate

Searched open issues and PRs for all affected symbols before opening. No existing PR covers either fix.

Test plan

  • Fix 1: Load `google/gemma-4-26B-A4B-it` — should pass `MoEMixin.recursive_replace` without `AssertionError`.
  • Fix 2: Any test dispatching through the vertical-slash sparse attention path (non-ROCm).

AI assistance disclosure

Developed with AI assistance (Claude). All changed lines reviewed by submitter.

Changed files

  • .pre-commit-config.yaml (modified, +1/-0)
  • .shellcheckrc (modified, +1/-1)
  • CMakeLists.txt (modified, +48/-2)
  • csrc/torch_bindings.cpp (modified, +16/-3)
  • tests/evals/gsm8k/gsm8k_eval.py (modified, +25/-1)
  • tools/pre_commit/update-dockerfile-graph.sh (modified, +6/-2)
  • vllm/_custom_ops.py (modified, +25/-3)
  • vllm/compilation/backends.py (modified, +1/-1)
  • vllm/compilation/passes/fusion/act_quant_fusion.py (modified, +4/-1)
  • vllm/compilation/passes/fusion/matcher_utils.py (modified, +5/-3)
  • vllm/compilation/piecewise_backend.py (modified, +3/-3)
  • vllm/config/compilation.py (modified, +1/-1)
  • vllm/entrypoints/openai/chat_completion/protocol.py (modified, +1/-1)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +10/-1)
  • vllm/entrypoints/openai/completion/protocol.py (modified, +1/-1)
  • vllm/entrypoints/openai/responses/serving.py (modified, +1/-1)
  • vllm/entrypoints/serve/render/serving.py (modified, +30/-4)
  • vllm/model_executor/layers/layernorm.py (modified, +1/-1)
  • vllm/model_executor/layers/quantization/mxfp4.py (modified, +132/-0)
  • vllm/model_executor/layers/quantization/utils/w8a8_utils.py (modified, +8/-2)
  • vllm/model_executor/models/transformers/base.py (modified, +33/-8)
  • vllm/model_executor/models/transformers/moe.py (modified, +3/-1)
  • vllm/model_executor/models/transformers/utils.py (modified, +11/-0)
  • vllm/model_executor/models/utils.py (modified, +5/-0)
  • vllm/tool_parsers/mistral_tool_parser.py (modified, +1/-1)
  • vllm/transformers_utils/model_arch_config_convertor.py (modified, +1/-1)
  • vllm/v1/engine/core.py (modified, +6/-0)
  • vllm/v1/worker/utils.py (modified, +28/-8)

Code Example

File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

---

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)
assert top_k is not None

---

# transformers/models/gemma4/configuration_gemma4.py
@dataclass
class Gemma4TextConfig:
    num_experts: int | None = None
    top_k_experts: int | None = None   # <-- this is the attribute used

---

# Requires access to google/gemma-4-26B-A4B-it (gated)
vllm serve google/gemma-4-26B-A4B-it

---

Traceback (most recent call last):
  File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

---

# Before
top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

# After
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)
RAW_BUFFERClick to expand / collapse

Describe the bug

Loading google/gemma-4-26B-A4B-it (or any Gemma 4 MoE variant) with the Transformers modeling backend raises an AssertionError immediately during model initialization, before any inference occurs.

File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

Root Cause

MoEMixin.recursive_replace resolves the top-k value via:

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)
assert top_k is not None

Gemma4TextConfig stores this value under the attribute name top_k_experts — not num_experts_per_tok or top_k. The lookup returns None and the assert fires.

# transformers/models/gemma4/configuration_gemma4.py
@dataclass
class Gemma4TextConfig:
    num_experts: int | None = None
    top_k_experts: int | None = None   # <-- this is the attribute used

Reproduction

# Requires access to google/gemma-4-26B-A4B-it (gated)
vllm serve google/gemma-4-26B-A4B-it

Full traceback:

Traceback (most recent call last):
  File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

The crash occurs during MoEMixin.recursive_replaceTransformersFusedMoE initialization, triggered by TransformersMultiModalMoEForCausalLM.

Expected behavior

vLLM should correctly resolve top_k_experts from Gemma4TextConfig and proceed with model loading.

Fix

Add "top_k_experts" to the getattr_iter lookup list in moe.py:197:

# Before
top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

# After
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)

This is addressed by the linked PR.

Environment

  • vLLM version: main (after 2026-04)
  • Model: google/gemma-4-26B-A4B-it
  • Backend: Transformers modeling backend (TransformersMultiModalMoEForCausalLM)
  • Python: 3.12
  • Transformers version: 4.52+ (Gemma 4 support)

Notes

  • This is not a duplicate of #38999 (which is about --data-parallel-size > 1 in CUDA communicator) or #39000 (MXFP4 quantization crash).
  • The fix is a one-line addition to the attribute lookup list and is safe for all other MoE models (the new entry is only reached if the first two attributes are absent).

extent analysis

TL;DR

To fix the AssertionError when loading Gemma 4 MoE variants with the Transformers modeling backend, add "top_k_experts" to the getattr_iter lookup list in moe.py.

Guidance

  • The issue arises from a mismatch in attribute names between MoEMixin.recursive_replace and Gemma4TextConfig.
  • The getattr_iter function should be updated to include "top_k_experts" in its lookup list to correctly resolve the top_k value.
  • Apply the fix by modifying the getattr_iter call in moe.py to include the new attribute name.
  • Verify the fix by attempting to load the Gemma 4 MoE model again and checking for the absence of the AssertionError.

Example

# Updated getattr_iter call
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)

Notes

  • This fix is specific to the Gemma 4 MoE models and does not affect other MoE models.
  • The change is a one-line addition to the attribute lookup list and is safe for all other models.

Recommendation

Apply the workaround by adding "top_k_experts" to the getattr_iter lookup list, as this is a targeted fix that resolves the attribute name mismatch.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

vLLM should correctly resolve top_k_experts from Gemma4TextConfig and proceed with model loading.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING