vllm - ✅(Solved) Fix [Bug]: Gemma 4 MoE (26B-A4B-it) crashes at startup — AssertionError: top_k is None in MoEMixin.recursive_replace [1 pull requests, 1 participants]

Q: Expected behavior

vLLM should correctly resolve `top_k_experts` from `Gemma4TextConfig` and proceed with model loading.

ohsono · 2026-04-06T08:31:37Z

[vllm] PR 39067: Transformers/Bugfix Fix Gemma4 MoE top k lookup + duplicate kv seqlens in op schema - Repository: vllm-project/vllm - Author: ohsono - State:… # PR #39067: [Transformers/Bugfix] Fix Gemma4 MoE top_k lookup + duplicate kv_seqlens in op schema - Repository: vllm-project/vllm - Author: ohsono - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/39067 ## Description (problem / solution / changelog) ## Summary This PR contains two independent bugfixes on the same branch. --- ### Fix 1 — Gemma4 MoE crashes at startup with \`AssertionError: top_k is None\` **Fixes #39066** \`google/gemma-4-26B-A4B-it\` crashes at startup because \`Gemma4TextConfig\` stores the top-k expert count under \`top_k_experts\`, while \`MoEMixin.recursive_replace\` only checked \`num_experts_per_tok\` and \`top_k\`. **Change** (\`vllm/model_executor/models/transformers/moe.py\`): \`\`\`python # Before top_k = getattr_iter(text_config, [\"num_experts_per_tok\", \"top_k\"], None) # After top_k = getattr_iter( text_config, [\"num_experts_per_tok\", \"top_k\", \"top_k_experts\"], None ) \`\`\` **Reproduction:** \`\`\`bash vllm serve google/gemma-4-26B-A4B-it # Crashes immediately: AssertionError at moe.py:198 — assert top_k is not None \`\`\` --- ### Fix 2 — Duplicate \`q_seqlens\` parameter in \`convert_vertical_slash_indexes\` op schema **Fixes #39068** The PyTorch op schema strings for \`convert_vertical_slash_indexes\` and \`convert_vertical_slash_indexes_mergehead\` in \`csrc/torch_bindings.cpp\` listed \`Tensor q_seqlens\` twice. The second argument must be \`kv_seqlens\`, consistent with \`csrc/ops.h\` and \`vllm/_custom_ops.py\`. **Change** (\`csrc/torch_bindings.cpp\`, lines 83 and 94): \`\`\`cpp // Before (both ops): " Tensor q_seqlens, Tensor q_seqlens, " // After (both ops): " Tensor q_seqlens, Tensor kv_seqlens, " \`\`\` --- ## What this PR does NOT address - #38999 — \`--data-parallel-size > 1\` crash in CUDA communicator - #39000 — MXFP4 quantization crash during weight loading ## Why not a duplicate Searched open issues and PRs for all affected symbols before opening. No existing PR covers either fix. ## Test plan - **Fix 1**: Load \`google/gemma-4-26B-A4B-it\` — should pass \`MoEMixin.recursive_replace\` without \`AssertionError\`. - **Fix 2**: Any test dispatching through the vertical-slash sparse attention path (non-ROCm). ## AI assistance disclosure Developed with AI assistance (Claude). All changed lines reviewed by submitter. ## Changed files - `.pre-commit-config.yaml` (modified, +1/-0) - `.shellcheckrc` (modified, +1/-1) - `CMakeLists.txt` (modified, +48/-2) - `csrc/torch_bindings.cpp` (modified, +16/-3) - `tests/evals/gsm8k/gsm8k_eval.py` (modified, +25/-1) - `tools/pre_commit/update-dockerfile-graph.sh` (modified, +6/-2) - `vllm/_custom_ops.py` (modified, +25/-3) - `vllm/compilation/backends.py` (modified, +1/-1) - `vllm/compilation/passes/fusion/act_quant_fusion.py` (modified, +4/-1) - `vllm/compilation/passes/fusion/matcher_utils.py` (modified, +5/-3) - `vllm/compilation/piecewise_backend.py` (modified, +3/-3) - `vllm/config/compilation.py` (modified, +1/-1) - `vllm/entrypoints/openai/chat_completion/protocol.py` (modified, +1/-1) - `vllm/entrypoints/openai/chat_completion/serving.py` (modified, +10/-1) - `vllm/entrypoints/openai/completion/protocol.py` (modified, +1/-1) - `vllm/entrypoints/openai/responses/serving.py` (modified, +1/-1) - `vllm/entrypoints/serve/render/serving.py` (modified, +30/-4) - `vllm/model_executor/layers/layernorm.py` (modified, +1/-1) - `vllm/model_executor/layers/quantization/mxfp4.py` (modified, +132/-0) - `vllm/model_executor/layers/quantization/utils/w8a8_utils.py` (modified, +8/-2) - `vllm/model_executor/models/transformers/base.py` (modified, +33/-8) - `vllm/model_executor/models/transformers/moe.py` (modified, +3/-1) - `vllm/model_executor/models/transformers/utils.py` (modified, +11/-0) - `vllm/model_executor/models/utils.py` (modified, +5/-0) - `vllm/tool_parsers/mistral_tool_parser.py` (modified, +1/-1) - `vllm/transformers_utils/model_arch_config_convertor.py` (modified, +1/-1) - `vllm/v1/engine/core.py` (modified, +6/-0) - `vllm/v1/worker/utils.py` (modified, +28/-8) ## Fix Add `"top_k_experts"` to the `getattr_iter` lookup list in `moe.py:197`: ```python # Before top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None) # After top_k = getattr_iter( text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None ) ``` This is addressed by the linked PR. ## Describe the bug Loading `google/gemma-4-26B-A4B-it` (or any Gemma 4 MoE variant) with the Transformers modeling backend raises an `AssertionError` immediately during model initialization, before any inference occurs. ``` File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace assert top_k is not None AssertionError ``` ### Root Cause `MoEMixin.recursive_replace` resolves the top-k value via: ```python top_k = getattr_iter(text_

vllm2026-04-06 08:31:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39066•Fetched 2026-04-08 02:52:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ohsono

Participants

ohsono

Timeline (top)

cross-referenced ×1

Error Message

Traceback (most recent call last): File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace assert top_k is not None AssertionError

Root Cause

MoEMixin.recursive_replace resolves the top-k value via:

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)
assert top_k is not None

Gemma4TextConfig stores this value under the attribute name top_k_experts — not num_experts_per_tok or top_k. The lookup returns None and the assert fires.

# transformers/models/gemma4/configuration_gemma4.py
@dataclass
class Gemma4TextConfig:
    num_experts: int | None = None
    top_k_experts: int | None = None   # <-- this is the attribute used

Fix Action

Fix

Add "top_k_experts" to the getattr_iter lookup list in moe.py:197:

# Before
top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

# After
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)

This is addressed by the linked PR.

PR fix notes

PR #39067: [Transformers/Bugfix] Fix Gemma4 MoE top_k lookup + duplicate kv_seqlens in op schema

Repository: vllm-project/vllm
Author: ohsono
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39067

Description (problem / solution / changelog)

Summary

This PR contains two independent bugfixes on the same branch.

Fix 1 — Gemma4 MoE crashes at startup with `AssertionError: top_k is None`

Fixes #39066

`google/gemma-4-26B-A4B-it` crashes at startup because `Gemma4TextConfig` stores the top-k expert count under `top_k_experts`, while `MoEMixin.recursive_replace` only checked `num_experts_per_tok` and `top_k`.

Change (`vllm/model_executor/models/transformers/moe.py`): ```python

Before

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

After

top_k = getattr_iter( text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None ) ```

Reproduction: ```bash vllm serve google/gemma-4-26B-A4B-it

Crashes immediately: AssertionError at moe.py:198 — assert top_k is not None

```

Fix 2 — Duplicate `q_seqlens` parameter in `convert_vertical_slash_indexes` op schema

Fixes #39068

The PyTorch op schema strings for `convert_vertical_slash_indexes` and `convert_vertical_slash_indexes_mergehead` in `csrc/torch_bindings.cpp` listed `Tensor q_seqlens` twice. The second argument must be `kv_seqlens`, consistent with `csrc/ops.h` and `vllm/_custom_ops.py`.

Change (`csrc/torch_bindings.cpp`, lines 83 and 94): ```cpp // Before (both ops): " Tensor q_seqlens, Tensor q_seqlens, "

// After (both ops): " Tensor q_seqlens, Tensor kv_seqlens, " ```

What this PR does NOT address

#38999 — `--data-parallel-size > 1` crash in CUDA communicator
#39000 — MXFP4 quantization crash during weight loading

Why not a duplicate

Searched open issues and PRs for all affected symbols before opening. No existing PR covers either fix.

Test plan

Fix 1: Load `google/gemma-4-26B-A4B-it` — should pass `MoEMixin.recursive_replace` without `AssertionError`.
Fix 2: Any test dispatching through the vertical-slash sparse attention path (non-ROCm).

AI assistance disclosure

Developed with AI assistance (Claude). All changed lines reviewed by submitter.

Changed files

.pre-commit-config.yaml (modified, +1/-0)
.shellcheckrc (modified, +1/-1)
CMakeLists.txt (modified, +48/-2)
csrc/torch_bindings.cpp (modified, +16/-3)
tests/evals/gsm8k/gsm8k_eval.py (modified, +25/-1)
tools/pre_commit/update-dockerfile-graph.sh (modified, +6/-2)
vllm/_custom_ops.py (modified, +25/-3)
vllm/compilation/backends.py (modified, +1/-1)
vllm/compilation/passes/fusion/act_quant_fusion.py (modified, +4/-1)
vllm/compilation/passes/fusion/matcher_utils.py (modified, +5/-3)
vllm/compilation/piecewise_backend.py (modified, +3/-3)
vllm/config/compilation.py (modified, +1/-1)
vllm/entrypoints/openai/chat_completion/protocol.py (modified, +1/-1)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +10/-1)
vllm/entrypoints/openai/completion/protocol.py (modified, +1/-1)
vllm/entrypoints/openai/responses/serving.py (modified, +1/-1)
vllm/entrypoints/serve/render/serving.py (modified, +30/-4)
vllm/model_executor/layers/layernorm.py (modified, +1/-1)
vllm/model_executor/layers/quantization/mxfp4.py (modified, +132/-0)
vllm/model_executor/layers/quantization/utils/w8a8_utils.py (modified, +8/-2)
vllm/model_executor/models/transformers/base.py (modified, +33/-8)
vllm/model_executor/models/transformers/moe.py (modified, +3/-1)
vllm/model_executor/models/transformers/utils.py (modified, +11/-0)
vllm/model_executor/models/utils.py (modified, +5/-0)
vllm/tool_parsers/mistral_tool_parser.py (modified, +1/-1)
vllm/transformers_utils/model_arch_config_convertor.py (modified, +1/-1)
vllm/v1/engine/core.py (modified, +6/-0)
vllm/v1/worker/utils.py (modified, +28/-8)

Code Example

File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

---

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)
assert top_k is not None

---

# transformers/models/gemma4/configuration_gemma4.py
@dataclass
class Gemma4TextConfig:
    num_experts: int | None = None
    top_k_experts: int | None = None   # <-- this is the attribute used

---

# Requires access to google/gemma-4-26B-A4B-it (gated)
vllm serve google/gemma-4-26B-A4B-it

---

Traceback (most recent call last):
  File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

---

# Before
top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

# After
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)

RAW_BUFFERClick to expand / collapse

Describe the bug

Loading google/gemma-4-26B-A4B-it (or any Gemma 4 MoE variant) with the Transformers modeling backend raises an AssertionError immediately during model initialization, before any inference occurs.

File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

Root Cause

MoEMixin.recursive_replace resolves the top-k value via:

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)
assert top_k is not None

Gemma4TextConfig stores this value under the attribute name top_k_experts — not num_experts_per_tok or top_k. The lookup returns None and the assert fires.

# transformers/models/gemma4/configuration_gemma4.py
@dataclass
class Gemma4TextConfig:
    num_experts: int | None = None
    top_k_experts: int | None = None   # <-- this is the attribute used

Reproduction

# Requires access to google/gemma-4-26B-A4B-it (gated)
vllm serve google/gemma-4-26B-A4B-it

Full traceback:

Traceback (most recent call last):
  File ".../vllm/model_executor/models/transformers/moe.py", line 198, in recursive_replace
    assert top_k is not None
AssertionError

The crash occurs during MoEMixin.recursive_replace → TransformersFusedMoE initialization, triggered by TransformersMultiModalMoEForCausalLM.

Expected behavior

vLLM should correctly resolve top_k_experts from Gemma4TextConfig and proceed with model loading.

Fix

Add "top_k_experts" to the getattr_iter lookup list in moe.py:197:

# Before
top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

# After
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)

This is addressed by the linked PR.

Environment

vLLM version: main (after 2026-04)
Model: google/gemma-4-26B-A4B-it
Backend: Transformers modeling backend (TransformersMultiModalMoEForCausalLM)
Python: 3.12
Transformers version: 4.52+ (Gemma 4 support)

Notes

This is not a duplicate of #38999 (which is about --data-parallel-size > 1 in CUDA communicator) or #39000 (MXFP4 quantization crash).
The fix is a one-line addition to the attribute lookup list and is safe for all other MoE models (the new entry is only reached if the first two attributes are absent).

extent analysis

TL;DR

To fix the AssertionError when loading Gemma 4 MoE variants with the Transformers modeling backend, add "top_k_experts" to the getattr_iter lookup list in moe.py.

Guidance

The issue arises from a mismatch in attribute names between MoEMixin.recursive_replace and Gemma4TextConfig.
The getattr_iter function should be updated to include "top_k_experts" in its lookup list to correctly resolve the top_k value.
Apply the fix by modifying the getattr_iter call in moe.py to include the new attribute name.
Verify the fix by attempting to load the Gemma 4 MoE model again and checking for the absence of the AssertionError.

Example

# Updated getattr_iter call
top_k = getattr_iter(
    text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None
)

Notes

This fix is specific to the Gemma 4 MoE models and does not affect other MoE models.
The change is a one-line addition to the attribute lookup list and is safe for all other models.

Recommendation

Apply the workaround by adding "top_k_experts" to the getattr_iter lookup list, as this is a targeted fix that resolves the attribute name mismatch.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

vLLM should correctly resolve top_k_experts from Gemma4TextConfig and proceed with model loading.

#model loading #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Gemma 4 MoE (26B-A4B-it) crashes at startup — AssertionError: top_k is None in MoEMixin.recursive_replace [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #39067: [Transformers/Bugfix] Fix Gemma4 MoE top_k lookup + duplicate kv_seqlens in op schema

Description (problem / solution / changelog)

Summary

Fix 1 — Gemma4 MoE crashes at startup with `AssertionError: top_k is None`

Before

After

Crashes immediately: AssertionError at moe.py:198 — assert top_k is not None

Fix 2 — Duplicate `q_seqlens` parameter in `convert_vertical_slash_indexes` op schema

What this PR does NOT address

Why not a duplicate

Test plan

AI assistance disclosure

Changed files

Code Example

Describe the bug

Root Cause

Reproduction

Expected behavior

Fix

Environment

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING