vllm - ✅(Solved) Fix [Bug]: Duplicate parameter name in convert_vertical_slash_indexes op schema — kv_seqlens registered as q_seqlens [1 pull requests, 1 participants]

vllm2026-04-06 08:41:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39068•Fetched 2026-04-08 02:52:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ohsono

Participants

ohsono

Timeline (top)

cross-referenced ×1

Error Message

RuntimeError: ... duplicate argument name 'q_seqlens' in schema

Root Cause

csrc/torch_bindings.cpp lines 83 and 94:

// Line 83 (convert_vertical_slash_indexes):
"   Tensor q_seqlens, Tensor q_seqlens, "

// Line 94 (convert_vertical_slash_indexes_mergehead):
"   Tensor q_seqlens, Tensor q_seqlens, "

Fix Action

Fix

Replace the second q_seqlens with kv_seqlens in both op definitions. This matches:

csrc/ops.h:69 — torch::Tensor kv_seqlens
csrc/ops.h:81 — torch::Tensor kv_seqlens
vllm/_custom_ops.py:275 — kv_seqlens: torch.Tensor
vllm/_custom_ops.py:330 — kv_seqlens: torch.Tensor

A fix is in the linked PR.

PR fix notes

PR #39067: [Transformers/Bugfix] Fix Gemma4 MoE top_k lookup + duplicate kv_seqlens in op schema

Repository: vllm-project/vllm
Author: ohsono
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39067

Description (problem / solution / changelog)

Summary

This PR contains two independent bugfixes on the same branch.

Fix 1 — Gemma4 MoE crashes at startup with `AssertionError: top_k is None`

Fixes #39066

`google/gemma-4-26B-A4B-it` crashes at startup because `Gemma4TextConfig` stores the top-k expert count under `top_k_experts`, while `MoEMixin.recursive_replace` only checked `num_experts_per_tok` and `top_k`.

Change (`vllm/model_executor/models/transformers/moe.py`): ```python

Before

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

After

top_k = getattr_iter( text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None ) ```

Reproduction: ```bash vllm serve google/gemma-4-26B-A4B-it

Crashes immediately: AssertionError at moe.py:198 — assert top_k is not None

```

Fix 2 — Duplicate `q_seqlens` parameter in `convert_vertical_slash_indexes` op schema

Fixes #39068

The PyTorch op schema strings for `convert_vertical_slash_indexes` and `convert_vertical_slash_indexes_mergehead` in `csrc/torch_bindings.cpp` listed `Tensor q_seqlens` twice. The second argument must be `kv_seqlens`, consistent with `csrc/ops.h` and `vllm/_custom_ops.py`.

Change (`csrc/torch_bindings.cpp`, lines 83 and 94): ```cpp // Before (both ops): " Tensor q_seqlens, Tensor q_seqlens, "

// After (both ops): " Tensor q_seqlens, Tensor kv_seqlens, " ```

What this PR does NOT address

#38999 — `--data-parallel-size > 1` crash in CUDA communicator
#39000 — MXFP4 quantization crash during weight loading

Why not a duplicate

Searched open issues and PRs for all affected symbols before opening. No existing PR covers either fix.

Test plan

Fix 1: Load `google/gemma-4-26B-A4B-it` — should pass `MoEMixin.recursive_replace` without `AssertionError`.
Fix 2: Any test dispatching through the vertical-slash sparse attention path (non-ROCm).

AI assistance disclosure

Developed with AI assistance (Claude). All changed lines reviewed by submitter.

Changed files

.pre-commit-config.yaml (modified, +1/-0)
.shellcheckrc (modified, +1/-1)
CMakeLists.txt (modified, +48/-2)
csrc/torch_bindings.cpp (modified, +16/-3)
tests/evals/gsm8k/gsm8k_eval.py (modified, +25/-1)
tools/pre_commit/update-dockerfile-graph.sh (modified, +6/-2)
vllm/_custom_ops.py (modified, +25/-3)
vllm/compilation/backends.py (modified, +1/-1)
vllm/compilation/passes/fusion/act_quant_fusion.py (modified, +4/-1)
vllm/compilation/passes/fusion/matcher_utils.py (modified, +5/-3)
vllm/compilation/piecewise_backend.py (modified, +3/-3)
vllm/config/compilation.py (modified, +1/-1)
vllm/entrypoints/openai/chat_completion/protocol.py (modified, +1/-1)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +10/-1)
vllm/entrypoints/openai/completion/protocol.py (modified, +1/-1)
vllm/entrypoints/openai/responses/serving.py (modified, +1/-1)
vllm/entrypoints/serve/render/serving.py (modified, +30/-4)
vllm/model_executor/layers/layernorm.py (modified, +1/-1)
vllm/model_executor/layers/quantization/mxfp4.py (modified, +132/-0)
vllm/model_executor/layers/quantization/utils/w8a8_utils.py (modified, +8/-2)
vllm/model_executor/models/transformers/base.py (modified, +33/-8)
vllm/model_executor/models/transformers/moe.py (modified, +3/-1)
vllm/model_executor/models/transformers/utils.py (modified, +11/-0)
vllm/model_executor/models/utils.py (modified, +5/-0)
vllm/tool_parsers/mistral_tool_parser.py (modified, +1/-1)
vllm/transformers_utils/model_arch_config_convertor.py (modified, +1/-1)
vllm/v1/engine/core.py (modified, +6/-0)
vllm/v1/worker/utils.py (modified, +28/-8)

Code Example

// BUGGY (both ops affected):
"   Tensor q_seqlens, Tensor q_seqlens, "   // <-- kv_seqlens written as q_seqlens

// CORRECT:
"   Tensor q_seqlens, Tensor kv_seqlens, "

---

RuntimeError: ... duplicate argument name 'q_seqlens' in schema

---

import torch
import vllm._custom_ops as ops

q_seqlens  = torch.tensor([16], dtype=torch.int32, device="cuda")
kv_seqlens = torch.tensor([16], dtype=torch.int32, device="cuda")
# Any call through the vertical_slash attention kernel path will hit this

---

// Line 83 (convert_vertical_slash_indexes):
"   Tensor q_seqlens, Tensor q_seqlens, "

// Line 94 (convert_vertical_slash_indexes_mergehead):
"   Tensor q_seqlens, Tensor q_seqlens, "

RAW_BUFFERClick to expand / collapse

Describe the bug

The PyTorch op schema strings registered in csrc/torch_bindings.cpp for convert_vertical_slash_indexes and convert_vertical_slash_indexes_mergehead have a duplicate parameter name: Tensor q_seqlens appears twice. The second occurrence should be Tensor kv_seqlens.

// BUGGY (both ops affected):
"   Tensor q_seqlens, Tensor q_seqlens, "   // <-- kv_seqlens written as q_seqlens

// CORRECT:
"   Tensor q_seqlens, Tensor kv_seqlens, "

The C++ function signatures in csrc/ops.h and the Python wrappers in vllm/_custom_ops.py are both correct — the mismatch is only in the schema string passed to ops.def(...).

Impact

PyTorch op schema registration with duplicate argument names causes a runtime error when the kernel is dispatched:

RuntimeError: ... duplicate argument name 'q_seqlens' in schema

This affects the sparse/blocksparse vertical-slash attention path (non-ROCm only, guarded by #ifndef USE_ROCM).

Reproduction

import torch
import vllm._custom_ops as ops

q_seqlens  = torch.tensor([16], dtype=torch.int32, device="cuda")
kv_seqlens = torch.tensor([16], dtype=torch.int32, device="cuda")
# Any call through the vertical_slash attention kernel path will hit this

Or simply load vLLM and trigger the sparse attention path via a blocksparse model.

Root cause

csrc/torch_bindings.cpp lines 83 and 94:

// Line 83 (convert_vertical_slash_indexes):
"   Tensor q_seqlens, Tensor q_seqlens, "

// Line 94 (convert_vertical_slash_indexes_mergehead):
"   Tensor q_seqlens, Tensor q_seqlens, "

Fix

Replace the second q_seqlens with kv_seqlens in both op definitions. This matches:

csrc/ops.h:69 — torch::Tensor kv_seqlens
csrc/ops.h:81 — torch::Tensor kv_seqlens
vllm/_custom_ops.py:275 — kv_seqlens: torch.Tensor
vllm/_custom_ops.py:330 — kv_seqlens: torch.Tensor

A fix is in the linked PR.

Environment

vLLM version: main
Platform: CUDA (non-ROCm)
File: csrc/torch_bindings.cpp

extent analysis

TL;DR

Replace the duplicate q_seqlens parameter with kv_seqlens in the PyTorch op schema strings for convert_vertical_slash_indexes and convert_vertical_slash_indexes_mergehead in csrc/torch_bindings.cpp.

Guidance

Identify the lines in csrc/torch_bindings.cpp where the duplicate parameter name q_seqlens appears (lines 83 and 94) and replace the second occurrence with kv_seqlens.
Verify that the corrected schema strings match the parameter names in csrc/ops.h and vllm/_custom_ops.py.
Test the corrected code by running the reproduction script or loading vLLM and triggering the sparse attention path via a blocksparse model.
Check for any remaining runtime errors related to duplicate argument names.

Example

// CORRECTED:
"   Tensor q_seqlens, Tensor kv_seqlens, "

Notes

This fix only applies to the non-ROCm platform (CUDA) and may not be relevant to other environments.

Recommendation

Apply the workaround by replacing the duplicate q_seqlens parameter with kv_seqlens in the PyTorch op schema strings, as this directly addresses the root cause of the runtime error.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#runtime error #API middleware #SSR setup #ISR setup #authentication setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.