vllm - ✅(Solved) Fix [Bug]: Duplicate parameter name in convert_vertical_slash_indexes op schema — kv_seqlens registered as q_seqlens [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39068Fetched 2026-04-08 02:52:37
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Error Message

RuntimeError: ... duplicate argument name 'q_seqlens' in schema

Root Cause

csrc/torch_bindings.cpp lines 83 and 94:

// Line 83 (convert_vertical_slash_indexes):
"   Tensor q_seqlens, Tensor q_seqlens, "

// Line 94 (convert_vertical_slash_indexes_mergehead):
"   Tensor q_seqlens, Tensor q_seqlens, "

Fix Action

Fix

Replace the second q_seqlens with kv_seqlens in both op definitions. This matches:

  • csrc/ops.h:69torch::Tensor kv_seqlens
  • csrc/ops.h:81torch::Tensor kv_seqlens
  • vllm/_custom_ops.py:275kv_seqlens: torch.Tensor
  • vllm/_custom_ops.py:330kv_seqlens: torch.Tensor

A fix is in the linked PR.

PR fix notes

PR #39067: [Transformers/Bugfix] Fix Gemma4 MoE top_k lookup + duplicate kv_seqlens in op schema

Description (problem / solution / changelog)

Summary

This PR contains two independent bugfixes on the same branch.


Fix 1 — Gemma4 MoE crashes at startup with `AssertionError: top_k is None`

Fixes #39066

`google/gemma-4-26B-A4B-it` crashes at startup because `Gemma4TextConfig` stores the top-k expert count under `top_k_experts`, while `MoEMixin.recursive_replace` only checked `num_experts_per_tok` and `top_k`.

Change (`vllm/model_executor/models/transformers/moe.py`): ```python

Before

top_k = getattr_iter(text_config, ["num_experts_per_tok", "top_k"], None)

After

top_k = getattr_iter( text_config, ["num_experts_per_tok", "top_k", "top_k_experts"], None ) ```

Reproduction: ```bash vllm serve google/gemma-4-26B-A4B-it

Crashes immediately: AssertionError at moe.py:198 — assert top_k is not None

```


Fix 2 — Duplicate `q_seqlens` parameter in `convert_vertical_slash_indexes` op schema

Fixes #39068

The PyTorch op schema strings for `convert_vertical_slash_indexes` and `convert_vertical_slash_indexes_mergehead` in `csrc/torch_bindings.cpp` listed `Tensor q_seqlens` twice. The second argument must be `kv_seqlens`, consistent with `csrc/ops.h` and `vllm/_custom_ops.py`.

Change (`csrc/torch_bindings.cpp`, lines 83 and 94): ```cpp // Before (both ops): " Tensor q_seqlens, Tensor q_seqlens, "

// After (both ops): " Tensor q_seqlens, Tensor kv_seqlens, " ```


What this PR does NOT address

  • #38999 — `--data-parallel-size > 1` crash in CUDA communicator
  • #39000 — MXFP4 quantization crash during weight loading

Why not a duplicate

Searched open issues and PRs for all affected symbols before opening. No existing PR covers either fix.

Test plan

  • Fix 1: Load `google/gemma-4-26B-A4B-it` — should pass `MoEMixin.recursive_replace` without `AssertionError`.
  • Fix 2: Any test dispatching through the vertical-slash sparse attention path (non-ROCm).

AI assistance disclosure

Developed with AI assistance (Claude). All changed lines reviewed by submitter.

Changed files

  • .pre-commit-config.yaml (modified, +1/-0)
  • .shellcheckrc (modified, +1/-1)
  • CMakeLists.txt (modified, +48/-2)
  • csrc/torch_bindings.cpp (modified, +16/-3)
  • tests/evals/gsm8k/gsm8k_eval.py (modified, +25/-1)
  • tools/pre_commit/update-dockerfile-graph.sh (modified, +6/-2)
  • vllm/_custom_ops.py (modified, +25/-3)
  • vllm/compilation/backends.py (modified, +1/-1)
  • vllm/compilation/passes/fusion/act_quant_fusion.py (modified, +4/-1)
  • vllm/compilation/passes/fusion/matcher_utils.py (modified, +5/-3)
  • vllm/compilation/piecewise_backend.py (modified, +3/-3)
  • vllm/config/compilation.py (modified, +1/-1)
  • vllm/entrypoints/openai/chat_completion/protocol.py (modified, +1/-1)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +10/-1)
  • vllm/entrypoints/openai/completion/protocol.py (modified, +1/-1)
  • vllm/entrypoints/openai/responses/serving.py (modified, +1/-1)
  • vllm/entrypoints/serve/render/serving.py (modified, +30/-4)
  • vllm/model_executor/layers/layernorm.py (modified, +1/-1)
  • vllm/model_executor/layers/quantization/mxfp4.py (modified, +132/-0)
  • vllm/model_executor/layers/quantization/utils/w8a8_utils.py (modified, +8/-2)
  • vllm/model_executor/models/transformers/base.py (modified, +33/-8)
  • vllm/model_executor/models/transformers/moe.py (modified, +3/-1)
  • vllm/model_executor/models/transformers/utils.py (modified, +11/-0)
  • vllm/model_executor/models/utils.py (modified, +5/-0)
  • vllm/tool_parsers/mistral_tool_parser.py (modified, +1/-1)
  • vllm/transformers_utils/model_arch_config_convertor.py (modified, +1/-1)
  • vllm/v1/engine/core.py (modified, +6/-0)
  • vllm/v1/worker/utils.py (modified, +28/-8)

Code Example

// BUGGY (both ops affected):
"   Tensor q_seqlens, Tensor q_seqlens, "   // <-- kv_seqlens written as q_seqlens

// CORRECT:
"   Tensor q_seqlens, Tensor kv_seqlens, "

---

RuntimeError: ... duplicate argument name 'q_seqlens' in schema

---

import torch
import vllm._custom_ops as ops

q_seqlens  = torch.tensor([16], dtype=torch.int32, device="cuda")
kv_seqlens = torch.tensor([16], dtype=torch.int32, device="cuda")
# Any call through the vertical_slash attention kernel path will hit this

---

// Line 83 (convert_vertical_slash_indexes):
"   Tensor q_seqlens, Tensor q_seqlens, "

// Line 94 (convert_vertical_slash_indexes_mergehead):
"   Tensor q_seqlens, Tensor q_seqlens, "
RAW_BUFFERClick to expand / collapse

Describe the bug

The PyTorch op schema strings registered in csrc/torch_bindings.cpp for convert_vertical_slash_indexes and convert_vertical_slash_indexes_mergehead have a duplicate parameter name: Tensor q_seqlens appears twice. The second occurrence should be Tensor kv_seqlens.

// BUGGY (both ops affected):
"   Tensor q_seqlens, Tensor q_seqlens, "   // <-- kv_seqlens written as q_seqlens

// CORRECT:
"   Tensor q_seqlens, Tensor kv_seqlens, "

The C++ function signatures in csrc/ops.h and the Python wrappers in vllm/_custom_ops.py are both correct — the mismatch is only in the schema string passed to ops.def(...).

Impact

PyTorch op schema registration with duplicate argument names causes a runtime error when the kernel is dispatched:

RuntimeError: ... duplicate argument name 'q_seqlens' in schema

This affects the sparse/blocksparse vertical-slash attention path (non-ROCm only, guarded by #ifndef USE_ROCM).

Reproduction

import torch
import vllm._custom_ops as ops

q_seqlens  = torch.tensor([16], dtype=torch.int32, device="cuda")
kv_seqlens = torch.tensor([16], dtype=torch.int32, device="cuda")
# Any call through the vertical_slash attention kernel path will hit this

Or simply load vLLM and trigger the sparse attention path via a blocksparse model.

Root cause

csrc/torch_bindings.cpp lines 83 and 94:

// Line 83 (convert_vertical_slash_indexes):
"   Tensor q_seqlens, Tensor q_seqlens, "

// Line 94 (convert_vertical_slash_indexes_mergehead):
"   Tensor q_seqlens, Tensor q_seqlens, "

Fix

Replace the second q_seqlens with kv_seqlens in both op definitions. This matches:

  • csrc/ops.h:69torch::Tensor kv_seqlens
  • csrc/ops.h:81torch::Tensor kv_seqlens
  • vllm/_custom_ops.py:275kv_seqlens: torch.Tensor
  • vllm/_custom_ops.py:330kv_seqlens: torch.Tensor

A fix is in the linked PR.

Environment

  • vLLM version: main
  • Platform: CUDA (non-ROCm)
  • File: csrc/torch_bindings.cpp

extent analysis

TL;DR

Replace the duplicate q_seqlens parameter with kv_seqlens in the PyTorch op schema strings for convert_vertical_slash_indexes and convert_vertical_slash_indexes_mergehead in csrc/torch_bindings.cpp.

Guidance

  • Identify the lines in csrc/torch_bindings.cpp where the duplicate parameter name q_seqlens appears (lines 83 and 94) and replace the second occurrence with kv_seqlens.
  • Verify that the corrected schema strings match the parameter names in csrc/ops.h and vllm/_custom_ops.py.
  • Test the corrected code by running the reproduction script or loading vLLM and triggering the sparse attention path via a blocksparse model.
  • Check for any remaining runtime errors related to duplicate argument names.

Example

// CORRECTED:
"   Tensor q_seqlens, Tensor kv_seqlens, "

Notes

This fix only applies to the non-ROCm platform (CUDA) and may not be relevant to other environments.

Recommendation

Apply the workaround by replacing the duplicate q_seqlens parameter with kv_seqlens in the PyTorch op schema strings, as this directly addresses the root cause of the runtime error.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING