vllm - ✅(Solved) Fix [Refactor] Merge `select_gpt_oss_mxfp4_moe_backend` and `select_mxfp4_moe_backend` [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41291Fetched 2026-04-30 06:19:00
View on GitHub
Comments
2
Participants
3
Timeline
21
Reactions
0
Author
Timeline (top)
mentioned ×8subscribed ×8commented ×2cross-referenced ×2

The current MXFP4 MoE backend selection logic has two separate functions:

  • select_gpt_oss_mxfp4_moe_backend() - uses _get_priority_backends_for_gpt_oss() (BF16 backends)
  • select_mxfp4_moe_backend() - uses _get_priority_backends() (MXFP8/DeepGEMM backends)

These two functions share significant code duplication and could potentially be merged into a unified selector.

Context

From PR #40860 review discussion: https://github.com/vllm-project/vllm/pull/40860#discussion_r3148773753

The functions differ primarily in:

  1. The priority backend list they iterate over
  2. The activation key handling (BF16 vs MXFP8/FP8)

Proposed Solution

Merge into a single select_mxfp4_moe_backend() function. Accepts activation_key parameter or quantization_config parameter to map to proper backend.

Related Files

  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
  • vllm/model_executor/layers/quantization/mxfp4.py
  • vllm/model_executor/layers/quantization/quark/quark_moe.py

cc @mgoin @robertgshaw2-redhat @zyongye @ivanium

Root Cause

The current MXFP4 MoE backend selection logic has two separate functions:

  • select_gpt_oss_mxfp4_moe_backend() - uses _get_priority_backends_for_gpt_oss() (BF16 backends)
  • select_mxfp4_moe_backend() - uses _get_priority_backends() (MXFP8/DeepGEMM backends)

These two functions share significant code duplication and could potentially be merged into a unified selector.

Context

From PR #40860 review discussion: https://github.com/vllm-project/vllm/pull/40860#discussion_r3148773753

The functions differ primarily in:

  1. The priority backend list they iterate over
  2. The activation key handling (BF16 vs MXFP8/FP8)

Proposed Solution

Merge into a single select_mxfp4_moe_backend() function. Accepts activation_key parameter or quantization_config parameter to map to proper backend.

Related Files

  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
  • vllm/model_executor/layers/quantization/mxfp4.py
  • vllm/model_executor/layers/quantization/quark/quark_moe.py

cc @mgoin @robertgshaw2-redhat @zyongye @ivanium

Fix Action

Fix / Workaround

These two functions share significant code duplication and could potentially be merged into a unified selector.

Merge into a single select_mxfp4_moe_backend() function. Accepts activation_key parameter or quantization_config parameter to map to proper backend.

PR fix notes

PR #39136: [ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle

Description (problem / solution / changelog)

  1. Remove QuarkOCP_MX_MoEMethod_OSS and add aiter w4a8 backend.
  2. Add unittest cases for rocm w4a16, w4a8 fused moe.
  3. Validated locally with
pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py \
    --config-list-file=tests/evals/gpt_oss/configs/models-gfx950.txt

pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py \
    --config-list-file=tests/evals/gsm8k/configs/models-qwen35-mi355.txt

Changed files

  • tests/kernels/moe/test_ocp_mx_moe.py (modified, +388/-8)
  • vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py (modified, +174/-26)
  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +126/-13)
  • vllm/model_executor/layers/quantization/mxfp4.py (modified, +3/-3)
  • vllm/model_executor/layers/quantization/quark/quark_moe.py (modified, +66/-200)

PR #41317: [Refactor] Extract shared helpers from MXFP4 MoE backend selectors

Description (problem / solution / changelog)

Summary

Addresses #41291 by extracting the duplicated machinery shared between select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend into private module-level helpers in vllm/model_executor/layers/fused_moe/oracle/mxfp4.py.

What changes

Both selectors carried four nested closures each (_make_log_backend, _make_log_unsupported, _return_or_raise) plus near-identical copies of the priority-iteration loop and the explicit-runner-backend branch. This PR lifts those into six private module-level helpers:

  • _activation_format_for_config(config) — pulls the BatchedExperts vs Standard decision out of both functions.
  • _make_log_backend(backend) / _make_log_unsupported(backend, reason) — the two log-message templates, deduplicated.
  • _try_kernel_classes(backend, config, ..., *, log_failures=False) — probes every kernel class registered for backend. The log_failures switch preserves the difference between the priority-loop path (which logs each unsupported kernel via debug_once) and the explicit-backend path (which only surfaces failures by raising).
  • _return_or_raise(backend, config, ...) — resolves a single backend or raises ValueError, used by every direct-pick site (env vars, runner-backend, XPU fallback).
  • _select_explicit_runner_backend(config, activation_format) — honors config.moe_backend when set, returns None to signal "fall through to priority list".
  • _select_first_supported_backend(config, priority_backends, activation_format) — walks a priority list and returns the first backend whose kernel matches.

select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend are now thin orchestration that compose those helpers. Both keep their public signatures and observable behavior, so the three callers in vllm/model_executor/layers/quantization/mxfp4.py and vllm/model_executor/layers/quantization/quark/quark_moe.py need no edit.

Behavioral preservation

The refactor preserves observable behavior 1:1, with two intentional no-op cleanups bundled in:

  1. Removed the redundant logger.info_once(_make_log_backend(backend)) immediately before _return_or_raise(Mxfp4MoeBackend.XPU, ...) in the GPT-OSS XPU branch — _return_or_raise already emits the same info_once message on success, so the original would have logged twice and info_once deduplicated it.
  2. Removed the explicit scope="local" kwargs in select_mxfp4_moe_backend's log calls. "local" is the default for info_once / debug_once per vllm/logger.py, so the behavior is identical and the calls match the rest of the file.

The GPT-OSS path's behavior matrix is fully preserved:

  • LoRA branch (raises on non-CUDA, picks TRITON_UNFUSED or MARLIN).
  • All four env-var override branches (VLLM_USE_FLASHINFER_MOE_MXFP4_BF16, VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8, VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS, VLLM_MXFP4_USE_MARLIN) — kept in select_gpt_oss_mxfp4_moe_backend, since they don't apply to the generic MXFP4 path.
  • The BATCHED_MARLIN coercion when the format is BatchedExperts.
  • The XPU fallback after the priority loop.
  • The CUDA/ROCm NotImplementedError and the final (NONE, None) return.
  • The per-kernel-class debug_once for unsupported configs in the priority loop (preserved via log_failures=True).

Why this shape and not the issue's literal ask

The issue proposes "a single select_mxfp4_moe_backend() function … with activation_key parameter or quantization_config parameter". Reading both functions, the divergence is bigger than that:

Aspectselect_gpt_oss_mxfp4_moe_backendselect_mxfp4_moe_backend
LoRA branchyesno
Env-var overridesyes (4 branches)no
Priority listBF16-focused (_get_priority_backends_for_gpt_oss)MXFP8/DeepGEMM (_get_priority_backends)
XPU fallbackyesno
Final fallbackreturns (NONE, None)raises NotImplementedError

A single function with a mode='gpt_oss' flag (or several boolean kwargs) would be harder to read than two named entry points and would lock in a shape that the larger class-hierarchy refactor in PR #37776 will rework anyway. This PR achieves the spirit of the dedup ask — eliminate the four copies of the helper closures and the duplicated priority loop — while keeping both entry points distinct.

Why it isn't a duplicate

Searched for existing PRs:

gh pr list --repo vllm-project/vllm --state all --search "41291 in:body"        → no PR
gh pr list --repo vllm-project/vllm --state open --search "select_mxfp4_moe_backend"  → no overlap
gh pr list --repo vllm-project/vllm --state open --search "merge mxfp4 backend"       → no overlap

The closest open PR is #37776 ([MoE] Unify MoE oracles with class structure, needs-rebase), which restructures all MoE oracles into a class hierarchy. That PR predates the GPT-OSS split that #39604 introduced and #40860 reshaped, so it does not touch select_gpt_oss_mxfp4_moe_backend at all. The two PRs don't conflict; whichever lands first, the other rebases cleanly.

Tracking issue: #41291.

Tests run

This is a Python-only refactor with no observable behavior change, so I ran the lint/typecheck commands instead of building the full vLLM wheel:

  • ruff check vllm/model_executor/layers/fused_moe/oracle/mxfp4.py — clean.
  • ruff format --check vllm/model_executor/layers/fused_moe/oracle/mxfp4.py — already formatted.
  • mypy --python-version 3.10 --follow-imports=skip --ignore-missing-imports vllm/model_executor/layers/fused_moe/oracle/mxfp4.pySuccess: no issues found in 1 source file.
  • typos vllm/model_executor/layers/fused_moe/oracle/mxfp4.py — clean.
  • python -c "import ast; ast.parse(open(...).read())" — OK.
  • grep -rn "select_gpt_oss_mxfp4_moe_backend\|select_mxfp4_moe_backend" across the repo — three call sites in quantization/mxfp4.py and quantization/quark/quark_moe.py, none need editing because the public signatures are unchanged.

I did not run the full pytest suite or build the wheel — the PR makes no behavior changes and the helpers are private (no new public API surface). Happy to run any specific kernel/MoE test the reviewer thinks is relevant.

AI assistance disclosure

Claude (Anthropic) assisted with: reading both selector functions and the issue thread, comparing against the in-flight PR #37776, designing the helper extraction shape, applying the edit, and running the lint/mypy/typos commands. I (Demian Havdun) reviewed every changed line, checked that the helpers preserve the original behavior path-by-path (especially the per-kernel-class debug logging and the XPU branch double-log dedup), and signed off as committer per DCO. Co-author trailer added per AGENTS.md.

Changed files

  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +141/-126)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

The current MXFP4 MoE backend selection logic has two separate functions:

  • select_gpt_oss_mxfp4_moe_backend() - uses _get_priority_backends_for_gpt_oss() (BF16 backends)
  • select_mxfp4_moe_backend() - uses _get_priority_backends() (MXFP8/DeepGEMM backends)

These two functions share significant code duplication and could potentially be merged into a unified selector.

Context

From PR #40860 review discussion: https://github.com/vllm-project/vllm/pull/40860#discussion_r3148773753

The functions differ primarily in:

  1. The priority backend list they iterate over
  2. The activation key handling (BF16 vs MXFP8/FP8)

Proposed Solution

Merge into a single select_mxfp4_moe_backend() function. Accepts activation_key parameter or quantization_config parameter to map to proper backend.

Related Files

  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py
  • vllm/model_executor/layers/quantization/mxfp4.py
  • vllm/model_executor/layers/quantization/quark/quark_moe.py

cc @mgoin @robertgshaw2-redhat @zyongye @ivanium

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Merge the two separate backend selection functions into a unified select_mxfp4_moe_backend() function that accepts an activation_key or quantization_config parameter.

Guidance

  • Identify the shared code between select_gpt_oss_mxfp4_moe_backend() and select_mxfp4_moe_backend() to determine what can be unified.
  • Determine how to handle the differences in priority backend lists and activation key handling between BF16 and MXFP8/DeepGEMM backends.
  • Update the select_mxfp4_moe_backend() function to accept the new parameter and map it to the proper backend.
  • Review the related files (mxfp4.py, quark_moe.py) to ensure the changes are properly integrated.

Example

No code example is provided as the issue does not contain sufficient technical details.

Notes

The proposed solution aims to reduce code duplication, but the implementation details are not specified. The actual merge process may require additional considerations, such as handling edge cases or ensuring backwards compatibility.

Recommendation

Apply workaround: Merge the two functions into a unified select_mxfp4_moe_backend() function, as this is the proposed solution that aims to reduce code duplication and improve maintainability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Refactor] Merge `select_gpt_oss_mxfp4_moe_backend` and `select_mxfp4_moe_backend` [2 pull requests, 2 comments, 3 participants]