vllm - 💡(How to fix) Fix routed_experts all-zero with FlashInfer (TRTLLM/CUTLASS) MoE — wire FlashInfer routing_replay_out into the capturer

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

--enable-return-routed-experts returns all-zero routing when the MoE layer runs on a fused FlashInfer kernel (TRTLLM / CUTLASS / CuteDSL / b12x). These kernels are internal routers (top-k is computed inside the kernel), so the Python capture hook is never reached. On Blackwell (SM100) moe_backend='auto' selects FlashInfer TRTLLM, so capture is effectively broken out of the box there.

The good news: the FlashInfer side is already merged — the TRTLLM MoE ops expose a routing_replay_out output. vLLM just doesn't pass/consume it yet. This issue tracks wiring it through.

Root Cause

  • Capture hook: FusedMoE.router.set_capture_fn(...)BaseRouter.select_experts() calls capture_fn(topk_ids) (vllm/model_executor/layers/fused_moe/router/base_router.py:289), delivered to RoutedExpertsCapturer.capture(layer_id, topk_ids) (bound in gpu_model_runner.py:_bind_routed_experts_capturer / :7261).
  • For FlashInfer MoE, MoERunner.is_internal_router is True (runner/moe_runner.py:305-306, gate is not None). The runner takes the monolithic branch (runner/moe_runner.py:514 apply_monolithic(...)), which never calls select_experts()capture_fn never fires → capturer device buffer stays zero-initialized.
  • The experts kernel call (experts/trtllm_bf16_moe.py apply()flashinfer.fused_moe.trtllm_bf16_moe(...)) does not pass routing_replay_out.

Fix Action

Fix / Workaround

Proposed work

FlashInfer's trtllm_{bf16,fp8_*,fp4_*}_moe_op already accept routing_replay_out: Optional[torch.Tensor] (mutated; see flashinfer/fused_moe/core.py, mutates_args=("routing_replay_out",)). Wire it through vLLM:

  1. In the monolithic experts apply() (experts/trtllm_bf16_moe.py and the fp8/fp4 variants), allocate a routing_replay_out tensor ([num_tokens, top_k], int32) and pass it to the FlashInfer call.
  2. Deliver the captured ids to the capturer for internal-router layers — e.g. have MoERunner (which holds self.router + layer_id) invoke self.router.capture_fn(routing_replay_out) after apply_monolithic when a capture_fn is set, mirroring the decomposed path.
  3. Handle the DP / SP / EP token layouts already documented in RoutedExpertsCapturer.capture() (naive-dispatch slice, modular-kernel per-DP, SP all-gather), and confirm correctness under CUDA graphs (the capturer buffer is read post-step by GPUModelRunner).

Code Example

vllm serve Qwen/Qwen3-30B-A3B --enable-return-routed-experts -tp 2
curl .../v1/completions -d '{"model":"...","prompt":"The capital of France is","max_tokens":16}'
# decode choices[0].routed_experts (base64 npy)
RAW_BUFFERClick to expand / collapse

Summary

--enable-return-routed-experts returns all-zero routing when the MoE layer runs on a fused FlashInfer kernel (TRTLLM / CUTLASS / CuteDSL / b12x). These kernels are internal routers (top-k is computed inside the kernel), so the Python capture hook is never reached. On Blackwell (SM100) moe_backend='auto' selects FlashInfer TRTLLM, so capture is effectively broken out of the box there.

The good news: the FlashInfer side is already merged — the TRTLLM MoE ops expose a routing_replay_out output. vLLM just doesn't pass/consume it yet. This issue tracks wiring it through.

Reproduction (GB200 / Blackwell, vllm ~0.22 nightly, flashinfer 0.6.11.post2)

vllm serve Qwen/Qwen3-30B-A3B --enable-return-routed-experts -tp 2
curl .../v1/completions -d '{"model":"...","prompt":"The capital of France is","max_tokens":16}'
# decode choices[0].routed_experts (base64 npy)
  • auto → FlashInfer TRTLLM: routed_experts shape (T, 48, 8), min=0 max=0, nonzero 0/7680 — every token "routes to expert 0".
  • --moe-backend triton: shape (T, 48, 8), min=0 max=127, 128 distinct experts, exactly 8 distinct per (token, layer) — correct top-8.

Generated text is identical/correct in both cases → the model routes fine internally; only the capture is empty.

Root cause

  • Capture hook: FusedMoE.router.set_capture_fn(...)BaseRouter.select_experts() calls capture_fn(topk_ids) (vllm/model_executor/layers/fused_moe/router/base_router.py:289), delivered to RoutedExpertsCapturer.capture(layer_id, topk_ids) (bound in gpu_model_runner.py:_bind_routed_experts_capturer / :7261).
  • For FlashInfer MoE, MoERunner.is_internal_router is True (runner/moe_runner.py:305-306, gate is not None). The runner takes the monolithic branch (runner/moe_runner.py:514 apply_monolithic(...)), which never calls select_experts()capture_fn never fires → capturer device buffer stays zero-initialized.
  • The experts kernel call (experts/trtllm_bf16_moe.py apply()flashinfer.fused_moe.trtllm_bf16_moe(...)) does not pass routing_replay_out.

Proposed work

FlashInfer's trtllm_{bf16,fp8_*,fp4_*}_moe_op already accept routing_replay_out: Optional[torch.Tensor] (mutated; see flashinfer/fused_moe/core.py, mutates_args=("routing_replay_out",)). Wire it through vLLM:

  1. In the monolithic experts apply() (experts/trtllm_bf16_moe.py and the fp8/fp4 variants), allocate a routing_replay_out tensor ([num_tokens, top_k], int32) and pass it to the FlashInfer call.
  2. Deliver the captured ids to the capturer for internal-router layers — e.g. have MoERunner (which holds self.router + layer_id) invoke self.router.capture_fn(routing_replay_out) after apply_monolithic when a capture_fn is set, mirroring the decomposed path.
  3. Handle the DP / SP / EP token layouts already documented in RoutedExpertsCapturer.capture() (naive-dispatch slice, modular-kernel per-DP, SP all-gather), and confirm correctness under CUDA graphs (the capturer buffer is read post-step by GPUModelRunner).

Relation to existing work

  • Parent design: RFC #39701 ("Replace routing replay with CUDA-graph-compatible device cache"). This issue is the concrete, narrower "wire the already-merged FlashInfer routing_replay_out" sub-task with a Blackwell repro; it can be an interim before/within the #39701 redesign.
  • Interim guardrail (prevents silent all-zero): PR #44115 (falls back to Triton / errors when capture + FlashInfer backend).
  • Capture transport + entrypoint: PR #39568, PR #38939.

Note

This issue was drafted with AI assistance and is intended as an actionable starting point for an implementer.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix routed_experts all-zero with FlashInfer (TRTLLM/CUTLASS) MoE — wire FlashInfer routing_replay_out into the capturer