vllm - ✅(Solved) Fix [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37857Fetched 2026-04-08 01:17:30
View on GitHub
Comments
3
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×3cross-referenced ×1

Error Message

ray.exceptions.RayTaskError(AssertionError): ray::RayWorkerWrapper.execute_method() File "vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 181, in capture assert cumsum[-1] == topk_ids.shape[0] AssertionError

Root Cause

Root cause:

Fix Action

Fix / Workaround

capture() assumes topk_ids contains all DP ranks' tokens concatenated (the naive dispatch path). It uses cumsum(num_tokens_across_dp_cpu) to slice this rank's portion:

But in DefaultMoERunner.forward(), there are two DP dispatch paths:

# default_moe_runner.py:638-640
do_naive_dispatch_combine = (
    self.moe_config.dp_size > 1 and not self.quant_method.supports_internal_mk
)

PR fix notes

PR #37879: fix(moe): fix RoutedExpertsCapturer assertion failure with DP>1 and MK path

Description (problem / solution / changelog)

Description

This PR fixes an AssertionError in RoutedExpertsCapturer.capture() that occurs during CUDA graph capture when --data-parallel-size > 1 and the quantization method uses the Modular Kernel (MK) path (supports_internal_mk=True).

Root Cause: In DefaultMoERunner.forward(), there are two DP dispatch paths:

  1. Naive dispatch (supports_internal_mk=False): All DP ranks' tokens are concatenated before routing. The capturer expects topk_ids.shape[0] to equal the total tokens across all DP ranks.
  2. Modular-kernel path (supports_internal_mk=True): The DP combine happens inside quant_method.apply. Therefore, select_experts() (where the capture fires) only sees this DP rank's local tokens. The original assert cumsum[-1] == topk_ids.shape[0] fails here.

Changes:

  • Updated the logic in RoutedExpertsCapturer.capture() to check topk_ids.shape[0] dynamically.
  • Handled the Naive dispatch path (n == total) by slicing the current rank's portion using cumsum.
  • Handled the MK path (n == token_num_per_dp) by directly capturing the local tokens.
  • Added a fallback logger.warning to prevent hard crashes and skip capture if the shape is unexpected.

Related Issue

Fixes #37857

Changed files

  • tests/model_executor/test_routed_experts_capture.py (modified, +82/-0)
  • vllm/model_executor/layers/fused_moe/routed_experts_capturer.py (modified, +21/-5)

Code Example

vLLM Version: v0.18.x (built from source, commit 9d28bf7e)
Python: 3.12.13
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPU: NVIDIA H20 x 8 per node (2 nodes, 16 GPUs total)
OS: Linux 5.15.0-124-generic

---

--tensor-parallel-size 8 \
--data-parallel-size 2 \
--enable-expert-parallel \
--enable-return-routed-experts

---

ray.exceptions.RayTaskError(AssertionError):
  ray::RayWorkerWrapper.execute_method()
    File "vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 181, in capture
      assert cumsum[-1] == topk_ids.shape[0]
    AssertionError

---

# routed_experts_capturer.py:178-183
cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
assert cumsum[-1] == topk_ids.shape[0]  # ← fails
end_loc = cumsum[self.dp_rank]
start_loc = end_loc - token_num_per_dp

---

# default_moe_runner.py:638-640
do_naive_dispatch_combine = (
    self.moe_config.dp_size > 1 and not self.quant_method.supports_internal_mk
)

---

token_num_per_dp = int(ctx.dp_metadata.num_tokens_across_dp_cpu[self.dp_rank])
total = int(torch.sum(ctx.dp_metadata.num_tokens_across_dp_cpu))

if topk_ids.shape[0] == total:
    # Naive dispatch path: all DP tokens concatenated
    cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
    start_loc = int(cumsum[self.dp_rank]) - token_num_per_dp
elif topk_ids.shape[0] == token_num_per_dp:
    # MK path: only this DP rank's tokens
    start_loc = 0
else:
    return

self._device_buffer[:token_num_per_dp, layer_id, :] = topk_ids[start_loc:start_loc + token_num_per_dp, :]
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
vLLM Version: v0.18.x (built from source, commit 9d28bf7e)
Python: 3.12.13
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPU: NVIDIA H20 x 8 per node (2 nodes, 16 GPUs total)
OS: Linux 5.15.0-124-generic
</details>

🐛 Describe the bug

RoutedExpertsCapturer.capture() crashes with AssertionError during CUDA graph capture when --data-parallel-size > 1 and the quant method uses the Modular Kernel (MK) path (supports_internal_mk=True).

Launch args:

--tensor-parallel-size 8 \
--data-parallel-size 2 \
--enable-expert-parallel \
--enable-return-routed-experts

Model: any MoE model (e.g. DeepSeek-V3-like architecture with 128 routed experts, top_k=4).

Error (on Ray worker during CUDA graph warmup):

ray.exceptions.RayTaskError(AssertionError):
  ray::RayWorkerWrapper.execute_method()
    File "vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 181, in capture
      assert cumsum[-1] == topk_ids.shape[0]
    AssertionError

Root cause:

capture() assumes topk_ids contains all DP ranks' tokens concatenated (the naive dispatch path). It uses cumsum(num_tokens_across_dp_cpu) to slice this rank's portion:

# routed_experts_capturer.py:178-183
cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
assert cumsum[-1] == topk_ids.shape[0]  # ← fails
end_loc = cumsum[self.dp_rank]
start_loc = end_loc - token_num_per_dp

But in DefaultMoERunner.forward(), there are two DP dispatch paths:

# default_moe_runner.py:638-640
do_naive_dispatch_combine = (
    self.moe_config.dp_size > 1 and not self.quant_method.supports_internal_mk
)

When supports_internal_mk=True, dispatch_router_logits() is not called before select_experts(). The DP dispatch happens inside quant_method.apply() instead. So select_experts() — where capture_fn fires — only sees this DP rank's local tokens.

Pathtopk_ids.shape[0]cumsum[-1]Result
Naive dispatch (supports_internal_mk=False)total across all DP rankssame
MK internal (supports_internal_mk=True)local DP rank onlytotal❌ assert

Suggested fix in capture():

token_num_per_dp = int(ctx.dp_metadata.num_tokens_across_dp_cpu[self.dp_rank])
total = int(torch.sum(ctx.dp_metadata.num_tokens_across_dp_cpu))

if topk_ids.shape[0] == total:
    # Naive dispatch path: all DP tokens concatenated
    cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
    start_loc = int(cumsum[self.dp_rank]) - token_num_per_dp
elif topk_ids.shape[0] == token_num_per_dp:
    # MK path: only this DP rank's tokens
    start_loc = 0
else:
    return

self._device_buffer[:token_num_per_dp, layer_id, :] = topk_ids[start_loc:start_loc + token_num_per_dp, :]

Workaround: vLLM general plugin (vllm.general_plugins entry point) that monkey-patches capture() with the above logic.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the AssertionError in RoutedExpertsCapturer.capture(), we need to modify the capture() method to handle both the naive dispatch path and the Modular Kernel (MK) path.

Step-by-Step Solution

  1. Update the capture() method: Modify the capture() method in routed_experts_capturer.py to check if topk_ids.shape[0] matches the total number of tokens across all DP ranks or the number of tokens in the current DP rank.
  2. Calculate the start location: Based on the dispatch path, calculate the start location for slicing topk_ids.
  3. Assign the sliced topk_ids to the device buffer: Assign the sliced topk_ids to the corresponding location in the device buffer.

Example Code

def capture(self, ctx, topk_ids, layer_id):
    token_num_per_dp = int(ctx.dp_metadata.num_tokens_across_dp_cpu[self.dp_rank])
    total = int(torch.sum(ctx.dp_metadata.num_tokens_across_dp_cpu))

    if topk_ids.shape[0] == total:
        # Naive dispatch path: all DP tokens concatenated
        cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
        start_loc = int(cumsum[self.dp_rank]) - token_num_per_dp
    elif topk_ids.shape[0] == token_num_per_dp:
        # MK path: only this DP rank's tokens
        start_loc = 0
    else:
        return

    self._device_buffer[:token_num_per_dp, layer_id, :] = topk_ids[start_loc:start_loc + token_num_per_dp, :]

Verification

To verify that the fix worked, run the model with the updated capture() method and check that the AssertionError is no longer raised.

Extra Tips

  • Make sure to test the model with different dispatch paths (naive and MK) to ensure that the fix works correctly in both cases.
  • Consider adding additional logging or debugging statements to help diagnose any future issues with the capture() method.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING