vllm - ✅(Solved) Fix [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True [1 pull requests, 3 comments, 2 participants]

vllm2026-03-23 06:44:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37857•Fetched 2026-04-08 01:17:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

junjzhang

Participants

junjzhang

Young-Leo

Timeline (top)

commented ×3cross-referenced ×1

Error Message

ray.exceptions.RayTaskError(AssertionError): ray::RayWorkerWrapper.execute_method() File "vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 181, in capture assert cumsum[-1] == topk_ids.shape[0] AssertionError

Root Cause

Root cause:

Fix Action

Fix / Workaround

capture() assumes topk_ids contains all DP ranks' tokens concatenated (the naive dispatch path). It uses cumsum(num_tokens_across_dp_cpu) to slice this rank's portion:

But in DefaultMoERunner.forward(), there are two DP dispatch paths:

# default_moe_runner.py:638-640
do_naive_dispatch_combine = (
    self.moe_config.dp_size > 1 and not self.quant_method.supports_internal_mk
)

PR fix notes

PR #37879: fix(moe): fix RoutedExpertsCapturer assertion failure with DP>1 and MK path

Repository: vllm-project/vllm
Author: Young-Leo
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37879

Description (problem / solution / changelog)

Description

This PR fixes an AssertionError in RoutedExpertsCapturer.capture() that occurs during CUDA graph capture when --data-parallel-size > 1 and the quantization method uses the Modular Kernel (MK) path (supports_internal_mk=True).

Root Cause: In DefaultMoERunner.forward(), there are two DP dispatch paths:

Naive dispatch (supports_internal_mk=False): All DP ranks' tokens are concatenated before routing. The capturer expects topk_ids.shape[0] to equal the total tokens across all DP ranks.
Modular-kernel path (supports_internal_mk=True): The DP combine happens inside quant_method.apply. Therefore, select_experts() (where the capture fires) only sees this DP rank's local tokens. The original assert cumsum[-1] == topk_ids.shape[0] fails here.

Changes:

Updated the logic in RoutedExpertsCapturer.capture() to check topk_ids.shape[0] dynamically.
Handled the Naive dispatch path (n == total) by slicing the current rank's portion using cumsum.
Handled the MK path (n == token_num_per_dp) by directly capturing the local tokens.
Added a fallback logger.warning to prevent hard crashes and skip capture if the shape is unexpected.

Related Issue

Fixes #37857

Changed files

tests/model_executor/test_routed_experts_capture.py (modified, +82/-0)
vllm/model_executor/layers/fused_moe/routed_experts_capturer.py (modified, +21/-5)

Code Example

vLLM Version: v0.18.x (built from source, commit 9d28bf7e)
Python: 3.12.13
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPU: NVIDIA H20 x 8 per node (2 nodes, 16 GPUs total)
OS: Linux 5.15.0-124-generic

---

--tensor-parallel-size 8 \
--data-parallel-size 2 \
--enable-expert-parallel \
--enable-return-routed-experts

---

ray.exceptions.RayTaskError(AssertionError):
  ray::RayWorkerWrapper.execute_method()
    File "vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 181, in capture
      assert cumsum[-1] == topk_ids.shape[0]
    AssertionError

---

# routed_experts_capturer.py:178-183
cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
assert cumsum[-1] == topk_ids.shape[0]  # ← fails
end_loc = cumsum[self.dp_rank]
start_loc = end_loc - token_num_per_dp

---

# default_moe_runner.py:638-640
do_naive_dispatch_combine = (
    self.moe_config.dp_size > 1 and not self.quant_method.supports_internal_mk
)

---

token_num_per_dp = int(ctx.dp_metadata.num_tokens_across_dp_cpu[self.dp_rank])
total = int(torch.sum(ctx.dp_metadata.num_tokens_across_dp_cpu))

if topk_ids.shape[0] == total:
    # Naive dispatch path: all DP tokens concatenated
    cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
    start_loc = int(cumsum[self.dp_rank]) - token_num_per_dp
elif topk_ids.shape[0] == token_num_per_dp:
    # MK path: only this DP rank's tokens
    start_loc = 0
else:
    return

self._device_buffer[:token_num_per_dp, layer_id, :] = topk_ids[start_loc:start_loc + token_num_per_dp, :]

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

vLLM Version: v0.18.x (built from source, commit 9d28bf7e)
Python: 3.12.13
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPU: NVIDIA H20 x 8 per node (2 nodes, 16 GPUs total)
OS: Linux 5.15.0-124-generic

</details>

🐛 Describe the bug

RoutedExpertsCapturer.capture() crashes with AssertionError during CUDA graph capture when --data-parallel-size > 1 and the quant method uses the Modular Kernel (MK) path (supports_internal_mk=True).

Launch args:

--tensor-parallel-size 8 \
--data-parallel-size 2 \
--enable-expert-parallel \
--enable-return-routed-experts

Model: any MoE model (e.g. DeepSeek-V3-like architecture with 128 routed experts, top_k=4).

Error (on Ray worker during CUDA graph warmup):

ray.exceptions.RayTaskError(AssertionError):
  ray::RayWorkerWrapper.execute_method()
    File "vllm/model_executor/layers/fused_moe/routed_experts_capturer.py", line 181, in capture
      assert cumsum[-1] == topk_ids.shape[0]
    AssertionError

Root cause:

capture() assumes topk_ids contains all DP ranks' tokens concatenated (the naive dispatch path). It uses cumsum(num_tokens_across_dp_cpu) to slice this rank's portion:

# routed_experts_capturer.py:178-183
cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
assert cumsum[-1] == topk_ids.shape[0]  # ← fails
end_loc = cumsum[self.dp_rank]
start_loc = end_loc - token_num_per_dp

But in DefaultMoERunner.forward(), there are two DP dispatch paths:

# default_moe_runner.py:638-640
do_naive_dispatch_combine = (
    self.moe_config.dp_size > 1 and not self.quant_method.supports_internal_mk
)

When supports_internal_mk=True, dispatch_router_logits() is not called before select_experts(). The DP dispatch happens inside quant_method.apply() instead. So select_experts() — where capture_fn fires — only sees this DP rank's local tokens.

Path	`topk_ids.shape[0]`	`cumsum[-1]`	Result
Naive dispatch (`supports_internal_mk=False`)	total across all DP ranks	same	✅
MK internal (`supports_internal_mk=True`)	local DP rank only	total	❌ assert

Suggested fix in capture():

token_num_per_dp = int(ctx.dp_metadata.num_tokens_across_dp_cpu[self.dp_rank])
total = int(torch.sum(ctx.dp_metadata.num_tokens_across_dp_cpu))

if topk_ids.shape[0] == total:
    # Naive dispatch path: all DP tokens concatenated
    cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
    start_loc = int(cumsum[self.dp_rank]) - token_num_per_dp
elif topk_ids.shape[0] == token_num_per_dp:
    # MK path: only this DP rank's tokens
    start_loc = 0
else:
    return

self._device_buffer[:token_num_per_dp, layer_id, :] = topk_ids[start_loc:start_loc + token_num_per_dp, :]

Workaround: vLLM general plugin (vllm.general_plugins entry point) that monkey-patches capture() with the above logic.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the AssertionError in RoutedExpertsCapturer.capture(), we need to modify the capture() method to handle both the naive dispatch path and the Modular Kernel (MK) path.

Step-by-Step Solution

Update the capture() method: Modify the capture() method in routed_experts_capturer.py to check if topk_ids.shape[0] matches the total number of tokens across all DP ranks or the number of tokens in the current DP rank.
Calculate the start location: Based on the dispatch path, calculate the start location for slicing topk_ids.
Assign the sliced topk_ids to the device buffer: Assign the sliced topk_ids to the corresponding location in the device buffer.

Example Code

def capture(self, ctx, topk_ids, layer_id):
    token_num_per_dp = int(ctx.dp_metadata.num_tokens_across_dp_cpu[self.dp_rank])
    total = int(torch.sum(ctx.dp_metadata.num_tokens_across_dp_cpu))

    if topk_ids.shape[0] == total:
        # Naive dispatch path: all DP tokens concatenated
        cumsum = torch.cumsum(ctx.dp_metadata.num_tokens_across_dp_cpu, dim=0)
        start_loc = int(cumsum[self.dp_rank]) - token_num_per_dp
    elif topk_ids.shape[0] == token_num_per_dp:
        # MK path: only this DP rank's tokens
        start_loc = 0
    else:
        return

    self._device_buffer[:token_num_per_dp, layer_id, :] = topk_ids[start_loc:start_loc + token_num_per_dp, :]

Verification

To verify that the fix worked, run the model with the updated capture() method and check that the AssertionError is no longer raised.

Extra Tips

Make sure to test the model with different dispatch paths (naive and MK) to ensure that the fix works correctly in both cases.
Consider adding additional logging or debugging statements to help diagnose any future issues with the capture() method.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #37879: fix(moe): fix RoutedExpertsCapturer assertion failure with DP>1 and MK path

Description (problem / solution / changelog)

Description

Related Issue

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #37879: fix(moe): fix RoutedExpertsCapturer assertion failure with DP>1 and MK path

Description (problem / solution / changelog)

Description

Related Issue

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING