vllm - 💡(How to fix) Fix [Bug]: `flashinfer_cutedsl` incompatible with all cross-node EP backends on GB200 NVL72 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37931Fetched 2026-04-08 01:22:37
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

flashinfer_cutedsl cannot be used with any A2A backend that supports cross-node EP on GB200 NVL72:

Path 1: deepep_low_latency + flashinfer_cutedsl

DeepEP buffer init fails cross-node: RuntimeError: Failed: CUDA error /tmp/ep_kernels_workspace/DeepEP/csrc/deep_ep.cpp:226 'invalid resource handle'

root cause: Buffer.runtime.sync() exchanges IPC handles via shared memory, which doesn't work across physical nodes. VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL=1 only affects data transfer, not the init path.

Path 2: flashinfer_nvlink_one_sided + flashinfer_cutedsl

Rejected at startup: ValueError: NvFp4 MoE backend 'FLASHINFER_CUTEDSL' does not support the deployment configuration since kernel does not support ('standard',) activation format.

root cause: select_nvfp4_moe_backend() determines activation format via config.moe_parallel_config.use_deepep_ll_kernels. this flag is only True for deepep, so any other a2a backend forces standard format, which cutedsl doesn't support.

Error Message

RuntimeError: Failed: CUDA error /tmp/ep_kernels_workspace/DeepEP/csrc/deep_ep.cpp:226 'invalid resource handle'

Root Cause

root cause: Buffer.runtime.sync() exchanges IPC handles via shared memory, which doesn't work across physical nodes. VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL=1 only affects data transfer, not the init path.

RAW_BUFFERClick to expand / collapse

Your current environment

Current environment

  • vLLM version: 0.17.2rc1
  • GPU: GB200 NVL72 (4 GPUs per node, MNNVL enabled)
  • Model: DeepSeek-R1-0528-FP4 (NVFP4)
  • Deployment: P/D disaggregation, cross-node EP8/EP32

🐛 Describe the bug

Description

flashinfer_cutedsl cannot be used with any A2A backend that supports cross-node EP on GB200 NVL72:

Path 1: deepep_low_latency + flashinfer_cutedsl

DeepEP buffer init fails cross-node: RuntimeError: Failed: CUDA error /tmp/ep_kernels_workspace/DeepEP/csrc/deep_ep.cpp:226 'invalid resource handle'

root cause: Buffer.runtime.sync() exchanges IPC handles via shared memory, which doesn't work across physical nodes. VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL=1 only affects data transfer, not the init path.

Path 2: flashinfer_nvlink_one_sided + flashinfer_cutedsl

Rejected at startup: ValueError: NvFp4 MoE backend 'FLASHINFER_CUTEDSL' does not support the deployment configuration since kernel does not support ('standard',) activation format.

root cause: select_nvfp4_moe_backend() determines activation format via config.moe_parallel_config.use_deepep_ll_kernels. this flag is only True for deepep, so any other a2a backend forces standard format, which cutedsl doesn't support.

Impact

the only working cross-node EP combination is flashinfer_nvlink_one_sided + flashinfer_trtllm, which loses the decode latency advantage of cutedsl. This affects all GB200 NVL72 deployments using vLLM with large MoE models.

Suggested fix

  1. Implement BatchedExperts output in FlashInferNVLinkOneSidedPrepareAndFinalize and allow it to set use_deepep_ll_kernels = True (or introduce a separate use_batched_experts flag decoupled from DeepEP)

  2. Fix DeepEP cross-node buffer init to use MNNVL instead of CUDA IPC for handle exchange

How to reproduce

# Path 1 - DeepEP IPC failure (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --moe-backend flashinfer_cutedsl

# Path 2 - activation format mismatch (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_one_sided \
  --moe-backend flashinfer_cutedsl

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue, we need to address two main problems:

  1. Implement BatchedExperts output in FlashInferNVLinkOneSidedPrepareAndFinalize:

    • Introduce a new flag use_batched_experts that allows FlashInferNVLinkOneSidedPrepareAndFinalize to set use_deepep_ll_kernels = True without being tied to DeepEP.
    • Modify select_nvfp4_moe_backend() to consider this new flag when determining the activation format.
  2. Fix DeepEP cross-node buffer init to use MNNVL instead of CUDA IPC for handle exchange:

    • Update Buffer.runtime.sync() to utilize MNNVL for exchanging IPC handles across physical nodes.

Example Code Changes

# Introduce a new flag use_batched_experts
class FlashInferNVLinkOneSidedPrepareAndFinalize:
    def __init__(self, use_batched_experts=False):
        self.use_batched_experts = use_batched_experts

    def select_nvfp4_moe_backend(self, config):
        # Consider the new flag when determining activation format
        if self.use_batched_experts:
            return 'batched_experts'
        # ... rest of the method remains the same

# Update Buffer.runtime.sync() to use MNNVL
class Buffer:
    def sync(self):
        # Check if MNNVL is enabled
        if VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL:
            # Use MNNVL for handle exchange
            self._sync_with_mnnvl()
        else:
            # Fallback to CUDA IPC (if necessary)
            self._sync_with_cuda_ipc()

    def _sync_with_mnnvl(self):
        # Implementation of MNNVL-based handle exchange
        pass

    def _sync_with_cuda_ipc(self):
        # Implementation of CUDA IPC-based handle exchange
        pass

Verification

To verify the fix, run the following commands:

# Path 1 - DeepEP IPC failure (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --moe-backend flashinfer_cutedsl

# Path 2 - activation format mismatch (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_one_sided \
  --moe-backend flashinfer_cutedsl

Both commands should now run without errors, and the flashinfer_cutedsl backend should work correctly with cross-node EP on GB200 NVL72.

Extra Tips

  • Ensure that MNNVL is properly configured and enabled on your system.
  • Test the changes thoroughly to avoid any regressions.
  • Consider adding additional logging or

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: `flashinfer_cutedsl` incompatible with all cross-node EP backends on GB200 NVL72 [1 participants]