vllm - 💡(How to fix) Fix [Bug]: `flashinfer_cutedsl` incompatible with all cross-node EP backends on GB200 NVL72 [1 participants]

vllm2026-03-23 21:40:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37931•Fetched 2026-04-08 01:22:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

qiching

Participants

qiching

Timeline (top)

labeled ×1

flashinfer_cutedsl cannot be used with any A2A backend that supports cross-node EP on GB200 NVL72:

Path 1: deepep_low_latency + flashinfer_cutedsl

DeepEP buffer init fails cross-node: RuntimeError: Failed: CUDA error /tmp/ep_kernels_workspace/DeepEP/csrc/deep_ep.cpp:226 'invalid resource handle'

root cause: Buffer.runtime.sync() exchanges IPC handles via shared memory, which doesn't work across physical nodes. VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL=1 only affects data transfer, not the init path.

Path 2: flashinfer_nvlink_one_sided + flashinfer_cutedsl

Rejected at startup: ValueError: NvFp4 MoE backend 'FLASHINFER_CUTEDSL' does not support the deployment configuration since kernel does not support ('standard',) activation format.

root cause: select_nvfp4_moe_backend() determines activation format via config.moe_parallel_config.use_deepep_ll_kernels. this flag is only True for deepep, so any other a2a backend forces standard format, which cutedsl doesn't support.

Error Message

RuntimeError: Failed: CUDA error /tmp/ep_kernels_workspace/DeepEP/csrc/deep_ep.cpp:226 'invalid resource handle'

Root Cause

RAW_BUFFERClick to expand / collapse

Your current environment

Current environment

vLLM version: 0.17.2rc1
GPU: GB200 NVL72 (4 GPUs per node, MNNVL enabled)
Model: DeepSeek-R1-0528-FP4 (NVFP4)
Deployment: P/D disaggregation, cross-node EP8/EP32

🐛 Describe the bug

Description

flashinfer_cutedsl cannot be used with any A2A backend that supports cross-node EP on GB200 NVL72:

Path 1: deepep_low_latency + flashinfer_cutedsl

DeepEP buffer init fails cross-node: RuntimeError: Failed: CUDA error /tmp/ep_kernels_workspace/DeepEP/csrc/deep_ep.cpp:226 'invalid resource handle'

Path 2: flashinfer_nvlink_one_sided + flashinfer_cutedsl

Rejected at startup: ValueError: NvFp4 MoE backend 'FLASHINFER_CUTEDSL' does not support the deployment configuration since kernel does not support ('standard',) activation format.

Impact

the only working cross-node EP combination is flashinfer_nvlink_one_sided + flashinfer_trtllm, which loses the decode latency advantage of cutedsl. This affects all GB200 NVL72 deployments using vLLM with large MoE models.

Suggested fix

Implement BatchedExperts output in FlashInferNVLinkOneSidedPrepareAndFinalize and allow it to set use_deepep_ll_kernels = True (or introduce a separate use_batched_experts flag decoupled from DeepEP)
Fix DeepEP cross-node buffer init to use MNNVL instead of CUDA IPC for handle exchange

How to reproduce

# Path 1 - DeepEP IPC failure (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --moe-backend flashinfer_cutedsl

# Path 2 - activation format mismatch (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_one_sided \
  --moe-backend flashinfer_cutedsl

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue, we need to address two main problems:

Implement BatchedExperts output in FlashInferNVLinkOneSidedPrepareAndFinalize:
- Introduce a new flag use_batched_experts that allows FlashInferNVLinkOneSidedPrepareAndFinalize to set use_deepep_ll_kernels = True without being tied to DeepEP.
- Modify select_nvfp4_moe_backend() to consider this new flag when determining the activation format.
Fix DeepEP cross-node buffer init to use MNNVL instead of CUDA IPC for handle exchange:
- Update Buffer.runtime.sync() to utilize MNNVL for exchanging IPC handles across physical nodes.

Example Code Changes

# Introduce a new flag use_batched_experts
class FlashInferNVLinkOneSidedPrepareAndFinalize:
    def __init__(self, use_batched_experts=False):
        self.use_batched_experts = use_batched_experts

    def select_nvfp4_moe_backend(self, config):
        # Consider the new flag when determining activation format
        if self.use_batched_experts:
            return 'batched_experts'
        # ... rest of the method remains the same

# Update Buffer.runtime.sync() to use MNNVL
class Buffer:
    def sync(self):
        # Check if MNNVL is enabled
        if VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL:
            # Use MNNVL for handle exchange
            self._sync_with_mnnvl()
        else:
            # Fallback to CUDA IPC (if necessary)
            self._sync_with_cuda_ipc()

    def _sync_with_mnnvl(self):
        # Implementation of MNNVL-based handle exchange
        pass

    def _sync_with_cuda_ipc(self):
        # Implementation of CUDA IPC-based handle exchange
        pass

Verification

To verify the fix, run the following commands:

# Path 1 - DeepEP IPC failure (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --moe-backend flashinfer_cutedsl

# Path 2 - activation format mismatch (multi-node)
vllm serve <model> --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_one_sided \
  --moe-backend flashinfer_cutedsl

Both commands should now run without errors, and the flashinfer_cutedsl backend should work correctly with cross-node EP on GB200 NVL72.

Extra Tips

Ensure that MNNVL is properly configured and enabled on your system.
Test the changes thoroughly to avoid any regressions.
Consider adding additional logging or

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#environment variable #network issue #logging issue #authentication issue #prompt issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: `flashinfer_cutedsl` incompatible with all cross-node EP backends on GB200 NVL72 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Your current environment

Current environment

🐛 Describe the bug

Description

Impact

Suggested fix

How to reproduce

extent analysis

Fix Plan

Example Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: `flashinfer_cutedsl` incompatible with all cross-node EP backends on GB200 NVL72 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Your current environment

Current environment

🐛 Describe the bug

Description

Impact

Suggested fix

How to reproduce

extent analysis

Fix Plan

Example Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING