vllm - 💡(How to fix) Fix [Bug]: FlashInfer CUTLASS MoE backend causes CUDA illegal memory access on H100 during CUDA graph capture (Qwen3-Next-80B BF16) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39288Fetched 2026-04-09 07:52:06
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Error Message

(Worker_TP0) INFO [unquantized.py:283] Using FlashInfer CUTLASS Unquantized MoE backend out of potential backends: ['FlashInfer TRTLLM', 'FlashInfer CUTLASS', 'TRITON', 'BATCHED_TRITON']. kernel_config=KernelConfig(...moe_backend='auto') (Worker_TP*) INFO [custom_all_reduce.py:215] Registering 0 cuda graph addresses ... Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 97%|█████████▋| 143/147 Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered' (EngineCore) ERROR Worker proc VllmWorker-7 died unexpectedly, shutting down executor. RuntimeError: Engine core initialization failed.

Root Cause

  1. Both runs register 0 cuda graph addresses — the crash is NOT caused by address registration overflow (unlike the B300 variant where workers registered 0 addresses due to a broken registration pass).
  2. Crash is deterministic — occurs at exactly step 143/147 (batch size ~2000 tokens) in every run, 100% failure rate across 40+ consecutive CI runs.
  3. Crash is specific to FlashInfer CUTLASS backend — the TRITON backend with identical configuration completes all 147 CUDA graph captures.
  4. H100-specific — the v0.19.1rc1.dev29-35 fix resolved the same crash on GB200NVL and B300 but H100 remains affected through v0.19.1rc1.dev45.
  5. Performance is equivalent — TRITON and FlashInfer CUTLASS show no measurable throughput difference for this model (<2% variance).

Fix Action

Fix / Workaround

Workaround (force TRITON MoE backend):

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 33020 \
  --host 0.0.0.0 \
  --moe-backend triton \
  --kv-cache-dtype auto

Code Example

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 33020 \
  --host 0.0.0.0 \
  --kv-cache-dtype auto

---

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 33020 \
  --host 0.0.0.0 \
  --moe-backend triton \
  --kv-cache-dtype auto

---

(Worker_TP0) INFO [unquantized.py:283] Using FlashInfer CUTLASS Unquantized MoE backend out of potential backends: ['FlashInfer TRTLLM', 'FlashInfer CUTLASS', 'TRITON', 'BATCHED_TRITON'].
kernel_config=KernelConfig(...moe_backend='auto')
(Worker_TP*) INFO [custom_all_reduce.py:215] Registering 0 cuda graph addresses
...
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  97%|█████████▋| 143/147
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
(EngineCore) ERROR Worker proc VllmWorker-7 died unexpectedly, shutting down executor.
RuntimeError: Engine core initialization failed.

---

(Worker_TP0) INFO [unquantized.py:208] Using TRITON Unquantized MoE backend out of potential backends: ['FlashInfer TRTLLM', 'FlashInfer CUTLASS', 'TRITON', 'BATCHED_TRITON'].
kernel_config=KernelConfig(...moe_backend='triton')
(Worker_TP*) INFO [custom_all_reduce.py:215] Registering 0 cuda graph addresses
...
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 147/147 [00:21<00:00, 6.98it/s]
RAW_BUFFERClick to expand / collapse

Your current environment

  • GPU: NVIDIA H100 80GB HBM3 (8x per node, NVLink connected)
  • Platform: DGX H100 (Eos cluster)
  • vLLM version: v0.18.2rc1.dev54+g73f48ce55
  • CUDA: 12.x
  • PyTorch: 2.9.0+cu129
  • Model: Qwen/Qwen3-Next-80B-A3B-Instruct
  • Precision: BF16 (torch.bfloat16)
  • Tensor parallel: TP=4 and TP=8 (both crash)

🐛 Describe the bug

The FlashInfer CUTLASS Unquantized MoE backend triggers a deterministic CUDA illegal memory access during PIECEWISE CUDA graph capture for Qwen3-Next-80B-A3B-Instruct (BF16) on NVIDIA H100 GPUs. The crash occurs at custom_all_reduce.cuh:455 at approximately step 143/147 of CUDA graph capture.

This regression was introduced by #36286 ("Migrate Unquantized to Full Oracle Flow"), which changed the default unquantized MoE backend priority to prefer FlashInfer CUTLASS over TRITON. The same model serves correctly with --moe-backend triton.

Related Issues

  • #30579 — Same model/crash on B200, closed by stale bot without a fix
  • #37758 — FlashInfer CUTLASS/TRTLLM failures for Qwen3.5 BF16 DP/EP (separate configurations, same underlying kernel issue)

Regression Range

  • Last working: vLLM v0.18.1rc1.dev227 (TRITON selected by default)
  • First failing: vLLM v0.18.1rc1.dev266 (FlashInfer CUTLASS selected by default after #36286)
  • Still failing: vLLM v0.19.1rc1.dev45

Note: The fix shipped in v0.19.1rc1.dev29-35 resolved the same crash on GB200NVL and DGX B300, but H100 remains fully broken.

Reproduction

Crashes (default FlashInfer CUTLASS MoE backend):

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 33020 \
  --host 0.0.0.0 \
  --kv-cache-dtype auto

Workaround (force TRITON MoE backend):

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 33020 \
  --host 0.0.0.0 \
  --moe-backend triton \
  --kv-cache-dtype auto

Experimental Evidence

We ran a controlled A/B experiment on the same hardware (DGX H100), same vLLM version (v0.18.2rc1.dev54), same model, identical configuration — only difference is the MoE backend.

Run A — Default MoE backend (FlashInfer CUTLASS) → CRASH

(Worker_TP0) INFO [unquantized.py:283] Using FlashInfer CUTLASS Unquantized MoE backend out of potential backends: ['FlashInfer TRTLLM', 'FlashInfer CUTLASS', 'TRITON', 'BATCHED_TRITON'].
kernel_config=KernelConfig(...moe_backend='auto')
(Worker_TP*) INFO [custom_all_reduce.py:215] Registering 0 cuda graph addresses
...
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  97%|█████████▋| 143/147
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
(EngineCore) ERROR Worker proc VllmWorker-7 died unexpectedly, shutting down executor.
RuntimeError: Engine core initialization failed.

Run B — TRITON MoE backend → SUCCESS

(Worker_TP0) INFO [unquantized.py:208] Using TRITON Unquantized MoE backend out of potential backends: ['FlashInfer TRTLLM', 'FlashInfer CUTLASS', 'TRITON', 'BATCHED_TRITON'].
kernel_config=KernelConfig(...moe_backend='triton')
(Worker_TP*) INFO [custom_all_reduce.py:215] Registering 0 cuda graph addresses
...
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 147/147 [00:21<00:00, 6.98it/s]

Server starts successfully, completes all 147 CUDA graph captures, and serves inference requests normally.

Key Observations

  1. Both runs register 0 cuda graph addresses — the crash is NOT caused by address registration overflow (unlike the B300 variant where workers registered 0 addresses due to a broken registration pass).
  2. Crash is deterministic — occurs at exactly step 143/147 (batch size ~2000 tokens) in every run, 100% failure rate across 40+ consecutive CI runs.
  3. Crash is specific to FlashInfer CUTLASS backend — the TRITON backend with identical configuration completes all 147 CUDA graph captures.
  4. H100-specific — the v0.19.1rc1.dev29-35 fix resolved the same crash on GB200NVL and B300 but H100 remains affected through v0.19.1rc1.dev45.
  5. Performance is equivalent — TRITON and FlashInfer CUTLASS show no measurable throughput difference for this model (<2% variance).

Root Cause Hypothesis

The finalizeMoeRoutingKernel from the FlashInfer CUTLASS (TRT-LLM) code path performs an out-of-bounds memory access during PIECEWISE CUDA graph capture at large batch sizes on H100. The crash site (custom_all_reduce.cuh:455) is the symptom — the actual illegal memory access likely originates in the MoE kernel and corrupts memory that is later accessed by the custom all-reduce operation.

Before submitting a new issue...

  • I have searched for similar issues and found #30579 (closed stale) and #37758 (open, different config)
  • I have verified the issue still exists on the latest vLLM version (v0.19.1rc1.dev45)

extent analysis

TL;DR

The most likely fix for the deterministic CUDA illegal memory access issue is to use the TRITON MoE backend instead of the default FlashInfer CUTLASS backend.

Guidance

  • The issue is specific to the FlashInfer CUTLASS MoE backend and can be worked around by forcing the use of the TRITON MoE backend.
  • The crash occurs at a specific step (143/147) during PIECEWISE CUDA graph capture, suggesting a memory access issue related to the MoE kernel.
  • To verify the issue, run the provided reproduction command and observe the crash.
  • To mitigate the issue, use the provided workaround command that forces the TRITON MoE backend.

Example

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 33020 \
  --host 0.0.0.0 \
  --moe-backend triton \
  --kv-cache-dtype auto

Notes

The root cause of the issue is likely related to an out-of-bounds memory access in the FlashInfer CUTLASS MoE kernel, but the exact fix is not provided in the issue. The workaround using the TRITON MoE backend is a temporary solution until the underlying issue is addressed.

Recommendation

Apply the workaround by using the TRITON MoE backend, as it has been shown to resolve the issue without affecting performance. This is a safe and effective solution until the root cause is identified and fixed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING