vllm - 💡(How to fix) Fix CUDA illegal memory access in FP8 MoE moe_permute with cutlass backend at batch size 8192

vllm2026-05-22 07:09:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py", line 90, in moe_permute
    a1q_scale = a1q_scale[permuted_idx.clamp(max=n_token * topk - 1) // topk]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Full call chain:

core.py:1102 → EngineCore fatal error
  → gpu_worker.py:728 execute_model
    → gpu_model_runner.py:3639 execute_model
      → qwen3_moe.py:783 forward
        → cutlass_moe.py:357 apply (run_cutlass_moe_fp8)
          → moe_permute_unpermute.py:90 moe_permute
            → CUDA illegal memory access

Root Cause

The crash happens at total_num_scheduled_tokens=8192 which is a CUDA graph capture boundary size (visible in the config: cudagraph_capture_sizes: [1, 2, 4, 8, ..., 512] with max_cudagraph_capture_size: 512 and compile_ranges_split_points: [8192]).

The permuted_idx tensor in moe_permute appears to have out-of-bounds indices when processing FP8 quantized weights with the CUTLASS MoE backend at this specific batch size. The issue is in the FP8 scale indexing: permuted_idx.clamp(max=n_token * topk - 1) // topk produces indices that exceed a1q_scale's first dimension.

Fix Action

Workaround

Setting --max-num-batched-tokens to a value other than 8192 (e.g., 4096 or 16384) avoids the crash.

Code Example

File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py", line 90, in moe_permute
    a1q_scale = a1q_scale[permuted_idx.clamp(max=n_token * topk - 1) // topk]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

---

core.py:1102 → EngineCore fatal error
  → gpu_worker.py:728 execute_model
    → gpu_model_runner.py:3639 execute_model
      → qwen3_moe.py:783 forward
        → cutlass_moe.py:357 apply (run_cutlass_moe_fp8)
          → moe_permute_unpermute.py:90 moe_permute
            → CUDA illegal memory access

---

total_num_scheduled_tokens = 8192
num_running_reqs = 17
num_waiting_reqs = 82
step_counter = 0  (first step after startup)
kv_cache_usage = 0.016

RAW_BUFFERClick to expand / collapse

Bug Description

vLLM crashes with CUDA error: an illegal memory access was encountered during FP8 MoE inference when the scheduled batch hits exactly 8192 tokens. The crash occurs in the CUTLASS MoE permutation kernel.

Environment

vLLM version: 0.17.1 (via ghcr.io/llm-d/llm-d-cuda:v0.6.0)
GPU: NVIDIA H200 (140GB)
Model: RedHatAI/Qwen3-30B-A3B-FP8-dynamic (128 experts, top-2 routing, compressed-tensors FP8)
TP=1, single GPU
max_model_len=2100, enable_prefix_caching=True

Steps to Reproduce

Deploy Qwen3-30B-A3B-FP8-dynamic with vLLM v0.17.1
Send ~100 concurrent requests with prompt_tokens=1000, max_tokens=1
vLLM schedules 9 requests in one batch (8×1001 + 1×184 = 8192 total tokens)
Engine crashes on the first model execution step

Stack Trace

File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py", line 90, in moe_permute
    a1q_scale = a1q_scale[permuted_idx.clamp(max=n_token * topk - 1) // topk]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Full call chain:

core.py:1102 → EngineCore fatal error
  → gpu_worker.py:728 execute_model
    → gpu_model_runner.py:3639 execute_model
      → qwen3_moe.py:783 forward
        → cutlass_moe.py:357 apply (run_cutlass_moe_fp8)
          → moe_permute_unpermute.py:90 moe_permute
            → CUDA illegal memory access

Scheduler State at Crash

From the dump:

total_num_scheduled_tokens = 8192
num_running_reqs = 17
num_waiting_reqs = 82
step_counter = 0  (first step after startup)
kv_cache_usage = 0.016

9 new requests scheduled, 8 with 1001 tokens each, 1 with 184 tokens. The last request got partial block allocation (block_ids=([146, 147, 148]) = only 3 blocks vs 16 for the others).

Analysis

Workaround

Setting --max-num-batched-tokens to a value other than 8192 (e.g., 4096 or 16384) avoids the crash.

Expected Behavior

vLLM should handle FP8 MoE models at any batch size without CUDA memory access errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix CUDA illegal memory access in FP8 MoE moe_permute with cutlass backend at batch size 8192

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Bug Description

Environment

Steps to Reproduce

Stack Trace

Scheduler State at Crash

Analysis

Workaround

Expected Behavior

Still need to ship something?

TRENDING