vllm - 💡(How to fix) Fix CUDA illegal memory access in FP8 MoE moe_permute with cutlass backend at batch size 8192

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py", line 90, in moe_permute
    a1q_scale = a1q_scale[permuted_idx.clamp(max=n_token * topk - 1) // topk]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Full call chain:

core.py:1102 → EngineCore fatal error
  → gpu_worker.py:728 execute_model
    → gpu_model_runner.py:3639 execute_model
      → qwen3_moe.py:783 forward
        → cutlass_moe.py:357 apply (run_cutlass_moe_fp8)
          → moe_permute_unpermute.py:90 moe_permute
            → CUDA illegal memory access

Root Cause

The crash happens at total_num_scheduled_tokens=8192 which is a CUDA graph capture boundary size (visible in the config: cudagraph_capture_sizes: [1, 2, 4, 8, ..., 512] with max_cudagraph_capture_size: 512 and compile_ranges_split_points: [8192]).

The permuted_idx tensor in moe_permute appears to have out-of-bounds indices when processing FP8 quantized weights with the CUTLASS MoE backend at this specific batch size. The issue is in the FP8 scale indexing: permuted_idx.clamp(max=n_token * topk - 1) // topk produces indices that exceed a1q_scale's first dimension.

Fix Action

Workaround

Setting --max-num-batched-tokens to a value other than 8192 (e.g., 4096 or 16384) avoids the crash.

Code Example

File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py", line 90, in moe_permute
    a1q_scale = a1q_scale[permuted_idx.clamp(max=n_token * topk - 1) // topk]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

---

core.py:1102EngineCore fatal error
  → gpu_worker.py:728 execute_model
    → gpu_model_runner.py:3639 execute_model
      → qwen3_moe.py:783 forward
        → cutlass_moe.py:357 apply (run_cutlass_moe_fp8)
          → moe_permute_unpermute.py:90 moe_permute
CUDA illegal memory access

---

total_num_scheduled_tokens = 8192
num_running_reqs = 17
num_waiting_reqs = 82
step_counter = 0  (first step after startup)
kv_cache_usage = 0.016
RAW_BUFFERClick to expand / collapse

Bug Description

vLLM crashes with CUDA error: an illegal memory access was encountered during FP8 MoE inference when the scheduled batch hits exactly 8192 tokens. The crash occurs in the CUTLASS MoE permutation kernel.

Environment

  • vLLM version: 0.17.1 (via ghcr.io/llm-d/llm-d-cuda:v0.6.0)
  • GPU: NVIDIA H200 (140GB)
  • Model: RedHatAI/Qwen3-30B-A3B-FP8-dynamic (128 experts, top-2 routing, compressed-tensors FP8)
  • TP=1, single GPU
  • max_model_len=2100, enable_prefix_caching=True

Steps to Reproduce

  1. Deploy Qwen3-30B-A3B-FP8-dynamic with vLLM v0.17.1
  2. Send ~100 concurrent requests with prompt_tokens=1000, max_tokens=1
  3. vLLM schedules 9 requests in one batch (8×1001 + 1×184 = 8192 total tokens)
  4. Engine crashes on the first model execution step

Stack Trace

File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py", line 90, in moe_permute
    a1q_scale = a1q_scale[permuted_idx.clamp(max=n_token * topk - 1) // topk]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Full call chain:

core.py:1102 → EngineCore fatal error
  → gpu_worker.py:728 execute_model
    → gpu_model_runner.py:3639 execute_model
      → qwen3_moe.py:783 forward
        → cutlass_moe.py:357 apply (run_cutlass_moe_fp8)
          → moe_permute_unpermute.py:90 moe_permute
            → CUDA illegal memory access

Scheduler State at Crash

From the dump:

total_num_scheduled_tokens = 8192
num_running_reqs = 17
num_waiting_reqs = 82
step_counter = 0  (first step after startup)
kv_cache_usage = 0.016

9 new requests scheduled, 8 with 1001 tokens each, 1 with 184 tokens. The last request got partial block allocation (block_ids=([146, 147, 148]) = only 3 blocks vs 16 for the others).

Analysis

The crash happens at total_num_scheduled_tokens=8192 which is a CUDA graph capture boundary size (visible in the config: cudagraph_capture_sizes: [1, 2, 4, 8, ..., 512] with max_cudagraph_capture_size: 512 and compile_ranges_split_points: [8192]).

The permuted_idx tensor in moe_permute appears to have out-of-bounds indices when processing FP8 quantized weights with the CUTLASS MoE backend at this specific batch size. The issue is in the FP8 scale indexing: permuted_idx.clamp(max=n_token * topk - 1) // topk produces indices that exceed a1q_scale's first dimension.

Workaround

Setting --max-num-batched-tokens to a value other than 8192 (e.g., 4096 or 16384) avoids the crash.

Expected Behavior

vLLM should handle FP8 MoE models at any batch size without CUDA memory access errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix CUDA illegal memory access in FP8 MoE moe_permute with cutlass backend at batch size 8192