vllm - 💡(How to fix) Fix [Bug]: vllm-openai:v0.19.1-cu130 / v0.16.0-cu130 docker images bundle pre-fix flashinfer (≤ 0.6.6), causing CUDA IMA on MoE FP8 decode [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(EngineCore_DP0) ERROR [core.py:1008] RuntimeError: CUDA driver error: an illegal memory access was encountered [rank0]:[W CUDAGuardImpl.h:122] CUDA warning: an illegal memory access was encountered (function destroyEvent) terminate called after throwing an instance of 'c10::AcceleratorError' what(): CUDA error: an illegal memory access was encountered

Traceback (synchronize point): File "vllm/v1/worker/gpu_model_runner.py", line 3485, in synchronize_input_prep self.prepare_inputs_event.synchronize() torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Root Cause

Already diagnosed by @kjiang249 in #35706 and fixed upstream in flashinfer:

The crash is caused by a bounds-check removal in flashinfer's bundled TRT-LLM finalizeMoeRoutingKernel, introduced in flashinfer v0.5.3 (commit 20435b40). When vLLM uses CUDAGraph with batch padding, the kernel accesses out-of-bounds memory for padding tokens, causing a Warp MMU Fault.

The fix is the 5-line bounds restore in flashinfer#2762:

// csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh, finalizeMoeRoutingKernel:
int64_t const expanded_rows = num_rows * experts_per_token;
if (expanded_permuted_row < 0 || expanded_permuted_row >= expanded_rows) {
  continue;
}

Confirmation that the bundled wheels lack the fix:

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.19.1-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match → fix not present

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.16.0-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match

After in-container pip upgrade to flashinfer-python==0.6.11.post3 (latest stable), the same grep returns 1 — and the crash disappears.

Fix Action

Fix / Workaround

vllm/vllm-openai:v0.19.1-cu130 and vllm/vllm-openai:v0.16.0-cu130 both ship a flashinfer wheel that predates the bounds-check fix flashinfer-ai/flashinfer#2762 (merged 2026-04-24, first released in flashinfer 0.6.10, 2026-05-04).

After in-container pip upgrade to flashinfer-python==0.6.11.post3 (latest stable), the same grep returns 1 — and the crash disappears.

Workarounds we verified

Code Example

PyTorch version: 2.10.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 (inside vllm/vllm-openai:v0.19.1-cu130)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
Libc version: glibc-2.35

Python version: 3.12 (64-bit)
Is CUDA available: True
CUDA runtime version: 13.0
CUDA_MODULE_LOADING set to: LAZY
GPU 0..7: NVIDIA H200 (8x)
Nvidia driver version: 590.48.01           # supports CUDA 13.1
cuDNN version: shipped in image
HIP runtime version: N/A
MIOpen runtime version: N/A

Relevant versions inside the v0.19.1-cu130 image:
  vllm                  0.19.1+cu130
  flashinfer-python     0.6.6          ← pre-fix
  flashinfer-cubin      0.6.6
  flashinfer-jit-cache  0.6.6+cu130
  torch                 2.10.0+cu130

Relevant versions inside the v0.16.0-cu130 image (same crash):
  vllm                  0.16.0+cu130
  flashinfer-python     0.6.3          ← pre-fix
  flashinfer-cubin      0.6.3
  flashinfer-jit-cache  0.6.3+cu130

---

docker run --runtime nvidia --gpus all --ipc=host \
  -v /path/to/hf-cache:/mnt/models/.cache/huggingface:ro \
  -e HF_HOME=/mnt/models/.cache/huggingface \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/cuda/targets/x86_64-linux/lib \
  -p 18000:8000 \
  vllm/vllm-openai:v0.19.1-cu130 \
  MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 1 --data-parallel-size 4 \
  --max-model-len 8192 --max-num-batched-tokens 8192 --max-num-seqs 256 \
  --block-size 256 --gpu-memory-utilization 0.8 \
  --enable-chunked-prefill --enable-expert-parallel --trust-remote-code \
  --served-model-name minimax-m2.5 --api-server-count 4

---

[fp8.py:338] Using FLASHINFER_CUTLASS Fp8 MoE backend out of potential backends:
  ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM',
   'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU']

---

(EngineCore_DP0) ERROR [core.py:1008] RuntimeError: CUDA driver error: an illegal memory access was encountered
[rank0]:[W CUDAGuardImpl.h:122] CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: an illegal memory access was encountered

Traceback (synchronize point):
  File "vllm/v1/worker/gpu_model_runner.py", line 3485, in synchronize_input_prep
    self.prepare_inputs_event.synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

---

// csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh, finalizeMoeRoutingKernel:
int64_t const expanded_rows = num_rows * experts_per_token;
if (expanded_permuted_row < 0 || expanded_permuted_row >= expanded_rows) {
  continue;
}

---

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.19.1-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match → fix not present

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.16.0-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match

---

pip install --no-cache-dir --upgrade \
    'flashinfer-python>=0.6.10' 'flashinfer-cubin>=0.6.10'
pip install --no-cache-dir --upgrade \
    --index-url https://flashinfer.ai/whl/cu130 --extra-index-url https://pypi.org/simple \
    'flashinfer-jit-cache>=0.6.10'
exec vllm serve "$@"
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
PyTorch version: 2.10.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 (inside vllm/vllm-openai:v0.19.1-cu130)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
Libc version: glibc-2.35

Python version: 3.12 (64-bit)
Is CUDA available: True
CUDA runtime version: 13.0
CUDA_MODULE_LOADING set to: LAZY
GPU 0..7: NVIDIA H200 (8x)
Nvidia driver version: 590.48.01           # supports CUDA 13.1
cuDNN version: shipped in image
HIP runtime version: N/A
MIOpen runtime version: N/A

Relevant versions inside the v0.19.1-cu130 image:
  vllm                  0.19.1+cu130
  flashinfer-python     0.6.6          ← pre-fix
  flashinfer-cubin      0.6.6
  flashinfer-jit-cache  0.6.6+cu130
  torch                 2.10.0+cu130

Relevant versions inside the v0.16.0-cu130 image (same crash):
  vllm                  0.16.0+cu130
  flashinfer-python     0.6.3          ← pre-fix
  flashinfer-cubin      0.6.3
  flashinfer-jit-cache  0.6.3+cu130
</details>

🐛 Describe the bug

TL;DR

vllm/vllm-openai:v0.19.1-cu130 and vllm/vllm-openai:v0.16.0-cu130 both ship a flashinfer wheel that predates the bounds-check fix flashinfer-ai/flashinfer#2762 (merged 2026-04-24, first released in flashinfer 0.6.10, 2026-05-04).

As a result, the FLASHINFER_CUTLASS FP8 MoE backend — which vLLM picks by default for FP8 MoE models — hits a CUDA illegal memory access mid-decode. This is the exact crash reported in #35706 (closed). The downstream code fix is in flashinfer main, but the images still bundle the pre-fix wheel.

Two requests:

  1. Could the next image rebuild bump the bundled flashinfer-python / flashinfer-cubin / flashinfer-jit-cache to >= 0.6.10?
  2. Until then, would it be possible to mention this in the image README or release notes (or have vLLM emit a startup warning if it detects bundled flashinfer < 0.6.10 + FP8 MoE backend)?

Repro

Affected models (anything routed to FLASHINFER_CUTLASS FP8 MoE): MiniMaxAI/MiniMax-M2.5, Qwen/Qwen3.5-coder-FP8 (see #35706 comments), etc.

docker run --runtime nvidia --gpus all --ipc=host \
  -v /path/to/hf-cache:/mnt/models/.cache/huggingface:ro \
  -e HF_HOME=/mnt/models/.cache/huggingface \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/cuda/targets/x86_64-linux/lib \
  -p 18000:8000 \
  vllm/vllm-openai:v0.19.1-cu130 \
  MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 1 --data-parallel-size 4 \
  --max-model-len 8192 --max-num-batched-tokens 8192 --max-num-seqs 256 \
  --block-size 256 --gpu-memory-utilization 0.8 \
  --enable-chunked-prefill --enable-expert-parallel --trust-remote-code \
  --served-model-name minimax-m2.5 --api-server-count 4

Then send sustained traffic with 1024/1024 random prompts at concurrency=8+ (e.g. vllm bench serve --random-input-len 1024 --random-output-len 1024 --max-concurrency 8 --num-prompts 80 ...). The engine reaches steady decode, then crashes after roughly 1500–1700 generated tokens.

The server log shows the backend pick that turns out to be the buggy one:

[fp8.py:338] Using FLASHINFER_CUTLASS Fp8 MoE backend out of potential backends:
  ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM',
   'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU']

Crash:

(EngineCore_DP0) ERROR [core.py:1008] RuntimeError: CUDA driver error: an illegal memory access was encountered
[rank0]:[W CUDAGuardImpl.h:122] CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: an illegal memory access was encountered

Traceback (synchronize point):
  File "vllm/v1/worker/gpu_model_runner.py", line 3485, in synchronize_input_prep
    self.prepare_inputs_event.synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

(The async-error nature means the synchronize line is the messenger; the actual fault is in a kernel that ran earlier in the same step.)

Root cause

Already diagnosed by @kjiang249 in #35706 and fixed upstream in flashinfer:

The crash is caused by a bounds-check removal in flashinfer's bundled TRT-LLM finalizeMoeRoutingKernel, introduced in flashinfer v0.5.3 (commit 20435b40). When vLLM uses CUDAGraph with batch padding, the kernel accesses out-of-bounds memory for padding tokens, causing a Warp MMU Fault.

The fix is the 5-line bounds restore in flashinfer#2762:

// csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh, finalizeMoeRoutingKernel:
int64_t const expanded_rows = num_rows * experts_per_token;
if (expanded_permuted_row < 0 || expanded_permuted_row >= expanded_rows) {
  continue;
}

Confirmation that the bundled wheels lack the fix:

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.19.1-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match → fix not present

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.16.0-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match

After in-container pip upgrade to flashinfer-python==0.6.11.post3 (latest stable), the same grep returns 1 — and the crash disappears.

Workarounds we verified

1. VLLM_USE_FLASHINFER_MOE_FP8=0 (recommended in #35706)

Forces vLLM to pick a different MoE backend (DEEPGEMM / TRITON). No crash, but on MiniMax-M2.5 we measured a substantial perf hit at conc=1 1k/1k:

FLASHINFER_CUTLASSDEEPGEMM (env=0)
output throughput tok/s89.9855.65 (–38%)
mean TPOT (ms)10.9617.73 (+62%)
mean ITL (ms)10.9417.73

So the env workaround works but costs significant throughput on the very models people are using FP8 MoE on in the first place.

2. In-container pip install --upgrade at entrypoint

This is what we ended up doing internally. Wrapping vllm serve with a small entrypoint script that runs:

pip install --no-cache-dir --upgrade \
    'flashinfer-python>=0.6.10' 'flashinfer-cubin>=0.6.10'
pip install --no-cache-dir --upgrade \
    --index-url https://flashinfer.ai/whl/cu130 --extra-index-url https://pypi.org/simple \
    'flashinfer-jit-cache>=0.6.10'
exec vllm serve "$@"

This bumps all three flashinfer packages to 0.6.11.post3 (verified grep returns 1 = fix present), and the crash is gone with no measurable perf regression vs the pre-bug behavior. The pip warning about vllm pinning flashinfer-python==0.6.X exact-version is harmless (the ABI is compatible; verified by booting and serving on both v0.16.0-cu130 and v0.19.1-cu130).

Adds ~30–60s to container startup for the pip install + jit-cache wheel download.

Suggested fix

  • Bump the bundled flashinfer in the next image rebuild to a version >= 0.6.10 (ideally the same release on both vllm-openai:vX.Y.Z and vllm-openai:vX.Y.Z-cu130 lines).
  • Or, at minimum, emit a startup warning when vLLM detects bundled flashinfer-python < 0.6.10 AND FP8 MoE backend selected AND CUDAGraph enabled so users hit a clear log line instead of an async IMA after 2 minutes of decode.

Happy to send a PR for either if it's helpful.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING