vllm - 💡(How to fix) Fix [Bug]: vllm-openai:v0.19.1-cu130 / v0.16.0-cu130 docker images bundle pre-fix flashinfer (≤ 0.6.6), causing CUDA IMA on MoE FP8 decode [1 pull requests]

StepCodex · 2026-05-18T00:17:24Z

[vllm] Your current environment <details <summary The output of <code python collect env.py</code </summary text PyTorch version: 2.10.0+cu130 Is debug build:… ## Fix / Workaround `vllm/vllm-openai:v0.19.1-cu130` and `vllm/vllm-openai:v0.16.0-cu130` both ship a flashinfer wheel that **predates** the bounds-check fix [flashinfer-ai/flashinfer#2762](https://github.com/flashinfer-ai/flashinfer/pull/2762) (merged 2026-04-24, first released in **flashinfer 0.6.10**, 2026-05-04). After in-container pip upgrade to `flashinfer-python==0.6.11.post3` (latest stable), the same grep returns `1` — and the crash disappears. ## Workarounds we verified ### Your current environment The output of python collect_env.py ```text PyTorch version: 2.10.0+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04 (inside vllm/vllm-openai:v0.19.1-cu130) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: N/A Libc version: glibc-2.35 Python version: 3.12 (64-bit) Is CUDA available: True CUDA runtime version: 13.0 CUDA_MODULE_LOADING set to: LAZY GPU 0..7: NVIDIA H200 (8x) Nvidia driver version: 590.48.01 # supports CUDA 13.1 cuDNN version: shipped in image HIP runtime version: N/A MIOpen runtime version: N/A Relevant versions inside the v0.19.1-cu130 image: vllm 0.19.1+cu130 flashinfer-python 0.6.6 ← pre-fix flashinfer-cubin 0.6.6 flashinfer-jit-cache 0.6.6+cu130 torch 2.10.0+cu130 Relevant versions inside the v0.16.0-cu130 image (same crash): vllm 0.16.0+cu130 flashinfer-python 0.6.3 ← pre-fix flashinfer-cubin 0.6.3 flashinfer-jit-cache 0.6.3+cu130 ``` ### 🐛 Describe the bug ## TL;DR `vllm/vllm-openai:v0.19.1-cu130` and `vllm/vllm-openai:v0.16.0-cu130` both ship a flashinfer wheel that **predates** the bounds-check fix [flashinfer-ai/flashinfer#2762](https://github.com/flashinfer-ai/flashinfer/pull/2762) (merged 2026-04-24, first released in **flashinfer 0.6.10**, 2026-05-04). As a result, the `FLASHINFER_CUTLASS` FP8 MoE backend — which vLLM picks by default for FP8 MoE models — hits a CUDA **`illegal memory access`** mid-decode. This is the exact crash reported in [#35706](https://github.com/vllm-project/vllm/issues/35706) (closed). The downstream code fix is in flashinfer main, but the images still bundle the pre-fix wheel. **Two requests**: 1. Could the next image rebuild bump the bundled `flashinfer-python` / `flashinfer-cubin` / `flashinfer-jit-cache` to `>= 0.6.10`? 2. Until then, would it be possible to mention this in the image `README` or `release notes` (or have vLLM emit a startup warning if it detects bundled `flashinfer < 0.6.10 + FP8 MoE backend`)? ## Repro Affected models (anything routed to `FLASHINFER_CUTLASS FP8 MoE`): `MiniMaxAI/MiniMax-M2.5`, `Qwen/Qwen3.5-coder-FP8` (see #35706 comments), etc. ```bash docker run --runtime nvidia --gpus all --ipc=host \ -v /path/to/hf-cache:/mnt/models/.cache/huggingface:ro \ -e HF_HOME=/mnt/models/.cache/huggingface \ -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/cuda/targets/x86_64-linux/lib \ -p 18000:8000 \ vllm/vllm-openai:v0.19.1-cu130 \ MiniMaxAI/MiniMax-M2.5 \ --tensor-parallel-size 1 --data-parallel-size 4 \ --max-model-len 8192 --max-num-batched-tokens 8192 --max-num-seqs 256 \ --block-size 256 --gpu-memory-utilization 0.8 \ --enable-chunked-prefill --enable-expert-parallel --trust-remote-code \ --served-model-name minimax-m2.5 --api-server-count 4 ``` Then send sustained traffic with 1024/1024 random prompts at `concurrency=8`+ (e.g. `vllm bench serve --random-input-len 1024 --random-output-len 1024 --max-concurrency 8 --num-prompts 80 ...`). The engine reaches steady decode, then crashes after roughly 1500–1700 generated tokens. The server log shows the backend pick that turns out to be the buggy one: ```text [fp8.py:338] Using FLASHINFER_CUTLASS Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU'] ``` Crash: ```text (EngineCore_DP0) ERROR [core.py:1008] RuntimeError: CUDA driver error: an illegal memory access was encountered [rank0]:[W CUDAGuardImpl.h:122] CUDA warning: an illegal memory access was encountered (function destroyEvent) terminate called after throwing an instance of 'c10::AcceleratorError' what(): CUDA error: an illegal memory access was encountered Traceback (synchronize point): File "vllm/v1/worker/gpu_model_runner.py", line 3485, in synchronize_input_prep self.prepare_inputs_event.synchronize() torch.AcceleratorError: CUDA error: an illegal memory access was encountered ``` (The async-error nature means the synchronize line is the messenger; the actual fault is in a kernel that ran earlier in the same step.) ## Root cause Already diagnosed by @kjiang249 in [#35706](https://github.com/vllm-project/vllm/issues/35706) and fixed upstream in flashinfer: > The cra

vllm2026-05-18 00:17:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

(EngineCore_DP0) ERROR [core.py:1008] RuntimeError: CUDA driver error: an illegal memory access was encountered [rank0]:[W CUDAGuardImpl.h:122] CUDA warning: an illegal memory access was encountered (function destroyEvent) terminate called after throwing an instance of 'c10::AcceleratorError' what(): CUDA error: an illegal memory access was encountered

Traceback (synchronize point): File "vllm/v1/worker/gpu_model_runner.py", line 3485, in synchronize_input_prep self.prepare_inputs_event.synchronize() torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Root Cause

Already diagnosed by @kjiang249 in #35706 and fixed upstream in flashinfer:

The crash is caused by a bounds-check removal in flashinfer's bundled TRT-LLM finalizeMoeRoutingKernel, introduced in flashinfer v0.5.3 (commit 20435b40). When vLLM uses CUDAGraph with batch padding, the kernel accesses out-of-bounds memory for padding tokens, causing a Warp MMU Fault.

The fix is the 5-line bounds restore in flashinfer#2762:

// csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh, finalizeMoeRoutingKernel:
int64_t const expanded_rows = num_rows * experts_per_token;
if (expanded_permuted_row < 0 || expanded_permuted_row >= expanded_rows) {
  continue;
}

Confirmation that the bundled wheels lack the fix:

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.19.1-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match → fix not present

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.16.0-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match

After in-container pip upgrade to flashinfer-python==0.6.11.post3 (latest stable), the same grep returns 1 — and the crash disappears.

Fix Action

Fix / Workaround

vllm/vllm-openai:v0.19.1-cu130 and vllm/vllm-openai:v0.16.0-cu130 both ship a flashinfer wheel that predates the bounds-check fix flashinfer-ai/flashinfer#2762 (merged 2026-04-24, first released in flashinfer 0.6.10, 2026-05-04).

After in-container pip upgrade to flashinfer-python==0.6.11.post3 (latest stable), the same grep returns 1 — and the crash disappears.

Workarounds we verified

Code Example

PyTorch version: 2.10.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 (inside vllm/vllm-openai:v0.19.1-cu130)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
Libc version: glibc-2.35

Python version: 3.12 (64-bit)
Is CUDA available: True
CUDA runtime version: 13.0
CUDA_MODULE_LOADING set to: LAZY
GPU 0..7: NVIDIA H200 (8x)
Nvidia driver version: 590.48.01           # supports CUDA 13.1
cuDNN version: shipped in image
HIP runtime version: N/A
MIOpen runtime version: N/A

Relevant versions inside the v0.19.1-cu130 image:
  vllm                  0.19.1+cu130
  flashinfer-python     0.6.6          ← pre-fix
  flashinfer-cubin      0.6.6
  flashinfer-jit-cache  0.6.6+cu130
  torch                 2.10.0+cu130

Relevant versions inside the v0.16.0-cu130 image (same crash):
  vllm                  0.16.0+cu130
  flashinfer-python     0.6.3          ← pre-fix
  flashinfer-cubin      0.6.3
  flashinfer-jit-cache  0.6.3+cu130

---

docker run --runtime nvidia --gpus all --ipc=host \
  -v /path/to/hf-cache:/mnt/models/.cache/huggingface:ro \
  -e HF_HOME=/mnt/models/.cache/huggingface \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/cuda/targets/x86_64-linux/lib \
  -p 18000:8000 \
  vllm/vllm-openai:v0.19.1-cu130 \
  MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 1 --data-parallel-size 4 \
  --max-model-len 8192 --max-num-batched-tokens 8192 --max-num-seqs 256 \
  --block-size 256 --gpu-memory-utilization 0.8 \
  --enable-chunked-prefill --enable-expert-parallel --trust-remote-code \
  --served-model-name minimax-m2.5 --api-server-count 4

---

[fp8.py:338] Using FLASHINFER_CUTLASS Fp8 MoE backend out of potential backends:
  ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM',
   'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU']

---

(EngineCore_DP0) ERROR [core.py:1008] RuntimeError: CUDA driver error: an illegal memory access was encountered
[rank0]:[W CUDAGuardImpl.h:122] CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: an illegal memory access was encountered

Traceback (synchronize point):
  File "vllm/v1/worker/gpu_model_runner.py", line 3485, in synchronize_input_prep
    self.prepare_inputs_event.synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

---

// csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh, finalizeMoeRoutingKernel:
int64_t const expanded_rows = num_rows * experts_per_token;
if (expanded_permuted_row < 0 || expanded_permuted_row >= expanded_rows) {
  continue;
}

---

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.19.1-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match → fix not present

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.16.0-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match

---

pip install --no-cache-dir --upgrade \
    'flashinfer-python>=0.6.10' 'flashinfer-cubin>=0.6.10'
pip install --no-cache-dir --upgrade \
    --index-url https://flashinfer.ai/whl/cu130 --extra-index-url https://pypi.org/simple \
    'flashinfer-jit-cache>=0.6.10'
exec vllm serve "$@"

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

PyTorch version: 2.10.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 (inside vllm/vllm-openai:v0.19.1-cu130)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
Libc version: glibc-2.35

Python version: 3.12 (64-bit)
Is CUDA available: True
CUDA runtime version: 13.0
CUDA_MODULE_LOADING set to: LAZY
GPU 0..7: NVIDIA H200 (8x)
Nvidia driver version: 590.48.01           # supports CUDA 13.1
cuDNN version: shipped in image
HIP runtime version: N/A
MIOpen runtime version: N/A

Relevant versions inside the v0.19.1-cu130 image:
  vllm                  0.19.1+cu130
  flashinfer-python     0.6.6          ← pre-fix
  flashinfer-cubin      0.6.6
  flashinfer-jit-cache  0.6.6+cu130
  torch                 2.10.0+cu130

Relevant versions inside the v0.16.0-cu130 image (same crash):
  vllm                  0.16.0+cu130
  flashinfer-python     0.6.3          ← pre-fix
  flashinfer-cubin      0.6.3
  flashinfer-jit-cache  0.6.3+cu130

</details>

🐛 Describe the bug

TL;DR

As a result, the FLASHINFER_CUTLASS FP8 MoE backend — which vLLM picks by default for FP8 MoE models — hits a CUDA illegal memory access mid-decode. This is the exact crash reported in #35706 (closed). The downstream code fix is in flashinfer main, but the images still bundle the pre-fix wheel.

Two requests:

Could the next image rebuild bump the bundled flashinfer-python / flashinfer-cubin / flashinfer-jit-cache to >= 0.6.10?
Until then, would it be possible to mention this in the image README or release notes (or have vLLM emit a startup warning if it detects bundled flashinfer < 0.6.10 + FP8 MoE backend)?

Repro

Affected models (anything routed to FLASHINFER_CUTLASS FP8 MoE): MiniMaxAI/MiniMax-M2.5, Qwen/Qwen3.5-coder-FP8 (see #35706 comments), etc.

docker run --runtime nvidia --gpus all --ipc=host \
  -v /path/to/hf-cache:/mnt/models/.cache/huggingface:ro \
  -e HF_HOME=/mnt/models/.cache/huggingface \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/cuda/targets/x86_64-linux/lib \
  -p 18000:8000 \
  vllm/vllm-openai:v0.19.1-cu130 \
  MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 1 --data-parallel-size 4 \
  --max-model-len 8192 --max-num-batched-tokens 8192 --max-num-seqs 256 \
  --block-size 256 --gpu-memory-utilization 0.8 \
  --enable-chunked-prefill --enable-expert-parallel --trust-remote-code \
  --served-model-name minimax-m2.5 --api-server-count 4

Then send sustained traffic with 1024/1024 random prompts at concurrency=8+ (e.g. vllm bench serve --random-input-len 1024 --random-output-len 1024 --max-concurrency 8 --num-prompts 80 ...). The engine reaches steady decode, then crashes after roughly 1500–1700 generated tokens.

The server log shows the backend pick that turns out to be the buggy one:

[fp8.py:338] Using FLASHINFER_CUTLASS Fp8 MoE backend out of potential backends:
  ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM',
   'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU']

Crash:

(EngineCore_DP0) ERROR [core.py:1008] RuntimeError: CUDA driver error: an illegal memory access was encountered
[rank0]:[W CUDAGuardImpl.h:122] CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: an illegal memory access was encountered

Traceback (synchronize point):
  File "vllm/v1/worker/gpu_model_runner.py", line 3485, in synchronize_input_prep
    self.prepare_inputs_event.synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

(The async-error nature means the synchronize line is the messenger; the actual fault is in a kernel that ran earlier in the same step.)

Root cause

Already diagnosed by @kjiang249 in #35706 and fixed upstream in flashinfer:

The crash is caused by a bounds-check removal in flashinfer's bundled TRT-LLM finalizeMoeRoutingKernel, introduced in flashinfer v0.5.3 (commit 20435b40). When vLLM uses CUDAGraph with batch padding, the kernel accesses out-of-bounds memory for padding tokens, causing a Warp MMU Fault.

The fix is the 5-line bounds restore in flashinfer#2762:

// csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh, finalizeMoeRoutingKernel:
int64_t const expanded_rows = num_rows * experts_per_token;
if (expanded_permuted_row < 0 || expanded_permuted_row >= expanded_rows) {
  continue;
}

Confirmation that the bundled wheels lack the fix:

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.19.1-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match → fix not present

$ docker run --rm --entrypoint grep vllm/vllm-openai:v0.16.0-cu130 \
    -c "expanded_permuted_row >= expanded_rows" \
    /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
0     # no match

After in-container pip upgrade to flashinfer-python==0.6.11.post3 (latest stable), the same grep returns 1 — and the crash disappears.

Workarounds we verified

1. `VLLM_USE_FLASHINFER_MOE_FP8=0` (recommended in #35706)

Forces vLLM to pick a different MoE backend (DEEPGEMM / TRITON). No crash, but on MiniMax-M2.5 we measured a substantial perf hit at conc=1 1k/1k:

	FLASHINFER_CUTLASS	DEEPGEMM (env=0)
output throughput tok/s	89.98	55.65 (–38%)
mean TPOT (ms)	10.96	17.73 (+62%)
mean ITL (ms)	10.94	17.73

So the env workaround works but costs significant throughput on the very models people are using FP8 MoE on in the first place.

2. In-container `pip install --upgrade` at entrypoint

This is what we ended up doing internally. Wrapping vllm serve with a small entrypoint script that runs:

pip install --no-cache-dir --upgrade \
    'flashinfer-python>=0.6.10' 'flashinfer-cubin>=0.6.10'
pip install --no-cache-dir --upgrade \
    --index-url https://flashinfer.ai/whl/cu130 --extra-index-url https://pypi.org/simple \
    'flashinfer-jit-cache>=0.6.10'
exec vllm serve "$@"

This bumps all three flashinfer packages to 0.6.11.post3 (verified grep returns 1 = fix present), and the crash is gone with no measurable perf regression vs the pre-bug behavior. The pip warning about vllm pinning flashinfer-python==0.6.X exact-version is harmless (the ABI is compatible; verified by booting and serving on both v0.16.0-cu130 and v0.19.1-cu130).

Adds ~30–60s to container startup for the pip install + jit-cache wheel download.

Suggested fix

Bump the bundled flashinfer in the next image rebuild to a version >= 0.6.10 (ideally the same release on both vllm-openai:vX.Y.Z and vllm-openai:vX.Y.Z-cu130 lines).
Or, at minimum, emit a startup warning when vLLM detects bundled flashinfer-python < 0.6.10 AND FP8 MoE backend selected AND CUDAGraph enabled so users hit a clear log line instead of an async IMA after 2 minutes of decode.

Happy to send a PR for either if it's helpful.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #latency issue #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: vllm-openai:v0.19.1-cu130 / v0.16.0-cu130 docker images bundle pre-fix flashinfer (≤ 0.6.6), causing CUDA IMA on MoE FP8 decode [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workarounds we verified

Code Example

Your current environment

🐛 Describe the bug

TL;DR

Repro

Root cause

Workarounds we verified

1. `VLLM_USE_FLASHINFER_MOE_FP8=0` (recommended in #35706)

2. In-container `pip install --upgrade` at entrypoint

Suggested fix

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: vllm-openai:v0.19.1-cu130 / v0.16.0-cu130 docker images bundle pre-fix flashinfer (≤ 0.6.6), causing CUDA IMA on MoE FP8 decode [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workarounds we verified

Code Example

Your current environment

🐛 Describe the bug

TL;DR

Repro

Root cause

Workarounds we verified

1. VLLM_USE_FLASHINFER_MOE_FP8=0 (recommended in #35706)

2. In-container pip install --upgrade at entrypoint

Suggested fix

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `VLLM_USE_FLASHINFER_MOE_FP8=0` (recommended in #35706)

2. In-container `pip install --upgrade` at entrypoint