vllm - 💡(How to fix) Fix [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37754Fetched 2026-04-08 01:13:03
View on GitHub
Comments
2
Participants
3
Timeline
4
Reactions
0
Author
Timeline (top)
commented ×2subscribed ×2

FlashInfer attention backend + MTP speculative decoding (num_speculative_tokens=2) crashes with "illegal memory access" on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 (GQA ratio = 16). Triton attention backend works correctly.

Error Message

RuntimeError: Check failed: (status == cudaSuccess) is false:
BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered

File: batch_prefill.cu, line 330
Kernel: dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_head_dim_qk_128

Root Cause

FlashInfer attention backend + MTP speculative decoding (num_speculative_tokens=2) crashes with "illegal memory access" on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 (GQA ratio = 16). Triton attention backend works correctly.

Fix Action

Workaround

Use --attention-backend triton_attn instead of flashinfer. Achieves 22.4 tok/s with MTP=2 on Nemotron-3-Super-120B-A12B-NVFP4 on DGX Spark.

Code Example

# Works (Triton attention):
vllm serve /models/nemotron-3-super \
  --attention-backend triton_attn \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill

# Crashes (FlashInfer attention):
vllm serve /models/nemotron-3-super \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill
# → First request returns 500, BatchPrefillWithPagedKVCache illegal memory access

---

RuntimeError: Check failed: (status == cudaSuccess) is false:
BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered

File: batch_prefill.cu, line 330
Kernel: dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_head_dim_qk_128
RAW_BUFFERClick to expand / collapse

Summary

FlashInfer attention backend + MTP speculative decoding (num_speculative_tokens=2) crashes with "illegal memory access" on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 (GQA ratio = 16). Triton attention backend works correctly.

Reproduction

# Works (Triton attention):
vllm serve /models/nemotron-3-super \
  --attention-backend triton_attn \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill

# Crashes (FlashInfer attention):
vllm serve /models/nemotron-3-super \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill
# → First request returns 500, BatchPrefillWithPagedKVCache illegal memory access

Key observations

  • LLM().generate() with FlashInfer + MTP=2 works (single synchronous request)
  • vllm serve with FlashInfer + MTP=2 crashes on the first request
  • Triton attention backend works perfectly at 22.4 tok/s with MTP=2
  • MTP=1 with FlashInfer works; MTP=2+ crashes
  • The crash is in FlashInfer's BatchPrefillWithPagedKVCacheRun kernel

Error

RuntimeError: Check failed: (status == cudaSuccess) is false:
BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered

File: batch_prefill.cu, line 330
Kernel: dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_head_dim_qk_128

Environment

  • GPU: NVIDIA GB10 (compute capability 12.1 / SM121, DGX Spark)
  • vLLM: 0.17.1 (built from source)
  • FlashInfer: 0.6.6 (built from source, FLASHINFER_CUDA_ARCH_LIST=12.1a)
  • CUDA: 13.1.1
  • PyTorch: 2.11.0a0 (NGC 26.02)
  • Model: NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (32 QO heads, 2 KV heads, GQA=16)

Related

  • Filed upstream FlashInfer bug: flashinfer-ai/flashinfer#2849 (standalone BatchDecode GQA>=16 crash on SM121 with 10-line reproducer)
  • SM121 is the DGX Spark variant of Blackwell — uses mma.sync not tcgen05, 99KB smem vs 228KB on SM100
  • The underlying issue appears to be FlashInfer kernel incompatibility with SM121 at high GQA ratios, but the vLLM MTP serving path is the trigger

Workaround

Use --attention-backend triton_attn instead of flashinfer. Achieves 22.4 tok/s with MTP=2 on Nemotron-3-Super-120B-A12B-NVFP4 on DGX Spark.

extent analysis

Fix Plan

To fix the issue with FlashInfer attention backend crashing on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 with num_speculative_tokens=2, follow these steps:

  • Update FlashInfer to the latest version that includes the fix for the kernel incompatibility issue with SM121 at high GQA ratios.
  • If the latest version is not available, apply the patch from the upstream FlashInfer bug fix (flashinfer-ai/flashinfer#2849) to the FlashInfer source code.
  • Rebuild FlashInfer with the updated source code and FLASHINFER_CUDA_ARCH_LIST=12.1a.
  • Update the vllm serve command to use the updated FlashInfer attention backend.

Example code snippet to update FlashInfer:

# Clone the FlashInfer repository
git clone https://github.com/flashinfer-ai/flashinfer.git

# Checkout the branch with the fix
git checkout fix/sm121-gqa-ratio

# Build FlashInfer with the updated source code
mkdir build && cd build
cmake .. -DCUDA_ARCH_LIST=12.1a
make -j

# Update the vllm serve command to use the updated FlashInfer attention backend
vllm serve /models/nemotron-3-super \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill

Verification

To verify that the fix worked, run the vllm serve command with the updated FlashInfer attention backend and check that it no longer crashes with an "illegal memory access" error.

Extra Tips

  • Make sure to update the CUDA version to the latest compatible version with the updated FlashInfer.
  • If the issue persists, try reducing the GQA ratio or using a different attention backend like Triton.
  • Keep an eye on the upstream FlashInfer bug fix (flashinfer-ai/flashinfer#2849) for any updates or changes to the fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model [2 comments, 3 participants]