vllm - 💡(How to fix) Fix [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model [2 comments, 3 participants]

vllm2026-03-21 17:44:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37754•Fetched 2026-04-08 01:13:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2subscribed ×2

FlashInfer attention backend + MTP speculative decoding (num_speculative_tokens=2) crashes with "illegal memory access" on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 (GQA ratio = 16). Triton attention backend works correctly.

Error Message

RuntimeError: Check failed: (status == cudaSuccess) is false:
BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered

File: batch_prefill.cu, line 330
Kernel: dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_head_dim_qk_128

Root Cause

Fix Action

Workaround

Use --attention-backend triton_attn instead of flashinfer. Achieves 22.4 tok/s with MTP=2 on Nemotron-3-Super-120B-A12B-NVFP4 on DGX Spark.

Code Example

# Works (Triton attention):
vllm serve /models/nemotron-3-super \
  --attention-backend triton_attn \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill

# Crashes (FlashInfer attention):
vllm serve /models/nemotron-3-super \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill
# → First request returns 500, BatchPrefillWithPagedKVCache illegal memory access

---

RuntimeError: Check failed: (status == cudaSuccess) is false:
BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered

File: batch_prefill.cu, line 330
Kernel: dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_head_dim_qk_128

RAW_BUFFERClick to expand / collapse

Summary

Reproduction

# Works (Triton attention):
vllm serve /models/nemotron-3-super \
  --attention-backend triton_attn \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill

# Crashes (FlashInfer attention):
vllm serve /models/nemotron-3-super \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill
# → First request returns 500, BatchPrefillWithPagedKVCache illegal memory access

Key observations

LLM().generate() with FlashInfer + MTP=2 works (single synchronous request)
vllm serve with FlashInfer + MTP=2 crashes on the first request
Triton attention backend works perfectly at 22.4 tok/s with MTP=2
MTP=1 with FlashInfer works; MTP=2+ crashes
The crash is in FlashInfer's BatchPrefillWithPagedKVCacheRun kernel

Error

RuntimeError: Check failed: (status == cudaSuccess) is false:
BatchPrefillWithPagedKVCache failed with error an illegal memory access was encountered

File: batch_prefill.cu, line 330
Kernel: dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_head_dim_qk_128

Environment

GPU: NVIDIA GB10 (compute capability 12.1 / SM121, DGX Spark)
vLLM: 0.17.1 (built from source)
FlashInfer: 0.6.6 (built from source, FLASHINFER_CUDA_ARCH_LIST=12.1a)
CUDA: 13.1.1
PyTorch: 2.11.0a0 (NGC 26.02)
Model: NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (32 QO heads, 2 KV heads, GQA=16)

Filed upstream FlashInfer bug: flashinfer-ai/flashinfer#2849 (standalone BatchDecode GQA>=16 crash on SM121 with 10-line reproducer)
SM121 is the DGX Spark variant of Blackwell — uses mma.sync not tcgen05, 99KB smem vs 228KB on SM100
The underlying issue appears to be FlashInfer kernel incompatibility with SM121 at high GQA ratios, but the vLLM MTP serving path is the trigger

Workaround

Use --attention-backend triton_attn instead of flashinfer. Achieves 22.4 tok/s with MTP=2 on Nemotron-3-Super-120B-A12B-NVFP4 on DGX Spark.

extent analysis

Fix Plan

To fix the issue with FlashInfer attention backend crashing on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 with num_speculative_tokens=2, follow these steps:

Update FlashInfer to the latest version that includes the fix for the kernel incompatibility issue with SM121 at high GQA ratios.
If the latest version is not available, apply the patch from the upstream FlashInfer bug fix (flashinfer-ai/flashinfer#2849) to the FlashInfer source code.
Rebuild FlashInfer with the updated source code and FLASHINFER_CUDA_ARCH_LIST=12.1a.
Update the vllm serve command to use the updated FlashInfer attention backend.

Example code snippet to update FlashInfer:

# Clone the FlashInfer repository
git clone https://github.com/flashinfer-ai/flashinfer.git

# Checkout the branch with the fix
git checkout fix/sm121-gqa-ratio

# Build FlashInfer with the updated source code
mkdir build && cd build
cmake .. -DCUDA_ARCH_LIST=12.1a
make -j

# Update the vllm serve command to use the updated FlashInfer attention backend
vllm serve /models/nemotron-3-super \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --kv-cache-dtype fp8 \
  --no-enable-chunked-prefill

Verification

To verify that the fix worked, run the vllm serve command with the updated FlashInfer attention backend and check that it no longer crashes with an "illegal memory access" error.

Extra Tips

Make sure to update the CUDA version to the latest compatible version with the updated FlashInfer.
If the issue persists, try reducing the GQA ratio or using a different attention backend like Triton.
Keep an eye on the upstream FlashInfer bug fix (flashinfer-ai/flashinfer#2849) for any updates or changes to the fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Reproduction

Key observations

Error

Environment

Related

Workaround

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Reproduction

Key observations

Error

Environment

Related

Workaround

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING