vllm - ✅(Solved) Fix Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37431Fetched 2026-04-08 00:58:43
View on GitHub
Comments
2
Participants
3
Timeline
6
Reactions
1
Author
Timeline (top)
subscribed ×3commented ×2cross-referenced ×1

NemotronH models that use Mamba-2 layers crash with CUDA error: an illegal instruction was encountered during inference on SM121 GPUs. The crash originates in vLLM's Triton-based Mamba ops (not the causal-conv1d / mamba-ssm packages, which are not used by vLLM).

Setting CUDA_LAUNCH_BLOCKING=1 makes the model fully stable but degrades throughput from ~14 tok/s (expected) to ~8.8 tok/s (~37% penalty) and limits GPU utilization to ~60%.

Error Message

NemotronH models that use Mamba-2 layers crash with CUDA error: an illegal instruction was encountered during inference on SM121 GPUs. The crash originates in vLLM's Triton-based Mamba ops (not the causal-conv1d / mamba-ssm packages, which are not used by vLLM). The error surfaces at mamba_mixer2.py:127 in forward_native: But with CUDA_LAUNCH_BLOCKING=1 we confirmed this is an async error from a preceding Triton kernel — the actual crash is in one of: → self.norm(ssm_output, gate) ← async error surfaces here

Root Cause

  1. Individual Triton SSM kernels work in isolation_chunk_cumsum_fwd passes with correct tensor shapes
  2. Basic Triton JIT kernels work on SM121 — simple @triton.jit pow/mean kernels run fine
  3. The crash only happens in async modeCUDA_LAUNCH_BLOCKING=1 makes everything stable
  4. Targeted torch.cuda.synchronize() after conv_ssm_forward does NOT fix it — the issue is at the CUDA driver dispatch level, not Python-level synchronization
  5. CUDA_DEVICE_MAX_CONNECTIONS=1 causes deadlock at engine startup (profiling phase)
  6. CUDA graphs are incompatible--enforce-eager is required alongside CUDA_LAUNCH_BLOCKING=1 because the sync patch inside conv_ssm_forward breaks graph capture
  7. The native causal-conv1d 1.6.1 package compiles and works for SM121 — but vLLM doesn't use it (uses its own Triton reimplementation instead)
  8. llama.cpp / Ollama achieve ~14 tok/s on the same hardware with their own CUDA kernels — the issue is specific to vLLM's Triton Mamba ops

Fix Action

Workaround

env:
  CUDA_LAUNCH_BLOCKING: "1"
command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --enforce-eager \
    --mamba-ssm-cache-dtype float32 \
    --attention-backend TRITON_ATTN \
    --kv-cache-dtype fp8 \
    ...

Result: stable at 8.8 tok/s (vs ~14 tok/s expected).

PR fix notes

PR #98: SM121/GB10: Build arch split, prefix caching, Qwen3.5 NVFP4 support

Description (problem / solution / changelog)

Summary

  • Fix SM121 build arch split for NVFP4 on DGX Spark
  • Enable prefix caching for Nemotron with Mamba align mode
  • Add Qwen3.5-122B NVFP4 recipe (FlashInfer CUTLASS MoE, MTP)
  • Add Mamba SSM SM121 recognition fix

Changes

Build system (build-and-copy.sh, Dockerfile)

  • GPU_ARCH_LIST=12.0a (was 12.1a) — vLLM's cmake gates FP4 kernels on cuda_archs_loose_intersection("12.0a", ...). Without 12.0, ENABLE_NVFP4_SM120 is never set → "No compiled nvfp4 quantization kernel"
  • Hardcode FLASHINFER_CUDA_ARCH_LIST=12.1a — FlashInfer JIT needs SM121 for E2M1 software fallback (via fix-e2m1-sm121 mod). These MUST differ: vLLM needs 12.0a, FlashInfer needs 12.1a
  • Dockerfile defaults updated to 12.0a;12.1a for TORCH_CUDA_ARCH_LIST, SM121 added to CUDA_SUPPORTED_ARCHS

Nemotron recipe

  • Enable prefix caching with --mamba-cache-mode align (default all mode crashes on SM121 in selective_state_update Triton kernel)
  • Add --max-num-batched-tokens 16384 (Mamba block size 8400 must be <= max_num_batched_tokens in align mode)

Qwen3.5 recipe (new)

  • Model: scottgl/Qwen3.5-122B-A10B-MTP-NVFP4
  • Same FlashInfer CUTLASS MoE path as Nemotron (VLLM_NVFP4_GEMM_BACKEND=cutlass, VLLM_USE_FLASHINFER_MOE_FP4=1)
  • MTP speculative decoding (n=2)
  • Tested: 25-40 tok/s on DGX Spark

SM121 mod (mods/fix-e2m1-sm121)

  • Add Mamba SSM fix: recognize SM121 (capability family 120) as Blackwell-class in mamba_mixer2.py for correct selective_state_update block sizes

Performance

ModelThroughputMTPPrefix Cache
Nemotron-3-Super-120B NVFP415-20 tok/sn=2Yes (align)
Qwen3.5-122B-A10B NVFP425-40 tok/sn=2Yes (align)

Upstream PR status

  • FlashInfer #2786 (K=64 tiles) — open, mergeable
  • FlashInfer #2798 (CUTLASS 4.4.2) — open, mergeable
  • vLLM #34822 (is_blackwell_class) — open, has conflicts
  • vLLM #35947 (E2M1 software) — open, low activity

🤖 Generated with Claude Code

Changed files

  • Dockerfile (modified, +57/-5)
  • bakeoff.sh (added, +46/-0)
  • bench_mtp.py (added, +111/-0)
  • build-and-copy.sh (modified, +49/-5)
  • e2m1_nvfp4_sm121.patch (added, +111/-0)
  • eval-quality.py (added, +258/-0)
  • fix_quantization_utils_sm121.py (added, +139/-0)
  • flashinfer_e2m1_sm121.patch (added, +22/-0)
  • flashinfer_k64_sm120.patch (added, +227/-0)
  • flashinfer_k64_sm120_v442.patch (added, +118/-0)
  • launch-cluster.sh (modified, +1/-1)
  • mods/fix-e2m1-sm121/patch_fp4_common.py (added, +78/-0)
  • mods/fix-e2m1-sm121/run.sh (added, +132/-0)
  • mods/fix-fla-sm121/patch_fla_sm121.py (added, +93/-0)
  • mods/fix-fla-sm121/run.sh (added, +24/-0)
  • mods/fix-mistral-guidance/pr37081-src-only.patch (added, +1261/-0)
  • mods/fix-mistral-guidance/run.sh (added, +62/-0)
  • mods/fix-mistral-reasoning/run.sh (added, +109/-0)
  • mods/fix-mistral-tool-role/run.sh (added, +54/-0)
  • mods/fix-qwen3.5-autoround/run.sh (modified, +8/-1)
  • mods/gpu-mem-util-gb/gpu_mem.patch (removed, +0/-255)
  • mods/gpu-mem-util-gb/run.sh (removed, +0/-6)
  • nightly-update.sh (added, +101/-0)
  • recipes/minimax-m2.5-reap-139b-nvfp4.yaml (added, +33/-0)
  • recipes/mistral-small-4-119b-nvfp4.yaml (added, +65/-0)
  • recipes/nemotron-3-super-nvfp4.yaml (modified, +19/-10)
  • recipes/qwen3.5-122b-a10b-int4-autoround.yaml (added, +46/-0)
  • recipes/qwen3.5-122b-a10b-nvfp4.yaml (added, +64/-0)
  • recipes/qwen3.5-27b-nvfp4.yaml (added, +56/-0)
  • recipes/qwen3.5-35b-a3b-nvfp4-baseline.yaml (added, +47/-0)
  • recipes/qwen3.5-35b-a3b-nvfp4-test.yaml (added, +55/-0)
  • recipes/qwen3.5-397b-a17b-nvfp4.yaml (added, +64/-0)
  • recipes/qwen3.5-397b-int4-autoround.yaml (modified, +0/-1)
  • run-bakeoff-overnight.sh (added, +53/-0)
  • test_e2m1_all_flags.sh (added, +43/-0)
  • test_e2m1_flags.cu (added, +72/-0)
  • vllm_cmake_arch_suffix.patch (added, +42/-0)
  • vllm_mistral_lark_grammar.patch (added, +1938/-0)
  • vllm_pr_37081.diff (added, +1938/-0)

Code Example

x_grouped = x_grouped * torch.rsqrt(variance + self.variance_epsilon)

---

NemotronHForCausalLM.forward
NemotronHDecoderLayer.forward (Mamba layer)
MambaMixer2.forward
      → torch.ops.vllm.mamba_mixer2
conv_ssm_forward  (Triton kernels: causal_conv1d_fn, mamba_chunk_scan_combined_varlen, selective_state_update)
      → self.norm(ssm_output, gate)async error surfaces here

---

env:
  CUDA_LAUNCH_BLOCKING: "1"
command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --enforce-eager \
    --mamba-ssm-cache-dtype float32 \
    --attention-backend TRITON_ATTN \
    --kv-cache-dtype fp8 \
    ...
RAW_BUFFERClick to expand / collapse

Bug: Mamba-2 Triton ops produce cudaErrorIllegalInstruction in async mode on SM121

Environment

  • GPU: NVIDIA GB10 (SM121) — DGX Spark
  • Driver: 580.126.09
  • CUDA: 13.0 (forward compat 13.1)
  • vLLM: 0.17.2rc1.dev7+g9c7cab5eb (eugr/spark-vllm-docker prebuilt wheels, March 17 2026)
  • PyTorch: 2.10.0a0+a36e1d3 (NGC 26.01)
  • Triton: 3.6.0
  • Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (NemotronHForCausalLM, hybrid Mamba-2 + Transformer + MoE)
  • Platform: aarch64 (Grace CPU)

Description

NemotronH models that use Mamba-2 layers crash with CUDA error: an illegal instruction was encountered during inference on SM121 GPUs. The crash originates in vLLM's Triton-based Mamba ops (not the causal-conv1d / mamba-ssm packages, which are not used by vLLM).

Setting CUDA_LAUNCH_BLOCKING=1 makes the model fully stable but degrades throughput from ~14 tok/s (expected) to ~8.8 tok/s (~37% penalty) and limits GPU utilization to ~60%.

Crash location

The error surfaces at mamba_mixer2.py:127 in forward_native:

x_grouped = x_grouped * torch.rsqrt(variance + self.variance_epsilon)

But with CUDA_LAUNCH_BLOCKING=1 we confirmed this is an async error from a preceding Triton kernel — the actual crash is in one of:

  • vllm/model_executor/layers/mamba/ops/causal_conv1d.py (_causal_conv1d_fwd_kernel)
  • vllm/model_executor/layers/mamba/ops/ssd_chunk_scan.py (_chunk_scan_fwd)
  • vllm/model_executor/layers/mamba/ops/ssd_chunk_state.py (_chunk_cumsum_fwd_kernel / _chunk_state_fwd_kernel)
  • vllm/model_executor/layers/mamba/ops/ssd_state_passing.py

Call chain

NemotronHForCausalLM.forward
  → NemotronHDecoderLayer.forward (Mamba layer)
    → MambaMixer2.forward
      → torch.ops.vllm.mamba_mixer2
        → conv_ssm_forward  (Triton kernels: causal_conv1d_fn, mamba_chunk_scan_combined_varlen, selective_state_update)
      → self.norm(ssm_output, gate)  ← async error surfaces here

Key findings

  1. Individual Triton SSM kernels work in isolation_chunk_cumsum_fwd passes with correct tensor shapes
  2. Basic Triton JIT kernels work on SM121 — simple @triton.jit pow/mean kernels run fine
  3. The crash only happens in async modeCUDA_LAUNCH_BLOCKING=1 makes everything stable
  4. Targeted torch.cuda.synchronize() after conv_ssm_forward does NOT fix it — the issue is at the CUDA driver dispatch level, not Python-level synchronization
  5. CUDA_DEVICE_MAX_CONNECTIONS=1 causes deadlock at engine startup (profiling phase)
  6. CUDA graphs are incompatible--enforce-eager is required alongside CUDA_LAUNCH_BLOCKING=1 because the sync patch inside conv_ssm_forward breaks graph capture
  7. The native causal-conv1d 1.6.1 package compiles and works for SM121 — but vLLM doesn't use it (uses its own Triton reimplementation instead)
  8. llama.cpp / Ollama achieve ~14 tok/s on the same hardware with their own CUDA kernels — the issue is specific to vLLM's Triton Mamba ops

Workaround

env:
  CUDA_LAUNCH_BLOCKING: "1"
command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --enforce-eager \
    --mamba-ssm-cache-dtype float32 \
    --attention-backend TRITON_ATTN \
    --kv-cache-dtype fp8 \
    ...

Result: stable at 8.8 tok/s (vs ~14 tok/s expected).

Suggested fix directions

  1. Preferred: Investigate the Triton code generation for SM121 async execution — there may be a missing memory barrier or wrong PTX target in the JIT output for the Mamba kernels
  2. Alternative: Allow vLLM to optionally use the native causal-conv1d / mamba-ssm CUDA kernels (which work on SM121) instead of the Triton fallback, similar to how flash-attn can be swapped for different backends
  3. Minimal: Add per-layer torch.cuda.synchronize() only around Mamba ops when running on SM12x, instead of requiring global CUDA_LAUNCH_BLOCKING=1

Related issues

  • #36821 — No sm_121 support on aarch64
  • #35519 — NVFP4 illegal instruction on SM121 (fixed by PR #35947 for E2M1, but Mamba ops remain broken)
  • #31128 — Add support of Blackwell SM121
  • #37030 — Marlin kernel wrong token on SM121

Performance impact

ConfigDecode tok/sGPU util
CUDA_LAUNCH_BLOCKING=1 + --enforce-eager8.8~60%
Ollama (same model, same hardware)~14~90%+
Expected (no workaround)~14-17~90%+

extent analysis

Fix Plan

To address the cudaErrorIllegalInstruction issue in async mode on SM121, we will investigate the Triton code generation for SM121 async execution. The goal is to identify and fix any missing memory barriers or incorrect PTX targets in the JIT output for the Mamba kernels.

Step-by-Step Solution

  1. Update Triton kernels: Modify the Triton kernels (_causal_conv1d_fwd_kernel, _chunk_scan_fwd, _chunk_cumsum_fwd_kernel, _chunk_state_fwd_kernel) to include memory barriers for SM121 async execution.
  2. Verify PTX target: Ensure the PTX target is correctly set for SM121 in the Triton JIT output.
  3. Test with async mode: Test the updated kernels in async mode to verify the fix.

Example Code Changes

# In vllm/model_executor/layers/mamba/ops/causal_conv1d.py
@triton.jit
def _causal_conv1d_fwd_kernel(
    # ... existing code ...
    ):
    # Add memory barrier for SM121 async execution
    if triton.runtime.get_sm_version() >= 121:
        triton.memory_barrier()
    # ... existing code ...

Verification

To verify the fix, run the model in async mode on SM121 and check for the absence of cudaErrorIllegalInstruction errors. Additionally, monitor the performance impact of the fix by comparing the decode tokens per second and GPU utilization with the expected values.

Extra Tips

  • When updating the Triton kernels, ensure that the changes are compatible with other GPU architectures to avoid introducing new issues.
  • Consider adding logging or debugging statements to help identify any future issues related to async execution on SM121.
  • If the fix introduces any performance regressions, consider exploring alternative solutions, such as using native causal-conv1d / mamba-ssm CUDA kernels or adding per-layer torch.cuda.synchronize() calls.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING