vllm - ✅(Solved) Fix Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 [1 pull requests, 2 comments, 3 participants]

vllm2026-03-18 13:09:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37431•Fetched 2026-04-08 00:58:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×3commented ×2cross-referenced ×1

Setting CUDA_LAUNCH_BLOCKING=1 makes the model fully stable but degrades throughput from ~14 tok/s (expected) to ~8.8 tok/s (~37% penalty) and limits GPU utilization to ~60%.

Error Message

NemotronH models that use Mamba-2 layers crash with CUDA error: an illegal instruction was encountered during inference on SM121 GPUs. The crash originates in vLLM's Triton-based Mamba ops (not the causal-conv1d / mamba-ssm packages, which are not used by vLLM). The error surfaces at mamba_mixer2.py:127 in forward_native: But with CUDA_LAUNCH_BLOCKING=1 we confirmed this is an async error from a preceding Triton kernel — the actual crash is in one of: → self.norm(ssm_output, gate) ← async error surfaces here

Root Cause

Individual Triton SSM kernels work in isolation — _chunk_cumsum_fwd passes with correct tensor shapes
Basic Triton JIT kernels work on SM121 — simple @triton.jit pow/mean kernels run fine
The crash only happens in async mode — CUDA_LAUNCH_BLOCKING=1 makes everything stable
Targeted torch.cuda.synchronize() after conv_ssm_forward does NOT fix it — the issue is at the CUDA driver dispatch level, not Python-level synchronization
CUDA_DEVICE_MAX_CONNECTIONS=1 causes deadlock at engine startup (profiling phase)
CUDA graphs are incompatible — --enforce-eager is required alongside CUDA_LAUNCH_BLOCKING=1 because the sync patch inside conv_ssm_forward breaks graph capture
The native causal-conv1d 1.6.1 package compiles and works for SM121 — but vLLM doesn't use it (uses its own Triton reimplementation instead)
llama.cpp / Ollama achieve ~14 tok/s on the same hardware with their own CUDA kernels — the issue is specific to vLLM's Triton Mamba ops

Fix Action

Workaround

env:
  CUDA_LAUNCH_BLOCKING: "1"
command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --enforce-eager \
    --mamba-ssm-cache-dtype float32 \
    --attention-backend TRITON_ATTN \
    --kv-cache-dtype fp8 \
    ...

Result: stable at 8.8 tok/s (vs ~14 tok/s expected).

PR fix notes

PR #98: SM121/GB10: Build arch split, prefix caching, Qwen3.5 NVFP4 support

Repository: eugr/spark-vllm-docker
Author: RobTand
State: open | merged: False
Link: https://github.com/eugr/spark-vllm-docker/pull/98

Description (problem / solution / changelog)

Summary

Fix SM121 build arch split for NVFP4 on DGX Spark
Enable prefix caching for Nemotron with Mamba align mode
Add Qwen3.5-122B NVFP4 recipe (FlashInfer CUTLASS MoE, MTP)
Add Mamba SSM SM121 recognition fix

Changes

Build system (`build-and-copy.sh`, `Dockerfile`)

GPU_ARCH_LIST=12.0a (was 12.1a) — vLLM's cmake gates FP4 kernels on cuda_archs_loose_intersection("12.0a", ...). Without 12.0, ENABLE_NVFP4_SM120 is never set → "No compiled nvfp4 quantization kernel"
Hardcode FLASHINFER_CUDA_ARCH_LIST=12.1a — FlashInfer JIT needs SM121 for E2M1 software fallback (via fix-e2m1-sm121 mod). These MUST differ: vLLM needs 12.0a, FlashInfer needs 12.1a
Dockerfile defaults updated to 12.0a;12.1a for TORCH_CUDA_ARCH_LIST, SM121 added to CUDA_SUPPORTED_ARCHS

Nemotron recipe

Enable prefix caching with --mamba-cache-mode align (default all mode crashes on SM121 in selective_state_update Triton kernel)
Add --max-num-batched-tokens 16384 (Mamba block size 8400 must be <= max_num_batched_tokens in align mode)

Qwen3.5 recipe (new)

Model: scottgl/Qwen3.5-122B-A10B-MTP-NVFP4
Same FlashInfer CUTLASS MoE path as Nemotron (VLLM_NVFP4_GEMM_BACKEND=cutlass, VLLM_USE_FLASHINFER_MOE_FP4=1)
MTP speculative decoding (n=2)
Tested: 25-40 tok/s on DGX Spark

SM121 mod (`mods/fix-e2m1-sm121`)

Add Mamba SSM fix: recognize SM121 (capability family 120) as Blackwell-class in mamba_mixer2.py for correct selective_state_update block sizes

Performance

Model	Throughput	MTP	Prefix Cache
Nemotron-3-Super-120B NVFP4	15-20 tok/s	n=2	Yes (align)
Qwen3.5-122B-A10B NVFP4	25-40 tok/s	n=2	Yes (align)

Upstream PR status

FlashInfer #2786 (K=64 tiles) — open, mergeable
FlashInfer #2798 (CUTLASS 4.4.2) — open, mergeable
vLLM #34822 (is_blackwell_class) — open, has conflicts
vLLM #35947 (E2M1 software) — open, low activity

🤖 Generated with Claude Code

Changed files

Dockerfile (modified, +57/-5)
bakeoff.sh (added, +46/-0)
bench_mtp.py (added, +111/-0)
build-and-copy.sh (modified, +49/-5)
e2m1_nvfp4_sm121.patch (added, +111/-0)
eval-quality.py (added, +258/-0)
fix_quantization_utils_sm121.py (added, +139/-0)
flashinfer_e2m1_sm121.patch (added, +22/-0)
flashinfer_k64_sm120.patch (added, +227/-0)
flashinfer_k64_sm120_v442.patch (added, +118/-0)
launch-cluster.sh (modified, +1/-1)
mods/fix-e2m1-sm121/patch_fp4_common.py (added, +78/-0)
mods/fix-e2m1-sm121/run.sh (added, +132/-0)
mods/fix-fla-sm121/patch_fla_sm121.py (added, +93/-0)
mods/fix-fla-sm121/run.sh (added, +24/-0)
mods/fix-mistral-guidance/pr37081-src-only.patch (added, +1261/-0)
mods/fix-mistral-guidance/run.sh (added, +62/-0)
mods/fix-mistral-reasoning/run.sh (added, +109/-0)
mods/fix-mistral-tool-role/run.sh (added, +54/-0)
mods/fix-qwen3.5-autoround/run.sh (modified, +8/-1)
mods/gpu-mem-util-gb/gpu_mem.patch (removed, +0/-255)
mods/gpu-mem-util-gb/run.sh (removed, +0/-6)
nightly-update.sh (added, +101/-0)
recipes/minimax-m2.5-reap-139b-nvfp4.yaml (added, +33/-0)
recipes/mistral-small-4-119b-nvfp4.yaml (added, +65/-0)
recipes/nemotron-3-super-nvfp4.yaml (modified, +19/-10)
recipes/qwen3.5-122b-a10b-int4-autoround.yaml (added, +46/-0)
recipes/qwen3.5-122b-a10b-nvfp4.yaml (added, +64/-0)
recipes/qwen3.5-27b-nvfp4.yaml (added, +56/-0)
recipes/qwen3.5-35b-a3b-nvfp4-baseline.yaml (added, +47/-0)
recipes/qwen3.5-35b-a3b-nvfp4-test.yaml (added, +55/-0)
recipes/qwen3.5-397b-a17b-nvfp4.yaml (added, +64/-0)
recipes/qwen3.5-397b-int4-autoround.yaml (modified, +0/-1)
run-bakeoff-overnight.sh (added, +53/-0)
test_e2m1_all_flags.sh (added, +43/-0)
test_e2m1_flags.cu (added, +72/-0)
vllm_cmake_arch_suffix.patch (added, +42/-0)
vllm_mistral_lark_grammar.patch (added, +1938/-0)
vllm_pr_37081.diff (added, +1938/-0)

Code Example

x_grouped = x_grouped * torch.rsqrt(variance + self.variance_epsilon)

---

NemotronHForCausalLM.forward
  → NemotronHDecoderLayer.forward (Mamba layer)
    → MambaMixer2.forward
      → torch.ops.vllm.mamba_mixer2
        → conv_ssm_forward  (Triton kernels: causal_conv1d_fn, mamba_chunk_scan_combined_varlen, selective_state_update)
      → self.norm(ssm_output, gate)  ← async error surfaces here

---

env:
  CUDA_LAUNCH_BLOCKING: "1"
command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --enforce-eager \
    --mamba-ssm-cache-dtype float32 \
    --attention-backend TRITON_ATTN \
    --kv-cache-dtype fp8 \
    ...

RAW_BUFFERClick to expand / collapse

Bug: Mamba-2 Triton ops produce `cudaErrorIllegalInstruction` in async mode on SM121

Environment

GPU: NVIDIA GB10 (SM121) — DGX Spark
Driver: 580.126.09
CUDA: 13.0 (forward compat 13.1)
vLLM: 0.17.2rc1.dev7+g9c7cab5eb (eugr/spark-vllm-docker prebuilt wheels, March 17 2026)
PyTorch: 2.10.0a0+a36e1d3 (NGC 26.01)
Triton: 3.6.0
Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (NemotronHForCausalLM, hybrid Mamba-2 + Transformer + MoE)
Platform: aarch64 (Grace CPU)

Description

Setting CUDA_LAUNCH_BLOCKING=1 makes the model fully stable but degrades throughput from ~14 tok/s (expected) to ~8.8 tok/s (~37% penalty) and limits GPU utilization to ~60%.

Crash location

The error surfaces at mamba_mixer2.py:127 in forward_native:

x_grouped = x_grouped * torch.rsqrt(variance + self.variance_epsilon)

But with CUDA_LAUNCH_BLOCKING=1 we confirmed this is an async error from a preceding Triton kernel — the actual crash is in one of:

vllm/model_executor/layers/mamba/ops/causal_conv1d.py (_causal_conv1d_fwd_kernel)
vllm/model_executor/layers/mamba/ops/ssd_chunk_scan.py (_chunk_scan_fwd)
vllm/model_executor/layers/mamba/ops/ssd_chunk_state.py (_chunk_cumsum_fwd_kernel / _chunk_state_fwd_kernel)
vllm/model_executor/layers/mamba/ops/ssd_state_passing.py

Call chain

NemotronHForCausalLM.forward
  → NemotronHDecoderLayer.forward (Mamba layer)
    → MambaMixer2.forward
      → torch.ops.vllm.mamba_mixer2
        → conv_ssm_forward  (Triton kernels: causal_conv1d_fn, mamba_chunk_scan_combined_varlen, selective_state_update)
      → self.norm(ssm_output, gate)  ← async error surfaces here

Key findings

Individual Triton SSM kernels work in isolation — _chunk_cumsum_fwd passes with correct tensor shapes
Basic Triton JIT kernels work on SM121 — simple @triton.jit pow/mean kernels run fine
The crash only happens in async mode — CUDA_LAUNCH_BLOCKING=1 makes everything stable
Targeted torch.cuda.synchronize() after conv_ssm_forward does NOT fix it — the issue is at the CUDA driver dispatch level, not Python-level synchronization
CUDA_DEVICE_MAX_CONNECTIONS=1 causes deadlock at engine startup (profiling phase)
CUDA graphs are incompatible — --enforce-eager is required alongside CUDA_LAUNCH_BLOCKING=1 because the sync patch inside conv_ssm_forward breaks graph capture
The native causal-conv1d 1.6.1 package compiles and works for SM121 — but vLLM doesn't use it (uses its own Triton reimplementation instead)
llama.cpp / Ollama achieve ~14 tok/s on the same hardware with their own CUDA kernels — the issue is specific to vLLM's Triton Mamba ops

Workaround

env:
  CUDA_LAUNCH_BLOCKING: "1"
command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --enforce-eager \
    --mamba-ssm-cache-dtype float32 \
    --attention-backend TRITON_ATTN \
    --kv-cache-dtype fp8 \
    ...

Result: stable at 8.8 tok/s (vs ~14 tok/s expected).

Suggested fix directions

Preferred: Investigate the Triton code generation for SM121 async execution — there may be a missing memory barrier or wrong PTX target in the JIT output for the Mamba kernels
Alternative: Allow vLLM to optionally use the native causal-conv1d / mamba-ssm CUDA kernels (which work on SM121) instead of the Triton fallback, similar to how flash-attn can be swapped for different backends
Minimal: Add per-layer torch.cuda.synchronize() only around Mamba ops when running on SM12x, instead of requiring global CUDA_LAUNCH_BLOCKING=1

Related issues

#36821 — No sm_121 support on aarch64
#35519 — NVFP4 illegal instruction on SM121 (fixed by PR #35947 for E2M1, but Mamba ops remain broken)
#31128 — Add support of Blackwell SM121
#37030 — Marlin kernel wrong token on SM121

Performance impact

Config	Decode tok/s	GPU util
`CUDA_LAUNCH_BLOCKING=1` + `--enforce-eager`	8.8	~60%
Ollama (same model, same hardware)	~14	~90%+
Expected (no workaround)	~14-17	~90%+

extent analysis

Fix Plan

To address the cudaErrorIllegalInstruction issue in async mode on SM121, we will investigate the Triton code generation for SM121 async execution. The goal is to identify and fix any missing memory barriers or incorrect PTX targets in the JIT output for the Mamba kernels.

Step-by-Step Solution

Update Triton kernels: Modify the Triton kernels (_causal_conv1d_fwd_kernel, _chunk_scan_fwd, _chunk_cumsum_fwd_kernel, _chunk_state_fwd_kernel) to include memory barriers for SM121 async execution.
Verify PTX target: Ensure the PTX target is correctly set for SM121 in the Triton JIT output.
Test with async mode: Test the updated kernels in async mode to verify the fix.

Example Code Changes

# In vllm/model_executor/layers/mamba/ops/causal_conv1d.py
@triton.jit
def _causal_conv1d_fwd_kernel(
    # ... existing code ...
    ):
    # Add memory barrier for SM121 async execution
    if triton.runtime.get_sm_version() >= 121:
        triton.memory_barrier()
    # ... existing code ...

Verification

To verify the fix, run the model in async mode on SM121 and check for the absence of cudaErrorIllegalInstruction errors. Additionally, monitor the performance impact of the fix by comparing the decode tokens per second and GPU utilization with the expected values.

Extra Tips

When updating the Triton kernels, ensure that the changes are compatible with other GPU architectures to avoid introducing new issues.
Consider adding logging or debugging statements to help identify any future issues related to async execution on SM121.
If the fix introduces any performance regressions, consider exploring alternative solutions, such as using native causal-conv1d / mamba-ssm CUDA kernels or adding per-layer torch.cuda.synchronize() calls.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #98: SM121/GB10: Build arch split, prefix caching, Qwen3.5 NVFP4 support

Description (problem / solution / changelog)

Summary

Changes

Build system (build-and-copy.sh, Dockerfile)

Nemotron recipe

Qwen3.5 recipe (new)

SM121 mod (mods/fix-e2m1-sm121)

Performance

Upstream PR status

Changed files

Code Example

Bug: Mamba-2 Triton ops produce cudaErrorIllegalInstruction in async mode on SM121

Environment

Description

Crash location

Call chain

Key findings

Workaround

Suggested fix directions

Related issues

Performance impact

extent analysis

Fix Plan

Step-by-Step Solution

Example Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Build system (`build-and-copy.sh`, `Dockerfile`)

SM121 mod (`mods/fix-e2m1-sm121`)

Bug: Mamba-2 Triton ops produce `cudaErrorIllegalInstruction` in async mode on SM121