vllm - 💡(How to fix) Fix [Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41477Fetched 2026-05-02 05:27:55
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0

Error Message

triton.runtime.errors.PTXASError: Internal Triton PTX codegen error ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a' ... (16 more identical errors at different line numbers) ... ptxas fatal : Ptx assembly aborted due to errors

Fix Action

Fix / Workaround

backendoutcome
marlin (default)Runs but emits broken first Harmony token → content: null, reasoning: null (#37030)
triton (after patching capability gate)ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'
flashinfer_cutlassQuant scheme mismatch (u8 / GroupShape(1,32) not supported)
flashinfer_trtllm"kernel does not support current device cuda"
flashinfer_cutedslEngine init failure
deep_gemm"kernel does not support current device cuda"
emulationWorks, but ≤5 tok/s

After patching both ranges to < (13, 0) so SM 12.1 passes the gate, the kernel reaches JIT and fails:

Happy to test patches and collect logs from GB10 hardware.

Code Example

docker run --rm -it --runtime nvidia --ipc host --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_MXFP4_USE_MARLIN=0 \
  vllm/vllm-openai:nightly-aarch64 \
  --model openai/gpt-oss-120b \
  --quantization gpt_oss_mxfp4 \
  --moe-backend triton \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --reasoning-parser openai_gptoss

---

triton.runtime.errors.PTXASError: Internal Triton PTX codegen error
ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a'
... (16 more identical errors at different line numbers) ...
ptxas fatal : Ptx assembly aborted due to errors
RAW_BUFFERClick to expand / collapse

Your current environment

  • GPU: NVIDIA GB10 (DGX Spark) — SM 12.1 / sm_121a, 128 GB unified memory
  • Architecture: Grace Blackwell (consumer/edge variant)
  • Driver 580.142, CUDA 13.0, Ubuntu 24.04 ARM64
  • vLLM image: vllm/vllm-openai:nightly-aarch64 (v0.20.1rc1.dev91+ga749a33d8, 2026-04-30); same behavior on v0.20.0

🐛 Describe the bug

Serving openai/gpt-oss-120b with native MXFP4 on GB10 / DGX Spark (SM 12.1) has no working --moe-backend:

backendoutcome
marlin (default)Runs but emits broken first Harmony token → content: null, reasoning: null (#37030)
triton (after patching capability gate)ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'
flashinfer_cutlassQuant scheme mismatch (u8 / GroupShape(1,32) not supported)
flashinfer_trtllm"kernel does not support current device cuda"
flashinfer_cutedslEngine init failure
deep_gemm"kernel does not support current device cuda"
emulationWorks, but ≤5 tok/s

Both OAITritonExperts and UnfusedOAITritonExperts call matmul_ogs from triton_kernels, which is JIT-compiled with .tile::scatter4 PTX. That instruction is a TMA scatter feature (Hopper SM 9.x / Blackwell datacenter SM 10.x) — it is not part of the SM 12.1 (GB10/consumer Blackwell) ISA.

So the gpt-oss-120b MXFP4 kernel families currently in vLLM all assume datacenter-class TMA, which GB10 does not have. The only path that does not hit TMA is Marlin, which has the separate first-token correctness bug from #37030.

🔁 Reproduction

docker run --rm -it --runtime nvidia --ipc host --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_MXFP4_USE_MARLIN=0 \
  vllm/vllm-openai:nightly-aarch64 \
  --model openai/gpt-oss-120b \
  --quantization gpt_oss_mxfp4 \
  --moe-backend triton \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --reasoning-parser openai_gptoss

triton is gated by (9, 0) <= cap < (11, 0) in:

  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py:255
  • vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py:660

After patching both ranges to < (13, 0) so SM 12.1 passes the gate, the kernel reaches JIT and fails:

triton.runtime.errors.PTXASError: Internal Triton PTX codegen error
ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a'
... (16 more identical errors at different line numbers) ...
ptxas fatal : Ptx assembly aborted due to errors

Default Marlin path: engine starts, every /v1/chat/completions returns content: null (= #37030).

Expected behavior

gpt_oss_mxfp4 + GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, and openai/gpt-oss-120b MXFP4 is the canonical Blackwell deployment.

Suggested directions

  1. Triton kernel path without tile::scatter4 — SM 12.x branch in matmul_ogs (or vLLM-local override) using regular tl.store scatter. Slower but functional.
  2. BF16 dequantize-on-load fallback — when gpt_oss_mxfp4 runs on a device with neither working Marlin nor TMA Triton, dequantize MXFP4 → BF16 at load and use the standard MoE kernel. Costs ~2× weight memory.
  3. Fix Marlin SM 12.1 first-token (#37030) — restores the de facto fallback.

Happy to test patches and collect logs from GB10 hardware.

Related

  • #37030 (Marlin null content on SM 12.1 — the fallback we currently land on)
  • #31607 (try/except HarmonyError; turns crash into empty response, doesn't fix kernel output)
  • #31740 (SM121/GB10 platform support, needs-rebase, FP8 focus, no MXFP4 MoE kernel)
  • #41028, #40923, #34822 (device-range extensions; helpful but don't fix TMA / Marlin)

Before submitting

  • Searched existing issues; #37030 is Marlin-only, this is a different layer (Triton PTX feature gap).

extent analysis

TL;DR

The most likely fix involves modifying the Triton kernel path to avoid using the tile::scatter4 feature not supported on SM 12.1.

Guidance

  • Identify and modify the matmul_ogs function in triton_kernels to use a compatible scatter method instead of .tile::scatter4.
  • Consider implementing a fallback to dequantize MXFP4 to BF16 at load and use the standard MoE kernel when gpt_oss_mxfp4 runs on a device without working Marlin or TMA Triton.
  • Fixing the Marlin SM 12.1 first-token issue (#37030) could also provide a functional workaround.

Example

No code snippet is provided due to the complexity of the issue and the need for specific modifications to the triton_kernels code.

Notes

The provided information suggests that the issue is specific to the SM 12.1 architecture and the gpt_oss_mxfp4 model. Any modifications should be thoroughly tested to ensure compatibility and functionality.

Recommendation

Apply a workaround by modifying the Triton kernel path to avoid using the tile::scatter4 feature, as this is the most direct approach to resolving the issue. This recommendation is based on the information provided and the analysis of the error messages.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

gpt_oss_mxfp4 + GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, and openai/gpt-oss-120b MXFP4 is the canonical Blackwell deployment.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING