vllm - 💡(How to fix) Fix [Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030 [1 participants]

Q: Expected behavior

`gpt_oss_mxfp4` + GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, and `openai/gpt-oss-120b` MXFP4 is the canonical Blackwell deployment.

vllm2026-05-01 19:20:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41477•Fetched 2026-05-02 05:27:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

vbalko-claimate

Participants

vbalko-claimate

Error Message

triton.runtime.errors.PTXASError: Internal Triton PTX codegen error ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a' ... (16 more identical errors at different line numbers) ... ptxas fatal : Ptx assembly aborted due to errors

Fix Action

Fix / Workaround

backend	outcome
`marlin` (default)	Runs but emits broken first Harmony token → `content: null`, `reasoning: null` (#37030)
`triton` (after patching capability gate)	`ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'`
`flashinfer_cutlass`	Quant scheme mismatch (`u8 / GroupShape(1,32)` not supported)
`flashinfer_trtllm`	"kernel does not support current device cuda"
`flashinfer_cutedsl`	Engine init failure
`deep_gemm`	"kernel does not support current device cuda"
`emulation`	Works, but ≤5 tok/s

After patching both ranges to < (13, 0) so SM 12.1 passes the gate, the kernel reaches JIT and fails:

Happy to test patches and collect logs from GB10 hardware.

Code Example

docker run --rm -it --runtime nvidia --ipc host --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_MXFP4_USE_MARLIN=0 \
  vllm/vllm-openai:nightly-aarch64 \
  --model openai/gpt-oss-120b \
  --quantization gpt_oss_mxfp4 \
  --moe-backend triton \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --reasoning-parser openai_gptoss

---

triton.runtime.errors.PTXASError: Internal Triton PTX codegen error
ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a'
... (16 more identical errors at different line numbers) ...
ptxas fatal : Ptx assembly aborted due to errors

RAW_BUFFERClick to expand / collapse

Your current environment

GPU: NVIDIA GB10 (DGX Spark) — SM 12.1 / sm_121a, 128 GB unified memory
Architecture: Grace Blackwell (consumer/edge variant)
Driver 580.142, CUDA 13.0, Ubuntu 24.04 ARM64
vLLM image: vllm/vllm-openai:nightly-aarch64 (v0.20.1rc1.dev91+ga749a33d8, 2026-04-30); same behavior on v0.20.0

🐛 Describe the bug

Serving openai/gpt-oss-120b with native MXFP4 on GB10 / DGX Spark (SM 12.1) has no working --moe-backend:

backend	outcome
`marlin` (default)	Runs but emits broken first Harmony token → `content: null`, `reasoning: null` (#37030)
`triton` (after patching capability gate)	`ptxas error: Feature '.tile::scatter4' not supported on .target 'sm_121a'`
`flashinfer_cutlass`	Quant scheme mismatch (`u8 / GroupShape(1,32)` not supported)
`flashinfer_trtllm`	"kernel does not support current device cuda"
`flashinfer_cutedsl`	Engine init failure
`deep_gemm`	"kernel does not support current device cuda"
`emulation`	Works, but ≤5 tok/s

Both OAITritonExperts and UnfusedOAITritonExperts call matmul_ogs from triton_kernels, which is JIT-compiled with .tile::scatter4 PTX. That instruction is a TMA scatter feature (Hopper SM 9.x / Blackwell datacenter SM 10.x) — it is not part of the SM 12.1 (GB10/consumer Blackwell) ISA.

So the gpt-oss-120b MXFP4 kernel families currently in vLLM all assume datacenter-class TMA, which GB10 does not have. The only path that does not hit TMA is Marlin, which has the separate first-token correctness bug from #37030.

🔁 Reproduction

docker run --rm -it --runtime nvidia --ipc host --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_MXFP4_USE_MARLIN=0 \
  vllm/vllm-openai:nightly-aarch64 \
  --model openai/gpt-oss-120b \
  --quantization gpt_oss_mxfp4 \
  --moe-backend triton \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --reasoning-parser openai_gptoss

triton is gated by (9, 0) <= cap < (11, 0) in:

vllm/model_executor/layers/fused_moe/oracle/mxfp4.py:255
vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py:660

After patching both ranges to < (13, 0) so SM 12.1 passes the gate, the kernel reaches JIT and fails:

triton.runtime.errors.PTXASError: Internal Triton PTX codegen error
ptxas line 3727; error : Feature '.tile::scatter4' not supported on .target 'sm_121a'
... (16 more identical errors at different line numbers) ...
ptxas fatal : Ptx assembly aborted due to errors

Default Marlin path: engine starts, every /v1/chat/completions returns content: null (= #37030).

Expected behavior

gpt_oss_mxfp4 + GB10/DGX Spark should produce a usable response — the hardware advertises native MXFP4 tensor cores, DGX Spark is marketed as an AI workstation, and openai/gpt-oss-120b MXFP4 is the canonical Blackwell deployment.

Suggested directions

Triton kernel path without tile::scatter4 — SM 12.x branch in matmul_ogs (or vLLM-local override) using regular tl.store scatter. Slower but functional.
BF16 dequantize-on-load fallback — when gpt_oss_mxfp4 runs on a device with neither working Marlin nor TMA Triton, dequantize MXFP4 → BF16 at load and use the standard MoE kernel. Costs ~2× weight memory.
Fix Marlin SM 12.1 first-token (#37030) — restores the de facto fallback.

Happy to test patches and collect logs from GB10 hardware.

#37030 (Marlin null content on SM 12.1 — the fallback we currently land on)
#31607 (try/except HarmonyError; turns crash into empty response, doesn't fix kernel output)
#31740 (SM121/GB10 platform support, needs-rebase, FP8 focus, no MXFP4 MoE kernel)
#41028, #40923, #34822 (device-range extensions; helpful but don't fix TMA / Marlin)

Before submitting

Searched existing issues; #37030 is Marlin-only, this is a different layer (Triton PTX feature gap).

extent analysis

TL;DR

The most likely fix involves modifying the Triton kernel path to avoid using the tile::scatter4 feature not supported on SM 12.1.

Guidance

Identify and modify the matmul_ogs function in triton_kernels to use a compatible scatter method instead of .tile::scatter4.
Consider implementing a fallback to dequantize MXFP4 to BF16 at load and use the standard MoE kernel when gpt_oss_mxfp4 runs on a device without working Marlin or TMA Triton.
Fixing the Marlin SM 12.1 first-token issue (#37030) could also provide a functional workaround.

Example

No code snippet is provided due to the complexity of the issue and the need for specific modifications to the triton_kernels code.

Notes

The provided information suggests that the issue is specific to the SM 12.1 architecture and the gpt_oss_mxfp4 model. Any modifications should be thoroughly tested to ensure compatibility and functionality.

Recommendation

Apply a workaround by modifying the Triton kernel path to avoid using the tile::scatter4 feature, as this is the most direct approach to resolving the issue. This recommendation is based on the information provided and the analysis of the error messages.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #autograd error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

🔁 Reproduction

Expected behavior

Suggested directions

Related

Before submitting

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Triton MXFP4 MoE kernel uses .tile::scatter4 PTX (Hopper/SM10 only) — fails on SM 12.1 (GB10/DGX Spark); Marlin fallback hits #37030 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

🔁 Reproduction

Expected behavior

Suggested directions

Related

Before submitting

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING