vllm - 💡(How to fix) Fix [Bug]: gpt-oss-120b MXFP4 MoE init OOM-killed on unified-memory ARM (DGX Spark / Jetson Thor)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(EngineCore_DP0 pid=…) INFO … mxfp4.py:… Using MoEPrepareAndFinalizeNoDPEPModular <process exits with SIGKILL / status 137 — no Python traceback> <kernel OOM-killer reports vllm serve process killed>

Fix Action

Fix / Workaround

Fails on aarch64 nightlies from 2026-05-06 onward; passes at v0.20.1. Reproduces on both unified-memory hosts described above with the upstream aarch64 ECR Public images (no NVIDIA-internal patches).

Workaround statuses

Code Example

(EngineCore_DP0 pid=) INFO … mxfp4.py:Using MoEPrepareAndFinalizeNoDPEPModular

---

vllm serve openai/gpt-oss-120b \
    --tensor-parallel-size 1 \
    --max-model-len 8192

---

(EngineCore_DP0 pid=) INFO … mxfp4.py:Using MoEPrepareAndFinalizeNoDPEPModular
<process exits with SIGKILL / status 137 — no Python traceback>
<kernel OOM-killer reports vllm serve process killed>
RAW_BUFFERClick to expand / collapse

Your current environment

Environment collected from python -m vllm.collect_env inside the upstream-vllm validation containers (one Thor, one Spark). Both runs use the same public.ecr.aws/q9t5s3a7/vllm-release-repo aarch64 nightly.

Both hosts (validation container)

  • OS: Ubuntu 22.04.5 LTS (aarch64)
  • Python: 3.12.13
  • CUDA runtime (in container): 13.0.88
  • PyTorch: 2.11.0+cu130
  • vLLM: 0.21.1rc1.dev201+g1fe330398 (aarch64 nightly from public.ecr.aws/q9t5s3a7/vllm-release-repo)

Host A — Jetson AGX Thor

  • GPU: NVIDIA Thor (SM 11.0), reported as iGPU on tegra kernel
  • Driver: 595.73
  • CPU: 14 ARM cores
  • Kernel: Linux 6.8.12-1021-tegra
  • Unified memory architecture (no discrete VRAM)

Host B — DGX Spark GB10

  • GPU: NVIDIA GB10 (SM 12.1)
  • Driver: 580.78
  • CPU: 20 ARM cores
  • Kernel: Linux 6.11.0-1013-nvidia
  • Unified memory architecture (no discrete VRAM)

Describe the bug

Loading openai/gpt-oss-120b in MXFP4 with vllm serve on either of the two unified-memory ARM systems above is host-OOM-killed during MoE quantization-method initialization. The kernel sends SIGKILL (exit 137) to the vllm serve process before model weights finish materializing — this is a Linux OOM-kill, not a CUDA OOM.

The failure is 100% reproducible on both hosts with the upstream aarch64 nightly. The same workload at upstream commit 132765e3560659ff63ebd236203672e991b70e08 (the v0.20.1 release tag, 2026-05-04) succeeds on the same Thor host with the same command — the model loads, KV-cache profiling runs, and Application startup complete is reached. The same workload at e47c98ef7a38792996e452ef53914e21e41928e9 (2026-05-06, 2 days and ~30 first-parent commits later) fails with SIGKILL.

Expected behavior

On an aarch64 host with sufficient combined system+device memory (the two hosts above each carry well over the steady-state footprint of the 120B MXFP4 model):

  1. vllm serve openai/gpt-oss-120b --tensor-parallel-size 1 --max-model-len 8192 should bring up the engine without being SIGKILL'd by the kernel.
  2. After Using MoEPrepareAndFinalizeNoDPEPModular is logged, MARLIN MXFP4 weight conversion and MoE-kernel initialization should complete, and weight loading should progress past the MoE layers.
  3. The KV-cache profiling pass should run.
  4. The server should reach Application startup complete and accept OpenAI-API requests on the configured port.

This is exactly what is observed at 132765e3560 (v0.20.1, 2026-05-04) on the same Thor host with the same command.

Actual behavior

At e47c98ef7a38 (2026-05-06) and on every later nightly tested (May 6 / 11 / 16 / 20 — all aarch64 ECR Public builds), and at HEAD-of-main (0.21.1rc1.dev201+g1fe330398):

  1. vllm serve starts up normally and prints its usual init logs.

  2. Weight files are read and the MoE quantization method is selected; the last log line emitted by the engine is consistently:

    (EngineCore_DP0 pid=…) INFO … mxfp4.py:… Using MoEPrepareAndFinalizeNoDPEPModular
  3. Immediately after that line, the vllm serve process exits with status 137 (SIGKILL). There is no Python traceback, no CUDA OOM error, and no shutdown banner.

  4. The Linux OOM-killer reports killing the vllm serve process in dmesg / kernel logs (this is a host-RAM OOM — the GPU side never gets to allocate).

  5. Server startup never reaches Application startup complete. No request is ever served. Re-running the same command reproduces the same kill point on every attempt.

Reproduction

Minimal command (single GPU, both hosts, batch-size 1):

vllm serve openai/gpt-oss-120b \
    --tensor-parallel-size 1 \
    --max-model-len 8192

Fails on aarch64 nightlies from 2026-05-06 onward; passes at v0.20.1. Reproduces on both unified-memory hosts described above with the upstream aarch64 ECR Public images (no NVIDIA-internal patches).

Bisect (narrowed window)

Using aarch64 images from public.ecr.aws/q9t5s3a7/vllm-release-repo, the regression has been narrowed to:

Date (UTC)Upstream SHATag/releaseResult on Thor
2026-04-287fd05e05aeb3664ca19346771dc559d93423acd4(pre-v0.20.1)PASS
2026-05-04132765e3560659ff63ebd236203672e991b70e08v0.20.1PASS
2026-05-06e47c98ef7a38792996e452ef53914e21e41928e9(post-v0.20.1)FAIL — first known bad
2026-05-11(later nightly)(mid-window)FAIL
2026-05-16(later nightly)(mid-window)FAIL
2026-05-20(later nightly)(close to v0.21.0)FAIL

The regression is in the ~30 first-parent commits between 132765e3560 and e47c98ef7a38 (≈2 days of upstream history). Anything before May 4 inclusive is fine; anything from May 6 onward fails on these hosts.

Observed log signature

(EngineCore_DP0 pid=…) INFO … mxfp4.py:… Using MoEPrepareAndFinalizeNoDPEPModular
<process exits with SIGKILL / status 137 — no Python traceback>
<kernel OOM-killer reports vllm serve process killed>

Code-path investigation

Under vllm/model_executor/layers/quantization/:

  • utils/marlin_utils_fp4.py is byte-identical between v0.20.1 and v0.21.0 (751 lines, 0 diff).
  • MarlinExpertsBase.__init__ is byte-identical.
  • The MARLIN branch of convert_gpt_oss_weight_to_mxfp4_moe_kernel_format and the MARLIN branch of make_mxfp4_moe_kernel are byte-identical.
  • The mxfp4.py line shifts (1270 → 1498 → 1515 → 1683 across releases) are explained by new sibling backends inserted before the MARLIN branch — HUMMING, AITER_MXFP4_MXFP4, AITER_MXFP4_FP8, CPU — not by changes to MARLIN itself.

In other words, the regression does not appear to be in the MARLIN code path that ultimately runs on these hosts; something earlier in the import / class-init / preload sequence (or in a global state-setup change introduced in the window) is causing a transient host-memory blow-up before the MARLIN path is exercised.

Strongest suspects in the narrowed window

From git log v0.20.1..e47c98ef7a38 --first-parent --oneline, the commits most likely to interact with HMA / unified-memory hosts and with model-loading peak host memory are:

  • 2fa1f8ec00cf [kv_offload+HMA][13/N]: Enable HMA support
  • efdc95674db5 [KVConnector] MultiConnector SupportsHMA
  • 941fb5083552 [kv_offload+HMA][12/N]
  • f03d82efdd88 (model-loader max_split_size_mb fix)
  • 08834cc3ceb8 (humming MXFP4 backend insertion before MARLIN branch)
  • 2ce95a761b9a (cumem memory-pool changes)

These are the most plausible sources of an init-time host-memory blow-up on a unified-memory aarch64 host.

Suggested next step

A targeted second-stage bisect across those ~30 first-parent commits between 132765e3560 and e47c98ef7a38 should pinpoint the offending commit. We have aarch64 images at the two boundary SHAs and can produce additional bisect points if upstream pushes nightly aarch64 builds for intermediate SHAs in this window.

Workaround statuses

  • VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1does not help on SM 11.0 (Thor) or SM 12.1 (GB10 Spark): the FlashInfer kernel-supported-device gate rejects these compute capabilities, so the runtime falls back to the same MARLIN path and gets OOM-killed at the same line.
  • Downgrading the upstream image to v0.20.1 (132765e3560) — works around the regression but is not a long-term fix.

Other backend statuses on these hosts

  • The same model and command on amd64 H100 / B200 (discrete-VRAM) hosts using the same upstream nightly is fine; the failure is specific to unified-memory aarch64 hosts.
  • We have not yet observed the same failure with any non-MXFP4 quantization on these hosts, but coverage is limited.

Before submitting a new issue…

  • I searched existing issues for related reports and found nothing specific to MXFP4 MoE init host-OOM on unified-memory aarch64 hosts in the v0.20.1..v0.21.0 window.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

On an aarch64 host with sufficient combined system+device memory (the two hosts above each carry well over the steady-state footprint of the 120B MXFP4 model):

  1. vllm serve openai/gpt-oss-120b --tensor-parallel-size 1 --max-model-len 8192 should bring up the engine without being SIGKILL'd by the kernel.
  2. After Using MoEPrepareAndFinalizeNoDPEPModular is logged, MARLIN MXFP4 weight conversion and MoE-kernel initialization should complete, and weight loading should progress past the MoE layers.
  3. The KV-cache profiling pass should run.
  4. The server should reach Application startup complete and accept OpenAI-API requests on the configured port.

This is exactly what is observed at 132765e3560 (v0.20.1, 2026-05-04) on the same Thor host with the same command.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: gpt-oss-120b MXFP4 MoE init OOM-killed on unified-memory ARM (DGX Spark / Jetson Thor)