vllm - 💡(How to fix) Fix [Bug]: Gemma-4 MoE Initialization Hang and Segfault

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Root Cause

  1. Configuration Error: A ValueError is raised because the vision encoder produces 2,496 tokens, which exceeds the default max_num_batched_tokens of 2,048.
  2. Runtime Initialization Hang: After resolving the token budget, the engine hangs during the "Wait for engine startup" phase. Logs indicate a shm_broadcast.py timeout: No available shared memory broadcast block found in 60 seconds.
  3. Segfault: Attempting to terminate the hanging process leads to a Segfault in libc_sigaction.c, likely due to corrupted shared memory segments or leaked semaphore objects.

Fix Action

Fix / Workaround

  • The issue persists even when forcing VLLM_USE_V1=0.
  • A temporary workaround was found by setting --disable-custom-all-reduce and NCCL_P2P_DISABLE=1.
  • This suggests that the newer V1 SHM broadcast mechanism may be unstable on PCIe-only topologies (PXB) when handling the complex routing requirements of the Gemma-4 MoE architecture on Ampere cards.
  • Leaked semaphore and shared memory objects are consistently left in /dev/shm/ after the crash.

Code Example

vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --quantization fp8 \
    --max-model-len 8192 \
    --trust-remote-code

---

ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

---

(EngineCore pid=xxxx) INFO [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer pid=xxxx) RuntimeError: Engine core initialization failed.
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000076536ea4251f
RAW_BUFFERClick to expand / collapse

Your current environment

Environment:

vLLM Version: 0.20.x (or current latest)

Model: google/gemma-4-26B-A4B-it

Hardware: 2x NVIDIA GeForce RTX 3090 (24GB)

Topology: PCIe Gen 4 (No NVLink)

OS: Linux (Ubuntu 22.04/24.04)

CUDA Version: 13.0

Python Version: 3.11+

🐛 Describe the bug

Describe the bug

The google/gemma-4-26B-A4B-it (MoE) model fails to initialize in a Tensor Parallel (TP=2) configuration on Ampere hardware (RTX 3090). The failure occurs in two distinct stages:

  1. Configuration Error: A ValueError is raised because the vision encoder produces 2,496 tokens, which exceeds the default max_num_batched_tokens of 2,048.
  2. Runtime Initialization Hang: After resolving the token budget, the engine hangs during the "Wait for engine startup" phase. Logs indicate a shm_broadcast.py timeout: No available shared memory broadcast block found in 60 seconds.
  3. Segfault: Attempting to terminate the hanging process leads to a Segfault in libc_sigaction.c, likely due to corrupted shared memory segments or leaked semaphore objects.

Steps to Reproduce

Attempt to serve the model on a dual-GPU system without NVLink using the following command:

vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --quantization fp8 \
    --max-model-len 8192 \
    --trust-remote-code

Error Logs / Traceback

Multimodal Budget Error:

ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Initialization Hang:

(EngineCore pid=xxxx) INFO [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer pid=xxxx) RuntimeError: Engine core initialization failed.
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000076536ea4251f

Additional Context

  • The issue persists even when forcing VLLM_USE_V1=0.
  • A temporary workaround was found by setting --disable-custom-all-reduce and NCCL_P2P_DISABLE=1.
  • This suggests that the newer V1 SHM broadcast mechanism may be unstable on PCIe-only topologies (PXB) when handling the complex routing requirements of the Gemma-4 MoE architecture on Ampere cards.
  • Leaked semaphore and shared memory objects are consistently left in /dev/shm/ after the crash.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Gemma-4 MoE Initialization Hang and Segfault