vllm - 💡(How to fix) Fix [Bug]: Gemma-4 MoE Initialization Hang and Segfault

StepCodex · 2026-05-12T07:51:04Z

[vllm] Your current environment Environment : vLLM Version: 0.20.x or current latest Model: google/gemma-4-26B-A4B-it Hardware: 2x NVIDIA GeForce RTX 3090 24GB… ## Fix / Workaround * The issue persists even when forcing `VLLM_USE_V1=0`. * A temporary workaround was found by setting `--disable-custom-all-reduce` and `NCCL_P2P_DISABLE=1`. * This suggests that the newer V1 SHM broadcast mechanism may be unstable on PCIe-only topologies (PXB) when handling the complex routing requirements of the Gemma-4 MoE architecture on Ampere cards. * Leaked semaphore and shared memory objects are consistently left in `/dev/shm/` after the crash. ### Your current environment **Environment**: vLLM Version: 0.20.x (or current latest) Model: google/gemma-4-26B-A4B-it Hardware: 2x NVIDIA GeForce RTX 3090 (24GB) Topology: PCIe Gen 4 (No NVLink) OS: Linux (Ubuntu 22.04/24.04) CUDA Version: 13.0 Python Version: 3.11+ ### 🐛 Describe the bug ### **Describe the bug** The `google/gemma-4-26B-A4B-it` (MoE) model fails to initialize in a Tensor Parallel (TP=2) configuration on Ampere hardware (RTX 3090). The failure occurs in two distinct stages: 1. **Configuration Error:** A `ValueError` is raised because the vision encoder produces 2,496 tokens, which exceeds the default `max_num_batched_tokens` of 2,048. 2. **Runtime Initialization Hang:** After resolving the token budget, the engine hangs during the "Wait for engine startup" phase. Logs indicate a `shm_broadcast.py` timeout: `No available shared memory broadcast block found in 60 seconds`. 3. **Segfault:** Attempting to terminate the hanging process leads to a Segfault in `libc_sigaction.c`, likely due to corrupted shared memory segments or leaked semaphore objects. ### **Steps to Reproduce** Attempt to serve the model on a dual-GPU system without NVLink using the following command: ```bash vllm serve google/gemma-4-26B-A4B-it \ --tensor-parallel-size 2 \ --quantization fp8 \ --max-model-len 8192 \ --trust-remote-code ``` ### **Error Logs / Traceback** **Multimodal Budget Error:** ```text ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens. ``` **Initialization Hang:** ```text (EngineCore pid=xxxx) INFO [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (APIServer pid=xxxx) RuntimeError: Engine core initialization failed. !!!!!!! Segfault encountered !!!!!!! File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000076536ea4251f ``` ### **Additional Context** * The issue persists even when forcing `VLLM_USE_V1=0`. * A temporary workaround was found by setting `--disable-custom-all-reduce` and `NCCL_P2P_DISABLE=1`. * This suggests that the newer V1 SHM broadcast mechanism may be unstable on PCIe-only topologies (PXB) when handling the complex routing requirements of the Gemma-4 MoE architecture on Ampere cards. * Leaked semaphore and shared memory objects are consistently left in `/dev/shm/` after the crash. --- ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-05-12 07:51:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Root Cause

Configuration Error: A ValueError is raised because the vision encoder produces 2,496 tokens, which exceeds the default max_num_batched_tokens of 2,048.
Runtime Initialization Hang: After resolving the token budget, the engine hangs during the "Wait for engine startup" phase. Logs indicate a shm_broadcast.py timeout: No available shared memory broadcast block found in 60 seconds.
Segfault: Attempting to terminate the hanging process leads to a Segfault in libc_sigaction.c, likely due to corrupted shared memory segments or leaked semaphore objects.

Fix Action

Fix / Workaround

The issue persists even when forcing VLLM_USE_V1=0.
A temporary workaround was found by setting --disable-custom-all-reduce and NCCL_P2P_DISABLE=1.
This suggests that the newer V1 SHM broadcast mechanism may be unstable on PCIe-only topologies (PXB) when handling the complex routing requirements of the Gemma-4 MoE architecture on Ampere cards.
Leaked semaphore and shared memory objects are consistently left in /dev/shm/ after the crash.

Code Example

vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --quantization fp8 \
    --max-model-len 8192 \
    --trust-remote-code

---

ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

---

(EngineCore pid=xxxx) INFO [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer pid=xxxx) RuntimeError: Engine core initialization failed.
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000076536ea4251f

RAW_BUFFERClick to expand / collapse

Your current environment

Environment:

vLLM Version: 0.20.x (or current latest)

Model: google/gemma-4-26B-A4B-it

Hardware: 2x NVIDIA GeForce RTX 3090 (24GB)

Topology: PCIe Gen 4 (No NVLink)

OS: Linux (Ubuntu 22.04/24.04)

CUDA Version: 13.0

Python Version: 3.11+

🐛 Describe the bug

Describe the bug

The google/gemma-4-26B-A4B-it (MoE) model fails to initialize in a Tensor Parallel (TP=2) configuration on Ampere hardware (RTX 3090). The failure occurs in two distinct stages:

Configuration Error: A ValueError is raised because the vision encoder produces 2,496 tokens, which exceeds the default max_num_batched_tokens of 2,048.
Runtime Initialization Hang: After resolving the token budget, the engine hangs during the "Wait for engine startup" phase. Logs indicate a shm_broadcast.py timeout: No available shared memory broadcast block found in 60 seconds.
Segfault: Attempting to terminate the hanging process leads to a Segfault in libc_sigaction.c, likely due to corrupted shared memory segments or leaked semaphore objects.

Steps to Reproduce

Attempt to serve the model on a dual-GPU system without NVLink using the following command:

vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --quantization fp8 \
    --max-model-len 8192 \
    --trust-remote-code

Error Logs / Traceback

Multimodal Budget Error:

ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048). Please increase max_num_batched_tokens.

Initialization Hang:

(EngineCore pid=xxxx) INFO [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer pid=xxxx) RuntimeError: Engine core initialization failed.
!!!!!!! Segfault encountered !!!!!!!
File "./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c", line 0, in 0x000076536ea4251f

Additional Context

The issue persists even when forcing VLLM_USE_V1=0.
A temporary workaround was found by setting --disable-custom-all-reduce and NCCL_P2P_DISABLE=1.
This suggests that the newer V1 SHM broadcast mechanism may be unstable on PCIe-only topologies (PXB) when handling the complex routing requirements of the Gemma-4 MoE architecture on Ampere cards.
Leaked semaphore and shared memory objects are consistently left in /dev/shm/ after the crash.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #API rate limit #retriever error #indexing error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Gemma-4 MoE Initialization Hang and Segfault

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Describe the bug

Steps to Reproduce

Error Logs / Traceback

Additional Context

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Gemma-4 MoE Initialization Hang and Segfault

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Describe the bug

Steps to Reproduce

Error Logs / Traceback

Additional Context

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING