vllm - 💡(How to fix) Fix [Bug] vLLM >= 0.18.0 NCCL segfault (cuMemCreate) with TP>1 on RTX 4090 (SM 89) [1 participants]

vllm2026-04-04 05:16:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38967•Fetched 2026-04-08 02:44:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zhouliang5266

Participants

zhouliang5266

Error Message

!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemCreate
  File "misc/cudawrap.cc", line 92, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 202, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 280, in initOnceFunc
  File "misc/cudawrap.cc", line 298, in ncclCudaLibraryInit()

Fix Action

Workaround

Roll back to v0.10.0 (V0 engine) for TP>1 on RTX 4090. TP=1 works fine on v0.18.0.

Code Example

docker run --gpus all --shm-size 32g --network host \
    vllm/vllm-openai:v0.18.0 \
    --model Qwen/Qwen3.5-35B-A3B \
    --tensor-parallel-size 2 \
    --trust-remote-code --max-model-len 4096

---

!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemCreate
  File "misc/cudawrap.cc", line 92, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 202, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 280, in initOnceFunc
  File "misc/cudawrap.cc", line 298, in ncclCudaLibraryInit()

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: 0.18.0
GPU: 2× RTX 4090 48GB (SM 89)
NCCL: 2.27.5 (bundled in vllm/vllm-openai:v0.18.0)
CUDA: 12.x
OS: openEuler 22.03 SP4

Problem

Starting from v0.18.0, any TP>1 deployment on RTX 4090 results in NCCL segfault during cuMemCreate. This was working fine on v0.10.0 (V0 engine).

Reproduction

docker run --gpus all --shm-size 32g --network host \
    vllm/vllm-openai:v0.18.0 \
    --model Qwen/Qwen3.5-35B-A3B \
    --tensor-parallel-size 2 \
    --trust-remote-code --max-model-len 4096

Error

!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemCreate
  File "misc/cudawrap.cc", line 92, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 202, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 280, in initOnceFunc
  File "misc/cudawrap.cc", line 298, in ncclCudaLibraryInit()

Workarounds tried (all failed)

NCCL_CUMEM=0 + NCCL_P2P_DISABLE=1 → same segfault
VLLM_USE_V1=0 → "Unknown vLLM environment variable" (V0 engine removed in 0.18.0)
--distributed-executor-backend mp → same segfault
--shm-size 32g + --network host → same segfault

Workaround

Roll back to v0.10.0 (V0 engine) for TP>1 on RTX 4090. TP=1 works fine on v0.18.0.

Note

v0.10.0 with V0 engine + TP=4 (2 nodes × 2 GPUs) works perfectly on the same hardware — Qwen3-235B-A22B-AWQ running at 497 tok/s peak with RDMA + CUDA Graphs. The issue is specific to V1 engine on consumer GPUs (SM 89).

extent analysis

TL;DR

Rolling back to v0.10.0 (V0 engine) is the most likely workaround for resolving the NCCL segfault issue on RTX 4090 with TP>1.

Guidance

The issue seems to be specific to the V1 engine on consumer GPUs (SM 89), so using the V0 engine may avoid the problem.
Verify that the issue is indeed related to the V1 engine by testing with TP=1, which is reported to work fine on v0.18.0.
If rolling back to v0.10.0 is not feasible, consider testing with a different GPU model or a different version of NCCL to see if the issue is specific to the current configuration.
Be cautious when using workarounds that involve environment variables (e.g., NCCL_CUMEM=0, NCCL_P2P_DISABLE=1) as they may have unintended consequences.

Example

No code snippet is provided as the issue seems to be related to the configuration and versioning of the software and hardware.

Notes

The issue appears to be specific to the combination of v0.18.0, RTX 4090, and TP>1, and rolling back to v0.10.0 is the only reported workaround that works.

Recommendation

Apply the workaround of rolling back to v0.10.0 (V0 engine) as it is the only reported solution that resolves the NCCL segfault issue on RTX 4090 with TP>1. This is because the V0 engine is known to work perfectly on the same hardware with TP=4.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model loading #dependency error #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] vLLM >= 0.18.0 NCCL segfault (cuMemCreate) with TP>1 on RTX 4090 (SM 89) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Workaround

Code Example

Your current environment

Problem

Reproduction

Error

Workarounds tried (all failed)

Workaround

Note

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug] vLLM >= 0.18.0 NCCL segfault (cuMemCreate) with TP>1 on RTX 4090 (SM 89) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Workaround

Code Example

Your current environment

Problem

Reproduction

Error

Workarounds tried (all failed)

Workaround

Note

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING