vllm - 💡(How to fix) Fix [Bug] vLLM >= 0.18.0 NCCL segfault (cuMemCreate) with TP>1 on RTX 4090 (SM 89) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38967Fetched 2026-04-08 02:44:43
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Error Message

!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemCreate
  File "misc/cudawrap.cc", line 92, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 202, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 280, in initOnceFunc
  File "misc/cudawrap.cc", line 298, in ncclCudaLibraryInit()

Fix Action

Workaround

Roll back to v0.10.0 (V0 engine) for TP>1 on RTX 4090. TP=1 works fine on v0.18.0.

Code Example

docker run --gpus all --shm-size 32g --network host \
    vllm/vllm-openai:v0.18.0 \
    --model Qwen/Qwen3.5-35B-A3B \
    --tensor-parallel-size 2 \
    --trust-remote-code --max-model-len 4096

---

!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemCreate
  File "misc/cudawrap.cc", line 92, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 202, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 280, in initOnceFunc
  File "misc/cudawrap.cc", line 298, in ncclCudaLibraryInit()
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: 0.18.0
  • GPU: 2× RTX 4090 48GB (SM 89)
  • NCCL: 2.27.5 (bundled in vllm/vllm-openai:v0.18.0)
  • CUDA: 12.x
  • OS: openEuler 22.03 SP4

Problem

Starting from v0.18.0, any TP>1 deployment on RTX 4090 results in NCCL segfault during cuMemCreate. This was working fine on v0.10.0 (V0 engine).

Reproduction

docker run --gpus all --shm-size 32g --network host \
    vllm/vllm-openai:v0.18.0 \
    --model Qwen/Qwen3.5-35B-A3B \
    --tensor-parallel-size 2 \
    --trust-remote-code --max-model-len 4096

Error

!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuMemCreate
  File "misc/cudawrap.cc", line 92, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 202, in ncclCuMemHostEnable()
  File "misc/cudawrap.cc", line 280, in initOnceFunc
  File "misc/cudawrap.cc", line 298, in ncclCudaLibraryInit()

Workarounds tried (all failed)

  • NCCL_CUMEM=0 + NCCL_P2P_DISABLE=1 → same segfault
  • VLLM_USE_V1=0 → "Unknown vLLM environment variable" (V0 engine removed in 0.18.0)
  • --distributed-executor-backend mp → same segfault
  • --shm-size 32g + --network host → same segfault

Workaround

Roll back to v0.10.0 (V0 engine) for TP>1 on RTX 4090. TP=1 works fine on v0.18.0.

Note

v0.10.0 with V0 engine + TP=4 (2 nodes × 2 GPUs) works perfectly on the same hardware — Qwen3-235B-A22B-AWQ running at 497 tok/s peak with RDMA + CUDA Graphs. The issue is specific to V1 engine on consumer GPUs (SM 89).

extent analysis

TL;DR

Rolling back to v0.10.0 (V0 engine) is the most likely workaround for resolving the NCCL segfault issue on RTX 4090 with TP>1.

Guidance

  • The issue seems to be specific to the V1 engine on consumer GPUs (SM 89), so using the V0 engine may avoid the problem.
  • Verify that the issue is indeed related to the V1 engine by testing with TP=1, which is reported to work fine on v0.18.0.
  • If rolling back to v0.10.0 is not feasible, consider testing with a different GPU model or a different version of NCCL to see if the issue is specific to the current configuration.
  • Be cautious when using workarounds that involve environment variables (e.g., NCCL_CUMEM=0, NCCL_P2P_DISABLE=1) as they may have unintended consequences.

Example

No code snippet is provided as the issue seems to be related to the configuration and versioning of the software and hardware.

Notes

The issue appears to be specific to the combination of v0.18.0, RTX 4090, and TP>1, and rolling back to v0.10.0 is the only reported workaround that works.

Recommendation

Apply the workaround of rolling back to v0.10.0 (V0 engine) as it is the only reported solution that resolves the NCCL segfault issue on RTX 4090 with TP>1. This is because the V0 engine is known to work perfectly on the same hardware with TP=4.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] vLLM >= 0.18.0 NCCL segfault (cuMemCreate) with TP>1 on RTX 4090 (SM 89) [1 participants]