vllm - 💡(How to fix) Fix [Bug] custom_all_reduce IPC handle fails with expandable_segments when DP>1 AND TP>1

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

With PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, starting vllm with both DP>1 and TP>1 crashes during worker init in custom_all_reduce.cuh:455 (cudaIpcGetMemHandle returns invalid argument). Either dim alone works.

Error Message

Failed: Cuda error custom_all_reduce.cuh:455 'invalid argument'

Root Cause

Root cause (best guess)

Fix Action

Workaround

Drop expandable_segments or pass --disable-custom-all-reduce.

Code Example

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve Qwen/Qwen3-0.6B --data-parallel-size 2 --tensor-parallel-size 2

---

Failed: Cuda error custom_all_reduce.cuh:455 'invalid argument'
EngineCore failed to start: Worker proc VllmWorker-* died unexpectedly
RAW_BUFFERClick to expand / collapse

Summary

With PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, starting vllm with both DP>1 and TP>1 crashes during worker init in custom_all_reduce.cuh:455 (cudaIpcGetMemHandle returns invalid argument). Either dim alone works.

Environment

  • vllm 0.20.2
  • 4× H200 (1 node)

Repro

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve Qwen/Qwen3-0.6B --data-parallel-size 2 --tensor-parallel-size 2
Failed: Cuda error custom_all_reduce.cuh:455 'invalid argument'
EngineCore failed to start: Worker proc VllmWorker-* died unexpectedly

Matrix (Qwen3-0.6B, expandable_segments enabled)

DPTPresult
21
12
22✗ (crash as above)

Independent of --enable-sleep-mode (reproduced with and without it).

Root cause (best guess)

expandable_segments reserves virtual address ranges via cuMemAddressReserve and commits physical memory lazily with cuMemMap. The base pointer returned by cuPointerGetAttribute(..., rangeStartAddrAttr, ...) in get_graph_buffer_ipc_meta is the head of such a VA range, which is not a valid source for cudaIpcGetMemHandle — IPC handles require memory backed by a single cudaMalloc/cuMemCreate allocation. DP=1 or TP=1 doesn't trigger the path because cross-process IPC isn't needed.

Workaround

Drop expandable_segments or pass --disable-custom-all-reduce.

Suggested fix

PR #40812 already temporarily disables expandable_segments around the cumem sleep-mode pool. The custom_all_reduce IPC graph buffer registration needs the same treatment — either temporarily switch off expandable segments while the graph buffers are being allocated/registered, or detect the cumem-backed pointer and skip custom-all-reduce IPC fallback.

@youkaichao could you take a look?

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] custom_all_reduce IPC handle fails with expandable_segments when DP>1 AND TP>1