vllm - ✅(Solved) Fix OffloadingConnector segfaults on decode node in P/D disaggregated mode with MultiConnector + NixlConnector [1 pull requests, 4 comments, 2 participants]

vllm2026-04-10 14:25:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39500•Fetched 2026-04-11 06:13:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

S1ro1

Participants

robertgshaw2-redhat

S1ro1

Timeline (top)

commented ×4subscribed ×3mentioned ×2cross-referenced ×1

Error Message

(Worker_DP1_EP1) ERROR [multiproc_executor.py:949] File "vllm/v1/worker/worker_base.py", line 332, in execute_model return self.worker.execute_model(scheduler_output) File "vllm/v1/worker/gpu_worker.py", line 803, in execute_model output = self.model_runner.execute_model( File "vllm/v1/worker/gpu_model_runner.py", line 3858, in execute_model logits_indices, spec_decode_metadata = self._prepare_inputs( File "vllm/v1/worker/gpu_model_runner.py", line 1958, in prepare_inputs self.num_accepted_tokens.gpu.fill(1) torch.AcceleratorError: CUDA error: invalid argument

[] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f997dc08010)

PR fix notes

PR #2256: feat: VLLM QoL - new A2A backends, KV cache offloading, DBO, cleanup for P/D

Repository: PrimeIntellect-ai/prime-rl
Author: S1ro1
State: closed | merged: True
Link: https://github.com/PrimeIntellect-ai/prime-rl/pull/2256

Description (problem / solution / changelog)

Summary

Fix CPUOffloadingSpec.num_blocks miscalculation (segfault root cause): page_size_bytes was multiplied by len(kv_cache_tensors), but for UniformTypeKVCacheSpecs the page size already aggregates all layers. This made num_blocks ~156× too small for GLM-5, so pinned CPU tensors were undersized and swap_blocks (cuMemcpyDtoHAsync_v2) segfaulted on out-of-bounds pinned memory. Applied as monkey-patch via the vLLM general plugin. (upstream: vllm-project/vllm#38395, vllm-project/vllm#39500)
OffloadingConnector on decode nodes: when kv_cache_offload is configured, both prefill and decode now get MultiConnector(NixlConnector + OffloadingConnector). Previously decode used NixlConnector only.
Router --intra-node-data-parallel-size: was hardcoded to 1 (RL template) or missing (multi-replica inference). Now set to {{ dp_per_node }} (gpus_per_node // tp) so the router correctly distributes across DP ranks.
Template alignment: both inference.sbatch.j2 and multi_node_rl.sbatch.j2 now use identical OffloadingConnector configs and router DP settings.
Misc: ulimit -l for pinned allocations, router bump 0.1.14 → 0.1.18, removed unused block_size pass-through.

🤖 Generated with Claude Code

[!NOTE] Medium Risk Touches SLURM launch templates and vLLM monkey-patching for KV offloading; misconfiguration could break multi-node inference or change memory behavior, though the patch is a targeted fix for a known upstream segfault.

Overview Adds a vLLM monkey patch to fix CPUOffloadingSpec.num_blocks calculation to prevent pinned-memory out-of-bounds/segfaults when using the OffloadingConnector.

Updates inference/RL SLURM script generation and templates to compute and pass dp_per_node (gpus_per_node // tp) into vllm-router --intra-node-data-parallel-size, increases router startup timeout, and aligns KV offload behavior so both prefill and decode nodes use MultiConnector(NixlConnector + OffloadingConnector) (with ulimit -l unlimited).

Refreshes inference config options by pruning/renaming supported All2AllBackend values and simplifying KVCacheOffloadConfig to per-worker cpu_bytes (removing the block_size passthrough).

<sup>Reviewed by Cursor Bugbot for commit 664e45dfa09025faf93e9c072e60774e2edd62bd. Bugbot is set up for automated code reviews on this repo. Configure here.</sup>

Changed files

src/prime_rl/configs/inference.py (modified, +8/-9)
src/prime_rl/entrypoints/inference.py (modified, +2/-3)
src/prime_rl/entrypoints/rl.py (modified, +2/-1)
src/prime_rl/inference/patches.py (modified, +44/-0)
src/prime_rl/templates/inference.sbatch.j2 (modified, +12/-4)
src/prime_rl/templates/multi_node_rl.sbatch.j2 (modified, +11/-4)
uv.lock (modified, +0/-124)

Code Example

{
  "kv_connector": "MultiConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
    "connectors": [
      {"kv_connector": "NixlConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"num_threads": 1}},
      {"kv_connector": "OffloadingConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"cpu_bytes_to_use": 128000000000}}
    ]
  }
}

---

(Worker_DP1_EP1) ERROR [multiproc_executor.py:949]
  File "vllm/v1/worker/worker_base.py", line 332, in execute_model
    return self.worker.execute_model(scheduler_output)
  File "vllm/v1/worker/gpu_worker.py", line 803, in execute_model
    output = self.model_runner.execute_model(
  File "vllm/v1/worker/gpu_model_runner.py", line 3858, in execute_model
    logits_indices, spec_decode_metadata = self._prepare_inputs(
  File "vllm/v1/worker/gpu_model_runner.py", line 1958, in _prepare_inputs
    self.num_accepted_tokens.gpu.fill_(1)
torch.AcceleratorError: CUDA error: invalid argument

[] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f997dc08010)

RAW_BUFFERClick to expand / collapse

Disclaimer

I will keep iterating on this issue as it's not very replicable right now due to sheer size and non-determinism, opening now to see if anyone has any clue from the top of their head

Title

OffloadingConnector segfaults on decode node in P/D disaggregated mode with MultiConnector + NixlConnector

Bug

Environment

vLLM version: v0.19.0
Model: zai-org/GLM-5-FP8 (MoE, FP8, expert parallel)
Hardware: 8x H200 per node, 4 nodes (2 prefill, 2 decode)
Setup: P/D disaggregated with MultiConnector wrapping NixlConnector + OffloadingConnector

Configuration

{
  "kv_connector": "MultiConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
    "connectors": [
      {"kv_connector": "NixlConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"num_threads": 1}},
      {"kv_connector": "OffloadingConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"cpu_bytes_to_use": 128000000000}}
    ]
  }
}

Additional flags:

data_parallel_size=16, data_parallel_size_local=8, data_parallel_hybrid_lb=True
enable_expert_parallel=True, all2all_backend=deepep_low_latency
compilation_config: {"cudagraph_mode": "FULL_DECODE_ONLY"}
enable_prefix_caching=True, enable_chunked_prefill=True

Crash

The decode node segfaults during model execution shortly after startup when processing P/D requests. The crash occurs in _prepare_inputs with a CUDA invalid argument error, followed by signal 11 (SIGSEGV).

Traceback

(Worker_DP1_EP1) ERROR [multiproc_executor.py:949]
  File "vllm/v1/worker/worker_base.py", line 332, in execute_model
    return self.worker.execute_model(scheduler_output)
  File "vllm/v1/worker/gpu_worker.py", line 803, in execute_model
    output = self.model_runner.execute_model(
  File "vllm/v1/worker/gpu_model_runner.py", line 3858, in execute_model
    logits_indices, spec_decode_metadata = self._prepare_inputs(
  File "vllm/v1/worker/gpu_model_runner.py", line 1958, in _prepare_inputs
    self.num_accepted_tokens.gpu.fill_(1)
torch.AcceleratorError: CUDA error: invalid argument

[] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f997dc08010)

The scheduler dump at crash time shows the OffloadingConnectorMetadata had active reqs_to_store with GPULoadStoreSpec and CPULoadStoreSpec entries, suggesting the crash is related to the offloading connector's GPU<>CPU block transfer during decode execution.

extent analysis

TL;DR

The segfault in the decode node may be related to the configuration of the OffloadingConnector and its interaction with the MultiConnector and NixlConnector, suggesting a review of the connector configurations and the GPU<>CPU block transfer process.

Guidance

Review the kv_connector_extra_config to ensure that the cpu_bytes_to_use parameter for the OffloadingConnector is properly set and does not exceed the available CPU memory.
Verify that the num_threads parameter for the NixlConnector is correctly configured and does not cause any thread-related issues.
Check the CUDA version and ensure it is compatible with the vLLM version (v0.19.0) and the model (zai-org/GLM-5-FP8) being used.
Investigate the compilation_config and the cudagraph_mode setting (FULL_DECODE_ONLY) to see if it has any impact on the GPU<>CPU block transfer process.

Example

No specific code snippet can be provided without further information, but reviewing the configuration files and the model execution code may help identify the root cause of the issue.

Notes

The issue seems to be related to the specific configuration and model being used, so any solution may need to be tailored to this particular setup.
Further debugging and logging may be necessary to determine the exact cause of the segfault.

Recommendation

Apply workaround: Review and adjust the connector configurations, and investigate the CUDA version and compatibility to see if it resolves the issue, as the problem seems to be related to the interaction between the connectors and the GPU<>CPU block transfer process.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#conversation history #tool integration #LLM response #prompt template #agent execution

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix OffloadingConnector segfaults on decode node in P/D disaggregated mode with MultiConnector + NixlConnector [1 pull requests, 4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

PR fix notes

PR #2256: feat: VLLM QoL - new A2A backends, KV cache offloading, DBO, cleanup for P/D

Description (problem / solution / changelog)

Summary

Changed files

Code Example

Disclaimer

Title

Bug

Environment

Configuration

Crash

Traceback

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix OffloadingConnector segfaults on decode node in P/D disaggregated mode with MultiConnector + NixlConnector [1 pull requests, 4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

PR fix notes

PR #2256: feat: VLLM QoL - new A2A backends, KV cache offloading, DBO, cleanup for P/D

Description (problem / solution / changelog)

Summary

Changed files

Code Example

Disclaimer

Title

Bug

Environment

Configuration

Crash

Traceback

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING