vllm - ✅(Solved) Fix OffloadingConnector segfaults on decode node in P/D disaggregated mode with MultiConnector + NixlConnector [1 pull requests, 4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39500Fetched 2026-04-11 06:13:15
View on GitHub
Comments
4
Participants
2
Timeline
10
Reactions
1
Author
Timeline (top)
commented ×4subscribed ×3mentioned ×2cross-referenced ×1

Error Message

(Worker_DP1_EP1) ERROR [multiproc_executor.py:949] File "vllm/v1/worker/worker_base.py", line 332, in execute_model return self.worker.execute_model(scheduler_output) File "vllm/v1/worker/gpu_worker.py", line 803, in execute_model output = self.model_runner.execute_model( File "vllm/v1/worker/gpu_model_runner.py", line 3858, in execute_model logits_indices, spec_decode_metadata = self._prepare_inputs( File "vllm/v1/worker/gpu_model_runner.py", line 1958, in prepare_inputs self.num_accepted_tokens.gpu.fill(1) torch.AcceleratorError: CUDA error: invalid argument

[] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f997dc08010)

PR fix notes

PR #2256: feat: VLLM QoL - new A2A backends, KV cache offloading, DBO, cleanup for P/D

Description (problem / solution / changelog)

Summary

  • Fix CPUOffloadingSpec.num_blocks miscalculation (segfault root cause): page_size_bytes was multiplied by len(kv_cache_tensors), but for UniformTypeKVCacheSpecs the page size already aggregates all layers. This made num_blocks ~156× too small for GLM-5, so pinned CPU tensors were undersized and swap_blocks (cuMemcpyDtoHAsync_v2) segfaulted on out-of-bounds pinned memory. Applied as monkey-patch via the vLLM general plugin. (upstream: vllm-project/vllm#38395, vllm-project/vllm#39500)

  • OffloadingConnector on decode nodes: when kv_cache_offload is configured, both prefill and decode now get MultiConnector(NixlConnector + OffloadingConnector). Previously decode used NixlConnector only.

  • Router --intra-node-data-parallel-size: was hardcoded to 1 (RL template) or missing (multi-replica inference). Now set to {{ dp_per_node }} (gpus_per_node // tp) so the router correctly distributes across DP ranks.

  • Template alignment: both inference.sbatch.j2 and multi_node_rl.sbatch.j2 now use identical OffloadingConnector configs and router DP settings.

  • Misc: ulimit -l for pinned allocations, router bump 0.1.14 → 0.1.18, removed unused block_size pass-through.

🤖 Generated with Claude Code

<!-- CURSOR_SUMMARY -->

[!NOTE] Medium Risk Touches SLURM launch templates and vLLM monkey-patching for KV offloading; misconfiguration could break multi-node inference or change memory behavior, though the patch is a targeted fix for a known upstream segfault.

Overview Adds a vLLM monkey patch to fix CPUOffloadingSpec.num_blocks calculation to prevent pinned-memory out-of-bounds/segfaults when using the OffloadingConnector.

Updates inference/RL SLURM script generation and templates to compute and pass dp_per_node (gpus_per_node // tp) into vllm-router --intra-node-data-parallel-size, increases router startup timeout, and aligns KV offload behavior so both prefill and decode nodes use MultiConnector(NixlConnector + OffloadingConnector) (with ulimit -l unlimited).

Refreshes inference config options by pruning/renaming supported All2AllBackend values and simplifying KVCacheOffloadConfig to per-worker cpu_bytes (removing the block_size passthrough).

<sup>Reviewed by Cursor Bugbot for commit 664e45dfa09025faf93e9c072e60774e2edd62bd. Bugbot is set up for automated code reviews on this repo. Configure here.</sup>

<!-- /CURSOR_SUMMARY -->

Changed files

  • src/prime_rl/configs/inference.py (modified, +8/-9)
  • src/prime_rl/entrypoints/inference.py (modified, +2/-3)
  • src/prime_rl/entrypoints/rl.py (modified, +2/-1)
  • src/prime_rl/inference/patches.py (modified, +44/-0)
  • src/prime_rl/templates/inference.sbatch.j2 (modified, +12/-4)
  • src/prime_rl/templates/multi_node_rl.sbatch.j2 (modified, +11/-4)
  • uv.lock (modified, +0/-124)

Code Example

{
  "kv_connector": "MultiConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
    "connectors": [
      {"kv_connector": "NixlConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"num_threads": 1}},
      {"kv_connector": "OffloadingConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"cpu_bytes_to_use": 128000000000}}
    ]
  }
}

---

(Worker_DP1_EP1) ERROR [multiproc_executor.py:949]
  File "vllm/v1/worker/worker_base.py", line 332, in execute_model
    return self.worker.execute_model(scheduler_output)
  File "vllm/v1/worker/gpu_worker.py", line 803, in execute_model
    output = self.model_runner.execute_model(
  File "vllm/v1/worker/gpu_model_runner.py", line 3858, in execute_model
    logits_indices, spec_decode_metadata = self._prepare_inputs(
  File "vllm/v1/worker/gpu_model_runner.py", line 1958, in _prepare_inputs
    self.num_accepted_tokens.gpu.fill_(1)
torch.AcceleratorError: CUDA error: invalid argument

[] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f997dc08010)
RAW_BUFFERClick to expand / collapse

Disclaimer

  • I will keep iterating on this issue as it's not very replicable right now due to sheer size and non-determinism, opening now to see if anyone has any clue from the top of their head

Title

OffloadingConnector segfaults on decode node in P/D disaggregated mode with MultiConnector + NixlConnector

Bug

Environment

  • vLLM version: v0.19.0
  • Model: zai-org/GLM-5-FP8 (MoE, FP8, expert parallel)
  • Hardware: 8x H200 per node, 4 nodes (2 prefill, 2 decode)
  • Setup: P/D disaggregated with MultiConnector wrapping NixlConnector + OffloadingConnector

Configuration

{
  "kv_connector": "MultiConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
    "connectors": [
      {"kv_connector": "NixlConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"num_threads": 1}},
      {"kv_connector": "OffloadingConnector", "kv_role": "kv_both", "kv_connector_extra_config": {"cpu_bytes_to_use": 128000000000}}
    ]
  }
}

Additional flags:

  • data_parallel_size=16, data_parallel_size_local=8, data_parallel_hybrid_lb=True
  • enable_expert_parallel=True, all2all_backend=deepep_low_latency
  • compilation_config: {"cudagraph_mode": "FULL_DECODE_ONLY"}
  • enable_prefix_caching=True, enable_chunked_prefill=True

Crash

The decode node segfaults during model execution shortly after startup when processing P/D requests. The crash occurs in _prepare_inputs with a CUDA invalid argument error, followed by signal 11 (SIGSEGV).

Traceback

(Worker_DP1_EP1) ERROR [multiproc_executor.py:949]
  File "vllm/v1/worker/worker_base.py", line 332, in execute_model
    return self.worker.execute_model(scheduler_output)
  File "vllm/v1/worker/gpu_worker.py", line 803, in execute_model
    output = self.model_runner.execute_model(
  File "vllm/v1/worker/gpu_model_runner.py", line 3858, in execute_model
    logits_indices, spec_decode_metadata = self._prepare_inputs(
  File "vllm/v1/worker/gpu_model_runner.py", line 1958, in _prepare_inputs
    self.num_accepted_tokens.gpu.fill_(1)
torch.AcceleratorError: CUDA error: invalid argument

[] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f997dc08010)

The scheduler dump at crash time shows the OffloadingConnectorMetadata had active reqs_to_store with GPULoadStoreSpec and CPULoadStoreSpec entries, suggesting the crash is related to the offloading connector's GPU<>CPU block transfer during decode execution.

extent analysis

TL;DR

  • The segfault in the decode node may be related to the configuration of the OffloadingConnector and its interaction with the MultiConnector and NixlConnector, suggesting a review of the connector configurations and the GPU<>CPU block transfer process.

Guidance

  • Review the kv_connector_extra_config to ensure that the cpu_bytes_to_use parameter for the OffloadingConnector is properly set and does not exceed the available CPU memory.
  • Verify that the num_threads parameter for the NixlConnector is correctly configured and does not cause any thread-related issues.
  • Check the CUDA version and ensure it is compatible with the vLLM version (v0.19.0) and the model (zai-org/GLM-5-FP8) being used.
  • Investigate the compilation_config and the cudagraph_mode setting (FULL_DECODE_ONLY) to see if it has any impact on the GPU<>CPU block transfer process.

Example

  • No specific code snippet can be provided without further information, but reviewing the configuration files and the model execution code may help identify the root cause of the issue.

Notes

  • The issue seems to be related to the specific configuration and model being used, so any solution may need to be tailored to this particular setup.
  • Further debugging and logging may be necessary to determine the exact cause of the segfault.

Recommendation

  • Apply workaround: Review and adjust the connector configurations, and investigate the CUDA version and compatibility to see if it resolves the issue, as the problem seems to be related to the interaction between the connectors and the GPU<>CPU block transfer process.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING