vllm - ✅(Solved) Fix [Feature]: Support sparse in-place weight updates in weight transfer API [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39451Fetched 2026-04-10 03:40:35
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
1
Participants
Timeline (top)
labeled ×1

Fix Action

Fix / Workaround

There's no way to say "here are the 0.3% of elements that changed — apply them in-place." This forces us to keep a full CPU bf16 snapshot on the vLLM side to reconstruct dense tensors from sparse patches before calling load_weights.

PR fix notes

PR #40096: [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates

Description (problem / solution / changelog)

Purpose

Implements an MVP sparse NCCL weight transfer path for online RL weight sync. Instead of resending full dense tensors, the trainer can send (indices, values) patches that are applied in-place to existing runtime GPU parameters.

This addresses #39451.

Current scope:

  • NCCL backend only
  • sparse updates use kernel-format/runtime parameter names
  • not composable with packed=True
  • not composable with is_checkpoint_format=True
  • restricted to TP=1, PP=1

Why this is not duplicating an existing PR:

  • I checked issue #39451 and open PRs for the same area; no open PR was already implementing this sparse NCCL MVP.

Test Plan

.venv/bin/python -m pytest tests/distributed/test_weight_transfer.py -v -k 'valid_sparse_update_info or sparse_update_requires_nnz_list or sparse_update_rejects_checkpoint_format or sparse_update_rejects_packed or sparse_update_rejects_non_int32_indices or dense_update_rejects_sparse_metadata or nccl_receive_sparse_weights_without_init_raises or nccl_sparse_weight_transfer_between_processes or sparse_update_kind_rejected'

.venv/bin/python -m pytest tests/entrypoints/weight_transfer/test_weight_transfer_llm.py -v -k 'test_update_weights_passes_sparse_metadata'

.venv/bin/python -m pytest tests/v1/worker/test_gpu_model_runner.py -v -k 'apply_sparse_weight_patches'

.venv/bin/python -m pytest tests/v1/worker/test_gpu_worker_weight_transfer.py -v -k 'sparse_dispatches or sparse_rejects_tp_or_pp'

Test Result

  • tests/distributed/test_weight_transfer.py

    • 9 passed, 27 deselected in 13.59s
    • includes test_nccl_sparse_weight_transfer_between_processes on a 2-GPU pod
  • tests/entrypoints/weight_transfer/test_weight_transfer_llm.py

    • 1 passed, 5 deselected in 28.24s
  • tests/v1/worker/test_gpu_model_runner.py

    • 3 passed, 25 deselected in 2.40s
  • tests/v1/worker/test_gpu_worker_weight_transfer.py

    • 2 passed in 2.17s

Additional Validation

Outside this PR branch, I also ran temporary repro/debug harnesses on a 2-GPU pod to validate dense-vs-sparse equivalence for the same deterministic patch on Qwen/Qwen3-1.7B.

Observed results:

  • trainer patch digests matched
  • full server-side parameter digest maps matched
  • controlled max_tokens=1, greedy outputs matched between dense and sparse updates

Performance validation:

  • for a patch affecting ~0.3% of model elements on Qwen/Qwen3-1.7B, the sparse payload was ~30.97 MB versus ~3.44 GB for the dense full-model resend path
  • in one pod validation run, trainer-side send time decreased from ~175 ms for dense resend to ~4 ms for sparse patch transfer

This validation is supplementary and is not part of the submitted branch.

AI assistance was used to help develop and validate this change.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/distributed/test_weight_transfer.py (modified, +247/-0)
  • tests/entrypoints/weight_transfer/test_weight_transfer_llm.py (modified, +67/-0)
  • tests/v1/worker/test_gpu_model_runner.py (modified, +69/-0)
  • tests/v1/worker/test_gpu_worker_weight_transfer.py (added, +99/-0)
  • vllm/distributed/weight_transfer/base.py (modified, +16/-1)
  • vllm/distributed/weight_transfer/ipc_engine.py (modified, +4/-0)
  • vllm/distributed/weight_transfer/nccl_engine.py (modified, +95/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +39/-0)
  • vllm/v1/worker/gpu_worker.py (modified, +21/-2)

Code Example

# nccl_engine.py — receive_weights
for name, dtype_name, shape in zip(update_info.names, update_info.dtype_names, update_info.shapes):
    weight = torch.empty(shape, dtype=dtype, device="cuda")
    self.model_update_group.broadcast(weight, src=0)
    load_weights([(name, weight)])

---

# Sparse path — broadcast only the delta
indices = torch.empty(nnz, dtype=torch.int32, device="cuda")
values = torch.empty(nnz, dtype=dtype, device="cuda")
self.model_update_group.broadcast(indices, src=0)
self.model_update_group.broadcast(values, src=0)
param = model.get_parameter(name)
param.flatten()[indices.long()] = values
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

In online RL, the trainer periodically syncs updated weights to a vLLM inference server. After a single optimizer step, typically >99% of bf16 elements are unchanged. We'd like to transfer and apply only the delta.

Problem

receive_weights operates on full dense tensors. For each parameter, it allocates the full shape and broadcasts the entire tensor:

# nccl_engine.py — receive_weights
for name, dtype_name, shape in zip(update_info.names, update_info.dtype_names, update_info.shapes):
    weight = torch.empty(shape, dtype=dtype, device="cuda")
    self.model_update_group.broadcast(weight, src=0)
    load_weights([(name, weight)])

There's no way to say "here are the 0.3% of elements that changed — apply them in-place." This forces us to keep a full CPU bf16 snapshot on the vLLM side to reconstruct dense tensors from sparse patches before calling load_weights.

Possible API

A sparse variant that broadcasts only indices + values, then scatters directly into the existing GPU parameter:

# Sparse path — broadcast only the delta
indices = torch.empty(nnz, dtype=torch.int32, device="cuda")
values = torch.empty(nnz, dtype=dtype, device="cuda")
self.model_update_group.broadcast(indices, src=0)
self.model_update_group.broadcast(values, src=0)
param = model.get_parameter(name)
param.flatten()[indices.long()] = values

This would eliminate the CPU snapshot and reduce the data transferred/copied from O(numel) to O(nnz) per parameter.

Alternatives

Keep as is, it works, but it's suboptimal

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement a sparse variant of the receive_weights function to broadcast only the changed indices and values, reducing data transfer and eliminating the need for a full CPU snapshot.

Guidance

  • Identify the parameters that are being updated and determine the indices and values of the changed elements.
  • Modify the receive_weights function to broadcast only the indices and values of the changed elements, rather than the entire tensor.
  • Use the scatter method to update the existing GPU parameter with the new values at the specified indices.
  • Test the sparse variant to ensure it correctly updates the model parameters and reduces data transfer.

Example

# Sparse path — broadcast only the delta
indices = torch.empty(nnz, dtype=torch.int32, device="cuda")
values = torch.empty(nnz, dtype=dtype, device="cuda")
self.model_update_group.broadcast(indices, src=0)
self.model_update_group.broadcast(values, src=0)
param = model.get_parameter(name)
param.flatten()[indices.long()] = values

Notes

The proposed solution assumes that the model_update_group and load_weights functions can be modified to support the sparse variant. Additionally, the implementation may require careful handling of data types and indexing to ensure correct updates.

Recommendation

Apply workaround: Implement the sparse variant of the receive_weights function to reduce data transfer and improve efficiency. This approach eliminates the need for a full CPU snapshot and reduces the amount of data transferred from O(numel) to O(nnz) per parameter.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING