vllm - ✅(Solved) Fix [Feature]: Support sparse in-place weight updates in weight transfer API [1 pull requests, 1 participants]

vllm2026-04-09 20:57:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39451•Fetched 2026-04-10 03:40:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

qgallouedec

Participants

qgallouedec

Timeline (top)

labeled ×1

Fix Action

Fix / Workaround

There's no way to say "here are the 0.3% of elements that changed — apply them in-place." This forces us to keep a full CPU bf16 snapshot on the vLLM side to reconstruct dense tensors from sparse patches before calling load_weights.

PR fix notes

PR #40096: [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates

Repository: vllm-project/vllm
Author: bedeks
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40096

Description (problem / solution / changelog)

Purpose

Implements an MVP sparse NCCL weight transfer path for online RL weight sync. Instead of resending full dense tensors, the trainer can send (indices, values) patches that are applied in-place to existing runtime GPU parameters.

This addresses #39451.

Current scope:

NCCL backend only
sparse updates use kernel-format/runtime parameter names
not composable with packed=True
not composable with is_checkpoint_format=True
restricted to TP=1, PP=1

Why this is not duplicating an existing PR:

I checked issue #39451 and open PRs for the same area; no open PR was already implementing this sparse NCCL MVP.

Test Plan

.venv/bin/python -m pytest tests/distributed/test_weight_transfer.py -v -k 'valid_sparse_update_info or sparse_update_requires_nnz_list or sparse_update_rejects_checkpoint_format or sparse_update_rejects_packed or sparse_update_rejects_non_int32_indices or dense_update_rejects_sparse_metadata or nccl_receive_sparse_weights_without_init_raises or nccl_sparse_weight_transfer_between_processes or sparse_update_kind_rejected'

.venv/bin/python -m pytest tests/entrypoints/weight_transfer/test_weight_transfer_llm.py -v -k 'test_update_weights_passes_sparse_metadata'

.venv/bin/python -m pytest tests/v1/worker/test_gpu_model_runner.py -v -k 'apply_sparse_weight_patches'

.venv/bin/python -m pytest tests/v1/worker/test_gpu_worker_weight_transfer.py -v -k 'sparse_dispatches or sparse_rejects_tp_or_pp'

Test Result

tests/distributed/test_weight_transfer.py
- 9 passed, 27 deselected in 13.59s
- includes test_nccl_sparse_weight_transfer_between_processes on a 2-GPU pod
tests/entrypoints/weight_transfer/test_weight_transfer_llm.py
- 1 passed, 5 deselected in 28.24s
tests/v1/worker/test_gpu_model_runner.py
- 3 passed, 25 deselected in 2.40s
tests/v1/worker/test_gpu_worker_weight_transfer.py
- 2 passed in 2.17s

Additional Validation

Outside this PR branch, I also ran temporary repro/debug harnesses on a 2-GPU pod to validate dense-vs-sparse equivalence for the same deterministic patch on Qwen/Qwen3-1.7B.

Observed results:

trainer patch digests matched
full server-side parameter digest maps matched
controlled max_tokens=1, greedy outputs matched between dense and sparse updates

Performance validation:

for a patch affecting ~0.3% of model elements on Qwen/Qwen3-1.7B, the sparse payload was ~30.97 MB versus ~3.44 GB for the dense full-model resend path
in one pod validation run, trainer-side send time decreased from ~175 ms for dense resend to ~4 ms for sparse patch transfer

This validation is supplementary and is not part of the submitted branch.

AI assistance was used to help develop and validate this change.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/distributed/test_weight_transfer.py (modified, +247/-0)
tests/entrypoints/weight_transfer/test_weight_transfer_llm.py (modified, +67/-0)
tests/v1/worker/test_gpu_model_runner.py (modified, +69/-0)
tests/v1/worker/test_gpu_worker_weight_transfer.py (added, +99/-0)
vllm/distributed/weight_transfer/base.py (modified, +16/-1)
vllm/distributed/weight_transfer/ipc_engine.py (modified, +4/-0)
vllm/distributed/weight_transfer/nccl_engine.py (modified, +95/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +39/-0)
vllm/v1/worker/gpu_worker.py (modified, +21/-2)

Code Example

# nccl_engine.py — receive_weights
for name, dtype_name, shape in zip(update_info.names, update_info.dtype_names, update_info.shapes):
    weight = torch.empty(shape, dtype=dtype, device="cuda")
    self.model_update_group.broadcast(weight, src=0)
    load_weights([(name, weight)])

---

# Sparse path — broadcast only the delta
indices = torch.empty(nnz, dtype=torch.int32, device="cuda")
values = torch.empty(nnz, dtype=dtype, device="cuda")
self.model_update_group.broadcast(indices, src=0)
self.model_update_group.broadcast(values, src=0)
param = model.get_parameter(name)
param.flatten()[indices.long()] = values

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

In online RL, the trainer periodically syncs updated weights to a vLLM inference server. After a single optimizer step, typically >99% of bf16 elements are unchanged. We'd like to transfer and apply only the delta.

Problem

receive_weights operates on full dense tensors. For each parameter, it allocates the full shape and broadcasts the entire tensor:

# nccl_engine.py — receive_weights
for name, dtype_name, shape in zip(update_info.names, update_info.dtype_names, update_info.shapes):
    weight = torch.empty(shape, dtype=dtype, device="cuda")
    self.model_update_group.broadcast(weight, src=0)
    load_weights([(name, weight)])

Possible API

A sparse variant that broadcasts only indices + values, then scatters directly into the existing GPU parameter:

# Sparse path — broadcast only the delta
indices = torch.empty(nnz, dtype=torch.int32, device="cuda")
values = torch.empty(nnz, dtype=dtype, device="cuda")
self.model_update_group.broadcast(indices, src=0)
self.model_update_group.broadcast(values, src=0)
param = model.get_parameter(name)
param.flatten()[indices.long()] = values

This would eliminate the CPU snapshot and reduce the data transferred/copied from O(numel) to O(nnz) per parameter.

Alternatives

Keep as is, it works, but it's suboptimal

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement a sparse variant of the receive_weights function to broadcast only the changed indices and values, reducing data transfer and eliminating the need for a full CPU snapshot.

Guidance

Identify the parameters that are being updated and determine the indices and values of the changed elements.
Modify the receive_weights function to broadcast only the indices and values of the changed elements, rather than the entire tensor.
Use the scatter method to update the existing GPU parameter with the new values at the specified indices.
Test the sparse variant to ensure it correctly updates the model parameters and reduces data transfer.

Example

# Sparse path — broadcast only the delta
indices = torch.empty(nnz, dtype=torch.int32, device="cuda")
values = torch.empty(nnz, dtype=dtype, device="cuda")
self.model_update_group.broadcast(indices, src=0)
self.model_update_group.broadcast(values, src=0)
param = model.get_parameter(name)
param.flatten()[indices.long()] = values

Notes

The proposed solution assumes that the model_update_group and load_weights functions can be modified to support the sparse variant. Additionally, the implementation may require careful handling of data types and indexing to ensure correct updates.

Recommendation

Apply workaround: Implement the sparse variant of the receive_weights function to reduce data transfer and improve efficiency. This approach eliminates the need for a full CPU snapshot and reduces the amount of data transferred from O(numel) to O(nnz) per parameter.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #prompt issue #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Support sparse in-place weight updates in weight transfer API [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #40096: [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Additional Validation

Changed files

Code Example

🚀 The feature, motivation and pitch

Problem

Possible API

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Support sparse in-place weight updates in weight transfer API [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #40096: [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Additional Validation

Changed files

Code Example

🚀 The feature, motivation and pitch

Problem

Possible API

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING