vllm - 💡(How to fix) Fix [RFC]: Move trainer-side weight transfer logic out of `vllm`

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

class MyWeightTransferEngine(WeightTransferEngine):
    init_info_cls = MyInitInfo
    update_info_cls = MyUpdateInfo

    # vLLM-side (receive)
    def init_transfer_engine(self, init_info: MyInitInfo): ...

    def receive_weights(
        self,
        update_info: MyUpdateInfo,
        load_weights: Callable[[list[tuple[str, Tensor]]], None],
    ): ...

    # Trainer-side (send) — called from the RL framework
    @staticmethod
    def trainer_send_weights(
        iterator: Iterator[tuple[str, Tensor]],
        trainer_args: dict[str, Any] | Any,
    ): ...

---

# examples/rl/weight_transfer/nccl_sender.py
class NCCLBroadcastSender:
    def __init__(
        self,
        master_address: str,
        master_port: int,
        world_size: int,
        packed: bool = False,
        packed_buffer_size_bytes: int = DEFAULT_PACKED_BUFFER_SIZE_BYTES,
        packed_num_buffers: int = DEFAULT_PACKED_NUM_BUFFERS,
    ):
        # Trainer is rank 0 in the stateless process group shared with vLLM workers.
        device = torch.accelerator.current_device_index()
        pg = StatelessProcessGroup.create(
            host=master_address, port=master_port, rank=0, world_size=world_size,
        )
        self.group = PyNcclCommunicator(pg, device=device)
        self.packed = packed
        self.packed_buffer_size_bytes = packed_buffer_size_bytes
        self.packed_num_buffers = packed_num_buffers

    def send_weights(
        self,
        iterator: Iterator[tuple[str, torch.Tensor]],
    ):
        if self.packed:
            packed_nccl_broadcast_producer(
                iterator=iterator, group=self.group, src=0,
                post_iter_func=lambda item: item[1],
                buffer_size_bytes=self.packed_buffer_size_bytes,
                num_buffers=self.packed_num_buffers,
            )
        else:
            for _, tensor in iterator:
                self.group.broadcast(tensor, src=0, stream=torch.cuda.current_stream())
RAW_BUFFERClick to expand / collapse

Motivation.

The new vLLM weight transfer APIs were introduced in https://github.com/vllm-project/vllm/pull/31943 (vllm/distributed/weight_transfer/). They give RL frameworks a standard way to sync updated weights from a trainer process into running vLLM workers, with pluggable backends (NCCL, CUDA IPC, etc.).

Today, a backend is implemented by subclassing WeightTransferEngine and providing the below methods:

class MyWeightTransferEngine(WeightTransferEngine):
    init_info_cls = MyInitInfo
    update_info_cls = MyUpdateInfo

    # vLLM-side (receive)
    def init_transfer_engine(self, init_info: MyInitInfo): ...

    def receive_weights(
        self,
        update_info: MyUpdateInfo,
        load_weights: Callable[[list[tuple[str, Tensor]]], None],
    ): ...

    # Trainer-side (send) — called from the RL framework
    @staticmethod
    def trainer_send_weights(
        iterator: Iterator[tuple[str, Tensor]],
        trainer_args: dict[str, Any] | Any,
    ): ...

init_transfer_engine and receive_weights are vLLM-side logic for setting up transport and receiving weights into the worker. trainer_send_weights (and on NCCLWeightTransferEngine also trainer_init) is a helper meant to be called from the trainer process to send weights to all inference workers.

There are two issues with bundling the trainer-side helper inside vLLM's engine class:

  1. The abstraction conflates send and receive. WeightTransferEngine should be a vLLM-side abstraction concerned only with receive logic. Putting send logic on the same class makes the API harder to reason about, and individual subclasses have already grown extra static methods that exist only for the trainer side (e.g. NCCLWeightTransferEngine.trainer_init).

  2. trainer_send_weights is too prescriptive. Its signature is tied to a (name, tensor) iterator for example. RL frameworks have diverse needs around how a state dict is materialized and chunked. A single static method on a vLLM class is the wrong place to standardize this functionality.

More broadly: even though trainer_send_weights is not required (frameworks can implement their own send path today), adding it as part of the public engine API can imply that this is the way to send weights. This is not what we want.

Proposed Change.

Move the trainer-side helpers fully out of the vllm package.

Concretely:

  • Remove WeightTransferEngine.trainer_send_weights from the base class and from all subclasses (NCCLWeightTransferEngine, IPCWeightTransferEngine).
  • Remove NCCLWeightTransferEngine.trainer_init (and any similar trainer-only helpers on other backends).
  • Keep on WeightTransferEngine only the receive-side contract: __init__, parse_init_info, parse_update_info, init_transfer_engine, receive_weights, shutdown.
  • Keep the backend-specific request/info dataclasses (*InitInfo, *UpdateInfo) where they are. They define the wire format between trainer and vLLM and remain part of the vLLM public API.
  • Provide one reference implementation per backend under examples/rl/ (or a sibling examples/rl/weight_transfer/) that shows the trainer-side counterpart. The existing example scripts (rlhf_nccl.py, rlhf_ipc.py, rlhf_http_nccl.py, rlhf_http_ipc.py, etc.) already call the helpers and will be updated to use the new in-example abstractions.

We still want this reference implementation in the vLLM repo and well tested. For example, NCCLWeightTransferEngine uses packed_nccl_broadcast_consumer helper function and it's better for trainer rank-0 sender to use the corresponding packed_nccl_broadcast_producer method to be consistent.

Migration

This is a breaking change for users who import trainer_send_weights / trainer_init directly.

Given the APIs are new (introduced in https://github.com/vllm-project/vllm/pull/31943 and not yet broadly adopted), I think they can be removed outright rather than going through a proper deprecation.

Example for NCCL

The reference trainer-side class for NCCL becomes:

# examples/rl/weight_transfer/nccl_sender.py
class NCCLBroadcastSender:
    def __init__(
        self,
        master_address: str,
        master_port: int,
        world_size: int,
        packed: bool = False,
        packed_buffer_size_bytes: int = DEFAULT_PACKED_BUFFER_SIZE_BYTES,
        packed_num_buffers: int = DEFAULT_PACKED_NUM_BUFFERS,
    ):
        # Trainer is rank 0 in the stateless process group shared with vLLM workers.
        device = torch.accelerator.current_device_index()
        pg = StatelessProcessGroup.create(
            host=master_address, port=master_port, rank=0, world_size=world_size,
        )
        self.group = PyNcclCommunicator(pg, device=device)
        self.packed = packed
        self.packed_buffer_size_bytes = packed_buffer_size_bytes
        self.packed_num_buffers = packed_num_buffers

    def send_weights(
        self,
        iterator: Iterator[tuple[str, torch.Tensor]],
    ):
        if self.packed:
            packed_nccl_broadcast_producer(
                iterator=iterator, group=self.group, src=0,
                post_iter_func=lambda item: item[1],
                buffer_size_bytes=self.packed_buffer_size_bytes,
                num_buffers=self.packed_num_buffers,
            )
        else:
            for _, tensor in iterator:
                self.group.broadcast(tensor, src=0, stream=torch.cuda.current_stream())

The example imports StatelessProcessGroup, PyNcclCommunicator, and the packed broadcast producer from vllm.

Feedback Period.

1 week

CC List.

@hao-aaron @aoshen02 @robertgshaw2-redhat @kouroshHakha

Any Other Things.

Alternative Designs

The other alternative design is to create a separate WeightTransferSender abstraction in vLLM that will only encapsulate the send logic from the trainer - that way we decouple send and receive. However, this will still be trying to standardize trainer side logic as a vLLM API and will be counter-productive.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING