vllm - 💡(How to fix) Fix [RFC]: Expert Weight Backup for Elastic EP [1 comments, 2 participants]

Code Example

--enable-expert-weight-backup       # default False; requires --enable-elastic-ep + --enable-eplb
--expert-weight-backup-port-base    # ZMQ control-plane base, default 35000

---

@dataclass(frozen=True)
class ExpertSlotLocation:
    addr: int                 # absolute address in the manager's registered region
    nbytes: int
    shape: tuple[int, ...]
    dtype: torch.dtype

@dataclass(frozen=True)
class ExpertBackupDescriptor:
    owner_node_rank: int
    backup_region_base: int                   # base addr of registered region
    weight_pointer_map: dict[str, ExpertSlotLocation]   # keyed by parameter name
    nixl_agent_metadata: bytes              # from get_agent_metadata()

---

class ExpertBackupClient:
    def is_ready(self) -> bool: ...
    def fetch_logical_experts(
        self,
        layer_id: int,
        logical_expert_ids: Sequence[int],
        weight_name_filter: Callable[[str], bool] | None = None,
    ) -> None:
        """Issue batched NIXL READs from each remote manager into local GPU
        param tensors. Synchronous; called during a stop-the-world rebalance,
        not on the inference critical path."""

1. Motivation

EPLB (vllm/distributed/eplb/) rebalances MoE expert placement every N forward passes via GPU to GPU NCCL P2P. It assumes every source rank is alive and reachable. When a rank dies, the new physical-to-logical map can name an expert whose only live replica was on the dead rank: there is no second source for those weights.

This RFC adds a per-node sidecar that mirrors expert weights into host CPU memory, registers them with NIXL, and serves RDMA reads to surviving ranks mid-rebalance.

2. Goal

Survive rank loss during EPLB rebalance.
Recover dead-source experts from remote DRAM.
Dormant on the happy path: NCCL P2P remains the primary transport.

3. Proposed Change

Each client opens ZMQ + NIXL handshakes with every node's manager (N managers, one per node). During a rebalance the client may RDMA-read from any subset of them.

3.1 CLI / Config

--enable-expert-weight-backup       # default False; requires --enable-elastic-ep + --enable-eplb
--expert-weight-backup-port-base    # ZMQ control-plane base, default 35000

ParallelConfig gains enable_expert_weight_backup, expert_weight_backup_ib_device, expert_weight_backup_port_base, expert_weight_backup_pin_memory.

3.2 Wire descriptor (manager → client, sent once over ZMQ PUB)

@dataclass(frozen=True)
class ExpertSlotLocation:
    addr: int                 # absolute address in the manager's registered region
    nbytes: int
    shape: tuple[int, ...]
    dtype: torch.dtype

@dataclass(frozen=True)
class ExpertBackupDescriptor:
    owner_node_rank: int
    backup_region_base: int                   # base addr of registered region
    weight_pointer_map: dict[str, ExpertSlotLocation]   # keyed by parameter name
    nixl_agent_metadata: bytes              # from get_agent_metadata()

3.3 `ExpertBackupManager`: sidecar process, one per node

Owns a single contiguous host-memory region containing this node's slice of the checkpoint's expert weights, registered with NIXL. After publishing its descriptor the manager is a passive RDMA target for the engine's lifetime.

3.4 `ExpertBackupClient`: in-worker, one per EP rank

class ExpertBackupClient:
    def is_ready(self) -> bool: ...
    def fetch_logical_experts(
        self,
        layer_id: int,
        logical_expert_ids: Sequence[int],
        weight_name_filter: Callable[[str], bool] | None = None,
    ) -> None:
        """Issue batched NIXL READs from each remote manager into local GPU
        param tensors. Synchronous; called during a stop-the-world rebalance,
        not on the inference critical path."""

4. Test Plan

Unit (CPU, NIXL stubbed): buffer layout, descriptor round-trip.
Multi-process (real NIXL, 2 GPUs): single + batched fetches, byte-identical to CPU reference.
Failure injection: kill rank 2/4; orphans recovered from DRAM mirror; outputs match all-4-rank reference.

Feedback Period.

1 week.

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To survive rank loss during EPLB rebalance, implement a per-node sidecar that mirrors expert weights into host CPU memory and serves RDMA reads to surviving ranks.

Guidance

Enable expert weight backup by setting --enable-expert-weight-backup to True and configuring the necessary parameters, such as expert_weight_backup_port_base.
Implement the ExpertBackupManager sidecar process to manage the mirrored expert weights and register them with NIXL.
Use the ExpertBackupClient to fetch logical experts from remote managers during rebalance, ensuring that the client can recover dead-source experts from remote DRAM.
Test the implementation using the proposed test plan, including unit tests, multi-process tests, and failure injection tests.

Example

# Enable expert weight backup
parser.add_argument('--enable-expert-weight-backup', action='store_true')
parser.add_argument('--expert-weight-backup-port-base', type=int, default=35000)

# Create an ExpertBackupManager instance
manager = ExpertBackupManager()

# Create an ExpertBackupClient instance
client = ExpertBackupClient()

# Fetch logical experts during rebalance
client.fetch_logical_experts(layer_id, logical_expert_ids)

Notes

The implementation requires careful consideration of the wire descriptor format and the interaction between the ExpertBackupManager and ExpertBackupClient. Additionally, the test plan should ensure that the implementation works correctly in various scenarios, including failure injection.

Recommendation

Apply the proposed workaround by implementing the per-node sidecar and enabling expert weight backup, as it provides a robust solution to survive rank loss during EPLB rebalance.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Expert Weight Backup for Elastic EP [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

1. Motivation

2. Goal

3. Proposed Change

3.1 CLI / Config

3.2 Wire descriptor (manager → client, sent once over ZMQ PUB)

3.3 `ExpertBackupManager`: sidecar process, one per node

3.4 `ExpertBackupClient`: in-worker, one per EP rank

4. Test Plan

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Expert Weight Backup for Elastic EP [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

1. Motivation

2. Goal

3. Proposed Change

3.1 CLI / Config

3.2 Wire descriptor (manager → client, sent once over ZMQ PUB)

3.3 ExpertBackupManager: sidecar process, one per node

3.4 ExpertBackupClient: in-worker, one per EP rank

4. Test Plan

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

3.3 `ExpertBackupManager`: sidecar process, one per node

3.4 `ExpertBackupClient`: in-worker, one per EP rank