vllm - 💡(How to fix) Fix [RFC]: Expert Weight Backup for Elastic EP [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41204Fetched 2026-04-30 06:19:34
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
commented ×1labeled ×1

Code Example

--enable-expert-weight-backup       # default False; requires --enable-elastic-ep + --enable-eplb
--expert-weight-backup-port-base    # ZMQ control-plane base, default 35000

---

@dataclass(frozen=True)
class ExpertSlotLocation:
    addr: int                 # absolute address in the manager's registered region
    nbytes: int
    shape: tuple[int, ...]
    dtype: torch.dtype

@dataclass(frozen=True)
class ExpertBackupDescriptor:
    owner_node_rank: int
    backup_region_base: int                   # base addr of registered region
    weight_pointer_map: dict[str, ExpertSlotLocation]   # keyed by parameter name
    nixl_agent_metadata: bytes              # from get_agent_metadata()

---

class ExpertBackupClient:
    def is_ready(self) -> bool: ...
    def fetch_logical_experts(
        self,
        layer_id: int,
        logical_expert_ids: Sequence[int],
        weight_name_filter: Callable[[str], bool] | None = None,
    ) -> None:
        """Issue batched NIXL READs from each remote manager into local GPU
        param tensors. Synchronous; called during a stop-the-world rebalance,
        not on the inference critical path."""
RAW_BUFFERClick to expand / collapse

1. Motivation

EPLB (vllm/distributed/eplb/) rebalances MoE expert placement every N forward passes via GPU to GPU NCCL P2P. It assumes every source rank is alive and reachable. When a rank dies, the new physical-to-logical map can name an expert whose only live replica was on the dead rank: there is no second source for those weights.

This RFC adds a per-node sidecar that mirrors expert weights into host CPU memory, registers them with NIXL, and serves RDMA reads to surviving ranks mid-rebalance.

2. Goal

  • Survive rank loss during EPLB rebalance.
  • Recover dead-source experts from remote DRAM.
  • Dormant on the happy path: NCCL P2P remains the primary transport.

3. Proposed Change

Each client opens ZMQ + NIXL handshakes with every node's manager (N managers, one per node). During a rebalance the client may RDMA-read from any subset of them.

<img width="567" height="295" alt="Image" src="https://github.com/user-attachments/assets/30f528dc-2ce9-41b6-9864-a3b7ad270e72" />

3.1 CLI / Config

--enable-expert-weight-backup       # default False; requires --enable-elastic-ep + --enable-eplb
--expert-weight-backup-port-base    # ZMQ control-plane base, default 35000

ParallelConfig gains enable_expert_weight_backup, expert_weight_backup_ib_device, expert_weight_backup_port_base, expert_weight_backup_pin_memory.

3.2 Wire descriptor (manager → client, sent once over ZMQ PUB)

@dataclass(frozen=True)
class ExpertSlotLocation:
    addr: int                 # absolute address in the manager's registered region
    nbytes: int
    shape: tuple[int, ...]
    dtype: torch.dtype

@dataclass(frozen=True)
class ExpertBackupDescriptor:
    owner_node_rank: int
    backup_region_base: int                   # base addr of registered region
    weight_pointer_map: dict[str, ExpertSlotLocation]   # keyed by parameter name
    nixl_agent_metadata: bytes              # from get_agent_metadata()

3.3 ExpertBackupManager: sidecar process, one per node

Owns a single contiguous host-memory region containing this node's slice of the checkpoint's expert weights, registered with NIXL. After publishing its descriptor the manager is a passive RDMA target for the engine's lifetime.

3.4 ExpertBackupClient: in-worker, one per EP rank

class ExpertBackupClient:
    def is_ready(self) -> bool: ...
    def fetch_logical_experts(
        self,
        layer_id: int,
        logical_expert_ids: Sequence[int],
        weight_name_filter: Callable[[str], bool] | None = None,
    ) -> None:
        """Issue batched NIXL READs from each remote manager into local GPU
        param tensors. Synchronous; called during a stop-the-world rebalance,
        not on the inference critical path."""

4. Test Plan

  • Unit (CPU, NIXL stubbed): buffer layout, descriptor round-trip.
  • Multi-process (real NIXL, 2 GPUs): single + batched fetches, byte-identical to CPU reference.
  • Failure injection: kill rank 2/4; orphans recovered from DRAM mirror; outputs match all-4-rank reference.

Feedback Period.

1 week.

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To survive rank loss during EPLB rebalance, implement a per-node sidecar that mirrors expert weights into host CPU memory and serves RDMA reads to surviving ranks.

Guidance

  • Enable expert weight backup by setting --enable-expert-weight-backup to True and configuring the necessary parameters, such as expert_weight_backup_port_base.
  • Implement the ExpertBackupManager sidecar process to manage the mirrored expert weights and register them with NIXL.
  • Use the ExpertBackupClient to fetch logical experts from remote managers during rebalance, ensuring that the client can recover dead-source experts from remote DRAM.
  • Test the implementation using the proposed test plan, including unit tests, multi-process tests, and failure injection tests.

Example

# Enable expert weight backup
parser.add_argument('--enable-expert-weight-backup', action='store_true')
parser.add_argument('--expert-weight-backup-port-base', type=int, default=35000)

# Create an ExpertBackupManager instance
manager = ExpertBackupManager()

# Create an ExpertBackupClient instance
client = ExpertBackupClient()

# Fetch logical experts during rebalance
client.fetch_logical_experts(layer_id, logical_expert_ids)

Notes

The implementation requires careful consideration of the wire descriptor format and the interaction between the ExpertBackupManager and ExpertBackupClient. Additionally, the test plan should ensure that the implementation works correctly in various scenarios, including failure injection.

Recommendation

Apply the proposed workaround by implementing the per-node sidecar and enabling expert weight backup, as it provides a robust solution to survive rank loss during EPLB rebalance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING