vllm - ✅(Solved) Fix [RFC]: Async Failure Notification for Fault Tolerant EP Kernels [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39760Fetched 2026-04-16 06:36:55
View on GitHub
Comments
2
Participants
3
Timeline
15
Reactions
0
Author
Timeline (top)
mentioned ×6subscribed ×6commented ×2renamed ×1

Error Message

"Raising a formal exception requires reading the failure mask back to the CPU. | Exception call stack | Natural unwind to execute_model() | No active call stack |

Exception Flow

EPRankFailureError. The exception unwinds the call stack naturally: | Exception needed | Yes — to unwind the forward pass call stack | engine core. No exception is raised — the background thread has no forward pass call | Exception needed | No — background thread communicates via ZMQ | | Exception needed | Yes — unwinds call stack | No — ZMQ sideways | 3. New exception type: EPRankFailureError(RuntimeError) A new exception class EPRankFailureError(RuntimeError) in

Root Cause

The CPU resets dirty_flag_raw[0] = 0 after reading the full mask_buffer_ptr, before reconnect_peer completes. This is safe because connect_ranks clears the GPU-side mask_buffer_ptr entries for reconnected ranks — the next pass will see a clean mask for the recovered slot. A CPU-side reset is sufficient; no kernel change is needed for reset.

Fix Action

Fix / Workaround

One store is added to the EP dispatch/combine kernels alongside the existing atomicExch:

// In dispatch / combine kernel, when timeout occurs:
atomicExch(mask_buffer_ptr + src_rank, 1);    // existing: per-rank mask in VRAM
st_release_sys_global(dirty_flag_pinned, 1);  // new: single bit written to CPU RAM
Option A — inside All2AllManagerBase.dispatch() [recommended]:

PR fix notes

PR #38862: [EP] Fault tolerance: automatic elastic scale-down on DP engine death

Description (problem / solution / changelog)

What problem does this solve?

When running large MoE models (like DeepSeek) with Expert Parallelism across multiple GPUs, if any single GPU worker crashes, the entire serving cluster goes down. This means one bad GPU can take out your whole deployment.

This PR adds automatic fault tolerance: when a GPU worker dies, the system detects it, redistributes the work across surviving GPUs, and resumes serving — all without operator intervention or downtime beyond the recovery window (~25 seconds).

Important: This PR has no dependency on FT EP kernels (e.g., DeepEP low-latency with fault masking, NIXL EP) or FT Torch collectives. It works with the default allgather_reducescatter NCCL backend using standard NCCL operations.

How it works (plain English)

Background context: In Expert Parallelism (EP), a large MoE model's experts are split across multiple GPUs. Each GPU holds a subset of experts. With EPLB (Expert-Parallel Load Balancing), some experts are intentionally duplicated across GPUs for redundancy.

When a GPU worker dies, here's what happens:

  1. Detection — A monitor thread watches all GPU worker processes. When one dies, it identifies which worker and deduplicates notifications (Ray may report the same death multiple times).

  2. Feasibility check — Before attempting recovery, verify that the remaining GPUs have enough expert slots to cover all the model's logical experts. If not, shut down gracefully.

  3. Cleanup — Abort the old communication groups (NCCL) that included the dead worker — otherwise surviving workers would hang waiting for the dead one. Fail any in-flight requests that were assigned to the dead worker.

  4. Reconfiguration — Create new communication groups with only the surviving workers. This uses a two-staged barrier to handle the timing issue where some workers may be mid-computation when the fault is detected.

  5. Expert recovery — This is the core logic:

    • Remove the dead worker's columns from the expert assignment table (strip_dead_columns)
    • Find which experts lost all their copies (no surviving replica)
    • Find which experts still have extra copies (redundant replicas)
    • Reassign the lost experts into those redundant slots (reassign_missing_experts)
    • For experts that still have a copy somewhere: transfer weights between GPUs via direct GPU-to-GPU communication (NCCL P2P)
    • For experts where every copy was on the dead GPU (no surviving replica anywhere): reload those expert weights from the original model checkpoint on disk
    • Note: The post-fault expert reassignment is not load-balanced. It only ensures every logical expert has at least 1 physical replica by stealing from the most-redundant slots. Future EPLB rebalancing cycles will optimize the placement for load.
  6. Resume — Re-capture CUDA graphs with the new configuration and resume normal inference.

Example: what it looks like in practice

Case 1: All experts have a surviving replica (common with high redundancy)

When enough redundant experts are configured, the dead worker's experts likely still have copies on other GPUs. Recovery only needs GPU-to-GPU weight transfer — no disk I/O.

# GPU worker 1 crashes
[Fault Tolerance] Engine (dp_rank=1) died. Initiating scale-down from 4 to 3 engines.
[Fault Tolerance] All 64 logical experts have at least one replica — no reassignment needed.
[Fault Tolerance] Scale-down complete. Now running with 3 engines.

POST /v1/completions → 200 OK   (still serving!)

Case 2: Some experts have no surviving replica (requires disk reload)

With lower redundancy or after multiple failures, some experts may only have existed on the dead GPU. These must be reloaded from the model checkpoint on disk.

# GPU worker 2 crashes (was the only holder of experts 42, 57)
[Fault Tolerance] Engine (dp_rank=2) died. Initiating scale-down from 3 to 2 engines.
[Fault Tolerance] 2 missing experts detected — reassigning to redundant slots.
[Fault Tolerance] Reloading 2 unreachable experts from disk checkpoint.
[Fault Tolerance] Scale-down complete. Now running with 2 engines.

POST /v1/completions → 200 OK   (still serving!)

Why fault-triggered scale-down needs custom logic

This PR reuses the existing elastic scale-down path. The only differences vs normal (graceful) scale-down are handling a dead GPU that (1) cannot join NCCL collectives and (2) cannot participate in the pre-scale-down EPLB weight reshuffle.

In normal (graceful) elastic EP scale-down, the removing engine is alive and cooperates. In fault-triggered scale-down, it's already dead. Every distributed operation requires all group members to participate — when one is dead, it hangs forever. This PR replaces each cooperative step with a non-cooperative alternative:

StepNormalFault-triggeredWhy
NCCL teardowndestroy() (orderly, all ranks)abort() (unilateral, no waiting)destroy() hangs waiting for dead rank
Gloo barriertorch.distributed.barrier() after TCP store syncSkipped — TCP store barrier onlyGloo requires all ranks
Barrier countold_dp_group.size()_alive_group_size()Dead rank won't write arrival key
Barrier cleanupRank 0 deletes keys_first_alive_rank() deletes keysRank 0 may be dead
EPLB weight transferNCCL P2P between all ranks (pre-scale-down reshuffle)Skipped; post-hoc reassignment via reassign_missing_experts() + disk reload fallback. Not load-balanced — only ensures at least 1 replica per expert.Dead rank can't send/recv
SHUTDOWN_COMPLETERemoving engine sends after reshuffle_synthesize_shutdown_complete_for_dead_ranks() injects itDead engine can't notify
Standby group ranksContiguous [0..N-1]Non-contiguous world ranks (e.g. [0,2,3]), compacted to contiguous rank_in_group for NCCL; ZMQ keeps original world rankSurviving ranks keep identity for socket routing
Expert map_expert_map stays valid (rank identity unchanged, highest ranks removed)Must call update_expert_map() — middle-rank removal changes ep_rank for compacted ranks, making the old map invalidRank compaction changes ep_rank for some surviving ranks
Client routingTrim core_engines after completion_dead_engine_identities blocks routing immediately; abort callback sends FinishReason.ABORT to in-flight requestsRequests on dead engine would hang
ReconfigureSent to all enginesSkip dead engineWould timeout

Recovery timing breakdown

All steps run sequentially on each worker. Workers run in parallel across GPUs via collective_rpc.

From real E2E runs (4 to 3 scale-down, DeepSeek-V2-Lite, 4xA100):

StepTime%
Abort NCCL groups2.0s8%
Create standby groups1.8s7%
MoE reconfig + expert reassign0.25s1%
CUDA graph recapture (51 captures)21.7s83%
Total~26s

The bottleneck is CUDA graph recapture: 51 captures at vLLM's default batch sizes [1..512]. Large batches (256-512) take ~1.1s each due to MoE all2all traffic; small batches (1-248) take ~0.2s each.

Expert reassignment time depends on whether disk reload is needed:

  • No disk reload (redundant replicas cover lost experts): 0.04s
  • Disk reload (tail-rank death, total replica loss): 3.81s

Usage

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2-Lite \
    --data-parallel-size 4 \
    --data-parallel-backend ray \
    --enable-expert-parallel \
    --enable-elastic-ep \
    --enable-eplb \
    --eplb-config '{"num_redundant_experts": 64}' \
    --enable-ep-fault-tolerance

Requirements:

  • --data-parallel-backend ray (multiprocessing backend not yet supported)
  • --tensor-parallel-size 1 (TP>1 not yet supported)
  • --enable-elastic-ep must be on (fault tolerance builds on elastic EP infrastructure)
  • --enable-eplb with num_redundant_experts > 0 (redundancy enables expert recovery without always hitting disk)
  • No dependency on FT EP kernels or FT Torch collectives

Files changed

FileWhat it does
vllm/config/parallel.pyNew --enable-ep-fault-tolerance flag and validation
vllm/v1/engine/core_client.pyDetect worker death, orchestrate scale-down, clean up in-flight requests
vllm/v1/engine/utils.pyWorker process health monitoring
vllm/distributed/elastic_ep/elastic_state.pyState machine for coordinating reconfiguration across workers
vllm/distributed/elastic_ep/elastic_execute.pyExpert reassignment, weight transfer, and disk reload logic
vllm/distributed/elastic_ep/standby_state.pyCreate new communication groups with only surviving workers
tests/v1/engine/test_ep_fault_tolerance.py~1300 lines of unit tests

Test plan

Unit tests (tests/v1/engine/test_ep_fault_tolerance.py)

  • Parse DP rank from process name
  • Config flag validation
  • Fault detection and deduplication
  • In-flight request cleanup on worker death
  • Abort callback registration
  • Two-staged barrier with timeout handling
  • Rank compaction after scale-down
  • Expert reassignment (missing to redundant slots)
  • Expert assignment table compaction
  • Surviving rank computation
  • Worker liveness monitoring

E2E fault injection (Kubernetes, 4xA100)

Model: deepseek-ai/DeepSeek-V2-Lite, DP=4, TP=1, 64 redundant experts

Run 1: kill 1, 1, 1 (no disk reload)

StageActionTotalReassignResult
0Startup 4 workersInference works
1Kill rank 1 (4 to 3)26.3s0.04sInference works
2Kill rank 1 (3 to 2)24.3s0.03sInference works
3Kill rank 1 (2 to 1)ep_size=1 not supported

Run 2: kill 1, 2, 1 (disk reload triggered at stage 2)

StageActionTotalReassignResult
0Startup 4 workersInference works
1Kill rank 1 (4 to 3)26.7s0.04sInference works
2Kill rank 2 (3 to 2)27.4s3.81s (disk)Inference works
3Kill rank 1 (2 to 1)ep_size=1 not supported

Run 2 stage 2: tail-rank death causes total replica loss for some experts, triggering disk reload (3.81s).

Known limitations

  • Requires tensor_parallel_size=1 (TP>1 support is future work)
  • Requires Ray backend (data_parallel_backend="ray")
  • Scale-up (adding workers back after recovery) not yet implemented
  • Post-fault expert reassignment is not load-balanced (only ensures at least 1 replica per expert; future EPLB cycles will optimize)
  • 2 to 1 scale-down not supported (FusedMoE requires ep_size >= 2)

AI assistance was used in developing this PR. Verified no existing open PR addresses EP fault tolerance.

Changed files

  • tests/v1/engine/test_ep_fault_tolerance.py (added, +1283/-0)
  • vllm/config/parallel.py (modified, +32/-2)
  • vllm/distributed/device_communicators/all2all.py (modified, +9/-0)
  • vllm/distributed/device_communicators/base_device_communicator.py (modified, +18/-0)
  • vllm/distributed/device_communicators/cuda_communicator.py (modified, +23/-0)
  • vllm/distributed/device_communicators/flashinfer_all_reduce.py (modified, +9/-0)
  • vllm/distributed/device_communicators/pynccl.py (modified, +13/-0)
  • vllm/distributed/device_communicators/pynccl_wrapper.py (modified, +5/-0)
  • vllm/distributed/elastic_ep/elastic_execute.py (modified, +368/-7)
  • vllm/distributed/elastic_ep/elastic_state.py (modified, +114/-25)
  • vllm/distributed/elastic_ep/standby_state.py (modified, +51/-8)
  • vllm/distributed/parallel_state.py (modified, +73/-0)
  • vllm/distributed/stateless_coordinator.py (modified, +17/-0)
  • vllm/distributed/utils.py (modified, +12/-0)
  • vllm/engine/arg_utils.py (modified, +6/-0)
  • vllm/model_executor/layers/fused_moe/layer.py (modified, +25/-1)
  • vllm/v1/engine/__init__.py (modified, +1/-0)
  • vllm/v1/engine/async_llm.py (modified, +6/-0)
  • vllm/v1/engine/coordinator.py (modified, +18/-2)
  • vllm/v1/engine/core.py (modified, +4/-1)
  • vllm/v1/engine/core_client.py (modified, +409/-23)
  • vllm/v1/engine/utils.py (modified, +155/-21)

Code Example

query_mask_buffer(...)     # enqueues a CUDA copy kernel
active_ranks.copy_(...)    # still a GPU tensor
torch.equal(active_ranks, ...)  # blocks — drains the CUDA stream, crosses PCIe

---

// In dispatch / combine kernel, when timeout occurs:
atomicExch(mask_buffer_ptr + src_rank, 1);    // existing: per-rank mask in VRAM
st_release_sys_global(dirty_flag_pinned, 1);  // new: single bit written to CPU RAM

---

Option A — inside All2AllManagerBase.dispatch() [recommended]:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch ← check here → raise EPRankFailureError
                             stops here; remaining layers never run

Option B — after model.forward() returns:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch → experts → combine  (runs anyway, wasted)
  ...
  model.forward() returns → check dirty_flag → handle

---

import ctypes
dirty_flag_raw = ctypes.cast(dirty_flag.data_ptr(), ctypes.POINTER(ctypes.c_int32))
# dirty_flag_raw[0] — direct C memory access, no Python boxing

---

All2AllManagerBase.dispatch()     ← raises EPRankFailureError
  GroupCoordinator.dispatch()
    DefaultMoERunner._maybe_dispatch()
      model.forward()
        model_runner.execute_model()
          worker.execute_model()  ← catches; self.elastic_ep_executor is here

---

# All2AllManagerBase.dispatch() — called once per MoE layer
def dispatch(self, hidden_states, topk_weights, topk_ids, ...):
    if dirty_flag_raw[0] != 0:                     # cheap read from CPU RAM
        torch.cuda.current_stream().synchronize()  # sync once, only on failure
        buffer.query_mask_buffer(mask_cpu)         # stream is drained; read full mask
        failed_ranks = mask_cpu.nonzero().tolist()
        dirty_flag_raw[0] = 0
        raise EPRankFailureError(failed_ranks)
    # ... normal dispatch logic

# worker.execute_model() — the catch boundary
def execute_model(self, scheduler_output):
    try:
        output = self.model_runner.execute_model(scheduler_output, ...)
    except EPRankFailureError as e:
        send_zmq_failure_notification(e.failed_ranks)   # new ZMQ message type

---

CUDA stream:  [dispatch kernel][combine kernel][host callback][next kernels ...]
Main thread:  submits work ─────────────────────────────────────────→ prepares next batch
Background:                        failure detected    sends ZMQ message → done

---

mask_cpu = torch.zeros(num_ranks, dtype=torch.int32).pin_memory()

# After combine kernel, enqueue in order:
buffer.query_mask_buffer(mask_cpu)            # (1) CUDA kernel: VRAM → pinned CPU RAM
cudaLaunchHostFunc(stream, on_combine_done)   # (2) fires after (1) completes

---

Engine Core Client (core_client.py)   ←——— ZMQ ———→   GPU Worker Processes
  asyncio event loop                                     execute_model()
  abort_requests_async()                                 dispatch / combine kernels
  _scale_down_elastic_ep()                               mask_buffer_ptr

---

# Worker process — CUDA background thread
def on_combine_done():
    failed_ranks = [i for i in range(num_ranks) if mask_cpu[i] != 0]
    if failed_ranks:
        zmq_socket.send_pyobj(("GPU_MASK_FAILURE", failed_ranks))  # non-blocking

# Engine core client — asyncio event loop, next iteration
async def _handle_gpu_mask_failure(failed_ranks: list[int]):
    await self._abort_requests(inflight_for(failed_ranks))
    await self._scale_down_elastic_ep(
        cur_data_parallel_size=len(self.core_engines),
        new_data_parallel_size=len(self.core_engines) - len(failed_ranks),
        dead_dp_ranks=failed_ranks,
    )
RAW_BUFFERClick to expand / collapse

[RFC]: Async Failure Notification for Fault Tolerant EP Kernels

<!-- Paste this as a GitHub issue body on https://github.com/vllm-project/vllm/issues/new -->

Motivation.

Fault-tolerant EP kernels (e.g. NIXL-EP) detect rank failures on the GPU — via a per-rank integer mask written atomically on timeout — without any CPU involvement. The CPU only learns about a failure if it explicitly reads that mask back from VRAM, which forces a GPU→CPU sync.

The problem is not detection speed. The problem is when and how often that sync is paid.

When SchedulerConfig.async_scheduling = True, vLLM pipelines CPU scheduling with GPU execution: the CPU schedules pass N+1 while the GPU runs pass N. A blocking GPU→CPU sync collapses that overlap and eliminates the throughput benefit of async scheduling — even during healthy operation when no rank has failed.

The natural "check after every forward pass" approach pays this cost on every pass:

query_mask_buffer(...)     # enqueues a CUDA copy kernel
active_ranks.copy_(...)    # still a GPU tensor
torch.equal(active_ranks, ...)  # blocks — drains the CUDA stream, crosses PCIe

As the NIXL team noted:

"Raising a formal exception requires reading the failure mask back to the CPU. This synchronization point may introduce the very latency we are trying to avoid by using async runs."

This RFC proposes wiring GPU-side mask detection into vLLM in a way that pays the sync cost only when a failure actually occurs, not on every pass.


Proposed Change.

Two designs are proposed.


Design 1 — Pinned Dirty Flag

The GPU kernel writes a single integer — dirty_flag — in pinned (page-locked) CPU memory whenever a rank fails. Pinned memory has a fixed physical address, so the GPU can write to it directly over PCIe without any CPU involvement or staging. The CPU reads it for free since it is already in its own RAM.

One store is added to the EP dispatch/combine kernels alongside the existing atomicExch:

// In dispatch / combine kernel, when timeout occurs:
atomicExch(mask_buffer_ptr + src_rank, 1);    // existing: per-rank mask in VRAM
st_release_sys_global(dirty_flag_pinned, 1);  // new: single bit written to CPU RAM
Where to Check the Dirty Flag

The check can be placed at two points. The placement determines detection lag and how much compute is wasted on failure:

Option A — inside All2AllManagerBase.dispatch() [recommended]:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch ← check here → raise EPRankFailureError
                             stops here; remaining layers never run

Option B — after model.forward() returns:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch → experts → combine  (runs anyway, wasted)
  ...
  model.forward() returns → check dirty_flag → handle
Option A: check in dispatch()Option B: check at pass boundary
Detection lag≤ one MoE layerRemainder of current pass
Wasted compute on failureOne MoE layerAll remaining MoE layers
Exception call stackNatural unwind to execute_model()No active call stack

Option A is strictly better on all three dimensions.

Hot-Path Cost

The check runs once per MoE layer. Accessing the flag via ctypes avoids Python tensor boxing overhead and keeps the per-check cost minimal — a direct read from CPU RAM:

import ctypes
dirty_flag_raw = ctypes.cast(dirty_flag.data_ptr(), ctypes.POINTER(ctypes.c_int32))
# dirty_flag_raw[0] — direct C memory access, no Python boxing

ctypes is significantly cheaper than PyTorch tensor indexing (dirty_flag[0]), which incurs Python object overhead on every call. Since failures are rare, the branch predictor predicts "not taken" almost perfectly. The misprediction cost is paid only on the one check that actually fires.

Exception Flow

When dirty_flag_raw[0] != 0, the handler does a one-time stream sync to drain the GPU, reads the full mask to identify failed ranks, resets the flag, and raises EPRankFailureError. The exception unwinds the call stack naturally:

All2AllManagerBase.dispatch()     ← raises EPRankFailureError
  GroupCoordinator.dispatch()
    DefaultMoERunner._maybe_dispatch()
      model.forward()
        model_runner.execute_model()
          worker.execute_model()  ← catches; self.elastic_ep_executor is here
# All2AllManagerBase.dispatch() — called once per MoE layer
def dispatch(self, hidden_states, topk_weights, topk_ids, ...):
    if dirty_flag_raw[0] != 0:                     # cheap read from CPU RAM
        torch.cuda.current_stream().synchronize()  # sync once, only on failure
        buffer.query_mask_buffer(mask_cpu)         # stream is drained; read full mask
        failed_ranks = mask_cpu.nonzero().tolist()
        dirty_flag_raw[0] = 0
        raise EPRankFailureError(failed_ranks)
    # ... normal dispatch logic

# worker.execute_model() — the catch boundary
def execute_model(self, scheduler_output):
    try:
        output = self.model_runner.execute_model(scheduler_output, ...)
    except EPRankFailureError as e:
        send_zmq_failure_notification(e.failed_ranks)   # new ZMQ message type

The mask readback is deferred to the moment a failure actually happens, paid only once, not on every dispatch call.

Properties:

Hot-path cost (healthy)Negligible — one integer read from CPU RAM (ctypes)
Detection lag≤ one MoE layer
Mask readback syncstream.synchronize() once, only on failure
Pinned memory requiredOne int32 (dirty_flag)
Exception neededYes — to unwind the forward pass call stack
C++ requiredNo
GPU kernel changeYes — one additional store on the timeout path

Design 2 — cudaLaunchHostFunc + ZMQ Notification

CUDA's cudaLaunchHostFunc places a CPU function onto a CUDA stream. It executes on a CUDA-managed background thread after the preceding GPU work completes, entirely independent of the main thread:

CUDA stream:  [dispatch kernel] → [combine kernel] → [host callback] → [next kernels ...]
Main thread:  submits work ─────────────────────────────────────────→ prepares next batch
Background:                        failure detected    sends ZMQ message → done

The main thread submits GPU work and immediately moves to the next batch. The callback fires in the background, checks the mask, and sends a ZMQ failure notification to the engine core. No exception is raised — the background thread has no forward pass call stack to unwind.

Pinned Memory Requirement

Design 2 requires a full pinned mask_cpu array (not just a single integer). The background CUDA thread cannot call CUDA APIs or touch VRAM, so the mask must already be in CPU RAM when the callback fires. This is done by queuing query_mask_buffer into the stream before the host callback:

mask_cpu = torch.zeros(num_ranks, dtype=torch.int32).pin_memory()

# After combine kernel, enqueue in order:
buffer.query_mask_buffer(mask_cpu)            # (1) CUDA kernel: VRAM → pinned CPU RAM
cudaLaunchHostFunc(stream, on_combine_done)   # (2) fires after (1) completes

PyTorch does not expose cudaLaunchHostFunc in Python, so a small C++ extension is required.

Notification Path

vLLM separates GPU workers from the engine core across processes:

Engine Core Client (core_client.py)   ←——— ZMQ ———→   GPU Worker Processes
  asyncio event loop                                     execute_model()
  abort_requests_async()                                 dispatch / combine kernels
  _scale_down_elastic_ep()                               mask_buffer_ptr

The background CUDA thread cannot reach the engine core's asyncio loop directly. It sends a ZMQ message, which the engine core receives on its next event loop iteration:

# Worker process — CUDA background thread
def on_combine_done():
    failed_ranks = [i for i in range(num_ranks) if mask_cpu[i] != 0]
    if failed_ranks:
        zmq_socket.send_pyobj(("GPU_MASK_FAILURE", failed_ranks))  # non-blocking

# Engine core client — asyncio event loop, next iteration
async def _handle_gpu_mask_failure(failed_ranks: list[int]):
    await self._abort_requests(inflight_for(failed_ranks))
    await self._scale_down_elastic_ep(
        cur_data_parallel_size=len(self.core_engines),
        new_data_parallel_size=len(self.core_engines) - len(failed_ranks),
        dead_dp_ranks=failed_ranks,
    )

asyncio is cooperative, not preemptive. The background thread signals and returns immediately; the engine picks up the message at the next iteration boundary after the forward pass completes. Interrupting mid-pass is intentionally avoided — the GPU kernels already masked the failed rank and continued, so the current pass output is valid.

Properties:

Hot-path cost (healthy)Zero — background thread only
Detection lagSub-pass (fires when combine kernel finishes)
Mask readback syncNone — GPU wrote to pinned CPU RAM in-stream
Pinned memory requiredFull mask_cpu array (num_ranks × int32)
Exception neededNo — background thread communicates via ZMQ
C++ requiredYes (cudaLaunchHostFunc binding)
GPU kernel changeNo — mask is already written

Design Comparison

Design 1: Dirty FlagDesign 2: cudaLaunchHostFunc
Hot-path overhead (healthy)Negligible — one integer read from CPU RAMZero
Detection lag≤ one MoE layerSub-pass
Pinned memory requiredOne int32Full mask_cpu array
Mask readback syncstream.synchronize() once on failureNone
Exception neededYes — unwinds call stackNo — ZMQ sideways
C++ requiredNoYes
EffortLow — pure Python + small kernel changeMedium — C++ extension needed

The per-dispatch overhead of Design 1 is negligible in production. Design 1 is the pragmatic first step; Design 2 is the preferred end state for zero hot-path cost.


Recovery Path (Reusing PR #38862)

Once the engine core receives a failure notification, the same three recovery actions from PR #38862 follow:

StepActionEntry point
1. Client notificationSend FinishReason.ABORT to in-flight requestsabort_requests_async() in core_client.py
2. DP routing updateStop routing new requests to the failed engineself.core_engines trim in _scale_down_elastic_ep()
3. Expert redistributionReassign experts; reload from disk if neededelastic_execute.pyreassign_missing_experts()

No new recovery logic is needed. Both designs plug into the existing machinery as an earlier trigger — GPU mask detection fires within one forward pass, while PR #38862's process monitor fires only after the OS confirms a process crash (potentially seconds later).


Any Other Things.

Design Decisions

1. API boundary for dirty_flag_ptr

The dirty flag address must be passed to the kernel buffer at initialization. This should be an internal implementation detail of each All2AllManagerBase subclass — not part of a public kernel buffer API — so different subclasses can adopt the flag independently without requiring an API contract change.

2. C++ binding location for Design 2

The cudaLaunchHostFunc binding should live in the kernel buffer library (e.g. as a set_failure_callback API on nixl_ep.Buffer) rather than in the vLLM repo. This avoids duplicating the binding in each downstream consumer.

3. New exception type: EPRankFailureError(RuntimeError)

A new exception class EPRankFailureError(RuntimeError) in vllm/distributed/elastic_ep/ should be created, following the pattern of _BarrierTimeoutError(RuntimeError) in elastic_state.py:73. It should carry failed_ranks: list[int] to identify which EP slots failed.

4. Dirty flag reset timing

The CPU resets dirty_flag_raw[0] = 0 after reading the full mask_buffer_ptr, before reconnect_peer completes. This is safe because connect_ranks clears the GPU-side mask_buffer_ptr entries for reconnected ranks — the next pass will see a clean mask for the recovered slot. A CPU-side reset is sufficient; no kernel change is needed for reset.

5. Multiple simultaneous failures

A single dirty_flag integer coalesces multiple failures in one pass into a single detection event. This is acceptable because the one-time stream sync that follows reads the full mask_buffer_ptr to enumerate all failed rank indices. The recovery path in PR #38862 already handles a list of failed ranks.

CC List.

@itayalroy @fangyuchu @tylermsmith @sagemoore

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing a dirty flag or using cudaLaunchHostFunc can help reduce the sync cost associated with GPU-side mask detection in fault-tolerant EP kernels.

Guidance

  • To reduce the sync cost, consider implementing a dirty flag that allows the CPU to check for failures without blocking the GPU stream.
  • Use ctypes to access the dirty flag from CPU RAM, minimizing Python object overhead.
  • Place the dirty flag check at the beginning of All2AllManagerBase.dispatch() to minimize detection lag and wasted compute.
  • For a more efficient solution, consider using cudaLaunchHostFunc to execute a CPU function on a CUDA stream, but this requires a C++ extension.

Example

import ctypes
dirty_flag_raw = ctypes.cast(dirty_flag.data_ptr(), ctypes.POINTER(ctypes.c_int32))
if dirty_flag_raw[0] != 0:
    # Handle failure
    pass

Notes

The choice between the dirty flag and cudaLaunchHostFunc approaches depends on the specific requirements and constraints of the project. The dirty flag approach is simpler and more straightforward, while cudaLaunchHostFunc provides a more efficient solution but requires a C++ extension.

Recommendation

Apply the dirty flag workaround, as it is a more straightforward and easier-to-implement solution that can provide a significant reduction in sync cost.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING