vllm - ✅(Solved) Fix [RFC]: Async Failure Notification for Fault Tolerant EP Kernels [1 pull requests, 2 comments, 3 participants]

vllm2026-04-14 03:00:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39760•Fetched 2026-04-16 06:36:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×6subscribed ×6commented ×2renamed ×1

Error Message

"Raising a formal exception requires reading the failure mask back to the CPU. | Exception call stack | Natural unwind to execute_model() | No active call stack |

Exception Flow

EPRankFailureError. The exception unwinds the call stack naturally: | Exception needed | Yes — to unwind the forward pass call stack | engine core. No exception is raised — the background thread has no forward pass call | Exception needed | No — background thread communicates via ZMQ | | Exception needed | Yes — unwinds call stack | No — ZMQ sideways | 3. New exception type: EPRankFailureError(RuntimeError) A new exception class EPRankFailureError(RuntimeError) in

Root Cause

The CPU resets dirty_flag_raw[0] = 0 after reading the full mask_buffer_ptr, before reconnect_peer completes. This is safe because connect_ranks clears the GPU-side mask_buffer_ptr entries for reconnected ranks — the next pass will see a clean mask for the recovered slot. A CPU-side reset is sufficient; no kernel change is needed for reset.

Fix Action

Fix / Workaround

One store is added to the EP dispatch/combine kernels alongside the existing atomicExch:

// In dispatch / combine kernel, when timeout occurs:
atomicExch(mask_buffer_ptr + src_rank, 1);    // existing: per-rank mask in VRAM
st_release_sys_global(dirty_flag_pinned, 1);  // new: single bit written to CPU RAM

Option A — inside All2AllManagerBase.dispatch() [recommended]:

PR fix notes

PR #38862: [EP] Fault tolerance: automatic elastic scale-down on DP engine death

Repository: vllm-project/vllm
Author: tzulingk
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38862

Description (problem / solution / changelog)

What problem does this solve?

When running large MoE models (like DeepSeek) with Expert Parallelism across multiple GPUs, if any single GPU worker crashes, the entire serving cluster goes down. This means one bad GPU can take out your whole deployment.

This PR adds automatic fault tolerance: when a GPU worker dies, the system detects it, redistributes the work across surviving GPUs, and resumes serving — all without operator intervention or downtime beyond the recovery window (~25 seconds).

Important: This PR has no dependency on FT EP kernels (e.g., DeepEP low-latency with fault masking, NIXL EP) or FT Torch collectives. It works with the default allgather_reducescatter NCCL backend using standard NCCL operations.

How it works (plain English)

Background context: In Expert Parallelism (EP), a large MoE model's experts are split across multiple GPUs. Each GPU holds a subset of experts. With EPLB (Expert-Parallel Load Balancing), some experts are intentionally duplicated across GPUs for redundancy.

When a GPU worker dies, here's what happens:

Detection — A monitor thread watches all GPU worker processes. When one dies, it identifies which worker and deduplicates notifications (Ray may report the same death multiple times).
Feasibility check — Before attempting recovery, verify that the remaining GPUs have enough expert slots to cover all the model's logical experts. If not, shut down gracefully.
Cleanup — Abort the old communication groups (NCCL) that included the dead worker — otherwise surviving workers would hang waiting for the dead one. Fail any in-flight requests that were assigned to the dead worker.
Reconfiguration — Create new communication groups with only the surviving workers. This uses a two-staged barrier to handle the timing issue where some workers may be mid-computation when the fault is detected.
Expert recovery — This is the core logic:
- Remove the dead worker's columns from the expert assignment table (strip_dead_columns)
- Find which experts lost all their copies (no surviving replica)
- Find which experts still have extra copies (redundant replicas)
- Reassign the lost experts into those redundant slots (reassign_missing_experts)
- For experts that still have a copy somewhere: transfer weights between GPUs via direct GPU-to-GPU communication (NCCL P2P)
- For experts where every copy was on the dead GPU (no surviving replica anywhere): reload those expert weights from the original model checkpoint on disk
- Note: The post-fault expert reassignment is not load-balanced. It only ensures every logical expert has at least 1 physical replica by stealing from the most-redundant slots. Future EPLB rebalancing cycles will optimize the placement for load.
Resume — Re-capture CUDA graphs with the new configuration and resume normal inference.

Example: what it looks like in practice

Case 1: All experts have a surviving replica (common with high redundancy)

When enough redundant experts are configured, the dead worker's experts likely still have copies on other GPUs. Recovery only needs GPU-to-GPU weight transfer — no disk I/O.

# GPU worker 1 crashes
[Fault Tolerance] Engine (dp_rank=1) died. Initiating scale-down from 4 to 3 engines.
[Fault Tolerance] All 64 logical experts have at least one replica — no reassignment needed.
[Fault Tolerance] Scale-down complete. Now running with 3 engines.

POST /v1/completions → 200 OK   (still serving!)

Case 2: Some experts have no surviving replica (requires disk reload)

With lower redundancy or after multiple failures, some experts may only have existed on the dead GPU. These must be reloaded from the model checkpoint on disk.

# GPU worker 2 crashes (was the only holder of experts 42, 57)
[Fault Tolerance] Engine (dp_rank=2) died. Initiating scale-down from 3 to 2 engines.
[Fault Tolerance] 2 missing experts detected — reassigning to redundant slots.
[Fault Tolerance] Reloading 2 unreachable experts from disk checkpoint.
[Fault Tolerance] Scale-down complete. Now running with 2 engines.

POST /v1/completions → 200 OK   (still serving!)

Why fault-triggered scale-down needs custom logic

This PR reuses the existing elastic scale-down path. The only differences vs normal (graceful) scale-down are handling a dead GPU that (1) cannot join NCCL collectives and (2) cannot participate in the pre-scale-down EPLB weight reshuffle.

In normal (graceful) elastic EP scale-down, the removing engine is alive and cooperates. In fault-triggered scale-down, it's already dead. Every distributed operation requires all group members to participate — when one is dead, it hangs forever. This PR replaces each cooperative step with a non-cooperative alternative:

Step	Normal	Fault-triggered	Why
NCCL teardown	`destroy()` (orderly, all ranks)	`abort()` (unilateral, no waiting)	`destroy()` hangs waiting for dead rank
Gloo barrier	`torch.distributed.barrier()` after TCP store sync	Skipped — TCP store barrier only	Gloo requires all ranks
Barrier count	`old_dp_group.size()`	`_alive_group_size()`	Dead rank won't write arrival key
Barrier cleanup	Rank 0 deletes keys	`_first_alive_rank()` deletes keys	Rank 0 may be dead
EPLB weight transfer	NCCL P2P between all ranks (pre-scale-down reshuffle)	Skipped; post-hoc reassignment via `reassign_missing_experts()` + disk reload fallback. Not load-balanced — only ensures at least 1 replica per expert.	Dead rank can't send/recv
SHUTDOWN_COMPLETE	Removing engine sends after reshuffle	`_synthesize_shutdown_complete_for_dead_ranks()` injects it	Dead engine can't notify
Standby group ranks	Contiguous `[0..N-1]`	Non-contiguous world ranks (e.g. `[0,2,3]`), compacted to contiguous `rank_in_group` for NCCL; ZMQ keeps original world rank	Surviving ranks keep identity for socket routing
Expert map	`_expert_map` stays valid (rank identity unchanged, highest ranks removed)	Must call `update_expert_map()` — middle-rank removal changes `ep_rank` for compacted ranks, making the old map invalid	Rank compaction changes `ep_rank` for some surviving ranks
Client routing	Trim `core_engines` after completion	`_dead_engine_identities` blocks routing immediately; abort callback sends `FinishReason.ABORT` to in-flight requests	Requests on dead engine would hang
Reconfigure	Sent to all engines	Skip dead engine	Would timeout

Recovery timing breakdown

All steps run sequentially on each worker. Workers run in parallel across GPUs via collective_rpc.

From real E2E runs (4 to 3 scale-down, DeepSeek-V2-Lite, 4xA100):

Step	Time	%
Abort NCCL groups	2.0s	8%
Create standby groups	1.8s	7%
MoE reconfig + expert reassign	0.25s	1%
CUDA graph recapture (51 captures)	21.7s	83%
Total	~26s

The bottleneck is CUDA graph recapture: 51 captures at vLLM's default batch sizes [1..512]. Large batches (256-512) take ~1.1s each due to MoE all2all traffic; small batches (1-248) take ~0.2s each.

Expert reassignment time depends on whether disk reload is needed:

No disk reload (redundant replicas cover lost experts): 0.04s
Disk reload (tail-rank death, total replica loss): 3.81s

Usage

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2-Lite \
    --data-parallel-size 4 \
    --data-parallel-backend ray \
    --enable-expert-parallel \
    --enable-elastic-ep \
    --enable-eplb \
    --eplb-config '{"num_redundant_experts": 64}' \
    --enable-ep-fault-tolerance

Requirements:

--data-parallel-backend ray (multiprocessing backend not yet supported)
--tensor-parallel-size 1 (TP>1 not yet supported)
--enable-elastic-ep must be on (fault tolerance builds on elastic EP infrastructure)
--enable-eplb with num_redundant_experts > 0 (redundancy enables expert recovery without always hitting disk)
No dependency on FT EP kernels or FT Torch collectives

Files changed

File	What it does
`vllm/config/parallel.py`	New `--enable-ep-fault-tolerance` flag and validation
`vllm/v1/engine/core_client.py`	Detect worker death, orchestrate scale-down, clean up in-flight requests
`vllm/v1/engine/utils.py`	Worker process health monitoring
`vllm/distributed/elastic_ep/elastic_state.py`	State machine for coordinating reconfiguration across workers
`vllm/distributed/elastic_ep/elastic_execute.py`	Expert reassignment, weight transfer, and disk reload logic
`vllm/distributed/elastic_ep/standby_state.py`	Create new communication groups with only surviving workers
`tests/v1/engine/test_ep_fault_tolerance.py`	~1300 lines of unit tests

Test plan

Unit tests (`tests/v1/engine/test_ep_fault_tolerance.py`)

E2E fault injection (Kubernetes, 4xA100)

Model: deepseek-ai/DeepSeek-V2-Lite, DP=4, TP=1, 64 redundant experts

Run 1: kill 1, 1, 1 (no disk reload)

Stage	Action	Total	Reassign	Result
0	Startup 4 workers	—	—	Inference works
1	Kill rank 1 (4 to 3)	26.3s	0.04s	Inference works
2	Kill rank 1 (3 to 2)	24.3s	0.03s	Inference works
3	Kill rank 1 (2 to 1)	—	—	ep_size=1 not supported

Run 2: kill 1, 2, 1 (disk reload triggered at stage 2)

Stage	Action	Total	Reassign	Result
0	Startup 4 workers	—	—	Inference works
1	Kill rank 1 (4 to 3)	26.7s	0.04s	Inference works
2	Kill rank 2 (3 to 2)	27.4s	3.81s (disk)	Inference works
3	Kill rank 1 (2 to 1)	—	—	ep_size=1 not supported

Run 2 stage 2: tail-rank death causes total replica loss for some experts, triggering disk reload (3.81s).

Known limitations

Requires tensor_parallel_size=1 (TP>1 support is future work)
Requires Ray backend (data_parallel_backend="ray")
Scale-up (adding workers back after recovery) not yet implemented
Post-fault expert reassignment is not load-balanced (only ensures at least 1 replica per expert; future EPLB cycles will optimize)
2 to 1 scale-down not supported (FusedMoE requires ep_size >= 2)

AI assistance was used in developing this PR. Verified no existing open PR addresses EP fault tolerance.

Changed files

tests/v1/engine/test_ep_fault_tolerance.py (added, +1283/-0)
vllm/config/parallel.py (modified, +32/-2)
vllm/distributed/device_communicators/all2all.py (modified, +9/-0)
vllm/distributed/device_communicators/base_device_communicator.py (modified, +18/-0)
vllm/distributed/device_communicators/cuda_communicator.py (modified, +23/-0)
vllm/distributed/device_communicators/flashinfer_all_reduce.py (modified, +9/-0)
vllm/distributed/device_communicators/pynccl.py (modified, +13/-0)
vllm/distributed/device_communicators/pynccl_wrapper.py (modified, +5/-0)
vllm/distributed/elastic_ep/elastic_execute.py (modified, +368/-7)
vllm/distributed/elastic_ep/elastic_state.py (modified, +114/-25)
vllm/distributed/elastic_ep/standby_state.py (modified, +51/-8)
vllm/distributed/parallel_state.py (modified, +73/-0)
vllm/distributed/stateless_coordinator.py (modified, +17/-0)
vllm/distributed/utils.py (modified, +12/-0)
vllm/engine/arg_utils.py (modified, +6/-0)
vllm/model_executor/layers/fused_moe/layer.py (modified, +25/-1)
vllm/v1/engine/__init__.py (modified, +1/-0)
vllm/v1/engine/async_llm.py (modified, +6/-0)
vllm/v1/engine/coordinator.py (modified, +18/-2)
vllm/v1/engine/core.py (modified, +4/-1)
vllm/v1/engine/core_client.py (modified, +409/-23)
vllm/v1/engine/utils.py (modified, +155/-21)

Code Example

query_mask_buffer(...)     # enqueues a CUDA copy kernel
active_ranks.copy_(...)    # still a GPU tensor
torch.equal(active_ranks, ...)  # blocks — drains the CUDA stream, crosses PCIe

---

// In dispatch / combine kernel, when timeout occurs:
atomicExch(mask_buffer_ptr + src_rank, 1);    // existing: per-rank mask in VRAM
st_release_sys_global(dirty_flag_pinned, 1);  // new: single bit written to CPU RAM

---

Option A — inside All2AllManagerBase.dispatch() [recommended]:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch ← check here → raise EPRankFailureError
                             stops here; remaining layers never run

Option B — after model.forward() returns:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch → experts → combine  (runs anyway, wasted)
  ...
  model.forward() returns → check dirty_flag → handle

---

import ctypes
dirty_flag_raw = ctypes.cast(dirty_flag.data_ptr(), ctypes.POINTER(ctypes.c_int32))
# dirty_flag_raw[0] — direct C memory access, no Python boxing

---

All2AllManagerBase.dispatch()     ← raises EPRankFailureError
  GroupCoordinator.dispatch()
    DefaultMoERunner._maybe_dispatch()
      model.forward()
        model_runner.execute_model()
          worker.execute_model()  ← catches; self.elastic_ep_executor is here

---

# All2AllManagerBase.dispatch() — called once per MoE layer
def dispatch(self, hidden_states, topk_weights, topk_ids, ...):
    if dirty_flag_raw[0] != 0:                     # cheap read from CPU RAM
        torch.cuda.current_stream().synchronize()  # sync once, only on failure
        buffer.query_mask_buffer(mask_cpu)         # stream is drained; read full mask
        failed_ranks = mask_cpu.nonzero().tolist()
        dirty_flag_raw[0] = 0
        raise EPRankFailureError(failed_ranks)
    # ... normal dispatch logic

# worker.execute_model() — the catch boundary
def execute_model(self, scheduler_output):
    try:
        output = self.model_runner.execute_model(scheduler_output, ...)
    except EPRankFailureError as e:
        send_zmq_failure_notification(e.failed_ranks)   # new ZMQ message type

---

CUDA stream:  [dispatch kernel] → [combine kernel] → [host callback] → [next kernels ...]
Main thread:  submits work ─────────────────────────────────────────→ prepares next batch
Background:                        failure detected    sends ZMQ message → done

---

mask_cpu = torch.zeros(num_ranks, dtype=torch.int32).pin_memory()

# After combine kernel, enqueue in order:
buffer.query_mask_buffer(mask_cpu)            # (1) CUDA kernel: VRAM → pinned CPU RAM
cudaLaunchHostFunc(stream, on_combine_done)   # (2) fires after (1) completes

---

Engine Core Client (core_client.py)   ←——— ZMQ ———→   GPU Worker Processes
  asyncio event loop                                     execute_model()
  abort_requests_async()                                 dispatch / combine kernels
  _scale_down_elastic_ep()                               mask_buffer_ptr

---

# Worker process — CUDA background thread
def on_combine_done():
    failed_ranks = [i for i in range(num_ranks) if mask_cpu[i] != 0]
    if failed_ranks:
        zmq_socket.send_pyobj(("GPU_MASK_FAILURE", failed_ranks))  # non-blocking

# Engine core client — asyncio event loop, next iteration
async def _handle_gpu_mask_failure(failed_ranks: list[int]):
    await self._abort_requests(inflight_for(failed_ranks))
    await self._scale_down_elastic_ep(
        cur_data_parallel_size=len(self.core_engines),
        new_data_parallel_size=len(self.core_engines) - len(failed_ranks),
        dead_dp_ranks=failed_ranks,
    )

RAW_BUFFERClick to expand / collapse

[RFC]: Async Failure Notification for Fault Tolerant EP Kernels

Motivation.

Fault-tolerant EP kernels (e.g. NIXL-EP) detect rank failures on the GPU — via a per-rank integer mask written atomically on timeout — without any CPU involvement. The CPU only learns about a failure if it explicitly reads that mask back from VRAM, which forces a GPU→CPU sync.

The problem is not detection speed. The problem is when and how often that sync is paid.

When SchedulerConfig.async_scheduling = True, vLLM pipelines CPU scheduling with GPU execution: the CPU schedules pass N+1 while the GPU runs pass N. A blocking GPU→CPU sync collapses that overlap and eliminates the throughput benefit of async scheduling — even during healthy operation when no rank has failed.

The natural "check after every forward pass" approach pays this cost on every pass:

query_mask_buffer(...)     # enqueues a CUDA copy kernel
active_ranks.copy_(...)    # still a GPU tensor
torch.equal(active_ranks, ...)  # blocks — drains the CUDA stream, crosses PCIe

As the NIXL team noted:

"Raising a formal exception requires reading the failure mask back to the CPU. This synchronization point may introduce the very latency we are trying to avoid by using async runs."

This RFC proposes wiring GPU-side mask detection into vLLM in a way that pays the sync cost only when a failure actually occurs, not on every pass.

Proposed Change.

Two designs are proposed.

Design 1 — Pinned Dirty Flag

The GPU kernel writes a single integer — dirty_flag — in pinned (page-locked) CPU memory whenever a rank fails. Pinned memory has a fixed physical address, so the GPU can write to it directly over PCIe without any CPU involvement or staging. The CPU reads it for free since it is already in its own RAM.

One store is added to the EP dispatch/combine kernels alongside the existing atomicExch:

// In dispatch / combine kernel, when timeout occurs:
atomicExch(mask_buffer_ptr + src_rank, 1);    // existing: per-rank mask in VRAM
st_release_sys_global(dirty_flag_pinned, 1);  // new: single bit written to CPU RAM

Where to Check the Dirty Flag

The check can be placed at two points. The placement determines detection lag and how much compute is wasted on failure:

Option A — inside All2AllManagerBase.dispatch() [recommended]:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch ← check here → raise EPRankFailureError
                             stops here; remaining layers never run

Option B — after model.forward() returns:

  MoE layer N:    dispatch → experts → combine  ← failure detected
  MoE layer N+1:  dispatch → experts → combine  (runs anyway, wasted)
  ...
  model.forward() returns → check dirty_flag → handle

	Option A: check in `dispatch()`	Option B: check at pass boundary
Detection lag	≤ one MoE layer	Remainder of current pass
Wasted compute on failure	One MoE layer	All remaining MoE layers
Exception call stack	Natural unwind to `execute_model()`	No active call stack

Option A is strictly better on all three dimensions.

Hot-Path Cost

The check runs once per MoE layer. Accessing the flag via ctypes avoids Python tensor boxing overhead and keeps the per-check cost minimal — a direct read from CPU RAM:

import ctypes
dirty_flag_raw = ctypes.cast(dirty_flag.data_ptr(), ctypes.POINTER(ctypes.c_int32))
# dirty_flag_raw[0] — direct C memory access, no Python boxing

ctypes is significantly cheaper than PyTorch tensor indexing (dirty_flag[0]), which incurs Python object overhead on every call. Since failures are rare, the branch predictor predicts "not taken" almost perfectly. The misprediction cost is paid only on the one check that actually fires.

Exception Flow

When dirty_flag_raw[0] != 0, the handler does a one-time stream sync to drain the GPU, reads the full mask to identify failed ranks, resets the flag, and raises EPRankFailureError. The exception unwinds the call stack naturally:

All2AllManagerBase.dispatch()     ← raises EPRankFailureError
  GroupCoordinator.dispatch()
    DefaultMoERunner._maybe_dispatch()
      model.forward()
        model_runner.execute_model()
          worker.execute_model()  ← catches; self.elastic_ep_executor is here

# All2AllManagerBase.dispatch() — called once per MoE layer
def dispatch(self, hidden_states, topk_weights, topk_ids, ...):
    if dirty_flag_raw[0] != 0:                     # cheap read from CPU RAM
        torch.cuda.current_stream().synchronize()  # sync once, only on failure
        buffer.query_mask_buffer(mask_cpu)         # stream is drained; read full mask
        failed_ranks = mask_cpu.nonzero().tolist()
        dirty_flag_raw[0] = 0
        raise EPRankFailureError(failed_ranks)
    # ... normal dispatch logic

# worker.execute_model() — the catch boundary
def execute_model(self, scheduler_output):
    try:
        output = self.model_runner.execute_model(scheduler_output, ...)
    except EPRankFailureError as e:
        send_zmq_failure_notification(e.failed_ranks)   # new ZMQ message type

The mask readback is deferred to the moment a failure actually happens, paid only once, not on every dispatch call.

Properties:


Hot-path cost (healthy)	Negligible — one integer read from CPU RAM (`ctypes`)
Detection lag	≤ one MoE layer
Mask readback sync	`stream.synchronize()` once, only on failure
Pinned memory required	One `int32` (`dirty_flag`)
Exception needed	Yes — to unwind the forward pass call stack
C++ required	No
GPU kernel change	Yes — one additional store on the timeout path

Design 2 — `cudaLaunchHostFunc` + ZMQ Notification

CUDA's cudaLaunchHostFunc places a CPU function onto a CUDA stream. It executes on a CUDA-managed background thread after the preceding GPU work completes, entirely independent of the main thread:

CUDA stream:  [dispatch kernel] → [combine kernel] → [host callback] → [next kernels ...]
Main thread:  submits work ─────────────────────────────────────────→ prepares next batch
Background:                        failure detected    sends ZMQ message → done

The main thread submits GPU work and immediately moves to the next batch. The callback fires in the background, checks the mask, and sends a ZMQ failure notification to the engine core. No exception is raised — the background thread has no forward pass call stack to unwind.

Pinned Memory Requirement

Design 2 requires a full pinned mask_cpu array (not just a single integer). The background CUDA thread cannot call CUDA APIs or touch VRAM, so the mask must already be in CPU RAM when the callback fires. This is done by queuing query_mask_buffer into the stream before the host callback:

mask_cpu = torch.zeros(num_ranks, dtype=torch.int32).pin_memory()

# After combine kernel, enqueue in order:
buffer.query_mask_buffer(mask_cpu)            # (1) CUDA kernel: VRAM → pinned CPU RAM
cudaLaunchHostFunc(stream, on_combine_done)   # (2) fires after (1) completes

PyTorch does not expose cudaLaunchHostFunc in Python, so a small C++ extension is required.

Notification Path

vLLM separates GPU workers from the engine core across processes:

Engine Core Client (core_client.py)   ←——— ZMQ ———→   GPU Worker Processes
  asyncio event loop                                     execute_model()
  abort_requests_async()                                 dispatch / combine kernels
  _scale_down_elastic_ep()                               mask_buffer_ptr

The background CUDA thread cannot reach the engine core's asyncio loop directly. It sends a ZMQ message, which the engine core receives on its next event loop iteration:

# Worker process — CUDA background thread
def on_combine_done():
    failed_ranks = [i for i in range(num_ranks) if mask_cpu[i] != 0]
    if failed_ranks:
        zmq_socket.send_pyobj(("GPU_MASK_FAILURE", failed_ranks))  # non-blocking

# Engine core client — asyncio event loop, next iteration
async def _handle_gpu_mask_failure(failed_ranks: list[int]):
    await self._abort_requests(inflight_for(failed_ranks))
    await self._scale_down_elastic_ep(
        cur_data_parallel_size=len(self.core_engines),
        new_data_parallel_size=len(self.core_engines) - len(failed_ranks),
        dead_dp_ranks=failed_ranks,
    )

asyncio is cooperative, not preemptive. The background thread signals and returns immediately; the engine picks up the message at the next iteration boundary after the forward pass completes. Interrupting mid-pass is intentionally avoided — the GPU kernels already masked the failed rank and continued, so the current pass output is valid.

Properties:


Hot-path cost (healthy)	Zero — background thread only
Detection lag	Sub-pass (fires when combine kernel finishes)
Mask readback sync	None — GPU wrote to pinned CPU RAM in-stream
Pinned memory required	Full `mask_cpu` array (`num_ranks × int32`)
Exception needed	No — background thread communicates via ZMQ
C++ required	Yes (`cudaLaunchHostFunc` binding)
GPU kernel change	No — mask is already written

Design Comparison

	Design 1: Dirty Flag	Design 2: `cudaLaunchHostFunc`
Hot-path overhead (healthy)	Negligible — one integer read from CPU RAM	Zero
Detection lag	≤ one MoE layer	Sub-pass
Pinned memory required	One `int32`	Full `mask_cpu` array
Mask readback sync	`stream.synchronize()` once on failure	None
Exception needed	Yes — unwinds call stack	No — ZMQ sideways
C++ required	No	Yes
Effort	Low — pure Python + small kernel change	Medium — C++ extension needed

The per-dispatch overhead of Design 1 is negligible in production. Design 1 is the pragmatic first step; Design 2 is the preferred end state for zero hot-path cost.

Recovery Path (Reusing PR #38862)

Once the engine core receives a failure notification, the same three recovery actions from PR #38862 follow:

Step	Action	Entry point
1. Client notification	Send `FinishReason.ABORT` to in-flight requests	`abort_requests_async()` in `core_client.py`
2. DP routing update	Stop routing new requests to the failed engine	`self.core_engines` trim in `_scale_down_elastic_ep()`
3. Expert redistribution	Reassign experts; reload from disk if needed	`elastic_execute.py` → `reassign_missing_experts()`

No new recovery logic is needed. Both designs plug into the existing machinery as an earlier trigger — GPU mask detection fires within one forward pass, while PR #38862's process monitor fires only after the OS confirms a process crash (potentially seconds later).

Any Other Things.

Design Decisions

1. API boundary for dirty_flag_ptr

The dirty flag address must be passed to the kernel buffer at initialization. This should be an internal implementation detail of each All2AllManagerBase subclass — not part of a public kernel buffer API — so different subclasses can adopt the flag independently without requiring an API contract change.

2. C++ binding location for Design 2

The cudaLaunchHostFunc binding should live in the kernel buffer library (e.g. as a set_failure_callback API on nixl_ep.Buffer) rather than in the vLLM repo. This avoids duplicating the binding in each downstream consumer.

3. New exception type: EPRankFailureError(RuntimeError)

A new exception class EPRankFailureError(RuntimeError) in vllm/distributed/elastic_ep/ should be created, following the pattern of _BarrierTimeoutError(RuntimeError) in elastic_state.py:73. It should carry failed_ranks: list[int] to identify which EP slots failed.

4. Dirty flag reset timing

5. Multiple simultaneous failures

A single dirty_flag integer coalesces multiple failures in one pass into a single detection event. This is acceptable because the one-time stream sync that follows reads the full mask_buffer_ptr to enumerate all failed rank indices. The recovery path in PR #38862 already handles a list of failed ranks.

CC List.

@itayalroy @fangyuchu @tylermsmith @sagemoore

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing a dirty flag or using cudaLaunchHostFunc can help reduce the sync cost associated with GPU-side mask detection in fault-tolerant EP kernels.

Guidance

To reduce the sync cost, consider implementing a dirty flag that allows the CPU to check for failures without blocking the GPU stream.
Use ctypes to access the dirty flag from CPU RAM, minimizing Python object overhead.
Place the dirty flag check at the beginning of All2AllManagerBase.dispatch() to minimize detection lag and wasted compute.
For a more efficient solution, consider using cudaLaunchHostFunc to execute a CPU function on a CUDA stream, but this requires a C++ extension.

Example

import ctypes
dirty_flag_raw = ctypes.cast(dirty_flag.data_ptr(), ctypes.POINTER(ctypes.c_int32))
if dirty_flag_raw[0] != 0:
    # Handle failure
    pass

Notes

The choice between the dirty flag and cudaLaunchHostFunc approaches depends on the specific requirements and constraints of the project. The dirty flag approach is simpler and more straightforward, while cudaLaunchHostFunc provides a more efficient solution but requires a C++ extension.

Recommendation

Apply the dirty flag workaround, as it is a more straightforward and easier-to-implement solution that can provide a significant reduction in sync cost.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC]: Async Failure Notification for Fault Tolerant EP Kernels [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Exception Flow

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #38862: [EP] Fault tolerance: automatic elastic scale-down on DP engine death

Description (problem / solution / changelog)

What problem does this solve?

How it works (plain English)

Example: what it looks like in practice

Case 1: All experts have a surviving replica (common with high redundancy)

Case 2: Some experts have no surviving replica (requires disk reload)

Why fault-triggered scale-down needs custom logic

Recovery timing breakdown

Usage

Files changed

Test plan

Unit tests (tests/v1/engine/test_ep_fault_tolerance.py)

E2E fault injection (Kubernetes, 4xA100)

Known limitations

Changed files

Code Example

[RFC]: Async Failure Notification for Fault Tolerant EP Kernels

Motivation.

Proposed Change.

Design 1 — Pinned Dirty Flag

Where to Check the Dirty Flag

Hot-Path Cost

Exception Flow

Design 2 — cudaLaunchHostFunc + ZMQ Notification

Pinned Memory Requirement

Notification Path

Design Comparison

Recovery Path (Reusing PR #38862)

Any Other Things.

Design Decisions

CC List.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Unit tests (`tests/v1/engine/test_ep_fault_tolerance.py`)

Design 2 — `cudaLaunchHostFunc` + ZMQ Notification