vllm - 💡(How to fix) Fix [RFC] Redesign enable_return_routed_experts to avoid blocking EngineCore event loop [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38079Fetched 2026-04-08 01:26:44
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

The current enable_return_routed_experts implementation uses shared memory + file locks for cross-process data transfer between Worker and Scheduler. When enabled, it introduces per-step GPU→CPU synchronization on Workers and per-completed-request blocking reads on the Scheduler's single-threaded event loop. In DP+EP setups, this causes NCCL collective timeouts. We propose a per-request opt-in model with data flowing through the existing ModelRunnerOutput pipeline, eliminating shared memory and file locks entirely.

Root Cause

We traced the issue through a series of diagnostic experiments, systematically disabling components to isolate the cause:

ExperimentWorker saveScheduler readResult
Baseline (flag=False)offoffOK — sustained throughput
Flag=True, all defaultevery stepevery completed reqHANG at ~20s
save=no-op, read activeoffevery completed reqHANG at ~20s
save active, read=Noneevery stepoffHANG at ~60s
save=off (routed_experts_initialized=False), read=NoneoffoffOK — sustained throughput

Both the Worker-side write path and the Scheduler-side read path independently cause hangs:

Worker-side (every forward step when flag is on):

  1. slot_mapping_attn[:num_tokens].cpu().numpy() — forces CUDA stream synchronization before the forward pass
  2. save_captured_experts()_device_buffer.cpu().numpy() (GPU→CPU copy ~2MB) + fcntl.flock(LOCK_EX) + shared memory write

These per-step GPU→CPU syncs delay Workers. Since Workers across DP ranks are coupled by EP all-to-all (NCCL), one slow Worker causes cross-node collective timeouts.

Scheduler-side (every completed request):

  1. _get_routed_experts() — numpy slot_mapping computation + fcntl.flock(LOCK_EX) + shared memory read (~2MB per request)
  2. For 64 concurrent 4K-token requests completing in one iteration: ~128MB of numpy copies in the single-threaded event loop

This stalls the EngineCore event loop, delaying the next collective_rpc to Workers, again causing EP NCCL timeouts.

Code Example

Worker Process (per GPU):
  MoE forward → capture() writes to GPU device_buffer
  After forward → save_captured_experts():
    device_buffer.cpu().numpy()GPUCPU sync (blocks CUDA stream)
    fcntl.flock(LOCK_EX)                 ← blocking file lock
    shm[slot_indices] = data             ← write to shared memory

EngineCore Process (Scheduler):
  On EVERY completed request → _get_routed_experts():
    reconstruct slot_mapping from KV block_ids (numpy)
    fcntl.flock(LOCK_EX)                 ← blocking file lock
    data = shm[slot_mapping].copy()      ← read from shared memory
  Attach to EngineCoreOutputIPCAPIServer

---

POST /v1/completions
{
    "prompt": [...],
    "max_tokens": 1,
    "vllm_xargs": {"return_routed_experts": true}
}

---

Worker Process:
  1. capture_fn() writes topk_ids to device_buffer during MoE forward (always, in CUDA graph)
  2. IF batch contains opted-in request:
     a. Extract routing deltas from device_buffer for this step's tokens
     b. Attach to ModelRunnerOutput.routed_expert_deltas: dict[str, np.ndarray]
        (req_id → ndarray of shape (num_new_tokens, num_layers, topk))
     GPUCPU copy uses a separate CUDA stream (non_blocking=True) — does not sync the main stream

EngineCore (Scheduler):
  3. update_from_output(): for opted-in requests, append delta to request-local accumulator
     (one list.append per step — O(1), no numpy, no shm, no lock)
  4. On request completion: np.concatenate(accumulated_deltas)EngineCoreOutput.routed_experts

APIServer:
  5. Existing output_processor forwards routed_experts to CompletionOutput
  6. Serve layer encodes to base64, injects into JSON response (unchanged)
RAW_BUFFERClick to expand / collapse

[RFC] Redesign enable_return_routed_experts to avoid blocking EngineCore event loop

Summary

The current enable_return_routed_experts implementation uses shared memory + file locks for cross-process data transfer between Worker and Scheduler. When enabled, it introduces per-step GPU→CPU synchronization on Workers and per-completed-request blocking reads on the Scheduler's single-threaded event loop. In DP+EP setups, this causes NCCL collective timeouts. We propose a per-request opt-in model with data flowing through the existing ModelRunnerOutput pipeline, eliminating shared memory and file locks entirely.

Motivation

We are using enable_return_routed_experts in a reinforcement learning pipeline to capture MoE routing decisions during inference, which are then replayed during training. Our setup: gpt-oss-120B, TP=8, DP=2, EP=16 across 2 nodes.

With the flag enabled, vLLM hangs within 20–60 seconds of sustained inference. The hang manifests as throughput dropping to 0 with requests stuck "Running", eventually triggering NCCL ALLREDUCE timeouts in the expert parallel group. With the flag disabled, the same workload runs indefinitely.

Root cause analysis

We traced the issue through a series of diagnostic experiments, systematically disabling components to isolate the cause:

ExperimentWorker saveScheduler readResult
Baseline (flag=False)offoffOK — sustained throughput
Flag=True, all defaultevery stepevery completed reqHANG at ~20s
save=no-op, read activeoffevery completed reqHANG at ~20s
save active, read=Noneevery stepoffHANG at ~60s
save=off (routed_experts_initialized=False), read=NoneoffoffOK — sustained throughput

Both the Worker-side write path and the Scheduler-side read path independently cause hangs:

Worker-side (every forward step when flag is on):

  1. slot_mapping_attn[:num_tokens].cpu().numpy() — forces CUDA stream synchronization before the forward pass
  2. save_captured_experts()_device_buffer.cpu().numpy() (GPU→CPU copy ~2MB) + fcntl.flock(LOCK_EX) + shared memory write

These per-step GPU→CPU syncs delay Workers. Since Workers across DP ranks are coupled by EP all-to-all (NCCL), one slow Worker causes cross-node collective timeouts.

Scheduler-side (every completed request):

  1. _get_routed_experts() — numpy slot_mapping computation + fcntl.flock(LOCK_EX) + shared memory read (~2MB per request)
  2. For 64 concurrent 4K-token requests completing in one iteration: ~128MB of numpy copies in the single-threaded event loop

This stalls the EngineCore event loop, delaying the next collective_rpc to Workers, again causing EP NCCL timeouts.

Current architecture

Worker Process (per GPU):
  MoE forward → capture() writes to GPU device_buffer
  After forward → save_captured_experts():
    device_buffer.cpu().numpy()          ← GPU→CPU sync (blocks CUDA stream)
    fcntl.flock(LOCK_EX)                 ← blocking file lock
    shm[slot_indices] = data             ← write to shared memory

EngineCore Process (Scheduler):
  On EVERY completed request → _get_routed_experts():
    reconstruct slot_mapping from KV block_ids (numpy)
    fcntl.flock(LOCK_EX)                 ← blocking file lock
    data = shm[slot_mapping].copy()      ← read from shared memory
  Attach to EngineCoreOutput → IPC → APIServer

Problems:

  1. Always-on overhead: Both paths run unconditionally for all requests, even when no client needs routing data
  2. GPU→CPU sync on Worker hot path: slot_mapping.cpu() and device_buffer.cpu() force CUDA synchronization every step
  3. Blocking I/O in event loop: fcntl.flock + numpy memcpy in the Scheduler's single-threaded loop
  4. Shared memory as IPC: Introduces a separate data channel (shm + flock) parallel to the existing ModelRunnerOutput → EngineCore IPC pipeline

Proposed design: per-request opt-in with ModelRunnerOutput handoff

The root problem is architectural: routed experts data flows through a dedicated shared-memory channel with file locks, separate from the existing Worker → EngineCore IPC pipeline. This introduces GPU→CPU synchronization, blocking I/O, and cross-process locking on every forward step — regardless of whether any client needs the data.

The fix is to treat routed experts like any other Worker-produced side data (pooler_output, kv_connector_output): flow it through ModelRunnerOutput, gated on per-request opt-in.

API surface

Clients opt in per-request via vllm_xargs:

POST /v1/completions
{
    "prompt": [...],
    "max_tokens": 1,
    "vllm_xargs": {"return_routed_experts": true}
}

The server-wide --enable-return-routed-experts flag remains, but only controls whether the RoutedExpertsCapturer is initialized (device buffer allocated, capture_fn bound to routers). It no longer implies always-on capture/save/read on every step.

Data flow

Worker Process:
  1. capture_fn() writes topk_ids to device_buffer during MoE forward (always, in CUDA graph)
  2. IF batch contains opted-in request:
     a. Extract routing deltas from device_buffer for this step's tokens
     b. Attach to ModelRunnerOutput.routed_expert_deltas: dict[str, np.ndarray]
        (req_id → ndarray of shape (num_new_tokens, num_layers, topk))
     GPU→CPU copy uses a separate CUDA stream (non_blocking=True) — does not sync the main stream

EngineCore (Scheduler):
  3. update_from_output(): for opted-in requests, append delta to request-local accumulator
     (one list.append per step — O(1), no numpy, no shm, no lock)
  4. On request completion: np.concatenate(accumulated_deltas) → EngineCoreOutput.routed_experts

APIServer:
  5. Existing output_processor forwards routed_experts to CompletionOutput
  6. Serve layer encodes to base64, injects into JSON response (unchanged)

Why this works

No shared memory. Data stays in-process until serialized through the existing ModelRunnerOutput IPC channel. No fcntl.flock, no /dev/shm buffers, no cross-process synchronization.

No GPU→CPU sync on the critical path. capture() runs inside the CUDA graph (zero overhead). The GPU→CPU copy for opted-in steps uses a separate stream with non_blocking=True — the main compute stream is never blocked. The copy result is consumed when building ModelRunnerOutput, which already has an async output path (AsyncModelRunnerOutput.get_output()).

Per-request, not per-server. The Worker tracks opted-in request IDs locally (from scheduled_new_reqs.sampling_params.extra_args). In typical workloads, 99.9% of forward steps have zero opted-in requests → zero additional work. Only the 1–2 steps containing an opted-in request incur the copy.

Consistent with existing patterns. ModelRunnerOutput already carries pooler_output (per-request tensors), kv_connector_output (connector metadata), and logprobs (per-token data). Routed expert deltas are the same class of data — Worker-produced, per-request, needed at completion time.

Data volume analysis

Per-token routing data: num_moe_layers × topk × sizeof(int32) bytes in ModelRunnerOutput (int32 at capture; can be compressed to uint8/uint16 at the serving layer if expert IDs fit).

For a concrete example (gpt-oss-120B: 36 MoE layers, topk=4, 128 experts):

ScenarioPer-tokenAdded to ModelRunnerOutputvs. existing payload (~10-30KB)
No opted-in request (99.9% of steps)00%
1 opted-in request, decode (1 token)576 B576 B~3%
1 opted-in request, prefill (2708 tokens)576 B~1.5 MB (one-time)large, but comparable to prompt_logprobs for same length

The transport uses vLLM's existing ShmRingBuffer (shared-memory ring buffer, lock-free single-producer/single-consumer). For same-node Workers (the common case), this is zero-copy with spin notification — no serialization, no network, no file lock.

Scope

This design covers tokens actually computed in the current request's forward passes. Prefix-cached tokens (computed by a prior request) are not captured. This is consistent with prompt_logprobs, which only returns logprobs for tokens the engine computes.

Concrete changes

  1. ModelRunnerOutput: Add routed_expert_deltas: dict[str, np.ndarray] | None = None
  2. GPUModelRunner: Track opted-in req IDs from scheduled_new_reqs. On opted-in steps, extract per-request deltas from device_buffer and populate routed_expert_deltas. Use non_blocking=True for the GPU→CPU copy.
  3. Scheduler: On update_from_output, append deltas to per-request accumulator. On completion, concatenate and set EngineCoreOutput.routed_experts.
  4. Scheduler._get_routed_experts: Remove. No longer needed — data arrives via ModelRunnerOutput, not shared memory.
  5. RoutedExpertsReader / RoutedExpertsCapturer.save_captured_experts: Remove. Shared memory read/write paths are no longer used.
  6. RoutedExpertsCapturer: Retain capture() and device_buffer (used during forward). Remove init_buffer's shared memory allocation and save_captured_experts. Add a method to extract per-request deltas given request→token index mapping.
  7. --enable-return-routed-experts: Retains its role as the server-wide flag to initialize the capturer. Does not imply always-on save/read.

Environment

extent analysis

Fix Plan

To address the issue, we will implement a per-request opt-in model, where routed experts data flows through the existing ModelRunnerOutput pipeline, eliminating shared memory and file locks. Here are the concrete steps:

  • Modify ModelRunnerOutput: Add a new field routed_expert_deltas to store the routed experts data for each request.
  • Update GPUModelRunner: Track opted-in request IDs and extract per-request deltas from device_buffer on opted-in steps. Use non_blocking=True for the GPU→CPU copy.
  • Update Scheduler: Append deltas to per-request accumulator on update_from_output and concatenate on completion to set EngineCoreOutput.routed_experts.
  • Remove shared memory read/write paths: Remove Scheduler._get_routed_experts, RoutedExpertsReader, and RoutedExpertsCapturer.save_captured_experts.
  • Update RoutedExpertsCapturer: Retain capture() and device_buffer, but remove shared memory allocation and add a method to extract per-request deltas.

Example code changes:

# ModelRunnerOutput.py
class ModelRunnerOutput:
    def __init__(self):
        self.routed_expert_deltas = None  # Add new field

# GPUModelRunner.py
class GPUModelRunner:
    def __init__(self):
        self.opted_in_req_ids = []  # Track opted-in request IDs

    def run(self):
        # ...
        if self.opted_in_req_ids:
            # Extract per-request deltas from device_buffer
            deltas = self.extract_deltas()
            self.model_runner_output.routed_expert_deltas = deltas
        # ...

    def extract_deltas(self):
        # Implement delta extraction using non_blocking=True for GPU→CPU copy
        pass

# Scheduler.py
class Scheduler:
    def update_from_output(self, output):
        # ...
        if output.routed_expert_deltas:
            # Append deltas to per-request accumulator
            self.request_accumulator.append(output.routed_expert_deltas)
        # ...

    def on_completion(self):
        # ...
        # Concatenate accumulated deltas and set EngineCoreOutput.routed_experts
        self.engine_core_output.routed_experts = np.concatenate(self.request_accumulator)
        # ...

Verification

To verify the fix, test the following scenarios:

  • Opted-in requests: Verify that routed experts data is correctly captured and returned for opted-in requests.
  • Non-opted-in requests: Verify that no routed experts data is captured or returned for non-opted-in requests.
  • Performance: Verify that the fix does not introduce significant performance overhead.

Extra Tips

To prevent similar issues in the future:

  • Avoid shared memory and file locks: Instead, use existing IPC pipelines or design new ones that avoid

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING