vllm - ✅(Solved) Fix [RFC] Replace routing replay with CUDA-graph-compatible device cache approach [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39701Fetched 2026-04-15 06:20:51
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
2
Participants
Timeline (top)
labeled ×3subscribed ×2added_to_project_v2 ×1issue_type_added ×1

PR fix notes

PR #39917: [Core] Replace routing replay with device cache and async D2H pipeline

Description (problem / solution / changelog)

Summary

Replace upstream vLLM routing replay with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. This PR focuses on the core architecture change — monolithic kernel support and prefix caching are in a follow-up PR.

RFC: #39701

What this PR does

Replaces the SharedMemory-based routing replay with:

  • Pre-allocated (L, N, K) int16 device buffer with per-layer views
  • Async D2H pipeline via CUDA events + pinned memory
  • Per-request host cache (no shared memory, no file locks)
  • Data flows through ModelRunnerOutput → Ray DAG → scheduler (enables multi-node)

What this PR removes

  • RoutedExpertsReader (shared memory reader)
  • multiprocessing.SharedMemory usage
  • fcntl file-based locking
  • capture() callback mechanism in router
  • KV cache slot_mapping retrieval for routing data
  • int32 dtype (replaced with int16)
  • (N, L, K) buffer layout (replaced with (L, N, K))

Changes

  • Rewrite routed_experts_capturer.py: device cache + async D2H pipeline
  • Add moe_layer_id auto-increment to FusedMoE for buffer binding
  • bind_routing_capture_to_model(): persistent tensor attribute + cudagraph_mark_tensor_static
  • Capture routing in non-monolithic (Triton) path via topk_ids.to(int16) copy
  • Route data through ModelRunnerOutput instead of shared memory
  • Wire routed_experts to OpenAI API response
  • Unit tests for device cache and host cache

What is NOT in this PR (follow-up)

  • Monolithic kernel path (FP8/MXFP8 via FlashInfer routing_replay_out) — depends on flashinfer-ai/flashinfer#3024
  • Prefix caching -1 sentinel
  • _monolithic_writes_routing_replay flag

Validation

Tested on GB200 GPUs with a 120B MoE model (BF16 Triton path, non-monolithic):

  • Single-node TP=4: PASS, 7,767 tok/s
  • Prefix caching: PASS, 7,136 tok/s
  • DP=2 (2 nodes, TP=4): PASS, 10,170 tok/s

Performance: 2.0% throughput overhead on random data. Accuracy: GSM8K pass@1 = 95.77% (identical to baseline).

API Compatibility

Fully preserved — same CLI flag (--enable-return-routed-experts), same output field (routed_experts), same shape [seq_len, num_moe_layers, top_k].

Test Plan

  • BF16 Triton path functional tests pass
  • Multi-node DP functional tests pass
  • Performance degradation < 5%
  • Accuracy unchanged
  • Unit tests pass (CI)

Changed files

  • docs/features/routed_experts_replay.md (added, +285/-0)
  • tests/model_executor/test_routed_experts_capture.py (modified, +83/-56)
  • vllm/entrypoints/openai/chat_completion/protocol.py (modified, +4/-0)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +15/-0)
  • vllm/entrypoints/openai/completion/protocol.py (modified, +4/-0)
  • vllm/entrypoints/openai/completion/serving.py (modified, +11/-0)
  • vllm/model_executor/layers/fused_moe/layer.py (modified, +7/-0)
  • vllm/model_executor/layers/fused_moe/routed_experts_capturer.py (modified, +706/-289)
  • vllm/model_executor/layers/fused_moe/runner/moe_runner_base.py (modified, +8/-0)
  • vllm/outputs.py (modified, +4/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +7/-70)
  • vllm/v1/engine/__init__.py (modified, +1/-2)
  • vllm/v1/engine/output_processor.py (modified, +31/-2)
  • vllm/v1/outputs.py (modified, +3/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +67/-47)
RAW_BUFFERClick to expand / collapse

[RFC] Replace routing replay with CUDA-graph-compatible device cache approach

Motivation

Routing replay (--enable-return-routed-experts) captures which MoE experts process each token during inference. This is needed by RL training pipelines (GRPO, RLHF) where the training step reconstructs expert routing decisions from the inference pass.

We've been running a fork with a production-grade routing replay replacement on our internal GPU clusters for large MoE models (120B and 400B+ parameter). The current upstream implementation has fundamental issues that prevent real-world use - it breaks under CUDA graphs, doesn't work multi-node, misses the monolithic kernel path entirely, and has no MTP or prefix caching support.

We'd like to upstream our replacement. The implementation is code complete and validated across 9 configurations with <5% throughput overhead and zero accuracy impact. A companion FlashInfer PR (adding routing_replay_out to MoE kernel launchers) is currently under review.

What's wrong with the current implementation

The current approach uses a capture() callback in the router combined with multiprocessing.SharedMemory and fcntl file locking. We found 8 issues:

1. CUDA graph incompatibility. The device buffer (routed_experts_capturer.py:133) is not marked with cudagraph_mark_tensor_static. During CUDA graph capture, the graph snapshots this buffer. On replay, it restores the snapshot before executing, overwriting clear_buffer() calls made outside the graph. Positions not rewritten by the current forward retain stale snapshot data instead of zeros. This took us 5 different approaches to solve correctly.

2. Monolithic kernels not captured. Routing is only captured via select_experts() in the router (base_router.py:280). The monolithic trtllm-gen path (apply_monolithic()) bypasses the router, so no routing data is captured for FP8/MXFP8 models using FlashInfer's fused kernels. This is the default serving path for quantized MoE models.

3. No multi-node support. SharedMemory is node-local (routed_experts_capturer.py:14,58-60). On multi-node TP setups (needed for 400B+ models), the scheduler on node 0 can't read shared memory from workers on other nodes. Our replacement removes shared memory entirely - routing data flows through ModelRunnerOutput via Ray DAG to the scheduler, same as other per-request outputs.

4. No MTP support. There's no handling of speculative tokens from MTP anywhere in the routing replay code path. With MTP, the model captures routing for all tokens including rejected speculative positions - the output needs trimming to match actual accepted tokens.

5. No prefix caching support. The host buffer is initialized with fill(0) (routed_experts_capturer.py:154). This means cached positions (prefix hits) are indistinguishable from "routed to expert 0" - 0 is a valid expert ID.

6. Synchronous D2H. save_captured_experts does .cpu().numpy() (routed_experts_capturer.py:230) which blocks the GPU, then takes a file lock (line 232) for shared memory access.

7. Buffer layout mismatch. Device buffer is (N, L, K) int32 (routed_experts_capturer.py:133-137). FlashInfer's routing_replay_out parameter needs contiguous (N, K) per layer, which requires (L, N, K) for zero-copy slicing.

8. Memory overhead. Uses int32 for expert IDs that fit in int16.

FeatureCurrent upstreamOur replacement
CUDA graphsBroken (snapshot/restore overwrites buffer, no mark_static)Working (persistent tensor attribute + mark_static)
Buffer layout(N,L,K) int32(L,N,K) int16
Monolithic kernels (FP8/MXFP8)Not capturedFull support via FlashInfer routing_replay_out
Multi-nodeBroken (SharedMemory is node-local)Working (data flows through Ray DAG)
MTP speculative decodingNot handledSeqlen clamping + output trim
Prefix cachingAmbiguous (0 = expert 0 or cache hit?)-1 sentinel for cache hits
D2H transferSynchronous .cpu().numpy() + file lockAsync pinned memory + CUDA events

Proposed design

Device cache + async D2H pipeline replacing the shared-memory capturer/reader:

  • Pre-allocated (L, N, K) int16 device buffer. buffer[layer_id] gives a contiguous (N, K) view per layer.
  • Each FusedMoE layer gets a persistent module attribute module._routing_replay_out = buffer[layer_id]. torch.compile captures module attributes by reference, so CUDA graph replay always writes to the live buffer.
  • cudagraph_mark_tensor_static on each per-layer view prevents snapshot/restore from zeroing the data.
  • Async D2H via CUDA events + pinned memory scatter, only on TP rank 0.
  • Per-request numpy host cache initialized with -1 sentinel.

Monolithic kernel integration: Thread routing_replay_out through the apply_monolithic() call chain so FlashInfer's fused kernels write expert IDs directly during routing. For the non-monolithic Triton path: write topk_ids.to(int16) after select_experts(). A _monolithic_writes_routing_replay flag distinguishes kernel capabilities.

MTP + prefix caching: Seqlen clamping using authoritative token count from request state. Output trim in output_processor to match actual accepted tokens. Host cache init with -1 sentinel instead of 0.

Multi-node: Device buffer on all TP ranks (symmetric CUDA graphs). Only rank 0 does D2H and host cache management. Data flows through ModelRunnerOutput -> Ray DAG -> scheduler, replacing the SharedMemory path entirely.

FlashInfer dependency

This depends on a companion FlashInfer PR that adds routing_replay_out as an optional parameter to all MoE kernel launchers (FP8 block, FP8 per-tensor, BF16, FP4, MXINT4) and routing kernels (Custom, DeepSeek, Llama4). When None (the default), there is zero overhead - the kernel skips the write entirely.

Files changed

Full rewrite: routed_experts_capturer.py

Modified (~12 files): gpu_model_runner.py (capturer lifecycle), layer.py (moe_layer_id), moe_runner_base.py (non-monolithic write), modelopt.py / modular_kernel.py / fp8.py / trtllm_fp8_moe.py (monolithic threading), output_processor.py (MTP trim), OpenAI API protocol + serving files (response field).

Removed: RoutedExpertsReader, SharedMemory usage, fcntl locking, capture() callback in router, KV cache slot_mapping retrieval.

Validation

Tested on GB200 GPUs with a 120B MoE model (BF16 dummy weights) and a 400B+ MoE model (MXFP8 and BF16, real weights). All configs use expert parallelism.

Functional tests (9 configs, all passing - 5 correctness checks + 1000-prompt scale test at 256 concurrency, ISL=1024/OSL=1024):

  • FP8/MXFP8 monolithic kernel path (single-node and multi-node)
  • BF16 non-monolithic Triton path (single-node and multi-node)
  • Prefix caching enabled
  • Data parallelism (TP=4, DP=2)
  • MTP speculative decoding (5 draft tokens)
  • MTP + prefix caching combined

Throughput (400B+ MoE, MXFP8, 2 nodes, TP=8, 1000 prompts, 256 concurrency):

DatasetBaselineRR enabledOverhead
Random (ISL=1024, OSL=1024)4,779 tok/s4,685 tok/s2.0%
Sonnet (ISL=1024, OSL=128)2,489 tok/s2,353 tok/s5.5%

Accuracy (GSM8K, 4 seeds, 1319 problems): pass@1 = 95.77% for both baseline and RR. Zero accuracy impact.

API compatibility

Fully preserved - same CLI flag (--enable-return-routed-experts), same output field (routed_experts), same shape ([seq_len, num_moe_layers, top_k]). Only internals change.

Plan

  1. FlashInfer PR merges (under review)
  2. vLLM pins FlashInfer version with routing_replay_out support
  3. vLLM PR: 3 commits (core plumbing, MTP+PC support, tests)

Happy to adjust the PR structure or split differently based on feedback.

extent analysis

TL;DR

Replace the current routing replay implementation with a CUDA-graph-compatible device cache approach to address the identified issues and improve performance.

Guidance

  • Review the proposed design for the device cache and async D2H pipeline to ensure it meets the requirements for CUDA graph compatibility and multi-node support.
  • Verify that the monolithic kernel integration is correctly implemented to thread routing_replay_out through the apply_monolithic() call chain.
  • Test the MTP and prefix caching support to ensure correct seqlen clamping and output trimming.
  • Validate the performance and accuracy of the new implementation using the provided test configurations and benchmarks.

Example

No code snippet is provided as the issue is focused on the design and implementation of the routing replay replacement, and the code changes are already outlined in the proposal.

Notes

The proposed solution depends on the companion FlashInfer PR, which adds routing_replay_out as an optional parameter to all MoE kernel launchers. The vLLM PR should pin the FlashInfer version with routing_replay_out support before merging the changes.

Recommendation

Apply the proposed workaround by replacing the current routing replay implementation with the new device cache approach, as it addresses the identified issues and provides improved performance and accuracy.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING