vllm - ✅(Solved) Fix [RFC] Replace routing replay with CUDA-graph-compatible device cache approach [1 pull requests, 1 participants]

vllm2026-04-13 12:41:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39701•Fetched 2026-04-15 06:20:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

TomerBN-Nvidia

Participants

TomerBN-Nvidia

Timeline (top)

labeled ×3subscribed ×2added_to_project_v2 ×1issue_type_added ×1

PR fix notes

PR #39917: [Core] Replace routing replay with device cache and async D2H pipeline

Repository: vllm-project/vllm
Author: TomerBN-Nvidia
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39917

Description (problem / solution / changelog)

Summary

Replace upstream vLLM routing replay with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. This PR focuses on the core architecture change — monolithic kernel support and prefix caching are in a follow-up PR.

RFC: #39701

What this PR does

Replaces the SharedMemory-based routing replay with:

Pre-allocated (L, N, K) int16 device buffer with per-layer views
Async D2H pipeline via CUDA events + pinned memory
Per-request host cache (no shared memory, no file locks)
Data flows through ModelRunnerOutput → Ray DAG → scheduler (enables multi-node)

What this PR removes

RoutedExpertsReader (shared memory reader)
multiprocessing.SharedMemory usage
fcntl file-based locking
capture() callback mechanism in router
KV cache slot_mapping retrieval for routing data
int32 dtype (replaced with int16)
(N, L, K) buffer layout (replaced with (L, N, K))

Changes

Rewrite routed_experts_capturer.py: device cache + async D2H pipeline
Add moe_layer_id auto-increment to FusedMoE for buffer binding
bind_routing_capture_to_model(): persistent tensor attribute + cudagraph_mark_tensor_static
Capture routing in non-monolithic (Triton) path via topk_ids.to(int16) copy
Route data through ModelRunnerOutput instead of shared memory
Wire routed_experts to OpenAI API response
Unit tests for device cache and host cache

What is NOT in this PR (follow-up)

Monolithic kernel path (FP8/MXFP8 via FlashInfer routing_replay_out) — depends on flashinfer-ai/flashinfer#3024
Prefix caching -1 sentinel
_monolithic_writes_routing_replay flag

Validation

Tested on GB200 GPUs with a 120B MoE model (BF16 Triton path, non-monolithic):

Single-node TP=4: PASS, 7,767 tok/s
Prefix caching: PASS, 7,136 tok/s
DP=2 (2 nodes, TP=4): PASS, 10,170 tok/s

Performance: 2.0% throughput overhead on random data. Accuracy: GSM8K pass@1 = 95.77% (identical to baseline).

API Compatibility

Fully preserved — same CLI flag (--enable-return-routed-experts), same output field (routed_experts), same shape [seq_len, num_moe_layers, top_k].

Test Plan

BF16 Triton path functional tests pass
Multi-node DP functional tests pass
Performance degradation < 5%
Accuracy unchanged
Unit tests pass (CI)

Changed files

docs/features/routed_experts_replay.md (added, +285/-0)
tests/model_executor/test_routed_experts_capture.py (modified, +83/-56)
vllm/entrypoints/openai/chat_completion/protocol.py (modified, +4/-0)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +15/-0)
vllm/entrypoints/openai/completion/protocol.py (modified, +4/-0)
vllm/entrypoints/openai/completion/serving.py (modified, +11/-0)
vllm/model_executor/layers/fused_moe/layer.py (modified, +7/-0)
vllm/model_executor/layers/fused_moe/routed_experts_capturer.py (modified, +706/-289)
vllm/model_executor/layers/fused_moe/runner/moe_runner_base.py (modified, +8/-0)
vllm/outputs.py (modified, +4/-0)
vllm/v1/core/sched/scheduler.py (modified, +7/-70)
vllm/v1/engine/__init__.py (modified, +1/-2)
vllm/v1/engine/output_processor.py (modified, +31/-2)
vllm/v1/outputs.py (modified, +3/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +67/-47)

RAW_BUFFERClick to expand / collapse

[RFC] Replace routing replay with CUDA-graph-compatible device cache approach

Motivation

Routing replay (--enable-return-routed-experts) captures which MoE experts process each token during inference. This is needed by RL training pipelines (GRPO, RLHF) where the training step reconstructs expert routing decisions from the inference pass.

We've been running a fork with a production-grade routing replay replacement on our internal GPU clusters for large MoE models (120B and 400B+ parameter). The current upstream implementation has fundamental issues that prevent real-world use - it breaks under CUDA graphs, doesn't work multi-node, misses the monolithic kernel path entirely, and has no MTP or prefix caching support.

We'd like to upstream our replacement. The implementation is code complete and validated across 9 configurations with <5% throughput overhead and zero accuracy impact. A companion FlashInfer PR (adding routing_replay_out to MoE kernel launchers) is currently under review.

What's wrong with the current implementation

The current approach uses a capture() callback in the router combined with multiprocessing.SharedMemory and fcntl file locking. We found 8 issues:

1. CUDA graph incompatibility. The device buffer (routed_experts_capturer.py:133) is not marked with cudagraph_mark_tensor_static. During CUDA graph capture, the graph snapshots this buffer. On replay, it restores the snapshot before executing, overwriting clear_buffer() calls made outside the graph. Positions not rewritten by the current forward retain stale snapshot data instead of zeros. This took us 5 different approaches to solve correctly.

2. Monolithic kernels not captured. Routing is only captured via select_experts() in the router (base_router.py:280). The monolithic trtllm-gen path (apply_monolithic()) bypasses the router, so no routing data is captured for FP8/MXFP8 models using FlashInfer's fused kernels. This is the default serving path for quantized MoE models.

3. No multi-node support. SharedMemory is node-local (routed_experts_capturer.py:14,58-60). On multi-node TP setups (needed for 400B+ models), the scheduler on node 0 can't read shared memory from workers on other nodes. Our replacement removes shared memory entirely - routing data flows through ModelRunnerOutput via Ray DAG to the scheduler, same as other per-request outputs.

4. No MTP support. There's no handling of speculative tokens from MTP anywhere in the routing replay code path. With MTP, the model captures routing for all tokens including rejected speculative positions - the output needs trimming to match actual accepted tokens.

5. No prefix caching support. The host buffer is initialized with fill(0) (routed_experts_capturer.py:154). This means cached positions (prefix hits) are indistinguishable from "routed to expert 0" - 0 is a valid expert ID.

6. Synchronous D2H. save_captured_experts does .cpu().numpy() (routed_experts_capturer.py:230) which blocks the GPU, then takes a file lock (line 232) for shared memory access.

7. Buffer layout mismatch. Device buffer is (N, L, K) int32 (routed_experts_capturer.py:133-137). FlashInfer's routing_replay_out parameter needs contiguous (N, K) per layer, which requires (L, N, K) for zero-copy slicing.

8. Memory overhead. Uses int32 for expert IDs that fit in int16.

Feature	Current upstream	Our replacement
CUDA graphs	Broken (snapshot/restore overwrites buffer, no mark_static)	Working (persistent tensor attribute + mark_static)
Buffer layout	`(N,L,K)` int32	`(L,N,K)` int16
Monolithic kernels (FP8/MXFP8)	Not captured	Full support via FlashInfer `routing_replay_out`
Multi-node	Broken (SharedMemory is node-local)	Working (data flows through Ray DAG)
MTP speculative decoding	Not handled	Seqlen clamping + output trim
Prefix caching	Ambiguous (0 = expert 0 or cache hit?)	`-1` sentinel for cache hits
D2H transfer	Synchronous `.cpu().numpy()` + file lock	Async pinned memory + CUDA events

Proposed design

Device cache + async D2H pipeline replacing the shared-memory capturer/reader:

Pre-allocated (L, N, K) int16 device buffer. buffer[layer_id] gives a contiguous (N, K) view per layer.
Each FusedMoE layer gets a persistent module attribute module._routing_replay_out = buffer[layer_id]. torch.compile captures module attributes by reference, so CUDA graph replay always writes to the live buffer.
cudagraph_mark_tensor_static on each per-layer view prevents snapshot/restore from zeroing the data.
Async D2H via CUDA events + pinned memory scatter, only on TP rank 0.
Per-request numpy host cache initialized with -1 sentinel.

Monolithic kernel integration: Thread routing_replay_out through the apply_monolithic() call chain so FlashInfer's fused kernels write expert IDs directly during routing. For the non-monolithic Triton path: write topk_ids.to(int16) after select_experts(). A _monolithic_writes_routing_replay flag distinguishes kernel capabilities.

MTP + prefix caching: Seqlen clamping using authoritative token count from request state. Output trim in output_processor to match actual accepted tokens. Host cache init with -1 sentinel instead of 0.

Multi-node: Device buffer on all TP ranks (symmetric CUDA graphs). Only rank 0 does D2H and host cache management. Data flows through ModelRunnerOutput -> Ray DAG -> scheduler, replacing the SharedMemory path entirely.

FlashInfer dependency

This depends on a companion FlashInfer PR that adds routing_replay_out as an optional parameter to all MoE kernel launchers (FP8 block, FP8 per-tensor, BF16, FP4, MXINT4) and routing kernels (Custom, DeepSeek, Llama4). When None (the default), there is zero overhead - the kernel skips the write entirely.

Files changed

Full rewrite: routed_experts_capturer.py

Modified (~12 files): gpu_model_runner.py (capturer lifecycle), layer.py (moe_layer_id), moe_runner_base.py (non-monolithic write), modelopt.py / modular_kernel.py / fp8.py / trtllm_fp8_moe.py (monolithic threading), output_processor.py (MTP trim), OpenAI API protocol + serving files (response field).

Removed: RoutedExpertsReader, SharedMemory usage, fcntl locking, capture() callback in router, KV cache slot_mapping retrieval.

Validation

Tested on GB200 GPUs with a 120B MoE model (BF16 dummy weights) and a 400B+ MoE model (MXFP8 and BF16, real weights). All configs use expert parallelism.

Functional tests (9 configs, all passing - 5 correctness checks + 1000-prompt scale test at 256 concurrency, ISL=1024/OSL=1024):

FP8/MXFP8 monolithic kernel path (single-node and multi-node)
BF16 non-monolithic Triton path (single-node and multi-node)
Prefix caching enabled
Data parallelism (TP=4, DP=2)
MTP speculative decoding (5 draft tokens)
MTP + prefix caching combined

Throughput (400B+ MoE, MXFP8, 2 nodes, TP=8, 1000 prompts, 256 concurrency):

Dataset	Baseline	RR enabled	Overhead
Random (ISL=1024, OSL=1024)	4,779 tok/s	4,685 tok/s	2.0%
Sonnet (ISL=1024, OSL=128)	2,489 tok/s	2,353 tok/s	5.5%

Accuracy (GSM8K, 4 seeds, 1319 problems): pass@1 = 95.77% for both baseline and RR. Zero accuracy impact.

API compatibility

Fully preserved - same CLI flag (--enable-return-routed-experts), same output field (routed_experts), same shape ([seq_len, num_moe_layers, top_k]). Only internals change.

Plan

FlashInfer PR merges (under review)
vLLM pins FlashInfer version with routing_replay_out support
vLLM PR: 3 commits (core plumbing, MTP+PC support, tests)

Happy to adjust the PR structure or split differently based on feedback.

extent analysis

TL;DR

Replace the current routing replay implementation with a CUDA-graph-compatible device cache approach to address the identified issues and improve performance.

Guidance

Review the proposed design for the device cache and async D2H pipeline to ensure it meets the requirements for CUDA graph compatibility and multi-node support.
Verify that the monolithic kernel integration is correctly implemented to thread routing_replay_out through the apply_monolithic() call chain.
Test the MTP and prefix caching support to ensure correct seqlen clamping and output trimming.
Validate the performance and accuracy of the new implementation using the provided test configurations and benchmarks.

Example

No code snippet is provided as the issue is focused on the design and implementation of the routing replay replacement, and the code changes are already outlined in the proposal.

Notes

The proposed solution depends on the companion FlashInfer PR, which adds routing_replay_out as an optional parameter to all MoE kernel launchers. The vLLM PR should pin the FlashInfer version with routing_replay_out support before merging the changes.

Recommendation

Apply the proposed workaround by replacing the current routing replay implementation with the new device cache approach, as it addresses the identified issues and provides improved performance and accuracy.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC] Replace routing replay with CUDA-graph-compatible device cache approach [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #39917: [Core] Replace routing replay with device cache and async D2H pipeline

Description (problem / solution / changelog)

Summary

What this PR does

What this PR removes

Changes

What is NOT in this PR (follow-up)

Validation

API Compatibility

Test Plan

Changed files

[RFC] Replace routing replay with CUDA-graph-compatible device cache approach

Motivation

What's wrong with the current implementation

Proposed design

FlashInfer dependency

Files changed

Validation

API compatibility

Plan

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING