vllm - ✅(Solved) Fix [RFC]: Support Mooncake Based ECConnector for EPD [3 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39766Fetched 2026-04-16 06:36:48
View on GitHub
Comments
1
Participants
2
Timeline
18
Reactions
0
Timeline (top)
subscribed ×9mentioned ×7commented ×1labeled ×1

PR fix notes

PR #33714: [EC Connector] SHMConnector: Share Memory based EC Connector

Description (problem / solution / changelog)

Purpose

This PR introduces SHMConnector, a new ECConnector implementation that leverages Shared Memory (SHM) and PyTorch RPC to enable low-latency transfer of encoder caches between ECConnector Producer and ECConnector Consumer.

Key Changes:

  1. Shared Memory Transport: Uses torch.multiprocessing.reductions.reduce_tensor to create shared memory handles for encoder cache tensors, enabling zero-copy-like inter-process transfer without raw data duplication.
  2. PyTorch RPC Control Plane: Implements a TensorPipe-backed PyTorch RPC layer to reliably broadcast shared memory handles and metadata from Producers to all Consumers across processes/nodes.
  3. Asynchronous Processing: Background threads and thread-safe queues handle cache serialization and RPC transmission, avoiding blocking of the main inference loop.
  4. Explicit Resource Management: Triggers gc.collect() and torch.cuda.empty_cache() in Producers on request completion to prevent memory fragmentation in long-running deployments; adds graceful cleanup for RPC agents and background threads.

Test Plan

cd examples/online_serving/disaggregated_encoder_shm
# Terminal 1
bash run_e.sh
# Terminal 2
bash run_pd.sh
# Terminal 3
bash 1e1pd_proxy.sh
# Terminal 4
vllm bench serve ...

Test Result

Tested on a single NVIDIA RTX 4090 (1E-1PD Colocated) with cc=4 and cc=16:

SHMConnector (cc=4)

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  9.53      
Total input tokens:                      320       
Total generated tokens:                  4096      
Request throughput (req/s):              1.68      
Output token throughput (tok/s):         430.02    
Peak output token throughput (tok/s):    612.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          463.62    
---------------Time to First Token----------------
Mean TTFT (ms):                          625.72    
Median TTFT (ms):                        666.30    
P75 TTFT (ms):                           727.82    
P90 TTFT (ms):                           750.67    
P99 TTFT (ms):                           823.90    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.87      
Median TPOT (ms):                        6.76      
P75 TPOT (ms):                           6.96      
P90 TPOT (ms):                           7.26      
P99 TPOT (ms):                           7.84      
==================================================

ExampleConnector (cc=4)

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             4         
Benchmark duration (s):                  9.92      
Total input tokens:                      320       
Total generated tokens:                  4096      
Request throughput (req/s):              1.61      
Output token throughput (tok/s):         413.07    
Peak output token throughput (tok/s):    612.00    
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          445.34    
---------------Time to First Token----------------
Mean TTFT (ms):                          681.57    
Median TTFT (ms):                        696.82    
P75 TTFT (ms):                           776.43    
P90 TTFT (ms):                           819.00    
P99 TTFT (ms):                           952.85    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.89      
Median TPOT (ms):                        6.80      
P75 TPOT (ms):                           7.10      
P90 TPOT (ms):                           7.46      
P99 TPOT (ms):                           7.54      
==================================================

SHMConnector (cc=16)

============ Serving Benchmark Result ============
Successful requests:                     64        
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  19.90     
Total input tokens:                      1280      
Total generated tokens:                  16384     
Request throughput (req/s):              3.22      
Output token throughput (tok/s):         823.34    
Peak output token throughput (tok/s):    1616.00   
Peak concurrent requests:                25.00     
Total token throughput (tok/s):          887.66    
---------------Time to First Token----------------
Mean TTFT (ms):                          1123.68   
Median TTFT (ms):                        893.69    
P75 TTFT (ms):                           1268.01   
P90 TTFT (ms):                           2819.99   
P99 TTFT (ms):                           3226.53   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.72     
Median TPOT (ms):                        15.19     
P75 TPOT (ms):                           16.03     
P90 TPOT (ms):                           16.96     
P99 TPOT (ms):                           17.39     
==================================================

ExampleConnector (cc=16)

============ Serving Benchmark Result ============
Successful requests:                     64        
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  20.89     
Total input tokens:                      1280      
Total generated tokens:                  16384     
Request throughput (req/s):              3.06      
Output token throughput (tok/s):         784.28    
Peak output token throughput (tok/s):    1369.00   
Peak concurrent requests:                23.00     
Total token throughput (tok/s):          845.55    
---------------Time to First Token----------------
Mean TTFT (ms):                          1149.32   
Median TTFT (ms):                        676.39    
P75 TTFT (ms):                           1051.28   
P90 TTFT (ms):                           3063.60   
P99 TTFT (ms):                           3814.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.14     
Median TPOT (ms):                        16.08     
P75 TPOT (ms):                           16.54     
P90 TPOT (ms):                           16.83     
P99 TPOT (ms):                           17.08     
==================================================
<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). </details>

Changed files

  • examples/online_serving/disaggregated_encoder_shm/1e1p1d_proxy.sh (added, +7/-0)
  • examples/online_serving/disaggregated_encoder_shm/1e1pd_proxy.sh (added, +7/-0)
  • examples/online_serving/disaggregated_encoder_shm/run_d.sh (added, +17/-0)
  • examples/online_serving/disaggregated_encoder_shm/run_e.sh (added, +37/-0)
  • examples/online_serving/disaggregated_encoder_shm/run_p.sh (added, +34/-0)
  • examples/online_serving/disaggregated_encoder_shm/run_pd.sh (added, +37/-0)
  • vllm/distributed/ec_transfer/ec_connector/factory.py (modified, +5/-0)
  • vllm/distributed/ec_transfer/ec_connector/shm_connector.py (added, +399/-0)

PR #34051: [Core] Configurable encoder compute and cache budget

Description (problem / solution / changelog)

Purpose

  • Rename SchedulerConfig.max_num_encoder_input_tokens -> SchedulerConfig.max_num_batched_encoder_embeds, since it's now based on the actual number of multimodal embeddings after applying is_embeds mask.
  • Make both max_num_batched_encoder_embeds and encoder_cache_size configurable instead of being hardcoded to max_num_batched_tokens.
  • Add corresponding validation:
    • max_num_batched_encoder_embeds and encoder_cache_size should be at least max_tokens_per_mm_item, but we automatically override the value and only display a warning if the user sets a value that is too small, because it may be difficult for the user to know what max_tokens_per_mm_item is beforehand.
    • Do not allow encoder_cache_size < max_num_batched_encoder_embeds since that effectively reduces max_num_batched_encoder_embeds to the value of encoder_cache_size (we stop scheduling multimodal items if either compute budget or cache budget is exhausted).
  • Update corresponding profiling code:
    • Add more embeddings to the encoder cache if encoder_cache_size > max_num_batched_encoder_embeds.
    • Improve related logs, for example for Qwen/Qwen3-VL-2B-Instruct:
    (EngineCore_DP0 pid=2931601) INFO 02-08 13:44:52 [gpu_model_runner.py:4225] Model loading took 4.24 GiB memory and 4.791106 seconds
    (EngineCore_DP0 pid=2931601) INFO 02-08 13:44:52 [gpu_model_runner.py:5144] Multimodal encoder will be profiled with 1 image item of the maximum feature size (16384 embeds/item).
    (EngineCore_DP0 pid=2931601) INFO 02-08 13:44:55 [gpu_model_runner.py:5184] Encoder cache contains up to 16384 embeddings (0.25 GiB).
  • Rename Scheduler.max_num_encoder_input_tokens to Scheduler.encoder_compute_budget to match its usage.
  • Adjust the definition of GPUModelRunner.max_encoder_len to be non-zero for all MM models, not just ones that use cross-attention, so that it is consistent with Scheduler. Updated downstream usages to take this into account.

This addresses the TODOs by @ywang96

Thoughts:

  • For general-purpose VLMs, it is mostly useless to adjust max_num_batched_encoder_embeds since it is inflated by max_tokens_per_mm_item already; but for OCR models and older VLMs like LLaVA, the feature size can become smaller than max_num_batched_tokens. Setting max_num_batched_encoder_embeds < max_num_batched_tokens causes some compute budget to be "reserved" for decoder-only tokens (including text tokens as well as cached multimodal embeddings) instead of being allocated to encoder inputs, though I'm not sure why would you want to do that.
  • There is a much stronger case for configuring encoder_cache_size, since it enables more MM features to be shared between requests. Even for Qwen3-VL, an encoder cache size of 16k only takes up 0.25 GiB, so it is reasonable to increase this by an order of magnitude or even more in situations with a high MM cache hit rate. There are some cases where the encoder cache is useful:
    • Prefix caching does not apply if the initial system prompt is different.
    • Prefix caching does not apply if MM items come in a different order from the previous request, even if individual items are the same.
    • If the cache size is too small, requests with large MM size filling up the cache will exclude other MM requests from being prefilled until prefill is completed for the MM tokens in the earlier requests.

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • docs/configuration/optimization.md (modified, +38/-37)
  • examples/online_serving/disaggregated_encoder/README.md (modified, +1/-1)
  • tests/v1/core/test_scheduler.py (modified, +125/-4)
  • tests/v1/core/utils.py (modified, +5/-1)
  • vllm/config/scheduler.py (modified, +44/-17)
  • vllm/engine/arg_utils.py (modified, +14/-0)
  • vllm/v1/core/encoder_cache_manager.py (modified, +127/-10)
  • vllm/v1/core/sched/scheduler.py (modified, +2/-2)
  • vllm/v1/kv_cache_interface.py (modified, +9/-1)
  • vllm/v1/worker/gpu_model_runner.py (modified, +79/-29)

PR #38330: [Core] Add score encoder cache manager

Description (problem / solution / changelog)

Purpose

With the advancement of multimodal large models, repeated visual encoding from high-resolution images and long videos has emerged as a key bottleneck for inference latency and computational cost. While vLLM alleviates redundant computation by introducing an EMB Cache, its cache management mechanism still follows the design paradigm of KV/Prefix Cache and is not optimized for the unique resource characteristics of EMB Cache.

Specifically, the effectiveness of EMB Cache is primarily driven by cache hit rate, which is inherently constrained by cache capacity. However, HBM is characterized by high bandwidth but limited capacity, making it difficult to store large-scale reusable visual features. As a result, many high-value entries are frequently evicted, leading to significant resource mismatch.

Meanwhile, existing scheduling strategies struggle to capture the value characteristics of EMB Cache. Time-locality-based policies such as LRU approximate cache value using “recency of access.” However, in the EMB Cache setting, entries exhibit substantial heterogeneity in recomputation cost, memory footprint, and reuse patterns, making a “cache hit” clearly not equivalent to “high value.”

This PR will resolve the mismatch in the storage architecture and the lack of value-aware modeling of EMB Cache in existing scheduling strategies.

Test Plan

this PR is primarily validated through end-to-end multimodal inference tests, with the new cache manager enabled on device, covering both single-GPU and multi-GPU settings. The covered test scenarios include: ● First request: generates encoder outputs and stores them in the CPU cache ● Repeated request hits: - Hits in NPU cache - Hits in CPU cache and triggers promotion - Hits in CPU cache without promotion (served via temporary device cache) ● After request completion: - Correctly releases references - Marks cache entries as reclaimable ● Memory pressure scenarios: - Triggers eviction on the NPU side - Performs reclamation on the CPU side Additionally, precision (accuracy) tests are also included.

Test Result

We evaluated our system on the Qwen2.5-VL model, experimental results show that, under a lossless setting, end-to-end throughput improves by up to 5.96%, while TTFT is reduced by up to 32.29%. <img width="590" height="391" alt="image" src="https://github.com/user-attachments/assets/96fa3ccd-9242-45b2-9ccc-81d63980072e" />

We further evaluated our system in a multi-GPU setting on the Qwen2.5-VL model and the MMBench dataset. The system demonstrates stable performance, with throughput improvements ranging from 2.23% to 6.39%, P90 TTFT changes from −0.37% to 32.29%, and TPOT improvements ranging from 0.29% to 6.73%. <img width="200" height="211" alt="image" src="https://github.com/user-attachments/assets/bc17b411-12dc-4292-ba77-47218b5d7504" />


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/config/score_encoder_cache.py (added, +50/-0)
  • vllm/v1/core/encoder_cache_manager.py (modified, +405/-1)
  • vllm/v1/core/sched/output.py (modified, +2/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +29/-3)
  • vllm/v1/worker/gpu_model_runner.py (modified, +91/-13)

Code Example

ret_value = self.engine.initialize(
    hostname,
    "P2PHANDSHAKE",
    "tcp/rdma/ascend/...",
    device_name if device_name is not None else "",
)
RAW_BUFFERClick to expand / collapse

Motivation.

vLLM's EPD (Encoder-Prefill-Decode) disaggregation feature allows running vision encoders on separate instances from the language model prefill/decode stages. This enables independent scaling, lower TTFT for text-only requests, and cross-process reuse of encoder outputs.

The current EPD implementation uses ECExampleConnector, a file-based connector suitable for debugging and experimentation. For production deployments, we need a high-performance connector that supports multiple network transports (TCP, RDMA, SHM/NVLink).

MooncakeECConnector addresses this by leveraging the Mooncake TransferEngine, which provides a unified API across multiple transport backends.

Key Challenge:

Unlike KV Cache transfer (where block memory is pre-allocated at fixed addresses), encoder cache tensors are dynamically allocated with variable sizes. Each multimodal input produces an encoder output of different dimensions depending on image resolution, model architecture, etc. Mooncake's TransferEngine requires pre-registered memory for efficient transfers, so we introduce an EmbedBlockManager with a pre-registered GPU buffer.

Related Discussions:

Related Works:

Proposed Change.

System Architecture

<img width="2055" height="1104" alt="Image" src="https://github.com/user-attachments/assets/fd0cd393-c380-4c7d-8663-57535129f495" />

[!NOTE] Find more details at the design doc.

Key Components

<img width="965" height="1317" alt="Image" src="https://github.com/user-attachments/assets/11a4e0a2-440c-43f4-8f91-94c1589db5d9" />

MooncakeECConnector extends from ECConnectorBase and implements its interface in MooncakeECConnectorScheduler and MooncakeECConnectorWorker, respectively (separated according to their responsibility).

The usage of these interfaces is totally the same as the standard EPD workflow (i,e, ExampleECConnector).

In comparison to ExampleECConnector, which use Local filesystem (safetensors) for encoder cache store/transfer (disk I/O, CPU serialization) and may lead to bad performance, MooncakeECConnector supports various protocols (TCP/RDMA/EFA) based on Mooncake transfer engine.

[!NOTE] Find more details at the design doc.

OOT Compatibility

Support OOT transfer backend for Mooncake, such as ascend.

ret_value = self.engine.initialize(
    hostname,
    "P2PHANDSHAKE",
    "tcp/rdma/ascend/...",
    device_name if device_name is not None else "",
)

Roadmap

Future Plan:

Others Related (Check Compatibility):

Feedback Period.

No response

CC List.

@ywang96 @DarkLight1337 @NickLucche @fake0fan @wangxiyuan @PiratePai

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement the MooncakeECConnector to replace the ECExampleConnector for production deployments, utilizing the Mooncake TransferEngine for efficient encoder cache transfers.

Guidance

  • Review the design document for MooncakeECConnector to understand its architecture and components.
  • Implement the MooncakeECConnectorScheduler and MooncakeECConnectorWorker classes, extending from ECConnectorBase and implementing its interface.
  • Ensure compatibility with various protocols (TCP/RDMA/EFA) and support for OOT transfer backends like ascend.
  • Test the MooncakeECConnector with different encoder cache sizes and network transports to verify its performance.

Example

ret_value = self.engine.initialize(
    hostname,
    "P2PHANDSHAKE",
    "tcp/rdma/ascend/...",
    device_name if device_name is not None else "",
)

This code snippet demonstrates the initialization of the Mooncake engine with different transport protocols.

Notes

The implementation of MooncakeECConnector requires careful consideration of the dynamic allocation of encoder cache tensors and the pre-registration of memory for efficient transfers. The EmbedBlockManager with a pre-registered GPU buffer is introduced to address this challenge.

Recommendation

Apply the MooncakeECConnector workaround to replace the ECExampleConnector for production deployments, as it provides a high-performance connector that supports multiple network transports and efficient encoder cache transfers.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [RFC]: Support Mooncake Based ECConnector for EPD [3 pull requests, 1 comments, 2 participants]