vllm - ✅(Solved) Fix [Feature][Ray]: Support EEP for RayExecutorV2 [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38164Fetched 2026-04-08 01:32:00
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Timeline (top)
cross-referenced ×2subscribed ×2commented ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #36836: [Feat][Executor] Introduce RayExecutorV2

Description (problem / solution / changelog)

Purpose

  • Implement RayExecutorV2, a new Ray-based distributed executor that uses MessageQueue (shared memory + TCP fallback) for the control plane instead of Ray compiled graphs. It reuses MultiprocExecutor's MQ-based RPC and NCCL data plane while spawning workers as Ray actors into placement group bundles.
  • Workers on the same node as the driver communicate via shared memory; cross-node workers automatically fall back to ZMQ TCP transport. Bundle assignments are sorted driver-node-first to ensure rank 0 is co-located with the executor.
  • Add VLLM_USE_RAY_V2_EXECUTOR_BACKEND env var feature flag (default off) to opt into the new executor when distributed_executor_backend="ray". Enable async scheduling support for the new backend.

For more details, please refer to RFC: https://github.com/vllm-project/vllm/issues/35848.

EEP support is out-of-scope for this PR and is tracked here: https://github.com/vllm-project/vllm/issues/38164.

Test Plan

Unit tests

  • pytest tests/distributed/test_ray_v2_executor.py: executor init, TP/PP combos, placement groups, RPC, worker death, shutdown
  • pytest tests/utils_/test_ray_utils.py: bundle sorting logic
  • Validate cross-node TCP path for MessageQueue with test_mq_tcp_multinode.py

Integration tests

  • pytest tests/distributed/test_ray_v2_executor.py: Creates Ray actors which initialize AsyncLLMEngine internally and verify that they can serve requests.
  • pytest tests/distributed/test_pipeline_parallel.py -k "ray": PP correctness with the new backend
  • pytest tests/basic_correctness/test_basic_correctness.py -k "ray": basic correctness

Test Result

Benchmark results (Qwen/Qwen3-8B on L4)

Server:

# MP backend
vllm serve Qwen/Qwen3-8B --tensor-parallel-size 4 --distributed-executor-backend mp --port 8000

# Existing Ray backend
VLLM_USE_RAY_V2_EXECUTOR_BACKEND=0 vllm serve Qwen/Qwen3-8B --tensor-parallel-size 4 --distributed-executor-backend ray --port 8000

# Ray V2 backend
VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 vllm serve Qwen/Qwen3-8B --tensor-parallel-size 4 --distributed-executor-backend ray --port 8000

Client

vllm bench serve --model Qwen/Qwen3-8B --dataset-name random --input-len 512 --output-len 128 --num-prompts 500 --request-rate 10 --port 8000
  • TP=4; MP backend (async scheduling is on by default)
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  53.64     
Total input tokens:                      256000    
Total generated tokens:                  64000     
Request throughput (req/s):              9.32      
Output token throughput (tok/s):         1193.20   
Peak output token throughput (tok/s):    1475.00   
Peak concurrent requests:                82.00     
Total token throughput (tok/s):          5965.99   
---------------Time to First Token----------------
Mean TTFT (ms):                          117.12    
Median TTFT (ms):                        117.26    
P99 TTFT (ms):                           156.28    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.95     
Median TPOT (ms):                        41.81     
P99 TPOT (ms):                           46.68     
---------------Inter-token Latency----------------
Mean ITL (ms):                           40.95     
Median ITL (ms):                         40.80     
P99 ITL (ms):                            54.51
  • TP=4; Ray backend
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  53.93     
Total input tokens:                      256000    
Total generated tokens:                  64000     
Request throughput (req/s):              9.27      
Output token throughput (tok/s):         1186.80   
Peak output token throughput (tok/s):    1464.00   
Peak concurrent requests:                84.00     
Total token throughput (tok/s):          5934.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          86.00     
Median TTFT (ms):                        86.32     
P99 TTFT (ms):                           120.62    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.88     
Median TPOT (ms):                        47.14     
P99 TPOT (ms):                           51.94     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.88     
Median ITL (ms):                         47.21     
P99 ITL (ms):                            58.59
  • TP=4; Ray V2 backend w/ async scheduling
============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  53.67     
Total input tokens:                      256000    
Total generated tokens:                  64000     
Request throughput (req/s):              9.32      
Output token throughput (tok/s):         1192.53   
Peak output token throughput (tok/s):    1442.00   
Peak concurrent requests:                82.00     
Total token throughput (tok/s):          5962.65   
---------------Time to First Token----------------
Mean TTFT (ms):                          119.11    
Median TTFT (ms):                        120.43    
P99 TTFT (ms):                           154.20    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.11     
Median TPOT (ms):                        42.06     
P99 TPOT (ms):                           46.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.11     
Median ITL (ms):                         40.82     
P99 ITL (ms):                            54.10

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • .buildkite/test_areas/distributed.yaml (modified, +34/-0)
  • tests/distributed/ray_v2_utils.py (added, +32/-0)
  • tests/distributed/test_mq_tcp_multinode.py (added, +119/-0)
  • tests/distributed/test_ray_v2_executor.py (added, +344/-0)
  • tests/distributed/test_ray_v2_executor_e2e.py (added, +198/-0)
  • tests/utils_/test_ray_utils.py (added, +100/-0)
  • vllm/envs.py (modified, +7/-0)
  • vllm/v1/executor/abstract.py (modified, +8/-2)
  • vllm/v1/executor/ray_executor.py (modified, +2/-12)
  • vllm/v1/executor/ray_executor_v2.py (added, +486/-0)
  • vllm/v1/executor/ray_utils.py (modified, +133/-14)
  • vllm/v1/worker/worker_base.py (modified, +2/-2)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

RayExecutorV2 is introduced in https://github.com/vllm-project/vllm/pull/36836 but still lacks EEP support.

Needs to implement reinitialize_distributed() and forward collective_rpc to workers.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To add EEP support to RayExecutorV2, we need to implement reinitialize_distributed() and forward collective_rpc to workers. Here are the steps:

  • Implement reinitialize_distributed() method:

def reinitialize_distributed(self): # Reinitialize distributed settings pass

* Forward `collective_rpc` to workers:
  ```python
def collective_rpc(self, method_name, args, kwargs):
    # Forward collective RPC to workers
    worker_results = []
    for worker in self.workers:
        result = worker.collective_rpc(method_name, args, kwargs)
        worker_results.append(result)
    return worker_results
  • Update RayExecutorV2 to use the new methods:

class RayExecutorV2: # ...

def reinitialize(self):
    self.reinitialize_distributed()

def collective_rpc(self, method_name, args, kwargs):
    return self.collective_rpc(method_name, args, kwargs)

### Verification
To verify the fix, test the `reinitialize_distributed()` and `collective_rpc()` methods with a sample workflow.

### Extra Tips
* Make sure to handle any exceptions that may occur during the reinitialization and RPC forwarding processes.
* Consider adding logging to track the progress and any issues that may arise during the execution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING