Fix Action

PR fix notes

PR #36836: [Feat][Executor] Introduce RayExecutorV2

Repository: vllm-project/vllm
Author: jeffreywang-anyscale
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36836

Description (problem / solution / changelog)

Purpose

Implement RayExecutorV2, a new Ray-based distributed executor that uses MessageQueue (shared memory + TCP fallback) for the control plane instead of Ray compiled graphs. It reuses MultiprocExecutor's MQ-based RPC and NCCL data plane while spawning workers as Ray actors into placement group bundles.
Workers on the same node as the driver communicate via shared memory; cross-node workers automatically fall back to ZMQ TCP transport. Bundle assignments are sorted driver-node-first to ensure rank 0 is co-located with the executor.
Add VLLM_USE_RAY_V2_EXECUTOR_BACKEND env var feature flag (default off) to opt into the new executor when distributed_executor_backend="ray". Enable async scheduling support for the new backend.

For more details, please refer to RFC: https://github.com/vllm-project/vllm/issues/35848.

EEP support is out-of-scope for this PR and is tracked here: https://github.com/vllm-project/vllm/issues/38164.

Test Plan

Unit tests

pytest tests/distributed/test_ray_v2_executor.py: executor init, TP/PP combos, placement groups, RPC, worker death, shutdown
pytest tests/utils_/test_ray_utils.py: bundle sorting logic
Validate cross-node TCP path for MessageQueue with test_mq_tcp_multinode.py

Integration tests

pytest tests/distributed/test_ray_v2_executor.py: Creates Ray actors which initialize AsyncLLMEngine internally and verify that they can serve requests.
pytest tests/distributed/test_pipeline_parallel.py -k "ray": PP correctness with the new backend
pytest tests/basic_correctness/test_basic_correctness.py -k "ray": basic correctness

Test Result

Benchmark results (Qwen/Qwen3-8B on L4)

Server:

# MP backend
vllm serve Qwen/Qwen3-8B --tensor-parallel-size 4 --distributed-executor-backend mp --port 8000

# Existing Ray backend
VLLM_USE_RAY_V2_EXECUTOR_BACKEND=0 vllm serve Qwen/Qwen3-8B --tensor-parallel-size 4 --distributed-executor-backend ray --port 8000

# Ray V2 backend
VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1 vllm serve Qwen/Qwen3-8B --tensor-parallel-size 4 --distributed-executor-backend ray --port 8000

Client

vllm bench serve --model Qwen/Qwen3-8B --dataset-name random --input-len 512 --output-len 128 --num-prompts 500 --request-rate 10 --port 8000

TP=4; MP backend (async scheduling is on by default)

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  53.64     
Total input tokens:                      256000    
Total generated tokens:                  64000     
Request throughput (req/s):              9.32      
Output token throughput (tok/s):         1193.20   
Peak output token throughput (tok/s):    1475.00   
Peak concurrent requests:                82.00     
Total token throughput (tok/s):          5965.99   
---------------Time to First Token----------------
Mean TTFT (ms):                          117.12    
Median TTFT (ms):                        117.26    
P99 TTFT (ms):                           156.28    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.95     
Median TPOT (ms):                        41.81     
P99 TPOT (ms):                           46.68     
---------------Inter-token Latency----------------
Mean ITL (ms):                           40.95     
Median ITL (ms):                         40.80     
P99 ITL (ms):                            54.51

TP=4; Ray backend

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  53.93     
Total input tokens:                      256000    
Total generated tokens:                  64000     
Request throughput (req/s):              9.27      
Output token throughput (tok/s):         1186.80   
Peak output token throughput (tok/s):    1464.00   
Peak concurrent requests:                84.00     
Total token throughput (tok/s):          5934.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          86.00     
Median TTFT (ms):                        86.32     
P99 TTFT (ms):                           120.62    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.88     
Median TPOT (ms):                        47.14     
P99 TPOT (ms):                           51.94     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.88     
Median ITL (ms):                         47.21     
P99 ITL (ms):                            58.59

TP=4; Ray V2 backend w/ async scheduling

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  53.67     
Total input tokens:                      256000    
Total generated tokens:                  64000     
Request throughput (req/s):              9.32      
Output token throughput (tok/s):         1192.53   
Peak output token throughput (tok/s):    1442.00   
Peak concurrent requests:                82.00     
Total token throughput (tok/s):          5962.65   
---------------Time to First Token----------------
Mean TTFT (ms):                          119.11    
Median TTFT (ms):                        120.43    
P99 TTFT (ms):                           154.20    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.11     
Median TPOT (ms):                        42.06     
P99 TPOT (ms):                           46.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.11     
Median ITL (ms):                         40.82     
P99 ITL (ms):                            54.10

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

.buildkite/test_areas/distributed.yaml (modified, +34/-0)
tests/distributed/ray_v2_utils.py (added, +32/-0)
tests/distributed/test_mq_tcp_multinode.py (added, +119/-0)
tests/distributed/test_ray_v2_executor.py (added, +344/-0)
tests/distributed/test_ray_v2_executor_e2e.py (added, +198/-0)
tests/utils_/test_ray_utils.py (added, +100/-0)
vllm/envs.py (modified, +7/-0)
vllm/v1/executor/abstract.py (modified, +8/-2)
vllm/v1/executor/ray_executor.py (modified, +2/-12)
vllm/v1/executor/ray_executor_v2.py (added, +486/-0)
vllm/v1/executor/ray_utils.py (modified, +133/-14)
vllm/v1/worker/worker_base.py (modified, +2/-2)

extent analysis

Fix Plan

To add EEP support to RayExecutorV2, we need to implement reinitialize_distributed() and forward collective_rpc to workers. Here are the steps:

Implement reinitialize_distributed() method:

def reinitialize_distributed(self): # Reinitialize distributed settings pass

* Forward `collective_rpc` to workers:
  ```python
def collective_rpc(self, method_name, args, kwargs):
    # Forward collective RPC to workers
    worker_results = []
    for worker in self.workers:
        result = worker.collective_rpc(method_name, args, kwargs)
        worker_results.append(result)
    return worker_results

Update RayExecutorV2 to use the new methods:

class RayExecutorV2: # ...

def reinitialize(self):
    self.reinitialize_distributed()

def collective_rpc(self, method_name, args, kwargs):
    return self.collective_rpc(method_name, args, kwargs)


### Verification
To verify the fix, test the `reinitialize_distributed()` and `collective_rpc()` methods with a sample workflow.

### Extra Tips
* Make sure to handle any exceptions that may occur during the reinitialization and RPC forwarding processes.
* Consider adding logging to track the progress and any issues that may arise during the execution.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature][Ray]: Support EEP for RayExecutorV2 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36836: [Feat][Executor] Introduce RayExecutorV2

Description (problem / solution / changelog)

Purpose

Test Plan

Unit tests

Integration tests

Test Result

Benchmark results (Qwen/Qwen3-8B on L4)

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature][Ray]: Support EEP for RayExecutorV2 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36836: [Feat][Executor] Introduce RayExecutorV2

Description (problem / solution / changelog)

Purpose

Test Plan

Unit tests

Integration tests

Test Result

Benchmark results (Qwen/Qwen3-8B on L4)

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING