pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] Sequence Parallel test_tp_sp_generation: RayChannelTimeoutError on tp=2 setups (Llama) [2 comments, 1 participants]

pytorch2026-04-27 19:05:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181632•Fetched 2026-04-28 06:24:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

atalman

Participants

atalman

Timeline (top)

mentioned ×20subscribed ×20labeled ×5commented ×2

Under torch 2.12.0 + triton 3.7.0, vLLM's test_tp_sp_generation fails on three parametrized configurations because the Ray-backed engine times out waiting for an inter-process object:

ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read.

Three parametrizations fail (tp_size=2, ray distributed backend):

test_tp_sp_generation[False-True-hmellor/tiny-random-LlamaForCausalLM-parallel_setup1-ray-auto-test_options1]
test_tp_sp_generation[False-True-hmellor/tiny-random-LlamaForCausalLM-parallel_setup3-ray-auto-test_options3]
test_tp_sp_generation[False-True-RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-parallel_setup17-ray-auto-test_options17]

The pytest-side error is the wrapper:

AssertionError: function test_tp_sp_generation failed when called with args () and kwargs {'model_id': 'hmellor/tiny-random-LlamaForCausalLM', 'parallel_setup': ParallelSetup(tp_size=2, pp_size=1, ...), 'distributed_backend': 'ray', ...}

Passes on torch 2.11. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Error Message

File "vllm/v1/engine/core.py", line 1129, in run_engine_core engine_core.run_busy_loop() ... File "python/ray/_raylet.pyx", line 3194, in ray._raylet.CoreWorker.get_objects File "python/ray/includes/common.pxi", line 106, in ray._raylet.check_status ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 003d5d11e415fe881ac38fca360ef9053beab8b0010000000ae1f505

Root Cause

Under torch 2.12.0 + triton 3.7.0, vLLM's test_tp_sp_generation fails on three parametrized configurations because the Ray-backed engine times out waiting for an inter-process object:

Code Example

test_tp_sp_generation[False-True-hmellor/tiny-random-LlamaForCausalLM-parallel_setup1-ray-auto-test_options1]
test_tp_sp_generation[False-True-hmellor/tiny-random-LlamaForCausalLM-parallel_setup3-ray-auto-test_options3]
test_tp_sp_generation[False-True-RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-parallel_setup17-ray-auto-test_options17]

---

File "vllm/v1/engine/core.py", line 1129, in run_engine_core
  engine_core.run_busy_loop()
...
File "python/ray/_raylet.pyx", line 3194, in ray._raylet.CoreWorker.get_objects
File "python/ray/includes/common.pxi", line 106, in ray._raylet.check_status
ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read.
ObjectID: 003d5d11e415fe881ac38fca360ef9053beab8b0010000000ae1f505

RAW_BUFFERClick to expand / collapse

Summary

Under torch 2.12.0 + triton 3.7.0, vLLM's test_tp_sp_generation fails on three parametrized configurations because the Ray-backed engine times out waiting for an inter-process object:

ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read.

Three parametrizations fail (tp_size=2, ray distributed backend):

test_tp_sp_generation[False-True-hmellor/tiny-random-LlamaForCausalLM-parallel_setup1-ray-auto-test_options1]
test_tp_sp_generation[False-True-hmellor/tiny-random-LlamaForCausalLM-parallel_setup3-ray-auto-test_options3]
test_tp_sp_generation[False-True-RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-parallel_setup17-ray-auto-test_options17]

The pytest-side error is the wrapper:

AssertionError: function test_tp_sp_generation failed when called with args () and kwargs {'model_id': 'hmellor/tiny-random-LlamaForCausalLM', 'parallel_setup': ParallelSetup(tp_size=2, pp_size=1, ...), 'distributed_backend': 'ray', ...}

Passes on torch 2.11. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Environment

torch: 2.12.0+cu130 (test channel)
triton: 3.7.0
CUDA: 13.0
Python: 3.12.13
GPU: 2× NVIDIA H100 (test name says 2xH100)
Distributed backend: ray
Models: hmellor/tiny-random-LlamaForCausalLM, RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8

Traceback (abridged)

File "vllm/v1/engine/core.py", line 1129, in run_engine_core
  engine_core.run_busy_loop()
...
File "python/ray/_raylet.pyx", line 3194, in ray._raylet.CoreWorker.get_objects
File "python/ray/includes/common.pxi", line 106, in ray._raylet.check_status
ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read.
ObjectID: 003d5d11e415fe881ac38fca360ef9053beab8b0010000000ae1f505

Reproducibility

torch-2.12 branch: failed https://buildkite.com/vllm/ci/builds/63095#019dcf15-7f5c-4f36-a0fb-53e4cf05b5e7
main (torch 2.11) — passes on every recent build:
- 2026-04-25 daily: https://buildkite.com/vllm/ci/builds/62981 (job was interrupted by signal — not the same RayChannelTimeoutError signature)
- 2026-04-26 nightly: https://buildkite.com/vllm/ci/builds/62990
- 2026-04-26 daily: https://buildkite.com/vllm/ci/builds/63026
- 2026-04-27 nightly: https://buildkite.com/vllm/ci/builds/63061

Diagnosis request

RayChannelTimeoutError on tp=2 ray-backed engine startup suggests something in the engine's worker-to-worker communication path is now significantly slower or hanging on torch 2.12 — possibly compile/AOT-cache time spent on each worker before they synchronize. Could a maintainer (or the Ray + torch.compile interaction owner) look at whether torch 2.12 introduces additional per-worker compilation latency that pushes Ray's default channel timeout over the limit?

extent analysis

TL;DR

The most likely fix is to investigate and potentially adjust the Ray channel timeout due to increased per-worker compilation latency introduced by torch 2.12.

Guidance

Review the Ray documentation to understand how to adjust the channel timeout and consider increasing it as a temporary workaround.
Investigate the compilation latency introduced by torch 2.12 and its impact on the Ray-backed engine startup.
Check the torch 2.12 release notes and documentation for any changes related to compilation or AOT caching that might affect the engine's worker-to-worker communication.
Consider testing with different torch versions to isolate the issue and confirm if it's specific to torch 2.12.

Example

No code snippet is provided as the issue seems to be related to configuration and version interactions rather than a specific code error.

Notes

The issue appears to be specific to the combination of torch 2.12 and the Ray-backed engine, and the exact cause is still under investigation. Any adjustments to the Ray channel timeout should be carefully considered to avoid introducing other issues.

Recommendation

Apply a workaround by adjusting the Ray channel timeout, as the root cause of the increased compilation latency in torch 2.12 is still being investigated. This will allow the engine to start up successfully while the underlying issue is being addressed.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] Sequence Parallel test_tp_sp_generation: RayChannelTimeoutError on tp=2 setups (Llama) [2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Traceback (abridged)

Reproducibility

Diagnosis request

Links

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] Sequence Parallel test_tp_sp_generation: RayChannelTimeoutError on tp=2 setups (Llama) [2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Traceback (abridged)

Reproducibility

Diagnosis request

Links

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING