vllm - ✅(Solved) Fix [Bug/Regression]: CPU spikes to 100% on Multi-node (PP=2, TP=8) | Regression starting from v0.12.0 [2 pull requests, 5 comments, 4 participants]

vllm2026-03-13 00:07:39

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36943•Fetched 2026-04-08 00:43:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5mentioned ×4subscribed ×4cross-referenced ×1

Code Example

The output of `python collect_env.py`

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

We are observing a severe regression in CPU management when running vLLM in a multi-node setup on OpenShift. While the deployment is perfectly stable on v0.11.0, any version from v0.12.0 up to v0.17.0 triggers complete CPU saturation.

Even under light inference load, the CPU utilization on both nodes spikes to 100%, leading to a rapid decline in throughput and eventual node unresponsiveness. This behavior is consistent across various models using a TP=8, PP=2 topology on 2xH100 nodes via LeaderWorkerSet (LWS).

The fact that the exact same configuration and infrastructure work seamlessly on v0.11.0 strongly suggests a regression introduced in v0.12.0, possibly related to how the distributed backend or the scheduler interacts with OpenShift's resource constraints.

Questions: Were there specific changes in the Ray/Distributed backend or scheduling logic in v0.12.0 that could cause this busy-wait loop or thread contention? Are there recommended environment variables or flags to tune Ray/vLLM resource management in hardened environments like OpenShift?

Report of performance regression

No response

Misc discussion on performance

Infrastructure: OpenShift Container Platform (LWS). Topology: 2 Nodes, 8xH100 per node. Config: tensor_parallel_size=8, pipeline_parallel_size=2. Versions affected: v0.12.0 through v0.17.0.

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the CPU saturation issue in the multi-node setup on OpenShift, we will focus on tuning Ray/vLLM resource management.

Step-by-Step Solution:

Adjust Resource Allocation: Modify the resource_requests and resource_limits in your Ray cluster configuration to better match your workload.
Tune Scheduling Parameters: Experiment with different scheduling strategies and parameters, such as scheduler_class and scheduling_strategy, to optimize task distribution.
Environment Variables: Utilize environment variables like RAY_SCHEDULER_SPREAD_OUT to control task placement and RAY_OBJECT_STORE_MEMORY to manage memory allocation.

Example Code Snippets:

import ray

# Initialize Ray with custom resource configuration
ray.init(
    num_cpus=16,  # Adjust based on your node's CPU count
    num_gpus=8,   # Adjust based on your node's GPU count
    resources={'CustomResource': 1}  # Define custom resources if needed
)

# Configure the scheduler
from ray.tune import Scheduler
from ray.tune.schedulers import StandardScheduler

# Define a custom scheduler class if needed
class CustomScheduler(StandardScheduler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Custom scheduling logic can be implemented here

# Set environment variables
import os
os.environ["RAY_SCHEDULER_SPREAD_OUT"] = "1"  # Enable spreading out tasks
os.environ["RAY_OBJECT_STORE_MEMORY"] = "10GB"  # Adjust object store memory

Verification

To verify the fix, monitor CPU utilization and throughput after applying the configuration changes. You can use tools like top, htop, or OpenShift's built-in monitoring dashboards to track performance metrics.

Extra Tips

Regularly review and adjust resource allocation and scheduling parameters based on your workload's requirements.
Consider implementing custom scheduling logic to optimize task distribution and resource utilization.
Keep your Ray and vLLM versions up-to-date to ensure you have the latest performance optimizations and bug fixes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #environment variable #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug/Regression]: CPU spikes to 100% on Multi-node (PP=2, TP=8) | Regression starting from v0.12.0 [2 pull requests, 5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36977: chore(pre-commit): make bash hooks runnable on Windows

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #37743: [CI] replace shellcheck script with shellcheck-py hook