vllm - 💡(How to fix) Fix [Bug]: Performance drop when using multiple VLLM on different numa nodes --> bugs in ompmultiprocessing.py

vllm2026-05-20 19:47:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

We can define VLLM_CPU_OMP_THREADS_BIND as a list of vcpus from cpuset, this way we always pass correct list of CPUs (ballon policy cares for proper vCPUs list we just can take it from /sys/fs/cgroup/cpuset.cpus.effective) , but there is bug #2 as without passing VLLM_CPU_OMP_THREADS_BIND=auto VLLM_CPU_NUM_OF_RESERVED_CPU is silently ignored and list of reserved cores is empty --> ends up with poor performance.(because the reserved-CPU logic only exists in the auto-bind path.)

Code Example

# Start both vLLM instances + nginx load balancer
./run_bench.sh start

# Run benchmark (48 prompts, 32 concurrent, through LB)
./run_bench.sh bench

[docker-compose.broken.yml](https://github.com/user-attachments/files/28071993/docker-compose.broken.yml)
[docker-compose.fixed.yml](https://github.com/user-attachments/files/28071992/docker-compose.fixed.yml)
[run_bench.sh](https://github.com/user-attachments/files/28071994/run_bench.sh)
[nginx_conf.txt](https://github.com/user-attachments/files/28072042/nginx_conf.txt)

Script is building and running 2 VLLM instances in docker with pinning taken from K8s setup via NRI_driver (checked it is pinned correctly) and next run benchmark.

Results
When using VLLM_CPU_OMP_THREADS_BIND=auto 
I instance is performing well (by accident)
(APIServer pid=1) INFO 05-20 19:14:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: **203.2 tokens/s**, Running: 16 reqs
while second replica is generating 
(APIServer pid=1) INFO 05-20 19:16:10 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.6 tokens/s, Running: 16 reqs

Why this is happening:
looking on binding during VLLM load:

Instance that is performing well:

(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    VLLM_CPU_OMP_THREADS_BIND='auto', auto_setup=True, skip_setup=False
(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    local_world_size=1, reserve_cpu_num=1
**(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    local_rank=0, core ids=[1, 10, 11, 12, 13, 122, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155
, 156, 157, 158]
(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    reserved_cpus=[159]**


Intance that performs poor:
(EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    VLLM_CPU_OMP_THREADS_BIND='auto', auto_setup=True, skip_setup=False                                                                                         (EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    local_world_size=1, reserve_cpu_num=1
**(EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    local_rank=0, core ids=[]                                                                                                                                   (EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    reserved_cpus=[]**

What happened with our second instance

Our 2-pod StatefulSet was allocated:
- **Pod 0** cpuset: `60,100-119,181-191` → all cores on **NUMA node 1**
- **Pod 1** cpuset: `1,10-13,122,134-159` → all cores on **NUMA node 0**

With auto-bind (`VLLM_CPU_OMP_THREADS_BIND` unset), both pods (rank 0) tried to bind OMP threads to NUMA node 0 cores (CPUs 0-59, 120-179).

- **Pod 1** (cpuset on node 0): worked correctly by coincidence — its cpuset overlapped with node 0.
- **Pod 0** (cpuset on node 1): vLLM attempted to bind to node 0 cores like `0-59`, but those cores are **not in the container's cpuset** (`60,100-119,181-191`)

How to fix it:

We can define VLLM_CPU_OMP_THREADS_BIND as a list of vcpus from cpuset, this way we always pass correct list of CPUs (ballon policy cares for proper vCPUs list we just can take it from /sys/fs/cgroup/cpuset.cpus.effective)
, but there is bug #2 as without passing VLLM_CPU_OMP_THREADS_BIND=auto  `VLLM_CPU_NUM_OF_RESERVED_CPU` is silently ignored and list of reserved cores is empty --> ends up with poor performance.(because the reserved-CPU logic only exists in the auto-bind path.)

How did we hack to get proper performance numbers on both VLLM instances:
We are setting VLLM_CPU_OMP_THREADS_BIND as a list from /sys/fs/cgroup/cpuset.cpus.effective (proper list defined by ballons policy) and removing last cores.

RAW_BUFFERClick to expand / collapse

Your current environment

Environment

vLLM version: v0.21.0 (public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.21.0)
Platform: Intel Xeon 2-socket, 60 cores/socket, HT enabled (240 logical CPUs)
NUMA layout: Node 0 = CPUs 0-59,120-179; Node 1 = CPUs 60-119,180-239
Orchestrator: Kubernetes with NRI Balloons policy (cpuset allocation)
OMP library: Intel OpenMP (libiomp5.so via LD_PRELOAD)

Bug is not related only to setting those variables: VLLM_CPU_OMP_THREADS_BIND --> when set to auto VLLM_CPU_NUM_OF_RESERVED_CPU

🐛 Describe the bug

Noticed that when running VLLM on entire system performance is fine , but when running 2 VLLM instances inside pinned to specific cores the performance of each instances differs significantly.

Isolated the VLLMs from entire solution and created dockerfiles running 2 VLLM instances.

Just change nginx_conf.txt to nginx.conf (needed to pass file to this bug report ) and run:

# Start both vLLM instances + nginx load balancer
./run_bench.sh start

# Run benchmark (48 prompts, 32 concurrent, through LB)
./run_bench.sh bench

[docker-compose.broken.yml](https://github.com/user-attachments/files/28071993/docker-compose.broken.yml)
[docker-compose.fixed.yml](https://github.com/user-attachments/files/28071992/docker-compose.fixed.yml)
[run_bench.sh](https://github.com/user-attachments/files/28071994/run_bench.sh)
[nginx_conf.txt](https://github.com/user-attachments/files/28072042/nginx_conf.txt)

Script is building and running 2 VLLM instances in docker with pinning taken from K8s setup via NRI_driver (checked it is pinned correctly) and next run benchmark.

Results
When using VLLM_CPU_OMP_THREADS_BIND=auto 
I instance is performing well (by accident)
(APIServer pid=1) INFO 05-20 19:14:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: **203.2 tokens/s**, Running: 16 reqs
while second replica is generating 
(APIServer pid=1) INFO 05-20 19:16:10 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.6 tokens/s, Running: 16 reqs

Why this is happening:
looking on binding during VLLM load:

Instance that is performing well:

(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    VLLM_CPU_OMP_THREADS_BIND='auto', auto_setup=True, skip_setup=False
(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    local_world_size=1, reserve_cpu_num=1
**(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    local_rank=0, core ids=[1, 10, 11, 12, 13, 122, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155
, 156, 157, 158]
(EngineCore pid=112) INFO 05-20 19:07:25 [ompmultiprocessing.py:187]    reserved_cpus=[159]**


Intance that performs poor:
(EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    VLLM_CPU_OMP_THREADS_BIND='auto', auto_setup=True, skip_setup=False                                                                                         (EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    local_world_size=1, reserve_cpu_num=1
**(EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    local_rank=0, core ids=[]                                                                                                                                   (EngineCore pid=112) INFO 05-20 19:07:24 [ompmultiprocessing.py:187]    reserved_cpus=[]**

What happened with our second instance

Our 2-pod StatefulSet was allocated:
- **Pod 0** cpuset: `60,100-119,181-191` → all cores on **NUMA node 1**
- **Pod 1** cpuset: `1,10-13,122,134-159` → all cores on **NUMA node 0**

With auto-bind (`VLLM_CPU_OMP_THREADS_BIND` unset), both pods (rank 0) tried to bind OMP threads to NUMA node 0 cores (CPUs 0-59, 120-179).

- **Pod 1** (cpuset on node 0): worked correctly by coincidence — its cpuset overlapped with node 0.
- **Pod 0** (cpuset on node 1): vLLM attempted to bind to node 0 cores like `0-59`, but those cores are **not in the container's cpuset** (`60,100-119,181-191`)

How to fix it:

We can define VLLM_CPU_OMP_THREADS_BIND as a list of vcpus from cpuset, this way we always pass correct list of CPUs (ballon policy cares for proper vCPUs list we just can take it from /sys/fs/cgroup/cpuset.cpus.effective)
, but there is bug #2 as without passing VLLM_CPU_OMP_THREADS_BIND=auto  `VLLM_CPU_NUM_OF_RESERVED_CPU` is silently ignored and list of reserved cores is empty --> ends up with poor performance.(because the reserved-CPU logic only exists in the auto-bind path.)

How did we hack to get proper performance numbers on both VLLM instances:
We are setting VLLM_CPU_OMP_THREADS_BIND as a list from /sys/fs/cgroup/cpuset.cpus.effective (proper list defined by ballons policy) and removing last cores.
```bash
export VLLM_CPU_OMP_THREADS_BIND=$(python3 -c "
r = open('/sys/fs/cgroup/cpuset.cpus.effective').read().strip()
cores = [c for p in r.split(',')
         for c in (range(int(p.split('-')[0]), int(p.split('-')[1])+1) if '-' in p else [int(p)])]
print(','.join(map(str, cores[:-1])))
")

This results in proper performance on both instances. With fix: 277.38 tok/s, 177s duration Without fix (auto-bind): 43.66 tok/s, 1126s duration → 6.4× slower

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering