vllm - ✅(Solved) Fix [Bug]: Failed to run distributed inference due to error list index out of range in omp_cpuids_list [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37255Fetched 2026-04-08 00:48:31
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Timeline (top)
commented ×1labeled ×1mentioned ×1subscribed ×1

Error Message

(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] WorkerProc failed to start. (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] Traceback (most recent call last): (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] File "/home/lialiu01/latest/vllm/vllm/v1/executor/multiproc_executor.py", line 821, in worker_main (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] worker = WorkerProc(*args, **kwargs) (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] File "/home/lialiu01/latest/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] return func(*args, **kwargs) (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] ^^^^^^^^^^^^^^^^^^^^^ (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] File "/home/lialiu01/latest/vllm/vllm/v1/executor/multiproc_executor.py", line 614, in init (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] self.worker.init_device() (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] File "/home/lialiu01/latest/vllm/vllm/v1/worker/worker_base.py", line 312, in init_device (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] self.worker.init_device() # type: ignore (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] ^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] File "/home/lialiu01/latest/vllm/vllm/v1/worker/cpu_worker.py", line 100, in init_device (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] self.local_omp_cpuid = omp_cpuids_list[self.rank] (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] ~~~~~~~~~~~~~~~^^^^^^^^^^^ (Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] IndexError: list index out of range

Fix Action

Fixed

PR fix notes

PR #37256: fix(cpu_worker): Fix list index out of range in omp_cpuids_list for multi-node distributed inference

Description (problem / solution / changelog)

Purpose

Fixes #37255 - In multi-node distributed setups with tensor parallelism, the CPU worker fails to start on secondary nodes with an IndexError: list index out of range when accessing "omp_cpuids_list[self.rank]".

Root Cause

In a multi-node setup with tensor_parallel_size=2 and 2 nodes:

  • Node 0 has workers with global ranks 0, 1
  • Node 1 has workers with global ranks 2, 3
  • The VLLM_CPU_OMP_THREADS_BIND environment variable contains entries for all workers (e.g., "0,1|2,3|4,5|6,7")
  • After slicing for the data parallel group, the list only has world_size entries
  • The old code used self.rank (global rank, e.g., 2 or 3) to index into this sliced list, causing IndexError

Fix

Use self.rank % world_size to get the correct local index within the sliced list when local_dp_rank is not None.

Test Plan

  • Verified the indexing logic with unit tests simulating multi-node scenarios
  • Tested both single data parallel group and multiple data parallel group configurations

Test Result

The fix correctly maps global ranks to local indices:

  • Node 0: rank 0 -> index 0, rank 1 -> index 1
  • Node 1: rank 2 -> index 0, rank 3 -> index 1

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

Changed files

  • vllm/v1/worker/cpu_worker.py (modified, +6/-1)

PR #37272: fix(distributed): resolve inference failure in cpu_worker

Description (problem / solution / changelog)

Description

Fixes #37255

When running distributed inference with multiple nodes, the CPU worker fails with bIndexError: list index out of rangeb when accessing .

Root Cause

When is not None, the code slices to a subset based on the data parallel rank and world size. However, it then tries to index into this sliced list using (global rank), which can be out of bounds for the sliced list.

For example, with 2 nodes and world_size=2:

  • Global ranks are 0, 1, 2, 3
  • For local_dp_rank=0, the slice gives indices 0-1
  • For local_dp_rank=1, the slice gives indices 2-3, but the sliced list only has length 2
  • When rank=2 tries to access index 2 on a list of length 2, it fails

Fix

Calculate the local rank within the DP slice using to correctly index into the sliced list.

Testing

Verified the fix with a test script that simulates the distributed scenario described in the issue:

  • Rank 0,1 on node 0 (local_dp_rank=0) correctly get indices 0,1
  • Rank 2,3 on node 1 (local_dp_rank=1) correctly get indices 0,1 from the sliced list

This fix is minimal and only affects the code path when data_parallel_rank_local is not None (distributed inference scenarios).

Changed files

  • vllm/v1/worker/cpu_worker.py (modified, +5/-1)

Code Example

Your output of `python collect_env.py` here

---

(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] WorkerProc failed to start.
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] Traceback (most recent call last):
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/executor/multiproc_executor.py", line 821, in worker_main
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     worker = WorkerProc(*args, **kwargs)
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     return func(*args, **kwargs)
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/executor/multiproc_executor.py", line 614, in __init__
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     self.worker.init_device()
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/worker/worker_base.py", line 312, in init_device
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     self.worker.init_device()  # type: ignore
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/worker/cpu_worker.py", line 100, in init_device
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     self.local_omp_cpuid = omp_cpuids_list[self.rank]
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]                            ~~~~~~~~~~~~~~~^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] IndexError: list index out of range

---

commit f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9 (HEAD)
Author: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
Date:   Wed Jan 21 15:27:30 2026 +0800

    [Bugfix] Support HF sharded weights for Mistral3/Pixtral models (#32673)
    
    Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com>
    Signed-off-by: vllm-dev <ricky.chen@infinirc.com>

---

#!/bin/bash

MODEL=Qwen/Qwen2.5-14B-Instruct
MASTER_IP=192.168.1.2
PORT=29508

export GLOO_SOCKET_IFNAME=enp1s0f1np1
export VLLM_HOST_IP=$MASTER_IP
export LD_LIBRARY_PATH=~/ComputeLibrary/build

vllm serve $MODEL --tensor-parallel-size 2 --master-addr $MASTER_IP --master-port $PORT --nnodes 2 --node-rank 0

---

#!/bin/bash
MODEL=Qwen/Qwen2.5-14B-Instruct
MASTER_IP=192.168.1.2
PIPE=0
export GLOO_SOCKET_IFNAME=enp1s0f1np1
export VLLM_HOST_IP=192.168.1.3
PORT=29508
export LD_LIBRARY_PATH=~/ComputeLibrary/build

vllm serve $MODEL --tensor-parallel-size 2 --master-addr $MASTER_IP --master-port $PORT --nnodes 2 --node-rank 1 --headless
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

it would fail when running with distributed inference (with 2 nodes), failed with "list index out of range" on node 1. @fadara01

Failed message:

(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] WorkerProc failed to start.
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] Traceback (most recent call last):
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/executor/multiproc_executor.py", line 821, in worker_main
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     worker = WorkerProc(*args, **kwargs)
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     return func(*args, **kwargs)
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/executor/multiproc_executor.py", line 614, in __init__
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     self.worker.init_device()
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/worker/worker_base.py", line 312, in init_device
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     self.worker.init_device()  # type: ignore
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]   File "/home/lialiu01/latest/vllm/vllm/v1/worker/cpu_worker.py", line 100, in init_device
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]     self.local_omp_cpuid = omp_cpuids_list[self.rank]
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852]                            ~~~~~~~~~~~~~~~^^^^^^^^^^^
(Worker pid=4035378) ERROR 03-17 05:18:29 [multiproc_executor.py:852] IndexError: list index out of range

This is a regression as i don't see the issue on old version such as commit:

commit f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9 (HEAD)
Author: RickyChen / 陳昭儒 <[email protected]>
Date:   Wed Jan 21 15:27:30 2026 +0800

    [Bugfix] Support HF sharded weights for Mistral3/Pixtral models (#32673)
    
    Signed-off-by: ricky-chaoju <[email protected]>
    Signed-off-by: vllm-dev <[email protected]>

Reproduce method:

run on node 0:

#!/bin/bash

MODEL=Qwen/Qwen2.5-14B-Instruct
MASTER_IP=192.168.1.2
PORT=29508

export GLOO_SOCKET_IFNAME=enp1s0f1np1
export VLLM_HOST_IP=$MASTER_IP
export LD_LIBRARY_PATH=~/ComputeLibrary/build

vllm serve $MODEL --tensor-parallel-size 2 --master-addr $MASTER_IP --master-port $PORT --nnodes 2 --node-rank 0

run on node1 (failed node):

#!/bin/bash
MODEL=Qwen/Qwen2.5-14B-Instruct
MASTER_IP=192.168.1.2
PIPE=0
export GLOO_SOCKET_IFNAME=enp1s0f1np1
export VLLM_HOST_IP=192.168.1.3
PORT=29508
export LD_LIBRARY_PATH=~/ComputeLibrary/build

vllm serve $MODEL --tensor-parallel-size 2 --master-addr $MASTER_IP --master-port $PORT --nnodes 2 --node-rank 1 --headless

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves ensuring that the omp_cpuids_list is properly initialized and has enough elements to accommodate the self.rank index.

Here are the steps to fix the issue:

  • Check the length of omp_cpuids_list and ensure it is greater than or equal to the number of nodes (self.rank + 1).
  • Modify the init_device method in cpu_worker.py to handle the case where self.rank is out of range.

Example code:

# in cpu_worker.py
def init_device(self):
    # ...
    if self.rank < len(omp_cpuids_list):
        self.local_omp_cpuid = omp_cpuids_list[self.rank]
    else:
        # Handle the case where self.rank is out of range
        # For example, raise an exception or log an error
        raise ValueError(f"self.rank {self.rank} is out of range for omp_cpuids_list")
    # ...

Verification

To verify the fix, run the same reproduce method as before:

# on node 0
vllm serve $MODEL --tensor-parallel-size 2 --master-addr $MASTER_IP --master-port $PORT --nnodes 2 --node-rank 0

# on node 1
vllm serve $MODEL --tensor-parallel-size 2 --master-addr $MASTER_IP --master-port $PORT --nnodes 2 --node-rank 1 --headless

If the fix is successful, the error should be resolved, and the program should run without the "list index out of range" error.

Extra Tips

  • Ensure that the omp_cpuids_list is properly initialized and has enough elements to accommodate the number of nodes.
  • Consider adding error handling to handle cases where self.rank is out of range.
  • Review the code to ensure that the omp_cpuids_list is being used correctly and that the self.rank index is valid.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING