vllm - 💡(How to fix) Fix [Bug]: Port collision ([Errno 98] Address already in use) when launching multiple LLM(tensor_parallel_size=2) instances concurrently on a single node (V1 Engine)

vllm2026-05-20 10:22:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

(EngineCore_DP0 pid=640) Process EngineCore_DP0: (EngineCore_DP0 pid=640) Traceback (most recent call last): (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=640) self.run() (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=640) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core (EngineCore_DP0 pid=640) raise e (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=640) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=640) return func(*args, **kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in init (EngineCore_DP0 pid=640) super().init( (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in init (EngineCore_DP0 pid=640) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=640) return func(*args, **kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches (EngineCore_DP0 pid=640) available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory (EngineCore_DP0 pid=640) return self.collective_rpc("determine_available_memory") (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc (EngineCore_DP0 pid=640) return aggregate(get_response()) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response (EngineCore_DP0 pid=640) raise RuntimeError( (EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause Traceback (most recent call last): File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module> main() File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main shared_llm = LLM( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in init self.llm_engine = LLMEngine.from_engine_args( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args return cls( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in init self.engine_core = EngineCoreClient.make_client( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client return SyncMPClient(vllm_config, executor_class, log_stats) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper return func(*args, **kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in init super().init( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in init with launch_core_engines(vllm_config, executor_class, log_stats) as ( File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in exit next(self.gen) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines wait_for_engine_startup( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup raise RuntimeError( RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

It seems that the V1 engine's multiproc executor (vllm/v1/executor/multiproc_executor.py) either has hardcoded ports for its ZMQ/RPC communication or suffers from a race condition when finding free ports concurrently, ignoring the environment variable isolations.

(EngineCore_DP0 pid=640) Process EngineCore_DP0:
(EngineCore_DP0 pid=640) Traceback (most recent call last):
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=640)     self.run()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=640)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core
(EngineCore_DP0 pid=640)     raise e
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=640)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=640)     super().__init__(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=640)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=640)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=640)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc
(EngineCore_DP0 pid=640)     return aggregate(get_response())
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=640)     raise RuntimeError(
(EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module>
    main()
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main
    shared_llm = LLM(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args
    return cls(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in __init__
    super().__init__(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
    wait_for_engine_startup(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

Code Example

(EngineCore_DP0 pid=640) Process EngineCore_DP0:
(EngineCore_DP0 pid=640) Traceback (most recent call last):
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=640)     self.run()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=640)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core
(EngineCore_DP0 pid=640)     raise e
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=640)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=640)     super().__init__(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=640)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=640)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=640)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc
(EngineCore_DP0 pid=640)     return aggregate(get_response())
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=640)     raise RuntimeError(
(EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module>
    main()
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main
    shared_llm = LLM(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args
    return cls(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in __init__
    super().__init__(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
    wait_for_engine_startup(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

RAW_BUFFERClick to expand / collapse

Your current environment

vllm0.19.0

🐛 Describe the bug

I am trying to run data annotation on an 8-GPU node by launching 4 independent Python processes. Each process instantiates an LLM class with tensor_parallel_size=2 (handling 1/4 of the dataset each) to maximize overall throughput.

To avoid port collisions, I explicitly isolated the environment variables for each process (e.g., MASTER_PORT, VLLM_PORT, VLLM_RPC_PORT, VLLM_IPC_PATH, etc.). However, the initialization still fails with [Errno 98] Address already in use during the V1 engine's collective_rpc initialization (EngineCore_DP0).

(EngineCore_DP0 pid=640) Process EngineCore_DP0:
(EngineCore_DP0 pid=640) Traceback (most recent call last):
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=640)     self.run()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=640)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core
(EngineCore_DP0 pid=640)     raise e
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=640)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=640)     super().__init__(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=640)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=640)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=640)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc
(EngineCore_DP0 pid=640)     return aggregate(get_response())
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=640)     raise RuntimeError(
(EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module>
    main()
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main
    shared_llm = LLM(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args
    return cls(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in __init__
    super().__init__(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
    wait_for_engine_startup(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Port collision ([Errno 98] Address already in use) when launching multiple LLM(tensor_parallel_size=2) instances concurrently on a single node (V1 Engine)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Before submitting a new issue...

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

TRENDING