vllm - 💡(How to fix) Fix [Bug]: Port collision ([Errno 98] Address already in use) when launching multiple LLM(tensor_parallel_size=2) instances concurrently on a single node (V1 Engine)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(EngineCore_DP0 pid=640) Process EngineCore_DP0: (EngineCore_DP0 pid=640) Traceback (most recent call last): (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=640) self.run() (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=640) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core (EngineCore_DP0 pid=640) raise e (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=640) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=640) return func(*args, **kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in init (EngineCore_DP0 pid=640) super().init( (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in init (EngineCore_DP0 pid=640) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=640) return func(*args, **kwargs) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches (EngineCore_DP0 pid=640) available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory (EngineCore_DP0 pid=640) return self.collective_rpc("determine_available_memory") (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc (EngineCore_DP0 pid=640) return aggregate(get_response()) (EngineCore_DP0 pid=640) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response (EngineCore_DP0 pid=640) raise RuntimeError( (EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause Traceback (most recent call last): File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module> main() File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main shared_llm = LLM( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in init self.llm_engine = LLMEngine.from_engine_args( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args return cls( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in init self.engine_core = EngineCoreClient.make_client( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client return SyncMPClient(vllm_config, executor_class, log_stats) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper return func(*args, **kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in init super().init( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in init with launch_core_engines(vllm_config, executor_class, log_stats) as ( File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in exit next(self.gen) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines wait_for_engine_startup( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup raise RuntimeError( RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

It seems that the V1 engine's multiproc executor (vllm/v1/executor/multiproc_executor.py) either has hardcoded ports for its ZMQ/RPC communication or suffers from a race condition when finding free ports concurrently, ignoring the environment variable isolations.

(EngineCore_DP0 pid=640) Process EngineCore_DP0:
(EngineCore_DP0 pid=640) Traceback (most recent call last):
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=640)     self.run()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=640)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core
(EngineCore_DP0 pid=640)     raise e
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=640)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=640)     super().__init__(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=640)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=640)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=640)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc
(EngineCore_DP0 pid=640)     return aggregate(get_response())
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=640)     raise RuntimeError(
(EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module>
    main()
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main
    shared_llm = LLM(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args
    return cls(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in __init__
    super().__init__(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
    wait_for_engine_startup(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

Code Example

(EngineCore_DP0 pid=640) Process EngineCore_DP0:
(EngineCore_DP0 pid=640) Traceback (most recent call last):
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=640)     self.run()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=640)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core
(EngineCore_DP0 pid=640)     raise e
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=640)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=640)     super().__init__(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=640)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=640)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=640)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc
(EngineCore_DP0 pid=640)     return aggregate(get_response())
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=640)     raise RuntimeError(
(EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module>
    main()
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main
    shared_llm = LLM(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args
    return cls(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in __init__
    super().__init__(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
    wait_for_engine_startup(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
RAW_BUFFERClick to expand / collapse

Your current environment

vllm0.19.0

🐛 Describe the bug

I am trying to run data annotation on an 8-GPU node by launching 4 independent Python processes. Each process instantiates an LLM class with tensor_parallel_size=2 (handling 1/4 of the dataset each) to maximize overall throughput.

To avoid port collisions, I explicitly isolated the environment variables for each process (e.g., MASTER_PORT, VLLM_PORT, VLLM_RPC_PORT, VLLM_IPC_PATH, etc.). However, the initialization still fails with [Errno 98] Address already in use during the V1 engine's collective_rpc initialization (EngineCore_DP0).

It seems that the V1 engine's multiproc executor (vllm/v1/executor/multiproc_executor.py) either has hardcoded ports for its ZMQ/RPC communication or suffers from a race condition when finding free ports concurrently, ignoring the environment variable isolations.

(EngineCore_DP0 pid=640) Process EngineCore_DP0:
(EngineCore_DP0 pid=640) Traceback (most recent call last):
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=640)     self.run()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=640)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core
(EngineCore_DP0 pid=640)     raise e
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=640)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=640)     super().__init__(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=640)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=640)     return func(*args, **kwargs)
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=640)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=640)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in collective_rpc
(EngineCore_DP0 pid=640)     return aggregate(get_response())
(EngineCore_DP0 pid=640)   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=640)     raise RuntimeError(
(EngineCore_DP0 pid=640) RuntimeError: Worker failed with error '[Errno 98] Address already in use', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 301, in <module>
    main()
  File "/checkpoint/binary/train_package/rynn_data/main_subtask_segment_caption.py", line 146, in main
    shared_llm = LLM(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 352, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args
    return cls(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 110, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 96, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 662, in __init__
    super().__init__(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 493, in __init__
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
  File "/opt/conda/envs/python3.10.13/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
    wait_for_engine_startup(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Port collision ([Errno 98] Address already in use) when launching multiple LLM(tensor_parallel_size=2) instances concurrently on a single node (V1 Engine)