vllm - 💡(How to fix) Fix [Bug]: vLLM serve with tensor-parallel-size=8 on Kubernetes + vGPU fails: NCCL TCPStore broken pipe, EngineCore initialization failed

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Exception: WorkerProc initialization failed due to an exception in a background process. ... RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

  1. EngineCore process exception:
    Exception: WorkerProc initialization failed due to an exception in a background process.
    ...
    RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Code Example

vllm serve /shared/models/huggingface/hub/models--deepseek-ai--DeepSeek-V4-Flash/snapshots/6976c7ff1b30a1b2cb7805021b8ba4684041f136 \
  --host 0.0.0.0 \
  --port 5180 \
  --enable-prefix-caching \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --kv-cache-dtype fp8

---

Exception: WorkerProc initialization failed due to an exception in a background process.
   ...
   RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

[rank6]:[W512 14:13:55.762631223 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=97, addr=[localhost]:46188, remote=[::ffff:127.0.0.1]:43715): Broken pipe
   [rank6]:[W512 14:13:55.765336703 ProcessGroupNCCL.cpp:1826] [PG ID 0 PG GUID 0 Rank 6] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe

---

template:
  metadata:
    labels:
      app: ${APP_NAME}
    annotations:
      nvidia.com/use-gputype: H100
  spec:
    tolerations:
    - key: "node/has-no-internet"
      operator: "Exists"
      effect: "NoSchedule"
    containers:
    - name: ${APP_NAME}
      image: ${DOCKER_PULL_IMAGE}
      command: ["vllm", "serve", "/${SHARE}/models/huggingface/hub/models--Qwen--Qwen3-Next-80B-A3B-Instruct/snapshots/3eb90afa4e2fff2db323f75999fd31dd2bae4021", "--host", "0.0.0.0", "--port", "${PORT}", "--swap-space", "16", "--gpu-memory-utilization", "0.95", "--max-num-seqs", "128", "--max-num-batched-tokens", "32768", "--max-model-len", "5120", "--tensor-parallel-size", "4"]
      ports:
      - containerPort: ${PORT}
      resources:
        limits:
          nvidia.com/vgpu: 4
          cpu: 32
          memory: 160Gi
        requests:
          nvidia.com/vgpu: 4
          cpu: 16
          memory: 80Gi
RAW_BUFFERClick to expand / collapse

Your current environment

1

🐛 Describe the bug

vLLM version: FROM vllm/vllm-openai:deepseekv4-cu129

Python version: 3.12

Hardware/Cluster: Kubernetes cluster, nodes with H100 vGPUs (virtual GPUs), requesting 8 vGPUs via nvidia.com/vgpu resource

Operating System: Container image based on Ubuntu (inferred)

Full launch command:

vllm serve /shared/models/huggingface/hub/models--deepseek-ai--DeepSeek-V4-Flash/snapshots/6976c7ff1b30a1b2cb7805021b8ba4684041f136 \
  --host 0.0.0.0 \
  --port 5180 \
  --enable-prefix-caching \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --kv-cache-dtype fp8

Kubernetes resource configuration:

  • Requests/limits: nvidia.com/vgpu: 8, CPU 32/16, memory 320Gi/160Gi
  • hostIPC: true and /dev/shm volume were not set initially (we later added them, but vLLM's compatibility with vGPUs still needs confirmation)

Problem Description

When starting the DeepSeek-V4-Flash model with --tensor-parallel-size 8 to run inference across 8 vGPUs, the service fails during initialization with the following two types of errors:

  1. EngineCore process exception:

    Exception: WorkerProc initialization failed due to an exception in a background process.
    ...
    RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
  2. NCCL communication error (critical):

    [rank6]:[W512 14:13:55.762631223 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=97, addr=[localhost]:46188, remote=[::ffff:127.0.0.1]:43715): Broken pipe
    [rank6]:[W512 14:13:55.765336703 ProcessGroupNCCL.cpp:1826] [PG ID 0 PG GUID 0 Rank 6] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe

Multiple ranks output similar errors, accompanied by a KeyboardInterrupt that eventually shuts down the entire APIServer.

In addition, there are many harmless environment variable warnings (e.g., VLLM_QWQ_SERVICE_*), which can be ignored.

However, when I previously deployed the Qwen80B model on the same GPUs, I did not encounter similar issues.

vLLM version: vllm/vllm-openai:v0.11.2

Other configuration (as shown in the pod template):

template:
  metadata:
    labels:
      app: ${APP_NAME}
    annotations:
      nvidia.com/use-gputype: H100
  spec:
    tolerations:
    - key: "node/has-no-internet"
      operator: "Exists"
      effect: "NoSchedule"
    containers:
    - name: ${APP_NAME}
      image: ${DOCKER_PULL_IMAGE}
      command: ["vllm", "serve", "/${SHARE}/models/huggingface/hub/models--Qwen--Qwen3-Next-80B-A3B-Instruct/snapshots/3eb90afa4e2fff2db323f75999fd31dd2bae4021", "--host", "0.0.0.0", "--port", "${PORT}", "--swap-space", "16", "--gpu-memory-utilization", "0.95", "--max-num-seqs", "128", "--max-num-batched-tokens", "32768", "--max-model-len", "5120", "--tensor-parallel-size", "4"]
      ports:
      - containerPort: ${PORT}
      resources:
        limits:
          nvidia.com/vgpu: 4
          cpu: 32
          memory: 160Gi
        requests:
          nvidia.com/vgpu: 4
          cpu: 16
          memory: 80Gi

I found that the NCCL version in the DeepSeek image is 2.28.9, while the NCCL version in the Qwen80B image is 2.27.5. Could this be an issue with the newer NCCL version? Because when I use the newer image to deploy other models (non-DeepSeek, such as Qwen3.6 27B) with tensor parallelism, I still encounter NCCL issues. Moreover, this problem has been present in the vllm/vllm-openai:nightly images for a long time.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: vLLM serve with tensor-parallel-size=8 on Kubernetes + vGPU fails: NCCL TCPStore broken pipe, EngineCore initialization failed