vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Pro DP+EP with deepep_low_latency fails during startup: expected scalar type Long but found Int

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

RuntimeError: Worker failed with error 'expected scalar type Long but found Int' ERROR [multiproc_executor.py:283] Worker proc VllmWorker-5 died unexpectedly, shutting down executor. Traceback: RuntimeError: Worker failed with error 'expected scalar type Long but found Int', please check the stack trace above for the root cause

Root Cause

RuntimeError: Worker failed with error 'expected scalar type Long but found Int'
RuntimeError: Engine core initialization failed. See root cause above.

Code Example

DeepSeek-V4-Pro fails during engine startup when running with TP=8, DP=2, expert parallel enabled, and `--all2all-backend deepep_low_latency` on a 2-node H100 InfiniBand setup.

The failure happens before serving any requests, during engine initialization / memory profiling / KV cache initialization. One worker process is terminated after TileLang kernels are compiled, and then the engine core fails with:
RAW_BUFFERClick to expand / collapse

Your current environment

Model: DeepSeek-V4-Pro local checkpoint Hardware: 2 nodes × 8× NVIDIA H100 Network: InfiniBand between nodes Parallelism: Tensor parallel size: 8 Data parallel size: 2 Expert parallel: enabled Effective EP size: 16 KV cache dtype: fp8 Block size: 256 Max model length: 262144 Backend that triggers the issue: --all2all-backend deepep_low_latency

🐛 Describe the bug

DeepSeek-V4-Pro fails during engine startup when running with TP=8, DP=2, expert parallel enabled, and `--all2all-backend deepep_low_latency` on a 2-node H100 InfiniBand setup.

The failure happens before serving any requests, during engine initialization / memory profiling / KV cache initialization. One worker process is terminated after TileLang kernels are compiled, and then the engine core fails with:

```text
RuntimeError: Worker failed with error 'expected scalar type Long but found Int'
RuntimeError: Engine core initialization failed. See root cause above.

Reproduce script:
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NCCL_IB_DISABLE=0
export GLOO_SOCKET_IFNAME=eth0
export VLLM_LOGGING_LEVEL=DEBUG

vllm serve /path/to/deepseek-v4-pro \
  --served-model-name deepseek-v4-pro \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  -cc.pass_config.fuse_allreduce_rms=False \
  --tensor-parallel-size 8 \
  --data-parallel-size "${DP_SIZE}" \
  --data-parallel-rank "${DP_RANK}" \
  --data-parallel-address "${MASTER_IP}" \
  --data-parallel-rpc-port 13346 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 3 \
  --max-num-batched-tokens 512 \
  --no-enable-flashinfer-autotune \
  --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --tokenizer-mode deepseek_v4 \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-prefix-caching \
  --enable-prompt-tokens-details \
  --enable-force-include-usage

log:
[TileLang:tilelang.jit.kernel:INFO] TileLang completes to compile kernel `mhc_pre_big_fuse_tilelang`
[TileLang:tilelang.jit.kernel:INFO] TileLang begins to compile kernel `mhc_post_tilelang` with `out_idx=None`
[TileLang:tilelang.jit.kernel:INFO] TileLang completes to compile kernel `mhc_post_tilelang`

WARNING [multiproc_executor.py:884] WorkerProc was terminated
ERROR [multiproc_executor.py:283] Worker proc VllmWorker-5 died unexpectedly, shutting down executor.

Traceback:
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in __init__
    kv_cache_config = self._initialize_kv_caches(vllm_config)

  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
    available_gpu_memory = self.model_executor.determine_available_memory()

  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
    return self.collective_rpc("determine_available_memory")

  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response
    raise RuntimeError(

RuntimeError: Worker failed with error 'expected scalar type Long but found Int', please check the stack trace above for the root cause

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING