vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Pro H200 DP+EP router dtype mismatch in topk_hash_softplus_sqrt (Long/Int inconsistency) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40862Fetched 2026-04-26 05:06:22
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
labeled ×1subscribed ×1

Error Message

(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] Traceback (most recent call last): (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] output = func(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] self.model_runner.profile_run() (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] hidden_states, last_hidden_states = self._dummy_run( (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] outputs = self.model( (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] hidden_states = self.model( (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 467, in call (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return self.forward(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 629, in forward (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] hidden_states = layer( (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 531, in forward (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] x = self.ffn(x, input_ids) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 202, in forward (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] fused_moe_out = self.experts( (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py", line 71, in vllm_topk_softplus_sqrt (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ops.topk_hash_softplus_sqrt( (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 2390, in topk_hash_softplus_sqrt (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] torch.ops._moe_C.topk_softplus_sqrt( (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in call (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] return self._op(*args, **kwargs) (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] RuntimeError: expected scalar type Long but found Int

Root Cause

RuntimeError: Worker failed with error 'expected scalar type Long but found Int', please check the stack trace above for the root cause
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Fix Action

Fix / Workaround

I tested a monkey-patch around vllm._custom_ops.topk_hash_softplus_sqrt to log dtypes and force suspected tensors to torch.long.

So a blanket “cast indices to Long” workaround is not correct. The result suggests the fused router op has mixed dtype expectations and the bug is in the op contract / binding rather than a simple caller-side cast mistake.

docker run -d --name deepseek-pro-dpep8-unpatched \
  --gpus all \
  --privileged \
  --ipc=host \
  --network host \
  --shm-size=32g \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/amit/vllm-cache/hf:/root/.cache/huggingface \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  -e VLLM_HOST_IP=172.25.131.103 \
  -e NCCL_SOCKET_IFNAME=enp188s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp188s0f1np1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  vllm/vllm-openai:deepseekv4-cu130 \
  /root/.cache/huggingface/deepseek-v4-pro-local \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --data-parallel-size-local 8 \
  --data-parallel-address 172.25.131.103 \
  --data-parallel-rpc-port 13346 \
  --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --api-server-count 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enforce-eager \
  --max-model-len 131072 \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log

Code Example

docker run -d --name deepseek-pro-dpep8-unpatched \
  --gpus all \
  --privileged \
  --ipc=host \
  --network host \
  --shm-size=32g \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/amit/vllm-cache/hf:/root/.cache/huggingface \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  -e VLLM_HOST_IP=172.25.131.103 \
  -e NCCL_SOCKET_IFNAME=enp188s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp188s0f1np1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  vllm/vllm-openai:deepseekv4-cu130 \
  /root/.cache/huggingface/deepseek-v4-pro-local \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --data-parallel-size-local 8 \
  --data-parallel-address 172.25.131.103 \
  --data-parallel-rpc-port 13346 \
  --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --api-server-count 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enforce-eager \
  --max-model-len 131072 \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log

---

docker run -d --name deepseek-pro-tp8 \
  --gpus all \
  --privileged \
  --ipc=host \
  --network host \
  --shm-size=32g \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/amit/vllm-cache/hf:/root/.cache/huggingface \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  -e VLLM_HOST_IP=172.25.131.103 \
  -e NCCL_SOCKET_IFNAME=enp188s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp188s0f1np1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  vllm/vllm-openai:deepseekv4-cu130 \
  /root/.cache/huggingface/deepseek-v4-pro-local \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enforce-eager \
  --max-model-len 131072 \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log

---

(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] Traceback (most recent call last):
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     output = func(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     self.model_runner.profile_run()
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                                         ^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     outputs = self.model(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]               ^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     hidden_states = self.model(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                     ^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 467, in __call__
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self.forward(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 629, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     hidden_states = layer(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                     ^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 531, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     x = self.ffn(x, input_ids)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]         ^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 202, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     fused_moe_out = self.experts(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py", line 71, in vllm_topk_softplus_sqrt
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     ops.topk_hash_softplus_sqrt(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 2390, in topk_hash_softplus_sqrt
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     torch.ops._moe_C.topk_softplus_sqrt(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._op(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] RuntimeError: expected scalar type Long but found Int

---

RuntimeError: Worker failed with error 'expected scalar type Long but found Int', please check the stack trace above for the root cause
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

import os
import sys


def _patch_topk_hash_softplus_sqrt() -> None:
    if os.environ.get("VLLM_PATCH_TOPK_HASH_SOFTPLUS_SQRT") != "1":
        return

    try:
        import torch
        import vllm._custom_ops as ops
    except Exception as exc:
        print(f"[sitecustomize] patch import skipped: {exc}", file=sys.stderr)
        return

    original = getattr(ops, "topk_hash_softplus_sqrt", None)
    if original is None:
        print("[sitecustomize] topk_hash_softplus_sqrt not found", file=sys.stderr)
        return

    if getattr(original, "_vllm_long_patch", False):
        return

    debug = os.environ.get("VLLM_DEBUG_TOPK_HASH_SOFTPLUS_SQRT") == "1"
    logged = {"done": False}
    log_path = os.environ.get("VLLM_TOPK_PATCH_LOG")

    def _log(msg: str) -> None:
        print(msg, file=sys.stderr)
        if log_path:
            try:
                with open(log_path, "a", encoding="utf-8") as f:
                    f.write(msg + "\n")
            except Exception:
                pass

    def patched(
        topk_weights,
        topk_indices,
        token_expert_indices,
        gating_output,
        renormalize,
        routed_scaling_factor,
        e_score_correction_bias,
        input_tokens,
        hash_indices_table,
    ):
        orig_topk_indices = topk_indices
        orig_token_expert_indices = token_expert_indices

        if topk_indices is not None and topk_indices.dtype != torch.long:
            topk_indices = torch.empty_like(topk_indices, dtype=torch.long)
        if (token_expert_indices is not None
                and token_expert_indices.dtype != torch.long):
            token_expert_indices = torch.empty_like(token_expert_indices,
                                                    dtype=torch.long)
        if input_tokens is not None and input_tokens.dtype != torch.long:
            input_tokens = input_tokens.to(dtype=torch.long)
        if (hash_indices_table is not None
                and hash_indices_table.dtype != torch.long):
            hash_indices_table = hash_indices_table.to(dtype=torch.long)

        if debug and not logged["done"]:
            _log(
                "[sitecustomize] topk_hash_softplus_sqrt dtypes "
                f"topk_indices={orig_topk_indices.dtype} -> {topk_indices.dtype}, "
                "token_expert_indices="
                f"{orig_token_expert_indices.dtype} -> {token_expert_indices.dtype}, "
                f"input_tokens={None if input_tokens is None else input_tokens.dtype}, "
                "hash_indices_table="
                f"{None if hash_indices_table is None else hash_indices_table.dtype}"
            )
            logged["done"] = True

        try:
            result = original(
                topk_weights,
                topk_indices,
                token_expert_indices,
                gating_output,
                renormalize,
                routed_scaling_factor,
                e_score_correction_bias,
                input_tokens,
                hash_indices_table,
            )
        except Exception as exc:
            _log(
                "[sitecustomize] topk_hash_softplus_sqrt exception "
                f"{type(exc).__name__}: {exc} "
                "with dtypes "
                f"topk_indices={topk_indices.dtype}, "
                f"token_expert_indices={token_expert_indices.dtype}, "
                f"input_tokens={None if input_tokens is None else input_tokens.dtype}, "
                "hash_indices_table="
                f"{None if hash_indices_table is None else hash_indices_table.dtype}"
            )
            raise

        if topk_indices is not orig_topk_indices:
            orig_topk_indices.copy_(topk_indices.to(dtype=orig_topk_indices.dtype))
        if token_expert_indices is not orig_token_expert_indices:
            orig_token_expert_indices.copy_(
                token_expert_indices.to(dtype=orig_token_expert_indices.dtype)
            )

        return result

    patched._vllm_long_patch = True
    ops.topk_hash_softplus_sqrt = patched
    _log("[sitecustomize] patched topk_hash_softplus_sqrt")


_patch_topk_hash_softplus_sqrt()

---

docker run -d --name deepseek-pro-dpep8-patched \
  --gpus all \
  --privileged \
  --ipc=host \
  --network host \
  --shm-size=32g \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/amit/vllm-cache/hf:/root/.cache/huggingface \
  -v /tmp/deepseek_v4_sitecustomize.py:/opt/vllm-patches/sitecustomize.py:ro \
  -v /tmp:/host-tmp \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  -e VLLM_HOST_IP=172.25.131.103 \
  -e NCCL_SOCKET_IFNAME=enp188s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp188s0f1np1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e PYTHONPATH=/opt/vllm-patches \
  -e VLLM_PATCH_TOPK_HASH_SOFTPLUS_SQRT=1 \
  -e VLLM_DEBUG_TOPK_HASH_SOFTPLUS_SQRT=1 \
  -e VLLM_TOPK_PATCH_LOG=/host-tmp/vllm_topk_patch.log \
  vllm/vllm-openai:deepseekv4-cu130 \
  /root/.cache/huggingface/deepseek-v4-pro-local \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --data-parallel-size-local 8 \
  --data-parallel-address 172.25.131.103 \
  --data-parallel-rpc-port 13346 \
  --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --api-server-count 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enforce-eager \
  --max-model-len 131072 \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log

---

[sitecustomize] topk_hash_softplus_sqrt dtypes topk_indices=torch.int64 -> torch.int64, token_expert_indices=torch.int32 -> torch.int64, input_tokens=torch.int64, hash_indices_table=torch.int64
[sitecustomize] topk_hash_softplus_sqrt exception RuntimeError: expected scalar type Int but found Long with dtypes topk_indices=torch.int64, token_expert_indices=torch.int64, input_tokens=torch.int64, hash_indices_table=torch.int64
RAW_BUFFERClick to expand / collapse

Your current environment

Environment

  • vLLM image: vllm/vllm-openai:deepseekv4-cu130
  • Model: deepseek-ai/DeepSeek-V4-Pro
  • Hardware: NVIDIA H200
  • Reproduced on:
    • single-node 8x H200 with DP=8 + EP
    • multi-node 16x H200 with DP=16 + EP
  • Fabric/network is healthy; other large-model runs on this cluster are fine

🐛 Describe the bug

Observed failure The intended DeepSeek-V4-Pro DP+EP path fails during startup/profiling in the fused router path, with dtype mismatch errors around topk_hash_softplus_sqrt.

Original failure:

  • expected scalar type Long but found Int

The failing stack goes through:

  • fused_topk_bias_router.py
  • ops.topk_hash_softplus_sqrt
  • torch.ops._moe_C.topk_softplus_sqrt

I tested a monkey-patch around vllm._custom_ops.topk_hash_softplus_sqrt to log dtypes and force suspected tensors to torch.long.

Observed dtypes at the op boundary:

  • topk_indices: torch.int64
  • input_tokens: torch.int64
  • hash_indices_table: torch.int64
  • token_expert_indices: torch.int32

After forcing the relevant integer tensors to Long, the error flipped to:

  • expected scalar type Int but found Long

So a blanket “cast indices to Long” workaround is not correct. The result suggests the fused router op has mixed dtype expectations and the bug is in the op contract / binding rather than a simple caller-side cast mistake.

Important behavior

  • TP=8 --enforce-eager works on the same H200 node
  • DP+EP still fails
  • --enforce-eager alone does not fix the intended DP+EP path

Request Can you clarify the intended dtype contract for topk_hash_softplus_sqrt inputs, especially:

  • topk_indices
  • token_expert_indices
  • input_tokens
  • hash_indices_table

Exact Failing Command

Single-node DP=8 + EP reproducer on 8x H200:

docker run -d --name deepseek-pro-dpep8-unpatched \
  --gpus all \
  --privileged \
  --ipc=host \
  --network host \
  --shm-size=32g \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/amit/vllm-cache/hf:/root/.cache/huggingface \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  -e VLLM_HOST_IP=172.25.131.103 \
  -e NCCL_SOCKET_IFNAME=enp188s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp188s0f1np1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  vllm/vllm-openai:deepseekv4-cu130 \
  /root/.cache/huggingface/deepseek-v4-pro-local \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --data-parallel-size-local 8 \
  --data-parallel-address 172.25.131.103 \
  --data-parallel-rpc-port 13346 \
  --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --api-server-count 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enforce-eager \
  --max-model-len 131072 \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log

Exact Working Command

Working fallback on the same node:

docker run -d --name deepseek-pro-tp8 \
  --gpus all \
  --privileged \
  --ipc=host \
  --network host \
  --shm-size=32g \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/amit/vllm-cache/hf:/root/.cache/huggingface \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  -e VLLM_HOST_IP=172.25.131.103 \
  -e NCCL_SOCKET_IFNAME=enp188s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp188s0f1np1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  vllm/vllm-openai:deepseekv4-cu130 \
  /root/.cache/huggingface/deepseek-v4-pro-local \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enforce-eager \
  --max-model-len 131072 \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log

Original Unpatched Traceback

This is from the single-node DP=8 + EP reproducer above.

(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] Traceback (most recent call last):
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     output = func(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     self.model_runner.profile_run()
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                                         ^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     outputs = self.model(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]               ^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     hidden_states = self.model(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                     ^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 467, in __call__
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self.forward(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 629, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     hidden_states = layer(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                     ^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 531, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     x = self.ffn(x, input_ids)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]         ^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 202, in forward
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     fused_moe_out = self.experts(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py", line 71, in vllm_topk_softplus_sqrt
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     ops.topk_hash_softplus_sqrt(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 2390, in topk_hash_softplus_sqrt
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     torch.ops._moe_C.topk_softplus_sqrt(
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]     return self._op(*args, **kwargs)
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP6_EP6 pid=3561) ERROR 04-25 05:03:54 [multiproc_executor.py:971] RuntimeError: expected scalar type Long but found Int

Top-level engine error:

RuntimeError: Worker failed with error 'expected scalar type Long but found Int', please check the stack trace above for the root cause
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Monkey Patch Used For Diagnosis

This patch was mounted via sitecustomize.py and only used for diagnosis.

import os
import sys


def _patch_topk_hash_softplus_sqrt() -> None:
    if os.environ.get("VLLM_PATCH_TOPK_HASH_SOFTPLUS_SQRT") != "1":
        return

    try:
        import torch
        import vllm._custom_ops as ops
    except Exception as exc:
        print(f"[sitecustomize] patch import skipped: {exc}", file=sys.stderr)
        return

    original = getattr(ops, "topk_hash_softplus_sqrt", None)
    if original is None:
        print("[sitecustomize] topk_hash_softplus_sqrt not found", file=sys.stderr)
        return

    if getattr(original, "_vllm_long_patch", False):
        return

    debug = os.environ.get("VLLM_DEBUG_TOPK_HASH_SOFTPLUS_SQRT") == "1"
    logged = {"done": False}
    log_path = os.environ.get("VLLM_TOPK_PATCH_LOG")

    def _log(msg: str) -> None:
        print(msg, file=sys.stderr)
        if log_path:
            try:
                with open(log_path, "a", encoding="utf-8") as f:
                    f.write(msg + "\n")
            except Exception:
                pass

    def patched(
        topk_weights,
        topk_indices,
        token_expert_indices,
        gating_output,
        renormalize,
        routed_scaling_factor,
        e_score_correction_bias,
        input_tokens,
        hash_indices_table,
    ):
        orig_topk_indices = topk_indices
        orig_token_expert_indices = token_expert_indices

        if topk_indices is not None and topk_indices.dtype != torch.long:
            topk_indices = torch.empty_like(topk_indices, dtype=torch.long)
        if (token_expert_indices is not None
                and token_expert_indices.dtype != torch.long):
            token_expert_indices = torch.empty_like(token_expert_indices,
                                                    dtype=torch.long)
        if input_tokens is not None and input_tokens.dtype != torch.long:
            input_tokens = input_tokens.to(dtype=torch.long)
        if (hash_indices_table is not None
                and hash_indices_table.dtype != torch.long):
            hash_indices_table = hash_indices_table.to(dtype=torch.long)

        if debug and not logged["done"]:
            _log(
                "[sitecustomize] topk_hash_softplus_sqrt dtypes "
                f"topk_indices={orig_topk_indices.dtype} -> {topk_indices.dtype}, "
                "token_expert_indices="
                f"{orig_token_expert_indices.dtype} -> {token_expert_indices.dtype}, "
                f"input_tokens={None if input_tokens is None else input_tokens.dtype}, "
                "hash_indices_table="
                f"{None if hash_indices_table is None else hash_indices_table.dtype}"
            )
            logged["done"] = True

        try:
            result = original(
                topk_weights,
                topk_indices,
                token_expert_indices,
                gating_output,
                renormalize,
                routed_scaling_factor,
                e_score_correction_bias,
                input_tokens,
                hash_indices_table,
            )
        except Exception as exc:
            _log(
                "[sitecustomize] topk_hash_softplus_sqrt exception "
                f"{type(exc).__name__}: {exc} "
                "with dtypes "
                f"topk_indices={topk_indices.dtype}, "
                f"token_expert_indices={token_expert_indices.dtype}, "
                f"input_tokens={None if input_tokens is None else input_tokens.dtype}, "
                "hash_indices_table="
                f"{None if hash_indices_table is None else hash_indices_table.dtype}"
            )
            raise

        if topk_indices is not orig_topk_indices:
            orig_topk_indices.copy_(topk_indices.to(dtype=orig_topk_indices.dtype))
        if token_expert_indices is not orig_token_expert_indices:
            orig_token_expert_indices.copy_(
                token_expert_indices.to(dtype=orig_token_expert_indices.dtype)
            )

        return result

    patched._vllm_long_patch = True
    ops.topk_hash_softplus_sqrt = patched
    _log("[sitecustomize] patched topk_hash_softplus_sqrt")


_patch_topk_hash_softplus_sqrt()

Exact Patched Command

docker run -d --name deepseek-pro-dpep8-patched \
  --gpus all \
  --privileged \
  --ipc=host \
  --network host \
  --shm-size=32g \
  -v /dev/infiniband:/dev/infiniband \
  -v /home/amit/vllm-cache/hf:/root/.cache/huggingface \
  -v /tmp/deepseek_v4_sitecustomize.py:/opt/vllm-patches/sitecustomize.py:ro \
  -v /tmp:/host-tmp \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HUB_ENABLE_HF_TRANSFER=1 \
  -e VLLM_HOST_IP=172.25.131.103 \
  -e NCCL_SOCKET_IFNAME=enp188s0f1np1 \
  -e GLOO_SOCKET_IFNAME=enp188s0f1np1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e PYTHONPATH=/opt/vllm-patches \
  -e VLLM_PATCH_TOPK_HASH_SOFTPLUS_SQRT=1 \
  -e VLLM_DEBUG_TOPK_HASH_SOFTPLUS_SQRT=1 \
  -e VLLM_TOPK_PATCH_LOG=/host-tmp/vllm_topk_patch.log \
  vllm/vllm-openai:deepseekv4-cu130 \
  /root/.cache/huggingface/deepseek-v4-pro-local \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --data-parallel-size-local 8 \
  --data-parallel-address 172.25.131.103 \
  --data-parallel-rpc-port 13346 \
  --enable-expert-parallel \
  --all2all-backend deepep_low_latency \
  --api-server-count 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enforce-eager \
  --max-model-len 131072 \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log

Patched Diagnostic Result

The patch log showed:

[sitecustomize] topk_hash_softplus_sqrt dtypes topk_indices=torch.int64 -> torch.int64, token_expert_indices=torch.int32 -> torch.int64, input_tokens=torch.int64, hash_indices_table=torch.int64
[sitecustomize] topk_hash_softplus_sqrt exception RuntimeError: expected scalar type Int but found Long with dtypes topk_indices=torch.int64, token_expert_indices=torch.int64, input_tokens=torch.int64, hash_indices_table=torch.int64

This repeated across multiple workers.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to a dtype mismatch in the topk_hash_softplus_sqrt operation, and a potential fix involves ensuring consistent dtype for the input tensors.

Guidance

  • Verify the dtypes of the input tensors topk_indices, token_expert_indices, input_tokens, and hash_indices_table to ensure they match the expected dtypes.
  • Check the topk_hash_softplus_sqrt operation to see if it has any specific dtype requirements or constraints.
  • Consider adding explicit dtype casting to ensure that the input tensors have the correct dtype before passing them to the topk_hash_softplus_sqrt operation.
  • Review the patch log output to understand the dtypes of the input tensors and how they are being modified by the patch.

Example

No code example is provided as the issue is related to a specific operation and its dtype requirements, which need to be investigated further.

Notes

The issue seems to be related to a specific operation and its dtype requirements. The patch log output suggests that the dtypes of the input tensors are being modified, but the exact requirements of the topk_hash_softplus_sqrt operation are not clear. Further investigation is needed to determine the correct dtypes for the input tensors.

Recommendation

Apply a workaround by explicitly casting the input tensors to the correct dtype before passing them to the topk_hash_softplus_sqrt operation, and verify that the issue is resolved.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING