vllm - 💡(How to fix) Fix [Bug]: vLLM 0.21: DeepSeek-V4-pro crashes with tensor size mismatch & CUBLAS error during PP+TP inference on 2x8 H800 [1 participants]

Error Message

CUDA / torch: not explicitly logged (error shows cublasGemmEx) [ERROR] Worker_PP1_TP4 (pid=44045, ip=172.21.6.7): Traceback (most recent call last): [ERROR] Worker_PP0_TP4 (pid=458312): RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) Traceback (most recent call last): RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx(...) [rank12] Process group watchdog terminated with exception: RECV timeout. [rank4] Process group watchdog terminated with exception: BROADCAST timeout. [raylet] Worker pid=44045 (ip=172.21.6.7) died, exit type: SYSTEM_ERROR, detail: connection error code 2 (end of file), possibly killed by OOM or SIGSEGV/SIGABRT.

Code Example

- vLLM version: 0.21 (Docker image)  
- Model: DeepSeek-V4-Pro (`/models/DeepSeek-V4-Pro/`)  
- Hardware: 2 nodes, each with 8× NVIDIA H800 (80 GB)  
- Deployment: Ray cluster (multi-node) + multi-process executor  
- Parallelism: tensor-parallel-size=8, pipeline-parallel-size=2  
- Python: 3.12 (from logs)  
- CUDA / torch: not explicitly logged (error shows `cublasGemmEx`)

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

- vLLM version: 0.21 (Docker image)  
- Model: DeepSeek-V4-Pro (`/models/DeepSeek-V4-Pro/`)  
- Hardware: 2 nodes, each with 8× NVIDIA H800 (80 GB)  
- Deployment: Ray cluster (multi-node) + multi-process executor  
- Parallelism: tensor-parallel-size=8, pipeline-parallel-size=2  
- Python: 3.12 (from logs)  
- CUDA / torch: not explicitly logged (error shows `cublasGemmEx`)

</details>

🐛 Describe the bug

[Warnings] Triton JIT compilation during inference (may indicate insufficient warmup): 05-18 08:23:00 (Worker_PP0_TP0) _compute_slot_mapping_kernel 05-18 08:23:00 (Worker_PP0_TP0) _build_prefill_chunk_metadata_kernel 05-18 08:23:00 (Worker_PP0_TP0) _compute_prefill_metadata_kernel 05-18 08:23:00 (Worker_PP0_TP0) _dequantize_and_gather_k_kernel 05-18 08:23:00 (Worker_PP0_TP0) _combine_topk_swa_indices_kernel 05-18 08:36:40 (Worker_PP0_TP0) _fused_inv_rope_fp8_quant_per_head (Same warnings on other workers, e.g., Worker_PP1_TP0)

[INFO] 05-18 08:22:39 Graph capturing finished in 33 secs, took 1.22 GiB [INFO] 05-18 08:22:40 CUDA graph pool memory: actual 1.22 GiB, estimated 3.15 GiB (157.8% difference) [INFO] 05-18 08:22:40 Kernel JIT monitor activated

=== CRASH at 05-18 12:46:50 ===

[ERROR] Worker_PP1_TP4 (pid=44045, ip=172.21.6.7): RuntimeError: The size of tensor a (3360) must match the size of tensor b (4) at non-singleton dimension 0 Traceback (most recent call last): File ".../vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop output = func(*args, **kwargs) File ".../vllm/v1/worker/worker_base.py", line 337, in execute_model return self.worker.execute_model(scheduler_output) File ".../torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) File ".../vllm/v1/worker/gpu_worker.py", line 843, in execute_model output = self.model_runner.execute_model( File ".../vllm/v1/worker/gpu_model_runner.py", line 4075, in execute_model ) = self._preprocess( File ".../vllm/v1/worker/gpu_model_runner.py", line 3375, in preprocess intermediate_tensors = self.sync_and_gather_intermediate_tensors( File ".../vllm/v1/worker/gpu_model_runner.py", line 3154, in sync_and_gather_intermediate_tensors self.intermediate_tensors[k][:num_tokens].copy( RuntimeError: The size of tensor a (3360) must match the size of tensor b (4) at non-singleton dimension 0

[ERROR] Worker_PP0_TP4 (pid=458312): RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) Traceback (most recent call last): File ".../vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop output = func(*args, **kwargs) ... File ".../vllm/v1/worker/gpu_model_runner.py", line 4117, in execute_model model_output = self._model_forward( File ".../vllm/v1/worker/gpu_model_runner.py", line 3592, in _model_forward return self.model( File ".../vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs) ... File ".../vllm/model_executor/models/deepseek_v4.py", line 1669, in forward hidden_states = self.model( ... File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 568, in deepseek_v4_attention self.attention_impl(hidden_states, positions, out) File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 426, in attention_impl self.attn_gemm_parallel_execute(hidden_states) File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 404, in attn_gemm_parallel_execute qr_kv, (kv_score, indexer_weights, indexer_kv_score) = execute_in_parallel( File ".../vllm/utils/multi_stream_utils.py", line 103, in execute_in_parallel aux_results = [fn() if fn is not None else None for fn in aux_fns] File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 390, in indexer_compressor_kv_score return torch.mm( RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx(...)

=== Subsequent NCCL Timeouts and Crash ===

05-18 12:46:52 Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 139.1 tokens/s, Running: 3 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 69.0%

[rank12] (Worker_PP1_TP4, pid=44045) NCCL timeout: RECV (SeqNum=478570, Timeout=600000ms) Stack: irecv -> irecv_tensor_dict -> gpu_worker.execute_model -> ... -> ray_executor_v2.run

[rank12] NCCL timeout: ALLGATHER (SeqNum=957141, Timeout=600000ms) Stack: all_gather_into_tensor -> vllm all_gather -> _postprocess -> wait_for_comm -> sync_and_gather_intermediate_tensors -> ...

[rank4] (Worker_PP0_TP4, pid=458312) NCCL timeout: BROADCAST (SeqNum=477803, Timeout=600080ms) Stack: broadcast -> _pp_receive_prev_sampled_token_ids_to_input_batch -> sample_tokens -> ...

Multiple ranks observe flight recorder dump signal, PG watchdogs terminate with exceptions, processes receive SIGABRT and abort.

[rank12] Process group watchdog terminated with exception: RECV timeout. [rank4] Process group watchdog terminated with exception: BROADCAST timeout. [raylet] Worker pid=44045 (ip=172.21.6.7) died, exit type: SYSTEM_ERROR, detail: connection error code 2 (end of file), possibly killed by OOM or SIGSEGV/SIGABRT.

05-18 12:57:51 EngineCore: RayWorkerProc rank=[12] died unexpectedly, shutting down executor.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: vLLM 0.21: DeepSeek-V4-pro crashes with tensor size mismatch & CUBLAS error during PP+TP inference on 2x8 H800 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: vLLM 0.21: DeepSeek-V4-pro crashes with tensor size mismatch & CUBLAS error during PP+TP inference on 2x8 H800 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING