vllm - 💡(How to fix) Fix [Bug]: vLLM 0.21: DeepSeek-V4-pro crashes with tensor size mismatch & CUBLAS error during PP+TP inference on 2x8 H800 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#43080Fetched 2026-05-20 03:39:59
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Error Message

  • CUDA / torch: not explicitly logged (error shows cublasGemmEx) [ERROR] Worker_PP1_TP4 (pid=44045, ip=172.21.6.7): Traceback (most recent call last): [ERROR] Worker_PP0_TP4 (pid=458312): RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) Traceback (most recent call last): RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx(...) [rank12] Process group watchdog terminated with exception: RECV timeout. [rank4] Process group watchdog terminated with exception: BROADCAST timeout. [raylet] Worker pid=44045 (ip=172.21.6.7) died, exit type: SYSTEM_ERROR, detail: connection error code 2 (end of file), possibly killed by OOM or SIGSEGV/SIGABRT.

Code Example

- vLLM version: 0.21 (Docker image)  
- Model: DeepSeek-V4-Pro (`/models/DeepSeek-V4-Pro/`)  
- Hardware: 2 nodes, each with 8× NVIDIA H800 (80 GB)  
- Deployment: Ray cluster (multi-node) + multi-process executor  
- Parallelism: tensor-parallel-size=8, pipeline-parallel-size=2  
- Python: 3.12 (from logs)  
- CUDA / torch: not explicitly logged (error shows `cublasGemmEx`)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
- vLLM version: 0.21 (Docker image)  
- Model: DeepSeek-V4-Pro (`/models/DeepSeek-V4-Pro/`)  
- Hardware: 2 nodes, each with 8× NVIDIA H800 (80 GB)  
- Deployment: Ray cluster (multi-node) + multi-process executor  
- Parallelism: tensor-parallel-size=8, pipeline-parallel-size=2  
- Python: 3.12 (from logs)  
- CUDA / torch: not explicitly logged (error shows `cublasGemmEx`)
</details>

🐛 Describe the bug

[Warnings] Triton JIT compilation during inference (may indicate insufficient warmup): 05-18 08:23:00 (Worker_PP0_TP0) _compute_slot_mapping_kernel 05-18 08:23:00 (Worker_PP0_TP0) _build_prefill_chunk_metadata_kernel 05-18 08:23:00 (Worker_PP0_TP0) _compute_prefill_metadata_kernel 05-18 08:23:00 (Worker_PP0_TP0) _dequantize_and_gather_k_kernel 05-18 08:23:00 (Worker_PP0_TP0) _combine_topk_swa_indices_kernel 05-18 08:36:40 (Worker_PP0_TP0) _fused_inv_rope_fp8_quant_per_head (Same warnings on other workers, e.g., Worker_PP1_TP0)

[INFO] 05-18 08:22:39 Graph capturing finished in 33 secs, took 1.22 GiB [INFO] 05-18 08:22:40 CUDA graph pool memory: actual 1.22 GiB, estimated 3.15 GiB (157.8% difference) [INFO] 05-18 08:22:40 Kernel JIT monitor activated

=== CRASH at 05-18 12:46:50 ===

[ERROR] Worker_PP1_TP4 (pid=44045, ip=172.21.6.7): RuntimeError: The size of tensor a (3360) must match the size of tensor b (4) at non-singleton dimension 0 Traceback (most recent call last): File ".../vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop output = func(*args, **kwargs) File ".../vllm/v1/worker/worker_base.py", line 337, in execute_model return self.worker.execute_model(scheduler_output) File ".../torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) File ".../vllm/v1/worker/gpu_worker.py", line 843, in execute_model output = self.model_runner.execute_model( File ".../vllm/v1/worker/gpu_model_runner.py", line 4075, in execute_model ) = self._preprocess( File ".../vllm/v1/worker/gpu_model_runner.py", line 3375, in preprocess intermediate_tensors = self.sync_and_gather_intermediate_tensors( File ".../vllm/v1/worker/gpu_model_runner.py", line 3154, in sync_and_gather_intermediate_tensors self.intermediate_tensors[k][:num_tokens].copy( RuntimeError: The size of tensor a (3360) must match the size of tensor b (4) at non-singleton dimension 0

[ERROR] Worker_PP0_TP4 (pid=458312): RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) Traceback (most recent call last): File ".../vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop output = func(*args, **kwargs) ... File ".../vllm/v1/worker/gpu_model_runner.py", line 4117, in execute_model model_output = self._model_forward( File ".../vllm/v1/worker/gpu_model_runner.py", line 3592, in _model_forward return self.model( File ".../vllm/compilation/cuda_graph.py", line 254, in call return self.runnable(*args, **kwargs) ... File ".../vllm/model_executor/models/deepseek_v4.py", line 1669, in forward hidden_states = self.model( ... File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 568, in deepseek_v4_attention self.attention_impl(hidden_states, positions, out) File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 426, in attention_impl self.attn_gemm_parallel_execute(hidden_states) File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 404, in attn_gemm_parallel_execute qr_kv, (kv_score, indexer_weights, indexer_kv_score) = execute_in_parallel( File ".../vllm/utils/multi_stream_utils.py", line 103, in execute_in_parallel aux_results = [fn() if fn is not None else None for fn in aux_fns] File ".../vllm/model_executor/layers/deepseek_v4_attention.py", line 390, in indexer_compressor_kv_score return torch.mm( RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx(...)

=== Subsequent NCCL Timeouts and Crash ===

05-18 12:46:52 Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 139.1 tokens/s, Running: 3 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 69.0%

[rank12] (Worker_PP1_TP4, pid=44045) NCCL timeout: RECV (SeqNum=478570, Timeout=600000ms) Stack: irecv -> irecv_tensor_dict -> gpu_worker.execute_model -> ... -> ray_executor_v2.run

[rank12] NCCL timeout: ALLGATHER (SeqNum=957141, Timeout=600000ms) Stack: all_gather_into_tensor -> vllm all_gather -> _postprocess -> wait_for_comm -> sync_and_gather_intermediate_tensors -> ...

[rank4] (Worker_PP0_TP4, pid=458312) NCCL timeout: BROADCAST (SeqNum=477803, Timeout=600080ms) Stack: broadcast -> _pp_receive_prev_sampled_token_ids_to_input_batch -> sample_tokens -> ...

Multiple ranks observe flight recorder dump signal, PG watchdogs terminate with exceptions, processes receive SIGABRT and abort.

[rank12] Process group watchdog terminated with exception: RECV timeout. [rank4] Process group watchdog terminated with exception: BROADCAST timeout. [raylet] Worker pid=44045 (ip=172.21.6.7) died, exit type: SYSTEM_ERROR, detail: connection error code 2 (end of file), possibly killed by OOM or SIGSEGV/SIGABRT.

05-18 12:57:51 EngineCore: RayWorkerProc rank=[12] died unexpectedly, shutting down executor.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING