vllm - ✅(Solved) Fix [Installation]: Blackwell SM120 + CUDA 13 pip install: 5 sequential failures before Qwen3.5 27B+ runs [1 pull requests, 1 comments, 1 participants]

vllm2026-03-20 20:59:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37714•Fetched 2026-04-08 01:08:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Gavin-Qiao

Participants

Gavin-Qiao

Timeline (top)

subscribed ×3cross-referenced ×2commented ×1labeled ×1

Error Message

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Root Cause

Root cause: vLLM's C extensions (_C.abi3.so, _moe_C.abi3.so, _vllm_fa2_C.abi3.so, _vllm_fa3_C.abi3.so, _flashmla_C.abi3.so, _flashmla_extension_C.abi3.so) are compiled against CUDA 12 and link libcudart.so.12. CUDA 13 systems only ship libcudart.so.13.

Fix Action

Fix / Workaround

Commonly suggested workaround: LD_LIBRARY_PATH. This is fragile — it doesn't survive process forks (vLLM's multiprocess EngineCore architecture), systemd services, background tasks, or Docker entrypoints.

What actually worked: Installing nvidia-cuda-runtime-cu12 (provides libcudart.so.12) and then running patchelf --set-rpath on all 8 affected .so files to bake in the path. No env vars needed.

Users may set --attention-backend FLASH_ATTN as a workaround for FlashInfer crashes (#36828), not knowing it's incompatible with --kv-cache-dtype fp8.

PR fix notes

PR #37757: [UX] Logging - Improve Startup Error Logs

Repository: vllm-project/vllm
Author: Waknis
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37757

Description (problem / solution / changelog)

Fixes #31683. Also addresses #37714.

Summary

This PR scopes #31683 to startup failures only, matching the guidance in the issue thread.

propagate structured startup failures from the worker ready pipe through the engine startup handshake
preserve the innermost child exception type, message, source, and traceback in the surfaced startup error
improve failed process summaries during startup and add focused regression coverage for the worker and engine startup paths

Why this is not duplicating an existing PR

gh pr list --repo vllm-project/vllm --state open --search "31683 in:body" returned no matching open PRs
gh pr list --repo vllm-project/vllm --state open --search "startup error propagation multiproc executor engine core" returned no matching open PRs
I also left a courtesy comment on #31683 because the issue is currently assigned

Testing

python -m pytest --noconftest tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py -v -s
- Passed (5 passed)
pre-commit run ruff-check --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run ruff-format --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run typos --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run mypy-local --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run check-spdx-header --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run check-forbidden-imports --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run check-torch-cuda-call --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run check-boolean-context-manager --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
pre-commit run check-root-lazy-imports --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
- Passed
python -m pytest --noconftest tests/v1/engine/test_engine_core_client.py -v -s -k startup_failure
- Not runnable here because this Windows non-CUDA environment skips the CUDA-only module before the startup test can execute
pre-commit run --files ...
- Not fully runnable in this Windows environment because several repo hooks require bash

Disclosure

Prepared with AI assistance; all changes and test results were reviewed before submission.

Changed files

tests/v1/engine/test_engine_core_client.py (modified, +41/-14)
tests/v1/engine/test_startup_error_reporting.py (added, +324/-0)
tests/v1/executor/test_startup_error_reporting.py (added, +110/-0)
vllm/v1/engine/core.py (modified, +56/-31)
vllm/v1/engine/utils.py (modified, +263/-23)
vllm/v1/executor/multiproc_executor.py (modified, +97/-21)

Code Example

uv pip install vllm --torch-backend=auto

---

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

---

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

AssertionError: In Mamba cache align mode, block_size (X) must be <= max_num_batched_tokens

---

ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration.
  Reason: ['kv_cache_dtype not supported']

---

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

# On any Blackwell GPU with CUDA 13.0 driver, WSL2 or native Linux
  pip install vllm  # or: uv pip install vllm

  # Problem 1: immediate crash
  python -c "import vllm"
  # ImportError: libcudart.so.12: cannot open shared object file

  # After fixing libcudart (patchelf or LD_LIBRARY_PATH):
  # Problem 2: crash on GDN model load (gcc/nvcc/ninja missing)
  vllm serve Qwen/Qwen3.5-27B-FP8

  # After installing system deps:
  # Problem 3: crash with default --max-num-batched-tokens
  vllm serve Qwen/Qwen3.5-27B-FP8 --kv-cache-dtype fp8
  # Engine core initialization failed

  # Problem 4: crash with explicit FLASH_ATTN + fp8
  vllm serve Qwen/Qwen3.5-27B-FP8 --attention-backend FLASH_ATTN --kv-cache-dtype fp8
  # Engine core initialization failed

  # What actually works (after all fixes):
  vllm serve Qwen/Qwen3.5-27B-FP8 --max-num-batched-tokens 2096 --kv-cache-dtype fp8 --dtype bfloat16

RAW_BUFFERClick to expand / collapse

Your current environment

Collecting environment information... uv is set ============================== System Info ============================== OS : Ubuntu 24.04.4 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.39

     PyTorch Info

============================== PyTorch version : 2.10.0+cu130 Is debug build : False CUDA used to build PyTorch : 13.0 ROCM used to build PyTorch : N/A ============================== Python Environment

Python version : 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime) Python platform : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39 ============================== CUDA / GPU Info ============================== Is CUDA available : True CUDA runtime version : Could not collect CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA RTX PRO 5000 Blackwell Nvidia driver version : 595.79 cuDNN version : Could not collect ============================== CPU Info

Architecture: x86_64 Model name: Intel(R) Core(TM) Ultra 9 285K CPU(s): 24 Hypervisor vendor: Microsoft ============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.4 [pip3] numpy==2.2.6 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] torch==2.10.0+cu130 [pip3] torchvision==0.25.0+cu130 [pip3] transformers==4.57.6 [pip3] triton==3.6.0 ============================== vLLM Info

vLLM Version : 0.17.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled

How you are installing vllm

uv pip install vllm --torch-backend=auto

Summary

Setting up vLLM 0.17.1 from PyPI on a Blackwell GPU (RTX PRO 5000, SM120) with CUDA 13.0 on WSL2 requires solving 5 undocumented problems before any Qwen3.5 27B+ model will run. Each failure produces a cryptic error with no actionable guidance. This issue documents the full chain and proposes fixes.

Problem 1: `libcudart.so.12` not found

Error:

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Proposed fix: Either:

(a) Add nvidia-cuda-runtime-cu12 as a dependency and set RPATH in vLLM's build system (CMakeLists.txt)
(b) Publish versioned wheels (cu12, cu130) on PyPI like PyTorch does, with clear install instructions
(c) At minimum, detect the mismatch at import time and print an actionable error message

Related: #30435, #31018, #28669, #35432

Problem 2: FlashInfer JIT requires undocumented system dependencies

FlashInfer JIT-compiles CUDA kernels at runtime for GDN/Mamba attention patterns (used by Qwen3.5 27B+). This requires four system packages that aren't documented anywhere:

Missing dependency	Error message	Fix
`gcc`	`RuntimeError: Failed to find C compiler`	`apt install gcc`
`python3.12-dev`	gcc compilation fails (missing `Python.h`)	`apt install python3.12-dev`
CUDA toolkit (`nvcc`)	`RuntimeError: Could not find nvcc and default cuda_home='/usr/local/cuda' doesn't exist`	`apt install cuda-toolkit`
`ninja`	`FileNotFoundError: [Errno 2] No such file or directory: 'ninja'`	`apt install ninja-build`

Each failure is discovered sequentially — you fix one, hit the next. The errors come from deep inside FlashInfer's JIT layer, not vLLM, making them hard to trace.

Proposed fix:

Document these as system requirements in the installation guide, at least for Blackwell/GDN models
Detect missing dependencies at startup and print a single actionable message: "FlashInfer JIT compilation requires: gcc, python3-dev, nvcc (cuda-toolkit), ninja-build"
Consider pre-compiling FlashInfer kernels for common architectures (SM120, SM100) and shipping them in the wheel

Related: #21960, #32826

Problem 3: Qwen3.5 27B+ GDN models silently require `--max-num-batched-tokens 2096`

Error:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Actual root cause (buried in EngineCore subprocess log):

AssertionError: In Mamba cache align mode, block_size (X) must be <= max_num_batched_tokens

All Qwen3.5 models with GDN (Gated DeltaNet) layers — including the 27B dense model (not just MoE variants) — require --max-num-batched-tokens 2096 due to mamba/GDN cache alignment constraints. vLLM's default of 8192 violates this.

The 27B has 64 layers: 16 groups × (3 GDN + 1 Attention). vLLM treats GDN like Mamba for cache alignment, setting attention block size to 1568 tokens to ensure that attention page size is >= mamba page size.

Proposed fix:

Auto-detect GDN models and cap max_num_batched_tokens at the mamba page size
At minimum, surface the EngineCore subprocess error message to the user (see Problem 5)
Document this in the Qwen3.5 recipe

Related: #36010, #35502

Problem 4: `--attention-backend FLASH_ATTN` + `--kv-cache-dtype fp8` = silent crash

Error (again, buried in EngineCore subprocess):

ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration.
Reason: ['kv_cache_dtype not supported']

Users may set --attention-backend FLASH_ATTN as a workaround for FlashInfer crashes (#36828), not knowing it's incompatible with --kv-cache-dtype fp8.

Proposed fix:

Validate flag combinations before spawning EngineCore subprocesses
Raise a clear error: "FLASH_ATTN does not support kv-cache-dtype=fp8. Use the default FlashInfer backend or remove --kv-cache-dtype fp8."

Related: #12543, #35577, PR #14221

Problem 5: "Engine core initialization failed" provides no useful information

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

This error appears for every problem above (and many others — see #17618, #18730, #19002, #21882, #23176, #26898, #33245). The actual root cause is always in the EngineCore subprocess log, which users must manually find at /tmp/vllm-*.log or by adding --log-level DEBUG.

I understand a logging redesign is tracked at #31683. In the meantime, a minimal fix would be:

Capture the EngineCore subprocess's last exception and include it in the parent's error message
Change Failed core proc(s): {} to actually list the subprocess PIDs and their exit codes/signals

Reproduction

# On any Blackwell GPU with CUDA 13.0 driver, WSL2 or native Linux
pip install vllm  # or: uv pip install vllm

# Problem 1: immediate crash
python -c "import vllm"
# ImportError: libcudart.so.12: cannot open shared object file

# After fixing libcudart (patchelf or LD_LIBRARY_PATH):
# Problem 2: crash on GDN model load (gcc/nvcc/ninja missing)
vllm serve Qwen/Qwen3.5-27B-FP8

# After installing system deps:
# Problem 3: crash with default --max-num-batched-tokens
vllm serve Qwen/Qwen3.5-27B-FP8 --kv-cache-dtype fp8
# Engine core initialization failed

# Problem 4: crash with explicit FLASH_ATTN + fp8
vllm serve Qwen/Qwen3.5-27B-FP8 --attention-backend FLASH_ATTN --kv-cache-dtype fp8
# Engine core initialization failed

# What actually works (after all fixes):
vllm serve Qwen/Qwen3.5-27B-FP8 --max-num-batched-tokens 2096 --kv-cache-dtype fp8 --dtype bfloat16

Suggested priority

Problems 1 and 5 are the highest impact — they affect every Blackwell user installing from PyPI, and the error messages give no path to resolution. Problems 2-4 compound the frustration but are solvable once you know what to look for.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issues outlined, the following steps can be taken:

Problem 1: `libcudart.so.12` not found

Install nvidia-cuda-runtime-cu12 to provide libcudart.so.12.
Run patchelf --set-rpath on affected .so files to set the path.

Example:

apt install nvidia-cuda-runtime-cu12
patchelf --set-rpath /usr/lib/cuda/lib64 _C.abi3.so

Problem 2: FlashInfer JIT requires undocumented system dependencies

Install required system packages: gcc, python3.12-dev, cuda-toolkit, and ninja-build.
Document these as system requirements in the installation guide.

Example:

apt install gcc python3.12-dev cuda-toolkit ninja-build

Problem 3: Qwen3.5 27B+ GDN models silently require `--max-num-batched-tokens 2096`

Auto-detect GDN models and cap max_num_batched_tokens at the mamba page size.
Surface the EngineCore subprocess error message to the user.

Example (in Python):

import vllm

# Auto-detect GDN models and set max_num_batched_tokens
if is_gdn_model():
    max_num_batched_tokens = 2096
    # Run with the corrected max_num_batched_tokens
    vllm.serve(Qwen/Qwen3.5-27B-FP8, max_num_batched_tokens=max_num_batched_tokens)

Problem 4: `--attention-backend FLASH_ATTN` + `--kv-cache-dtype fp8` = silent crash

Validate flag combinations before spawning EngineCore subprocesses.
Raise a clear error for incompatible combinations.

Example (in Python):

import vllm

# Validate flag combinations
if attention_backend == "FLASH_ATTN" and kv_cache_dtype == "fp8":
    raise ValueError("FLASH_ATTN does not support kv-cache-dtype=fp8")

Problem 5: "Engine core initialization failed" provides no useful information

Capture the EngineCore subprocess's last exception and include it in the parent's error message.
Change Failed core proc(s): {} to list subprocess PIDs and their exit codes/signals.

Example (in Python):

import vllm

try:
    # Run EngineCore subprocess
    vllm.serve(Qwen/Qwen3.5-27B-FP8)
except RuntimeError as e:
    # Capture and include subprocess exception
    print(f"Engine core initialization failed: {e}")

Verification

To verify the fixes, run the following commands:

vllm serve Qwen/Qwen3.5-

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Installation]: Blackwell SM120 + CUDA 13 pip install: 5 sequential failures before Qwen3.5 27B+ runs [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #37757: [UX] Logging - Improve Startup Error Logs

Description (problem / solution / changelog)

Summary

Why this is not duplicating an existing PR

Testing

Disclosure

Changed files

Code Example

Your current environment

============================== PyTorch version : 2.10.0+cu130 Is debug build : False CUDA used to build PyTorch : 13.0 ROCM used to build PyTorch : N/A ============================== Python Environment

Architecture: x86_64 Model name: Intel(R) Core(TM) Ultra 9 285K CPU(s): 24 Hypervisor vendor: Microsoft ============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.4 [pip3] numpy==2.2.6 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] torch==2.10.0+cu130 [pip3] torchvision==0.25.0+cu130 [pip3] transformers==4.57.6 [pip3] triton==3.6.0 ============================== vLLM Info

How you are installing vllm

Summary

Problem 1: libcudart.so.12 not found

Problem 2: FlashInfer JIT requires undocumented system dependencies

Problem 3: Qwen3.5 27B+ GDN models silently require --max-num-batched-tokens 2096

Problem 4: --attention-backend FLASH_ATTN + --kv-cache-dtype fp8 = silent crash

Problem 5: "Engine core initialization failed" provides no useful information

Reproduction

Suggested priority

Before submitting a new issue...

extent analysis

Fix Plan

Problem 1: libcudart.so.12 not found

Problem 2: FlashInfer JIT requires undocumented system dependencies

Problem 3: Qwen3.5 27B+ GDN models silently require --max-num-batched-tokens 2096

Problem 4: --attention-backend FLASH_ATTN + --kv-cache-dtype fp8 = silent crash

Problem 5: "Engine core initialization failed" provides no useful information

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Problem 1: `libcudart.so.12` not found

Problem 3: Qwen3.5 27B+ GDN models silently require `--max-num-batched-tokens 2096`

Problem 4: `--attention-backend FLASH_ATTN` + `--kv-cache-dtype fp8` = silent crash

Problem 1: `libcudart.so.12` not found

Problem 3: Qwen3.5 27B+ GDN models silently require `--max-num-batched-tokens 2096`

Problem 4: `--attention-backend FLASH_ATTN` + `--kv-cache-dtype fp8` = silent crash