vllm - ✅(Solved) Fix [Installation]: Blackwell SM120 + CUDA 13 pip install: 5 sequential failures before Qwen3.5 27B+ runs [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37714Fetched 2026-04-08 01:08:37
View on GitHub
Comments
1
Participants
1
Timeline
7
Reactions
0
Participants
Timeline (top)
subscribed ×3cross-referenced ×2commented ×1labeled ×1

Error Message

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Root Cause

Root cause: vLLM's C extensions (_C.abi3.so, _moe_C.abi3.so, _vllm_fa2_C.abi3.so, _vllm_fa3_C.abi3.so, _flashmla_C.abi3.so, _flashmla_extension_C.abi3.so) are compiled against CUDA 12 and link libcudart.so.12. CUDA 13 systems only ship libcudart.so.13.

Fix Action

Fix / Workaround

Commonly suggested workaround: LD_LIBRARY_PATH. This is fragile — it doesn't survive process forks (vLLM's multiprocess EngineCore architecture), systemd services, background tasks, or Docker entrypoints.

What actually worked: Installing nvidia-cuda-runtime-cu12 (provides libcudart.so.12) and then running patchelf --set-rpath on all 8 affected .so files to bake in the path. No env vars needed.

Users may set --attention-backend FLASH_ATTN as a workaround for FlashInfer crashes (#36828), not knowing it's incompatible with --kv-cache-dtype fp8.

PR fix notes

PR #37757: [UX] Logging - Improve Startup Error Logs

Description (problem / solution / changelog)

Fixes #31683. Also addresses #37714.

Summary

This PR scopes #31683 to startup failures only, matching the guidance in the issue thread.

  • propagate structured startup failures from the worker ready pipe through the engine startup handshake
  • preserve the innermost child exception type, message, source, and traceback in the surfaced startup error
  • improve failed process summaries during startup and add focused regression coverage for the worker and engine startup paths

Why this is not duplicating an existing PR

  • gh pr list --repo vllm-project/vllm --state open --search "31683 in:body" returned no matching open PRs
  • gh pr list --repo vllm-project/vllm --state open --search "startup error propagation multiproc executor engine core" returned no matching open PRs
  • I also left a courtesy comment on #31683 because the issue is currently assigned

Testing

  • python -m pytest --noconftest tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py -v -s
    • Passed (5 passed)
  • pre-commit run ruff-check --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run ruff-format --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run typos --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run mypy-local --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run check-spdx-header --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run check-forbidden-imports --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run check-torch-cuda-call --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run check-boolean-context-manager --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • pre-commit run check-root-lazy-imports --files vllm/v1/engine/utils.py vllm/v1/engine/core.py vllm/v1/executor/multiproc_executor.py tests/v1/engine/test_engine_core_client.py tests/v1/executor/test_startup_error_reporting.py tests/v1/engine/test_startup_error_reporting.py
    • Passed
  • python -m pytest --noconftest tests/v1/engine/test_engine_core_client.py -v -s -k startup_failure
    • Not runnable here because this Windows non-CUDA environment skips the CUDA-only module before the startup test can execute
  • pre-commit run --files ...
    • Not fully runnable in this Windows environment because several repo hooks require bash

Disclosure

Prepared with AI assistance; all changes and test results were reviewed before submission.

Changed files

  • tests/v1/engine/test_engine_core_client.py (modified, +41/-14)
  • tests/v1/engine/test_startup_error_reporting.py (added, +324/-0)
  • tests/v1/executor/test_startup_error_reporting.py (added, +110/-0)
  • vllm/v1/engine/core.py (modified, +56/-31)
  • vllm/v1/engine/utils.py (modified, +263/-23)
  • vllm/v1/executor/multiproc_executor.py (modified, +97/-21)

Code Example

uv pip install vllm --torch-backend=auto

---

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

---

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

AssertionError: In Mamba cache align mode, block_size (X) must be <= max_num_batched_tokens

---

ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration.
  Reason: ['kv_cache_dtype not supported']

---

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

# On any Blackwell GPU with CUDA 13.0 driver, WSL2 or native Linux
  pip install vllm  # or: uv pip install vllm

  # Problem 1: immediate crash
  python -c "import vllm"
  # ImportError: libcudart.so.12: cannot open shared object file

  # After fixing libcudart (patchelf or LD_LIBRARY_PATH):
  # Problem 2: crash on GDN model load (gcc/nvcc/ninja missing)
  vllm serve Qwen/Qwen3.5-27B-FP8

  # After installing system deps:
  # Problem 3: crash with default --max-num-batched-tokens
  vllm serve Qwen/Qwen3.5-27B-FP8 --kv-cache-dtype fp8
  # Engine core initialization failed

  # Problem 4: crash with explicit FLASH_ATTN + fp8
  vllm serve Qwen/Qwen3.5-27B-FP8 --attention-backend FLASH_ATTN --kv-cache-dtype fp8
  # Engine core initialization failed

  # What actually works (after all fixes):
  vllm serve Qwen/Qwen3.5-27B-FP8 --max-num-batched-tokens 2096 --kv-cache-dtype fp8 --dtype bfloat16
RAW_BUFFERClick to expand / collapse

Your current environment

Collecting environment information... uv is set ============================== System Info ============================== OS : Ubuntu 24.04.4 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.39

     PyTorch Info

============================== PyTorch version : 2.10.0+cu130 Is debug build : False CUDA used to build PyTorch : 13.0 ROCM used to build PyTorch : N/A ============================== Python Environment

Python version : 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime) Python platform : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39 ============================== CUDA / GPU Info ============================== Is CUDA available : True CUDA runtime version : Could not collect CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA RTX PRO 5000 Blackwell Nvidia driver version : 595.79 cuDNN version : Could not collect ============================== CPU Info

Architecture: x86_64 Model name: Intel(R) Core(TM) Ultra 9 285K CPU(s): 24 Hypervisor vendor: Microsoft ============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.4 [pip3] numpy==2.2.6 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] torch==2.10.0+cu130 [pip3] torchvision==0.25.0+cu130 [pip3] transformers==4.57.6 [pip3] triton==3.6.0 ============================== vLLM Info

vLLM Version : 0.17.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled

How you are installing vllm

uv pip install vllm --torch-backend=auto

Summary

Setting up vLLM 0.17.1 from PyPI on a Blackwell GPU (RTX PRO 5000, SM120) with CUDA 13.0 on WSL2 requires solving 5 undocumented problems before any Qwen3.5 27B+ model will run. Each failure produces a cryptic error with no actionable guidance. This issue documents the full chain and proposes fixes.

Problem 1: libcudart.so.12 not found

Error:

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Root cause: vLLM's C extensions (_C.abi3.so, _moe_C.abi3.so, _vllm_fa2_C.abi3.so, _vllm_fa3_C.abi3.so, _flashmla_C.abi3.so, _flashmla_extension_C.abi3.so) are compiled against CUDA 12 and link libcudart.so.12. CUDA 13 systems only ship libcudart.so.13.

Commonly suggested workaround: LD_LIBRARY_PATH. This is fragile — it doesn't survive process forks (vLLM's multiprocess EngineCore architecture), systemd services, background tasks, or Docker entrypoints.

What actually worked: Installing nvidia-cuda-runtime-cu12 (provides libcudart.so.12) and then running patchelf --set-rpath on all 8 affected .so files to bake in the path. No env vars needed.

Proposed fix: Either:

  • (a) Add nvidia-cuda-runtime-cu12 as a dependency and set RPATH in vLLM's build system (CMakeLists.txt)
  • (b) Publish versioned wheels (cu12, cu130) on PyPI like PyTorch does, with clear install instructions
  • (c) At minimum, detect the mismatch at import time and print an actionable error message

Related: #30435, #31018, #28669, #35432


Problem 2: FlashInfer JIT requires undocumented system dependencies

FlashInfer JIT-compiles CUDA kernels at runtime for GDN/Mamba attention patterns (used by Qwen3.5 27B+). This requires four system packages that aren't documented anywhere:

Missing dependencyError messageFix
gccRuntimeError: Failed to find C compilerapt install gcc
python3.12-devgcc compilation fails (missing Python.h)apt install python3.12-dev
CUDA toolkit (nvcc)RuntimeError: Could not find nvcc and default cuda_home='/usr/local/cuda' doesn't existapt install cuda-toolkit
ninjaFileNotFoundError: [Errno 2] No such file or directory: 'ninja'apt install ninja-build

Each failure is discovered sequentially — you fix one, hit the next. The errors come from deep inside FlashInfer's JIT layer, not vLLM, making them hard to trace.

Proposed fix:

  • Document these as system requirements in the installation guide, at least for Blackwell/GDN models
  • Detect missing dependencies at startup and print a single actionable message: "FlashInfer JIT compilation requires: gcc, python3-dev, nvcc (cuda-toolkit), ninja-build"
  • Consider pre-compiling FlashInfer kernels for common architectures (SM120, SM100) and shipping them in the wheel

Related: #21960, #32826


Problem 3: Qwen3.5 27B+ GDN models silently require --max-num-batched-tokens 2096

Error:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Actual root cause (buried in EngineCore subprocess log):

AssertionError: In Mamba cache align mode, block_size (X) must be <= max_num_batched_tokens

All Qwen3.5 models with GDN (Gated DeltaNet) layers — including the 27B dense model (not just MoE variants) — require --max-num-batched-tokens 2096 due to mamba/GDN cache alignment constraints. vLLM's default of 8192 violates this.

The 27B has 64 layers: 16 groups × (3 GDN + 1 Attention). vLLM treats GDN like Mamba for cache alignment, setting attention block size to 1568 tokens to ensure that attention page size is >= mamba page size.

Proposed fix:

  • Auto-detect GDN models and cap max_num_batched_tokens at the mamba page size
  • At minimum, surface the EngineCore subprocess error message to the user (see Problem 5)
  • Document this in the Qwen3.5 recipe

Related: #36010, #35502


Problem 4: --attention-backend FLASH_ATTN + --kv-cache-dtype fp8 = silent crash

Error (again, buried in EngineCore subprocess):

ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration.
Reason: ['kv_cache_dtype not supported']

Users may set --attention-backend FLASH_ATTN as a workaround for FlashInfer crashes (#36828), not knowing it's incompatible with --kv-cache-dtype fp8.

Proposed fix:

  • Validate flag combinations before spawning EngineCore subprocesses
  • Raise a clear error: "FLASH_ATTN does not support kv-cache-dtype=fp8. Use the default FlashInfer backend or remove --kv-cache-dtype fp8."

Related: #12543, #35577, PR #14221


Problem 5: "Engine core initialization failed" provides no useful information

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

This error appears for every problem above (and many others — see #17618, #18730, #19002, #21882, #23176, #26898, #33245). The actual root cause is always in the EngineCore subprocess log, which users must manually find at /tmp/vllm-*.log or by adding --log-level DEBUG.

I understand a logging redesign is tracked at #31683. In the meantime, a minimal fix would be:

  • Capture the EngineCore subprocess's last exception and include it in the parent's error message
  • Change Failed core proc(s): {} to actually list the subprocess PIDs and their exit codes/signals

Reproduction

# On any Blackwell GPU with CUDA 13.0 driver, WSL2 or native Linux
pip install vllm  # or: uv pip install vllm

# Problem 1: immediate crash
python -c "import vllm"
# ImportError: libcudart.so.12: cannot open shared object file

# After fixing libcudart (patchelf or LD_LIBRARY_PATH):
# Problem 2: crash on GDN model load (gcc/nvcc/ninja missing)
vllm serve Qwen/Qwen3.5-27B-FP8

# After installing system deps:
# Problem 3: crash with default --max-num-batched-tokens
vllm serve Qwen/Qwen3.5-27B-FP8 --kv-cache-dtype fp8
# Engine core initialization failed

# Problem 4: crash with explicit FLASH_ATTN + fp8
vllm serve Qwen/Qwen3.5-27B-FP8 --attention-backend FLASH_ATTN --kv-cache-dtype fp8
# Engine core initialization failed

# What actually works (after all fixes):
vllm serve Qwen/Qwen3.5-27B-FP8 --max-num-batched-tokens 2096 --kv-cache-dtype fp8 --dtype bfloat16

Suggested priority

Problems 1 and 5 are the highest impact — they affect every Blackwell user installing from PyPI, and the error messages give no path to resolution. Problems 2-4 compound the frustration but are solvable once you know what to look for.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issues outlined, the following steps can be taken:

Problem 1: libcudart.so.12 not found

  • Install nvidia-cuda-runtime-cu12 to provide libcudart.so.12.
  • Run patchelf --set-rpath on affected .so files to set the path.

Example:

apt install nvidia-cuda-runtime-cu12
patchelf --set-rpath /usr/lib/cuda/lib64 _C.abi3.so

Problem 2: FlashInfer JIT requires undocumented system dependencies

  • Install required system packages: gcc, python3.12-dev, cuda-toolkit, and ninja-build.
  • Document these as system requirements in the installation guide.

Example:

apt install gcc python3.12-dev cuda-toolkit ninja-build

Problem 3: Qwen3.5 27B+ GDN models silently require --max-num-batched-tokens 2096

  • Auto-detect GDN models and cap max_num_batched_tokens at the mamba page size.
  • Surface the EngineCore subprocess error message to the user.

Example (in Python):

import vllm

# Auto-detect GDN models and set max_num_batched_tokens
if is_gdn_model():
    max_num_batched_tokens = 2096
    # Run with the corrected max_num_batched_tokens
    vllm.serve(Qwen/Qwen3.5-27B-FP8, max_num_batched_tokens=max_num_batched_tokens)

Problem 4: --attention-backend FLASH_ATTN + --kv-cache-dtype fp8 = silent crash

  • Validate flag combinations before spawning EngineCore subprocesses.
  • Raise a clear error for incompatible combinations.

Example (in Python):

import vllm

# Validate flag combinations
if attention_backend == "FLASH_ATTN" and kv_cache_dtype == "fp8":
    raise ValueError("FLASH_ATTN does not support kv-cache-dtype=fp8")

Problem 5: "Engine core initialization failed" provides no useful information

  • Capture the EngineCore subprocess's last exception and include it in the parent's error message.
  • Change Failed core proc(s): {} to list subprocess PIDs and their exit codes/signals.

Example (in Python):

import vllm

try:
    # Run EngineCore subprocess
    vllm.serve(Qwen/Qwen3.5-27B-FP8)
except RuntimeError as e:
    # Capture and include subprocess exception
    print(f"Engine core initialization failed: {e}")

Verification

To verify the fixes, run the following commands:

vllm serve Qwen/Qwen3.5-

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING