vllm - ✅(Solved) Fix [Bug]: Engine crashes on startup with 'DeepGEMM backend not available' for standard bf16 models on H100 [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41849Fetched 2026-05-07 03:32:29
View on GitHub
Comments
0
Participants
1
Timeline
9
Reactions
0
Participants
Timeline (top)
referenced ×7cross-referenced ×2

Error Message

(EngineCore pid=...) ERROR [core.py:1136] EngineCore failed to start.
(EngineCore pid=...) ERROR [core.py:1136] Traceback (most recent call last):
  ...
  File ".../vllm/v1/worker/gpu_worker.py", line 586, in compile_or_warm_up_model
      kernel_warmup(self)
  File ".../vllm/model_executor/warmup/kernel_warmup.py", line 37, in kernel_warmup
      deep_gemm_warmup(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 364, in deep_gemm_warmup
      total = _count_warmup_iterations(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 342, in _count_warmup_iterations
      if _fp8_linear_may_use_deep_gemm(m):
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 136, in _fp8_linear_may_use_deep_gemm
      block_size = get_mk_alignment_for_contiguous_layout()[0]
  File ".../vllm/utils/deep_gemm.py", line 266, in get_mk_alignment_for_contiguous_layout
      return _missing()
RuntimeError: DeepGEMM backend is not available or outdated.
Please install or update the `deep_gemm` to a newer version to enable FP8 kernels.

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

kernel_warmup calls deep_gemm_warmup unconditionally. Inside deep_gemm_warmup, _count_warmup_iterations iterates over model modules and calls _fp8_linear_may_use_deep_gemm(m), which calls into the deep_gemm package via get_mk_alignment_for_contiguous_layout(). When deep_gemm is not installed, this raises RuntimeError via _missing() — even for models with zero FP8 layers.

The guard function (_fp8_linear_may_use_deep_gemm) requires deep_gemm to be installed just to answer "does this layer use deep_gemm?" — causing it to fail before it can return False.

Fix Action

Workaround

export VLLM_USE_DEEP_GEMM=0
vllm serve ...

PR fix notes

PR #41851: warmup: skip deep_gemm warmup when DeepGEMM is not available

Description (problem / solution / changelog)

Summary

Fixes #41849.

_fp8_linear_may_use_deep_gemm calls get_mk_alignment_for_contiguous_layout() unconditionally before the FP8 type check, so it raises RuntimeError for every module — including plain nn.Linear layers in unquantized bf16 models — when deep_gemm is not installed.

Reproduction environment

Provider: Lambda Labs (gpu_2x_h100_sxm5)
Hardware: NVIDIA H100 80 GB HBM3 SXM5
Driver: 580.105.08 — CUDA runtime 13.0 — Python 3.12 — vLLM 0.20.1 — torch 2.11.0+cu130 — deep_gemm not installed

pip install vllm   # 0.20.1 — deep_gemm is not listed as a required dependency
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --gpu-memory-utilization 0.85 --max-model-len 2048

Full traceback (from live vllm.log on that H100)

(EngineCore pid=3838313) ERROR [core.py:1136] EngineCore failed to start.
  File ".../vllm/v1/engine/core.py", line 1110, in run_engine_core
      engine_core = EngineCoreProc(...)
  File ".../vllm/v1/engine/core.py", line 128, in __init__
      kv_cache_config = self._initialize_kv_caches(vllm_config)
  File ".../vllm/v1/worker/gpu_worker.py", line 586, in compile_or_warm_up_model
      kernel_warmup(self)
  File ".../vllm/model_executor/warmup/kernel_warmup.py", line 37, in kernel_warmup
      deep_gemm_warmup(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 364, in deep_gemm_warmup
      total = _count_warmup_iterations(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 342, in _count_warmup_iterations
      if _fp8_linear_may_use_deep_gemm(m):
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 135, in _fp8_linear_may_use_deep_gemm
      block_size = get_mk_alignment_for_contiguous_layout()[0]
  File ".../vllm/utils/deep_gemm.py", line 266, in get_mk_alignment_for_contiguous_layout
      return _missing()
RuntimeError: DeepGEMM backend is not available or outdated.
Please install or update the `deep_gemm` to a newer version to enable FP8 kernels.

Note: this crash happens after weights load, torch.compile runs (10.57 s), and KV cache is profiled (645,472 tokens available) — so it appears late and is easy to miss without reading engine-process logs.

Root cause

def _fp8_linear_may_use_deep_gemm(module: torch.nn.Module) -> bool:
    # FIXME: this logic is brittle and incorrect
    block_size = get_mk_alignment_for_contiguous_layout()[0]  # ← crash for ALL modules
    if not (
        isinstance(module, LinearBase)
        and isinstance(module.quant_method, Fp8LinearMethod)   # ← never reached
        ...
    ):
        return False

get_mk_alignment_for_contiguous_layout() is called for every nn.Module before the FP8 guard, so it fails for plain nn.Linear layers too.

The existing VLLM_USE_DEEP_GEMM=0 workaround works because is_deep_gemm_supported() returns False — but that check is never reached by _fp8_linear_may_use_deep_gemm.

Fix — vllm/model_executor/warmup/deep_gemm_warmup.py

+from vllm.utils.deep_gemm import (
     fp8_gemm_nt,
     get_mk_alignment_for_contiguous_layout,
+    is_deep_gemm_supported,
     m_grouped_fp8_gemm_nt_contiguous,
 )

 def _fp8_linear_may_use_deep_gemm(module: torch.nn.Module) -> bool:
     """Return True if the module could be processed with DeepGEMM."""
+    if not is_deep_gemm_supported():
+        return False
     block_size = get_mk_alignment_for_contiguous_layout()[0]
     ...

is_deep_gemm_supported() already encodes has_deep_gemm() and VLLM_USE_DEEP_GEMM and sm_90/sm_100. Reusing it here makes the guard consistent with the rest of the file.

Tests added — tests/model_executor/test_deep_gemm_warmup.py

TestWhat it checks
test_fp8_linear_returns_false_when_deep_gemm_unavailable_plain_modulenn.LinearFalse, no exception, when is_deep_gemm_supported=False
test_fp8_linear_returns_false_when_deep_gemm_unavailable_mock_fp8Mock FP8 linear: get_mk_alignment is never called when unsupported
test_deep_gemm_warmup_noop_when_unavailableFull Sequential bf16 model: deep_gemm_warmup is a no-op

The key assertion in test 2 is mock_align.assert_not_called() — proving the fix prevents the call that caused the crash.

Security

_fp8_linear_may_use_deep_gemm is a pure predicate. Returning False early when deep_gemm is unavailable matches the intent of VLLM_USE_DEEP_GEMM=0 — no new behaviour is introduced, no I/O or code execution is affected.

Verification on hardware

After applying this fix (plus the separate MIG UUID fix #41850), vLLM started on the H100 without deep_gemm installed:

INFO  [monitor.py] torch.compile took 10.57 s in total
INFO  [gpu_worker.py] Available KV cache memory: 13.54 GiB
INFO  [kv_cache_utils.py] GPU KV cache size: 645,472 tokens
# Engine started — no DeepGEMM crash
INFO  vLLM ready — TinyLlama/TinyLlama-1.1B-Chat-v1.0 on port 8001

Changed files

  • tests/model_executor/test_deep_gemm_warmup.py (added, +149/-0)
  • vllm/model_executor/warmup/deep_gemm_warmup.py (modified, +4/-1)

PR #41850: platforms: handle MIG device UUID in device_id_to_physical_device_id

Description (problem / solution / changelog)

Summary

Fixes #41848.

When CUDA_VISIBLE_DEVICES is set to a MIG device UUID (e.g. MIG-377e0049-554c-540b-93c6-d0976f8426cb), device_id_to_physical_device_id() raises ValueError and vLLM crashes before loading any model. This blocks all MIG-partitioned H100/A100 deployments using UUID-based device assignment.

Reproduction environment

Provider: Lambda Labs (gpu_2x_h100_sxm5)
Hardware: 2× NVIDIA H100 80 GB HBM3 SXM5, MIG enabled on GPU 0
Driver: 580.105.08 — CUDA runtime 13.0 — Python 3.12 — vLLM 0.20.1 — torch 2.11.0+cu130

nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-7f57f032-34ab-a064-c37c-7a4f75601cc5)
  MIG 2g.20gb     Device  0: (UUID: MIG-377e0049-554c-540b-93c6-d0976f8426cb)
  MIG 1g.10gb     Device  1: (UUID: MIG-f39aa248-6f1b-5769-b119-8d650bb34b27)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-8c7c9c30-f12f-20bf-e14e-d3c4c965af32)
# Exact command that triggered the crash
CUDA_VISIBLE_DEVICES="MIG-377e0049-554c-540b-93c6-d0976f8426cb" \
  vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --gpu-memory-utilization 0.85 --max-model-len 2048 --port 8001

Full traceback (from live vllm.log on that H100)

Engine subprocess:

  File ".../vllm/platforms/cuda.py", line 174, in has_device_capability
      return super().has_device_capability(capability, device_id)
  File ".../vllm/platforms/interface.py", line 141, in has_device_capability
      current_capability = cls.get_device_capability(device_id=device_id)
  File ".../vllm/platforms/cuda.py", line 158, in get_device_capability
      physical_device_id = device_id_to_physical_device_id(device_id)
  File ".../vllm/platforms/cuda.py", line 57, in device_id_to_physical_device_id
      return int(physical_device_id)
             ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'MIG-377e0049-554c-540b-93c6-d0976f8426cb'

API server:

RuntimeError: Engine process failed to start. See stack trace for the root cause.

Crash happens inside create_model_config(), before weights load or CUDA graphs are touched.

Root cause

# vllm/platforms/interface.py
device_ids = os.environ[cls.device_control_env_var].split(",")
physical_device_id = device_ids[device_id]
return int(physical_device_id)   # ← ValueError for "MIG-<uuid>" strings

CUDA accepts MIG UUID strings in CUDA_VISIBLE_DEVICES and maps them transparently to logical device 0. I confirmed this with PyTorch directly on the same H100:

import os, torch
os.environ["CUDA_VISIBLE_DEVICES"] = "MIG-377e0049-554c-540b-93c6-d0976f8426cb"
torch.cuda.device_count()      # → 1
torch.cuda.get_device_name(0)  # → "NVIDIA H100 80GB HBM3 MIG 2g.20gb"

So when int() fails, device_id (the CUDA-adjusted logical index) is the correct value.

Fix — vllm/platforms/interface.py

-            return int(physical_device_id)
+            try:
+                return int(physical_device_id)
+            except ValueError:
+                # MIG device UUID (e.g. "MIG-<uuid>") — CUDA already remaps
+                # it to logical device 0, so device_id is the correct index.
+                return device_id

Tests added — tests/cuda/test_mig_uuid_device_id.py

Five unit tests using unittest.mock.patch.dict on CUDA_VISIBLE_DEVICES:

TestWhat it checks
test_device_id_integer_passthroughInteger device IDs still map correctly
test_device_id_mig_uuid_returns_logical_indexMIG UUID returns device_id=0, no exception
test_device_id_multiple_mig_uuidsTwo comma-separated MIG UUIDs → indices 0 and 1
test_device_id_unset_env_returns_device_idUnset env falls back to device_id
test_device_id_empty_env_returns_device_idEmpty string treated as unset (Ray compat)

All five pass against the patched code. The pre-fix code raises ValueError on tests 2 and 3.

Security

device_id_to_physical_device_id is a pure mapping function — no I/O, no subprocess calls, no user-controlled code execution. Returning device_id when int() fails is the same behaviour as when CUDA_VISIBLE_DEVICES is unset: the already-validated logical index is used. No new attack surface is introduced.

Verification on hardware

After applying this patch together with VLLM_USE_DEEP_GEMM=0 (workaround for the separate #41849), vLLM started successfully on the 2g.20gb MIG partition and served inference requests:

INFO  [cuda.py] Using FLASH_ATTN attention backend (FlashAttention v3)
INFO  [kv_cache_utils.py] GPU KV cache size: 645,472 tokens
INFO  vLLM ready — TinyLlama/TinyLlama-1.1B-Chat-v1.0 on port 8001

Changed files

  • tests/cuda/test_mig_uuid_device_id.py (added, +77/-0)
  • vllm/platforms/cuda.py (modified, +12/-5)
  • vllm/platforms/interface.py (modified, +9/-2)

Code Example

pip install vllm  # 0.20.1, without deep_gemm
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --gpu-memory-utilization 0.85 --max-model-len 2048

---

(EngineCore pid=...) ERROR [core.py:1136] EngineCore failed to start.
(EngineCore pid=...) ERROR [core.py:1136] Traceback (most recent call last):
  ...
  File ".../vllm/v1/worker/gpu_worker.py", line 586, in compile_or_warm_up_model
      kernel_warmup(self)
  File ".../vllm/model_executor/warmup/kernel_warmup.py", line 37, in kernel_warmup
      deep_gemm_warmup(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 364, in deep_gemm_warmup
      total = _count_warmup_iterations(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 342, in _count_warmup_iterations
      if _fp8_linear_may_use_deep_gemm(m):
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 136, in _fp8_linear_may_use_deep_gemm
      block_size = get_mk_alignment_for_contiguous_layout()[0]
  File ".../vllm/utils/deep_gemm.py", line 266, in get_mk_alignment_for_contiguous_layout
      return _missing()
RuntimeError: DeepGEMM backend is not available or outdated.
Please install or update the `deep_gemm` to a newer version to enable FP8 kernels.

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

---

export VLLM_USE_DEEP_GEMM=0
vllm serve ...

---

def deep_gemm_warmup(model, max_tokens):
    from vllm.utils.deep_gemm import is_deep_gemm_available
    if not is_deep_gemm_available():
        return  # deep_gemm not installed — skip warmup silently
    ...
RAW_BUFFERClick to expand / collapse

Describe the bug

vLLM 0.20.1 crashes during engine initialization with RuntimeError: DeepGEMM backend is not available even when running a standard bf16 model with no FP8 quantization. deep_gemm_warmup is called unconditionally for all models, making it impossible to start vLLM on H100 without installing the optional deep_gemm package.

Environment

  • vLLM version: 0.20.1
  • Python: 3.12
  • GPU: NVIDIA H100 80 GB HBM3 SXM5
  • CUDA: 13.0 / torch 2.11.0+cu130
  • OS: Ubuntu 24.04
  • Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (bf16, no quantization)
  • deep_gemm not installed

Steps to reproduce

pip install vllm  # 0.20.1, without deep_gemm
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --gpu-memory-utilization 0.85 --max-model-len 2048

Error

(EngineCore pid=...) ERROR [core.py:1136] EngineCore failed to start.
(EngineCore pid=...) ERROR [core.py:1136] Traceback (most recent call last):
  ...
  File ".../vllm/v1/worker/gpu_worker.py", line 586, in compile_or_warm_up_model
      kernel_warmup(self)
  File ".../vllm/model_executor/warmup/kernel_warmup.py", line 37, in kernel_warmup
      deep_gemm_warmup(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 364, in deep_gemm_warmup
      total = _count_warmup_iterations(model, max_tokens)
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 342, in _count_warmup_iterations
      if _fp8_linear_may_use_deep_gemm(m):
  File ".../vllm/model_executor/warmup/deep_gemm_warmup.py", line 136, in _fp8_linear_may_use_deep_gemm
      block_size = get_mk_alignment_for_contiguous_layout()[0]
  File ".../vllm/utils/deep_gemm.py", line 266, in get_mk_alignment_for_contiguous_layout
      return _missing()
RuntimeError: DeepGEMM backend is not available or outdated.
Please install or update the `deep_gemm` to a newer version to enable FP8 kernels.

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root cause

kernel_warmup calls deep_gemm_warmup unconditionally. Inside deep_gemm_warmup, _count_warmup_iterations iterates over model modules and calls _fp8_linear_may_use_deep_gemm(m), which calls into the deep_gemm package via get_mk_alignment_for_contiguous_layout(). When deep_gemm is not installed, this raises RuntimeError via _missing() — even for models with zero FP8 layers.

The guard function (_fp8_linear_may_use_deep_gemm) requires deep_gemm to be installed just to answer "does this layer use deep_gemm?" — causing it to fail before it can return False.

Workaround

export VLLM_USE_DEEP_GEMM=0
vllm serve ...

Suggested fix

Guard the warmup at the top of deep_gemm_warmup() with an availability check:

def deep_gemm_warmup(model, max_tokens):
    from vllm.utils.deep_gemm import is_deep_gemm_available
    if not is_deep_gemm_available():
        return  # deep_gemm not installed — skip warmup silently
    ...

Or make _fp8_linear_may_use_deep_gemm return False (instead of raising) when deep_gemm is absent.

Impact

Any user running vLLM 0.20.1 on H100 (sm_90) without deep_gemm installed cannot start the server for standard unquantized models. Since deep_gemm is not a required dependency and requires special installation steps, this is a regression from 0.19.x for H100 users.

I have a fix ready and can submit a PR.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Engine crashes on startup with 'DeepGEMM backend not available' for standard bf16 models on H100 [2 pull requests, 1 participants]