vllm - 💡(How to fix) Fix [Bug]: vLLM fails to start on RDNA 4 (gfx1201) inside containers — amdsmi, circular import, and torch.cuda.device_count() all broken [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40081Fetched 2026-04-17 08:27:18
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Author
Timeline (top)
mentioned ×3subscribed ×3added_to_project_v2 ×1commented ×1
BugFileImpactClean fix difficulty
amdsmi platform detectionplatforms/__init__.pyEngine won't startLow — add HIP fallback
Circular import via warning_onceplatforms/rocm.pyEngine won't startLow — env var check + replace warning_once
torch.cuda.device_count() = 0platforms/rocm.pyEngine crashes on DP initMedium — needs proper HIP integration in torch or vLLM platform layer
GGUF weight-loading OOMWeight loaderOOM on 16 GB cardsUnknown — may need loader refactor

Full patches, Dockerfile diff, and k8s deployment manifest: sleeepss/vllm-rdna4-container-patches

AI disclosure: I used Claude to help draft this issue. All patches have been verified on my own hardware (RX 9070 XT, ROCm 7.12 nightly, vLLM v0.16.0rc0).

I used claude to obviously draft this but I reviewed it as a human, but apologies in advanced if this isnt acceptable

Error Message

def rocm_platform_plugin() -> str | None: # Try amdsmi first (existing code)... # ... # Fallback: HIP-based detection try: import ctypes hip = ctypes.CDLL("libamdhip64.so") count = ctypes.c_int() result = hip.hipGetDeviceCount(ctypes.byref(count)) if result == 0 and count.value > 0: return "vllm.platforms.rocm.RocmPlatform" except Exception: pass return None

Root Cause

Note: python collect_env.py is not available because vLLM crashes during platform detection before reaching any usable state. The patches in this report are required to get it to start.

Fix Action

Fix / Workaround

<details> <summary>Environment details (no collect_env.py — vLLM won't start without the patches)</summary>

Note: python collect_env.py is not available because vLLM crashes during platform detection before reaching any usable state. The patches in this report are required to get it to start.

I've published workaround patches and a full writeup at sleeepss/vllm-rdna4-container-patches. Filing this to get the issues on upstream's radar.

Code Example

OS: CachyOS (Arch-based), kernel 6.19.x-cachyos
CPU: AMD Ryzen 9 5950X
GPU: AMD Radeon RX 9070 XT (gfx1201 / RDNA 4, 16 GB VRAM)
ROCm: 7.12 nightly (TheRock toolchain, via bluefalcon13/vllm-rocm container)
PyTorch: 2.7.0a0 (built against ROCm 7.12)
vLLM: v0.16.0rc0
Container runtime: k3s (containerd) / podman
Flash Attention: hyoon1/flash-attention enable-ck-gfx12 branch (Composable Kernel)

---

def rocm_platform_plugin() -> str | None:
    # Try amdsmi first (existing code)...
    # ...
    # Fallback: HIP-based detection
    try:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        result = hip.hipGetDeviceCount(ctypes.byref(count))
        if result == 0 and count.value > 0:
            return "vllm.platforms.rocm.RocmPlatform"
    except Exception:
        pass
    return None

---

warning_once() → vllm.distributed.parallel_state → vllm.utils.system_utils → vllm.platforms

---

except Exception as e:
    logger.debug("Failed to get GCN arch via amdsmi: %s", e)
# Env var fallback — avoids circular import from warning_once during module init
arch_env = os.environ.get("PYTORCH_ROCM_ARCH", "")
if arch_env:
    logger.info("Using PYTORCH_ROCM_ARCH=%s for GCN arch", arch_env)
    return arch_env
logger.warning(
    "Failed to get GCN arch via amdsmi, falling back to torch.cuda. "
    "This will initialize CUDA and may cause "
    "issues if CUDA_VISIBLE_DEVICES is not set yet."
)
return torch.cuda.get_device_properties("cuda").gcnArchName

---

def _patch_torch_device_count():
    import torch
    if torch.cuda.device_count() == 0:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        if hip.hipGetDeviceCount(ctypes.byref(count)) == 0 and count.value > 0:
            n = count.value
            torch.cuda.device_count = lambda: n
            if hasattr(torch, "accelerator"):
                torch.accelerator.device_count = lambda: n

---

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 96.00 MiB.
GPU 0 has a total capacity of 15.92 GiB of which 5.71 MiB is free.
Of the allocated memory 6.84 GiB is allocated by PyTorch...
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment details (no collect_env.py — vLLM won't start without the patches)</summary>
OS: CachyOS (Arch-based), kernel 6.19.x-cachyos
CPU: AMD Ryzen 9 5950X
GPU: AMD Radeon RX 9070 XT (gfx1201 / RDNA 4, 16 GB VRAM)
ROCm: 7.12 nightly (TheRock toolchain, via bluefalcon13/vllm-rocm container)
PyTorch: 2.7.0a0 (built against ROCm 7.12)
vLLM: v0.16.0rc0
Container runtime: k3s (containerd) / podman
Flash Attention: hyoon1/flash-attention enable-ck-gfx12 branch (Composable Kernel)

Note: python collect_env.py is not available because vLLM crashes during platform detection before reaching any usable state. The patches in this report are required to get it to start.

</details>

🐛 Describe the bug

Running vLLM inside a container on an RDNA 4 GPU (RX 9070 XT, gfx1201) hits three sequential failures that prevent the engine from starting. The GPU is fully functional via HIP — hipGetDeviceCount() returns 1, inference works once you get past these bugs. The issues are all in vLLM's platform detection and initialization path, not in the GPU or ROCm stack.

I've published workaround patches and a full writeup at sleeepss/vllm-rdna4-container-patches. Filing this to get the issues on upstream's radar.

Note: I tested on v0.16.0rc0 via bluefalcon13/vllm-rocm. I've confirmed that rocm_platform_plugin() in platforms/__init__.py is still amdsmi-only on current main (no HIP fallback). Open issues #24576, #34573, and #39378 report the same detection failure on other hardware — this issue adds the RDNA 4 container-specific angle with root cause analysis and working patches.


Bug 1: amdsmi fails to initialize → platform detection fails

vllm/platforms/__init__.pyrocm_platform_plugin() calls amdsmi to detect ROCm. Inside a container, amdsmi fails with AMDSMI_STATUS_NOT_INIT despite the GPU being fully accessible via HIP. vLLM's platform detection returns None and the engine exits with "no platform found."

Root cause: amdsmi requires sysfs/hwmon paths that aren't always exposed inside unprivileged containers. HIP (via libamdhip64.sohipGetDeviceCount) works fine because it goes through /dev/kfd + /dev/dri, which are mounted. This is a known upstream ROCm issue — see ROCm/ROCm#5000 (amdsmi Error code 34), ROCm/amdsmi#75 (driver not initialized despite working rocminfo). Additionally, ROCm/k8s-device-plugin#65 documents that the k8s device plugin sets /dev/dri permissions to rw instead of rwm, causing amdgpu_device_initialize to fail.

Suggested fix: Fall back to HIP-based detection when amdsmi fails. A ctypes call to hipGetDeviceCount is sufficient:

def rocm_platform_plugin() -> str | None:
    # Try amdsmi first (existing code)...
    # ...
    # Fallback: HIP-based detection
    try:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        result = hip.hipGetDeviceCount(ctypes.byref(count))
        if result == 0 and count.value > 0:
            return "vllm.platforms.rocm.RocmPlatform"
    except Exception:
        pass
    return None

Bug 2: Circular import in GCN arch fallback via logger.warning_once()

Once you get past Bug 1, vllm/platforms/rocm.py_get_gcn_arch() tries amdsmi for the GCN architecture string. That fails (same amdsmi issue). The except block calls logger.warning_once(), which during module init triggers:

warning_once() → vllm.distributed.parallel_state → vllm.utils.system_utils → vllm.platforms

This is a circular import back into the module that's still initializing. Result: ImportError: cannot import name 'current_platform'.

Suggested fix: Check PYTORCH_ROCM_ARCH env var before calling logger.warning_once(). If set, return it immediately. Also replace warning_once() with regular warning() to avoid the circular import path entirely:

except Exception as e:
    logger.debug("Failed to get GCN arch via amdsmi: %s", e)
# Env var fallback — avoids circular import from warning_once during module init
arch_env = os.environ.get("PYTORCH_ROCM_ARCH", "")
if arch_env:
    logger.info("Using PYTORCH_ROCM_ARCH=%s for GCN arch", arch_env)
    return arch_env
logger.warning(
    "Failed to get GCN arch via amdsmi, falling back to torch.cuda. "
    "This will initialize CUDA and may cause "
    "issues if CUDA_VISIBLE_DEVICES is not set yet."
)
return torch.cuda.get_device_properties("cuda").gcnArchName

Bug 3: torch.cuda.device_count() returns 0

With Bugs 1 and 2 fixed, vLLM starts but torch.cuda.device_count() returns 0 despite the GPU working via HIP. vLLM's data-parallel code then asserts rank 0 is out of bounds and crashes. This matches the pattern in ROCm/ROCm#5461 (rocminfo works, device_count 0) and ROCm/HIP#3710 (hipErrorNoDevice from PyTorch).

Workaround: Monkey-patch torch.cuda.device_count (and torch.accelerator.device_count) at module load time using HIP's hipGetDeviceCount:

def _patch_torch_device_count():
    import torch
    if torch.cuda.device_count() == 0:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        if hip.hipGetDeviceCount(ctypes.byref(count)) == 0 and count.value > 0:
            n = count.value
            torch.cuda.device_count = lambda: n
            if hasattr(torch, "accelerator"):
                torch.accelerator.device_count = lambda: n

This is a global mutation and not a clean fix — flagging it as something that should be handled properly in vLLM's ROCm platform layer or upstream in PyTorch's ROCm backend.

Bonus: GGUF weight-loading OOM on 16 GB cards

Not a code bug, but a sharp edge worth documenting. Once vLLM starts and begins loading a 14B Q4_K_M GGUF (~9 GB on disk), _create_padded_weight_param allocates a temp FP16 buffer for merged/padded weights and OOMs:

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 96.00 MiB.
GPU 0 has a total capacity of 15.92 GiB of which 5.71 MiB is free.
Of the allocated memory 6.84 GiB is allocated by PyTorch...

This happens during weight loading, before KV cache allocation, so --gpu-memory-utilization and --max-model-len do not help. The workaround is --cpu-offload-gb 4, which uses vLLM's UVA zero-copy offload path to keep 4 GB of weights in pinned CPU RAM.

This may be related to #22814 (GGUF loader system RAM bloat via torch.tensor() vs torch.from_numpy()). I hit a similar symptom shape but in VRAM.


Related upstream issues

vLLM — same detection failure, different hardware:

  • #24576 — "No module named 'amdsmi'" on ROCm 6.4.2, gfx1151 (open)
  • #34573 — "No HIP GPUs are available" on gfx1151 in container (open)
  • #39378 — "Failed to infer device type" on 7900XTX + ROCm 7.2 + v0.19.0 (open)

vLLM — RDNA 4 specific:

  • #28649 — FP8 WMMA feature request for gfx1201 (open)
  • #28052 — Flash attention error on gfx1201 in Docker (closed, fixed by PR #31062)
  • PR #38086 — FP8 MoE enabled for gfx1201 on main (merged Apr 2, 2026)
  • PR #38455 — gfx1201 device ID mapping added on main (merged Apr 10, 2026)

ROCm upstream — amdsmi and container detection:

ROCm upstream — RDNA 4 in containers:

Summary

BugFileImpactClean fix difficulty
amdsmi platform detectionplatforms/__init__.pyEngine won't startLow — add HIP fallback
Circular import via warning_onceplatforms/rocm.pyEngine won't startLow — env var check + replace warning_once
torch.cuda.device_count() = 0platforms/rocm.pyEngine crashes on DP initMedium — needs proper HIP integration in torch or vLLM platform layer
GGUF weight-loading OOMWeight loaderOOM on 16 GB cardsUnknown — may need loader refactor

Full patches, Dockerfile diff, and k8s deployment manifest: sleeepss/vllm-rdna4-container-patches

AI disclosure: I used Claude to help draft this issue. All patches have been verified on my own hardware (RX 9070 XT, ROCm 7.12 nightly, vLLM v0.16.0rc0).

I used claude to obviously draft this but I reviewed it as a human, but apologies in advanced if this isnt acceptable

extent analysis

TL;DR

Apply the suggested fixes for the three bugs: add a HIP fallback for amdsmi platform detection, check the PYTORCH_ROCM_ARCH environment variable to avoid circular imports, and monkey-patch torch.cuda.device_count using HIP's hipGetDeviceCount.

Guidance

  • To fix Bug 1, modify the rocm_platform_plugin function in platforms/__init__.py to fall back to HIP-based detection when amdsmi fails.
  • For Bug 2, check the PYTORCH_ROCM_ARCH environment variable before calling logger.warning_once in platforms/rocm.py, and replace warning_once with a regular warning to avoid circular imports.
  • To address Bug 3, monkey-patch torch.cuda.device_count (and torch.accelerator.device_count) at module load time using HIP's hipGetDeviceCount.
  • Additionally, consider applying the workaround for the GGUF weight-loading OOM issue by using the --cpu-offload-gb option.

Example

The suggested fix for Bug 1 can be implemented as follows:

def rocm_platform_plugin() -> str | None:
    # Try amdsmi first (existing code)...
    # ...
    # Fallback: HIP-based detection
    try:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        result = hip.hipGetDeviceCount(ctypes.byref(count))
        if result == 0 and count.value > 0:
            return "vllm.platforms.rocm.RocmPlatform"
    except Exception:
        pass
    return None

Notes

The provided fixes are workarounds and may not be the final solution. The root causes of the issues are related to the interaction between vLLM, ROCm, and the container environment. Further investigation and collaboration with the upstream developers may be necessary to resolve the issues permanently.

Recommendation

Apply the suggested workarounds to fix the immediate issues, and consider upgrading to a future version of vLLM or ROCm that may include proper fixes for these problems. The reason for this recommendation is that the workarounds provide a temporary solution, but a more permanent fix may be available in future versions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING