vllm - 💡(How to fix) Fix [Bug]: vLLM fails to start on RDNA 4 (gfx1201) inside containers — amdsmi, circular import, and torch.cuda.device_count() all broken [1 comments, 2 participants]

vllm2026-04-17 01:33:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40081•Fetched 2026-04-17 08:27:18

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sleeepss

Participants

github-actions[bot]

sleeepss

Timeline (top)

mentioned ×3subscribed ×3added_to_project_v2 ×1commented ×1

Bug	File	Impact	Clean fix difficulty
amdsmi platform detection	`platforms/__init__.py`	Engine won't start	Low — add HIP fallback
Circular import via warning_once	`platforms/rocm.py`	Engine won't start	Low — env var check + replace warning_once
torch.cuda.device_count() = 0	`platforms/rocm.py`	Engine crashes on DP init	Medium — needs proper HIP integration in torch or vLLM platform layer
GGUF weight-loading OOM	Weight loader	OOM on 16 GB cards	Unknown — may need loader refactor

Full patches, Dockerfile diff, and k8s deployment manifest: sleeepss/vllm-rdna4-container-patches

AI disclosure: I used Claude to help draft this issue. All patches have been verified on my own hardware (RX 9070 XT, ROCm 7.12 nightly, vLLM v0.16.0rc0).

I used claude to obviously draft this but I reviewed it as a human, but apologies in advanced if this isnt acceptable

Error Message

def rocm_platform_plugin() -> str | None: # Try amdsmi first (existing code)... # ... # Fallback: HIP-based detection try: import ctypes hip = ctypes.CDLL("libamdhip64.so") count = ctypes.c_int() result = hip.hipGetDeviceCount(ctypes.byref(count)) if result == 0 and count.value > 0: return "vllm.platforms.rocm.RocmPlatform" except Exception: pass return None

Root Cause

Note: python collect_env.py is not available because vLLM crashes during platform detection before reaching any usable state. The patches in this report are required to get it to start.

Fix Action

Fix / Workaround

<details> <summary>Environment details (no collect_env.py — vLLM won't start without the patches)</summary>

Note: python collect_env.py is not available because vLLM crashes during platform detection before reaching any usable state. The patches in this report are required to get it to start.

I've published workaround patches and a full writeup at sleeepss/vllm-rdna4-container-patches. Filing this to get the issues on upstream's radar.

Code Example

OS: CachyOS (Arch-based), kernel 6.19.x-cachyos
CPU: AMD Ryzen 9 5950X
GPU: AMD Radeon RX 9070 XT (gfx1201 / RDNA 4, 16 GB VRAM)
ROCm: 7.12 nightly (TheRock toolchain, via bluefalcon13/vllm-rocm container)
PyTorch: 2.7.0a0 (built against ROCm 7.12)
vLLM: v0.16.0rc0
Container runtime: k3s (containerd) / podman
Flash Attention: hyoon1/flash-attention enable-ck-gfx12 branch (Composable Kernel)

---

def rocm_platform_plugin() -> str | None:
    # Try amdsmi first (existing code)...
    # ...
    # Fallback: HIP-based detection
    try:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        result = hip.hipGetDeviceCount(ctypes.byref(count))
        if result == 0 and count.value > 0:
            return "vllm.platforms.rocm.RocmPlatform"
    except Exception:
        pass
    return None

---

warning_once() → vllm.distributed.parallel_state → vllm.utils.system_utils → vllm.platforms

---

except Exception as e:
    logger.debug("Failed to get GCN arch via amdsmi: %s", e)
# Env var fallback — avoids circular import from warning_once during module init
arch_env = os.environ.get("PYTORCH_ROCM_ARCH", "")
if arch_env:
    logger.info("Using PYTORCH_ROCM_ARCH=%s for GCN arch", arch_env)
    return arch_env
logger.warning(
    "Failed to get GCN arch via amdsmi, falling back to torch.cuda. "
    "This will initialize CUDA and may cause "
    "issues if CUDA_VISIBLE_DEVICES is not set yet."
)
return torch.cuda.get_device_properties("cuda").gcnArchName

---

def _patch_torch_device_count():
    import torch
    if torch.cuda.device_count() == 0:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        if hip.hipGetDeviceCount(ctypes.byref(count)) == 0 and count.value > 0:
            n = count.value
            torch.cuda.device_count = lambda: n
            if hasattr(torch, "accelerator"):
                torch.accelerator.device_count = lambda: n

---

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 96.00 MiB.
GPU 0 has a total capacity of 15.92 GiB of which 5.71 MiB is free.
Of the allocated memory 6.84 GiB is allocated by PyTorch...

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment details (no collect_env.py — vLLM won't start without the patches)</summary>

OS: CachyOS (Arch-based), kernel 6.19.x-cachyos
CPU: AMD Ryzen 9 5950X
GPU: AMD Radeon RX 9070 XT (gfx1201 / RDNA 4, 16 GB VRAM)
ROCm: 7.12 nightly (TheRock toolchain, via bluefalcon13/vllm-rocm container)
PyTorch: 2.7.0a0 (built against ROCm 7.12)
vLLM: v0.16.0rc0
Container runtime: k3s (containerd) / podman
Flash Attention: hyoon1/flash-attention enable-ck-gfx12 branch (Composable Kernel)

Note: python collect_env.py is not available because vLLM crashes during platform detection before reaching any usable state. The patches in this report are required to get it to start.

</details>

🐛 Describe the bug

Running vLLM inside a container on an RDNA 4 GPU (RX 9070 XT, gfx1201) hits three sequential failures that prevent the engine from starting. The GPU is fully functional via HIP — hipGetDeviceCount() returns 1, inference works once you get past these bugs. The issues are all in vLLM's platform detection and initialization path, not in the GPU or ROCm stack.

I've published workaround patches and a full writeup at sleeepss/vllm-rdna4-container-patches. Filing this to get the issues on upstream's radar.

Note: I tested on v0.16.0rc0 via bluefalcon13/vllm-rocm. I've confirmed that rocm_platform_plugin() in platforms/__init__.py is still amdsmi-only on current main (no HIP fallback). Open issues #24576, #34573, and #39378 report the same detection failure on other hardware — this issue adds the RDNA 4 container-specific angle with root cause analysis and working patches.

Bug 1: `amdsmi` fails to initialize → platform detection fails

vllm/platforms/__init__.py → rocm_platform_plugin() calls amdsmi to detect ROCm. Inside a container, amdsmi fails with AMDSMI_STATUS_NOT_INIT despite the GPU being fully accessible via HIP. vLLM's platform detection returns None and the engine exits with "no platform found."

Root cause: amdsmi requires sysfs/hwmon paths that aren't always exposed inside unprivileged containers. HIP (via libamdhip64.so → hipGetDeviceCount) works fine because it goes through /dev/kfd + /dev/dri, which are mounted. This is a known upstream ROCm issue — see ROCm/ROCm#5000 (amdsmi Error code 34), ROCm/amdsmi#75 (driver not initialized despite working rocminfo). Additionally, ROCm/k8s-device-plugin#65 documents that the k8s device plugin sets /dev/dri permissions to rw instead of rwm, causing amdgpu_device_initialize to fail.

Suggested fix: Fall back to HIP-based detection when amdsmi fails. A ctypes call to hipGetDeviceCount is sufficient:

def rocm_platform_plugin() -> str | None:
    # Try amdsmi first (existing code)...
    # ...
    # Fallback: HIP-based detection
    try:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        result = hip.hipGetDeviceCount(ctypes.byref(count))
        if result == 0 and count.value > 0:
            return "vllm.platforms.rocm.RocmPlatform"
    except Exception:
        pass
    return None

Bug 2: Circular import in GCN arch fallback via `logger.warning_once()`

Once you get past Bug 1, vllm/platforms/rocm.py → _get_gcn_arch() tries amdsmi for the GCN architecture string. That fails (same amdsmi issue). The except block calls logger.warning_once(), which during module init triggers:

warning_once() → vllm.distributed.parallel_state → vllm.utils.system_utils → vllm.platforms

This is a circular import back into the module that's still initializing. Result: ImportError: cannot import name 'current_platform'.

Suggested fix: Check PYTORCH_ROCM_ARCH env var before calling logger.warning_once(). If set, return it immediately. Also replace warning_once() with regular warning() to avoid the circular import path entirely:

except Exception as e:
    logger.debug("Failed to get GCN arch via amdsmi: %s", e)
# Env var fallback — avoids circular import from warning_once during module init
arch_env = os.environ.get("PYTORCH_ROCM_ARCH", "")
if arch_env:
    logger.info("Using PYTORCH_ROCM_ARCH=%s for GCN arch", arch_env)
    return arch_env
logger.warning(
    "Failed to get GCN arch via amdsmi, falling back to torch.cuda. "
    "This will initialize CUDA and may cause "
    "issues if CUDA_VISIBLE_DEVICES is not set yet."
)
return torch.cuda.get_device_properties("cuda").gcnArchName

Bug 3: `torch.cuda.device_count()` returns 0

With Bugs 1 and 2 fixed, vLLM starts but torch.cuda.device_count() returns 0 despite the GPU working via HIP. vLLM's data-parallel code then asserts rank 0 is out of bounds and crashes. This matches the pattern in ROCm/ROCm#5461 (rocminfo works, device_count 0) and ROCm/HIP#3710 (hipErrorNoDevice from PyTorch).

Workaround: Monkey-patch torch.cuda.device_count (and torch.accelerator.device_count) at module load time using HIP's hipGetDeviceCount:

def _patch_torch_device_count():
    import torch
    if torch.cuda.device_count() == 0:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        if hip.hipGetDeviceCount(ctypes.byref(count)) == 0 and count.value > 0:
            n = count.value
            torch.cuda.device_count = lambda: n
            if hasattr(torch, "accelerator"):
                torch.accelerator.device_count = lambda: n

This is a global mutation and not a clean fix — flagging it as something that should be handled properly in vLLM's ROCm platform layer or upstream in PyTorch's ROCm backend.

Bonus: GGUF weight-loading OOM on 16 GB cards

Not a code bug, but a sharp edge worth documenting. Once vLLM starts and begins loading a 14B Q4_K_M GGUF (~9 GB on disk), _create_padded_weight_param allocates a temp FP16 buffer for merged/padded weights and OOMs:

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 96.00 MiB.
GPU 0 has a total capacity of 15.92 GiB of which 5.71 MiB is free.
Of the allocated memory 6.84 GiB is allocated by PyTorch...

This happens during weight loading, before KV cache allocation, so --gpu-memory-utilization and --max-model-len do not help. The workaround is --cpu-offload-gb 4, which uses vLLM's UVA zero-copy offload path to keep 4 GB of weights in pinned CPU RAM.

This may be related to #22814 (GGUF loader system RAM bloat via torch.tensor() vs torch.from_numpy()). I hit a similar symptom shape but in VRAM.

Related upstream issues

vLLM — same detection failure, different hardware:

#24576 — "No module named 'amdsmi'" on ROCm 6.4.2, gfx1151 (open)
#34573 — "No HIP GPUs are available" on gfx1151 in container (open)
#39378 — "Failed to infer device type" on 7900XTX + ROCm 7.2 + v0.19.0 (open)

vLLM — RDNA 4 specific:

#28649 — FP8 WMMA feature request for gfx1201 (open)
#28052 — Flash attention error on gfx1201 in Docker (closed, fixed by PR #31062)
PR #38086 — FP8 MoE enabled for gfx1201 on main (merged Apr 2, 2026)
PR #38455 — gfx1201 device ID mapping added on main (merged Apr 10, 2026)

ROCm upstream — amdsmi and container detection:

ROCm/ROCm#5000 — amdsmi Error code 34 init failure (closed without fix)
ROCm/amdsmi#75 — "driver not initialized" despite working rocminfo (open)
ROCm/k8s-device-plugin#65 — Container device permissions rw vs rwm causing PyTorch detection failure
ROCm/ROCm#5461 — rocminfo works, torch.cuda.device_count() returns 0
ROCm/HIP#3710 — hipErrorNoDevice from PyTorch

ROCm upstream — RDNA 4 in containers:

ROCm/ROCm#5812 — RX 9070 XT HSA discovery hangs
ROCm/ROCm#5581 — gfx1201 Docker stalls on ROCm 7.x
ROCm/ROCm#5656 — vLLM multi-node segfault with R9700 in Docker

Summary

Bug	File	Impact	Clean fix difficulty
amdsmi platform detection	`platforms/__init__.py`	Engine won't start	Low — add HIP fallback
Circular import via warning_once	`platforms/rocm.py`	Engine won't start	Low — env var check + replace warning_once
torch.cuda.device_count() = 0	`platforms/rocm.py`	Engine crashes on DP init	Medium — needs proper HIP integration in torch or vLLM platform layer
GGUF weight-loading OOM	Weight loader	OOM on 16 GB cards	Unknown — may need loader refactor

Full patches, Dockerfile diff, and k8s deployment manifest: sleeepss/vllm-rdna4-container-patches

AI disclosure: I used Claude to help draft this issue. All patches have been verified on my own hardware (RX 9070 XT, ROCm 7.12 nightly, vLLM v0.16.0rc0).

I used claude to obviously draft this but I reviewed it as a human, but apologies in advanced if this isnt acceptable

extent analysis

TL;DR

Apply the suggested fixes for the three bugs: add a HIP fallback for amdsmi platform detection, check the PYTORCH_ROCM_ARCH environment variable to avoid circular imports, and monkey-patch torch.cuda.device_count using HIP's hipGetDeviceCount.

Guidance

To fix Bug 1, modify the rocm_platform_plugin function in platforms/__init__.py to fall back to HIP-based detection when amdsmi fails.
For Bug 2, check the PYTORCH_ROCM_ARCH environment variable before calling logger.warning_once in platforms/rocm.py, and replace warning_once with a regular warning to avoid circular imports.
To address Bug 3, monkey-patch torch.cuda.device_count (and torch.accelerator.device_count) at module load time using HIP's hipGetDeviceCount.
Additionally, consider applying the workaround for the GGUF weight-loading OOM issue by using the --cpu-offload-gb option.

Example

The suggested fix for Bug 1 can be implemented as follows:

def rocm_platform_plugin() -> str | None:
    # Try amdsmi first (existing code)...
    # ...
    # Fallback: HIP-based detection
    try:
        import ctypes
        hip = ctypes.CDLL("libamdhip64.so")
        count = ctypes.c_int()
        result = hip.hipGetDeviceCount(ctypes.byref(count))
        if result == 0 and count.value > 0:
            return "vllm.platforms.rocm.RocmPlatform"
    except Exception:
        pass
    return None

Notes

The provided fixes are workarounds and may not be the final solution. The root causes of the issues are related to the interaction between vLLM, ROCm, and the container environment. Further investigation and collaboration with the upstream developers may be necessary to resolve the issues permanently.

Recommendation

Apply the suggested workarounds to fix the immediate issues, and consider upgrading to a future version of vLLM or ROCm that may include proper fixes for these problems. The reason for this recommendation is that the workarounds provide a temporary solution, but a more permanent fix may be available in future versions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#response parsing #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: vLLM fails to start on RDNA 4 (gfx1201) inside containers — amdsmi, circular import, and torch.cuda.device_count() all broken [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Bug 1: `amdsmi` fails to initialize → platform detection fails

Bug 2: Circular import in GCN arch fallback via `logger.warning_once()`

Bug 3: `torch.cuda.device_count()` returns 0

Bonus: GGUF weight-loading OOM on 16 GB cards

Related upstream issues

Summary

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: vLLM fails to start on RDNA 4 (gfx1201) inside containers — amdsmi, circular import, and torch.cuda.device_count() all broken [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Bug 1: amdsmi fails to initialize → platform detection fails

Bug 2: Circular import in GCN arch fallback via logger.warning_once()

Bug 3: torch.cuda.device_count() returns 0

Bonus: GGUF weight-loading OOM on 16 GB cards

Related upstream issues

Summary

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Bug 1: `amdsmi` fails to initialize → platform detection fails

Bug 2: Circular import in GCN arch fallback via `logger.warning_once()`

Bug 3: `torch.cuda.device_count()` returns 0