ollama - ✅(Solved) Fix Bug: --ollama-engine runner ignores GGML_CUDA_INIT=1, breaking ROCm bootstrap on gfx1151 APUs [1 pull requests, 8 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15420Fetched 2026-04-09 07:51:21
View on GitHub
Comments
8
Participants
2
Timeline
11
Reactions
0
Author
Participants
Timeline (top)
commented ×8closed ×1mentioned ×1subscribed ×1

Ollama 0.20.3 uses a two-step GPU bootstrap:

  1. Discovery probe (no GGML_CUDA_INIT) → finds GPU ✓
  2. Verification probe (with GGML_CUDA_INIT=1) → supposed to confirm GPU can init → always fails

The verification probe starts the new --ollama-engine runner subprocess, which:

  • Starts and listens on a port
  • Returns {"status":2,"progress":0} from /health
  • Never opens /dev/kfd — GPU is never initialized
  • Status stays at 2 forever (never becomes 1 = "ready")
  • After ~91-101ms Ollama kills it and logs: filtering device which didn't fully initialize

Root cause: GGML_CUDA_INIT=1 was an env var for the old llama.cpp/ggml runner. The new --ollama-engine runner (Go-based) does not implement it. The runner ignores the variable entirely and never initializes the GPU during the bootstrap health check. This means ROCm devices that require NeedsInitValidation() (all ROCm and CUDA devices per ml/device.go:541) are always filtered out on 0.20.x.

Root Cause

Root cause: GGML_CUDA_INIT=1 was an env var for the old llama.cpp/ggml runner. The new --ollama-engine runner (Go-based) does not implement it. The runner ignores the variable entirely and never initializes the GPU during the bootstrap health check. This means ROCm devices that require NeedsInitValidation() (all ROCm and CUDA devices per ml/device.go:541) are always filtered out on 0.20.x.

Fix Action

Fix / Workaround

Current Workaround

PR fix notes

PR #15509: Add OLLAMA_SKIP_GPU_VALIDATION env var to bypass broken GPU validation on Strix Halo (gfx1151)

Description (problem / solution / changelog)

Problem

The GPU validation subprocess added in 0.18+ silently filters out AMD GPUs that crash during the deep init check. This affects AMD Strix Halo (gfx1151) and is reported in:

  • #15336 — "ollama 17.7 last version working on strix halo, all 18.x fallback to cpu"
  • #13589 — "gfx1151 silently falls back to CPU on Linux despite rocminfo detecting GPU"
  • #15261 — "Vulkan causing unrelated output with gemma4:e4b (AMD/Ryzen iGPU)"

Root cause

Two separate crashes prevent gfx1151 from working on 0.18+:

1. Bootstrap validation crash

NeedsInitValidation() triggers a runner subprocess with GGML_CUDA_INIT=1 that calls rocblas_initialize(). On gfx1151 with the bundled ROCm libraries, this crashes because TensileLibrary_lazy_gfx1151.dat cannot be loaded from the expected hipblaslt path. The Go discovery code interprets the empty subprocess output as "filtering device which didn't fully initialize" and removes the GPU.

2. Worst-case graph reservation crash

Even after working around the bootstrap, reserveWorstCaseGraph() in the new ollamarunner calls ggml_backend_sched_reserve() which crashes with SIGSEGV inside libamdhip64 — a HIP runtime memory allocator bug specific to gfx1151.

Fix

This patch adds an OLLAMA_SKIP_GPU_VALIDATION env var that:

  1. Skips NeedsInitValidation() for ROCm/CUDA devices (so the bootstrap subprocess uses bare device enumeration without the crashing rocblas init)
  2. Skips reserveWorstCaseGraph() in ollamarunner.allocModel() (memory is allocated lazily during inference instead, which works fine in practice)

The user takes responsibility for ensuring their GPU is actually compatible. This is documented in the env var description.

Tested

  • Hardware: AMD Ryzen AI MAX+ PRO 395 (Strix Halo, gfx1151), 96GB GTT
  • OS: Debian 12 in unprivileged Proxmox LXC, kernel 6.17
  • Drivers: mesa-vulkan-drivers 25.0.7 from bookworm-backports
  • Ollama: built from this branch

Results with OLLAMA_SKIP_GPU_VALIDATION=1 and OLLAMA_VULKAN=1:

ModelBackendAvg latency (warm)Tokens/call
qwen3.5:4bVulkan (gfx1151)1.89s~63
qwen3.5:4bCPU (without patch)15.6s~155

Performance via Vulkan is comparable to or faster than 0.17.7 with native ROCm support. Full 33/33 layers offload to GPU. KHR_coopmat cooperative matrix support is active.

Risk

  • Low blast radius: opt-in via env var, no behavior change for users who don't set it
  • No new dependencies: uses existing envconfig package
  • Backwards compatible: existing GPU validation logic untouched

Future work

The underlying bugs in rocblas tensile loading and HIP memory allocator should ideally be fixed upstream, but this gives Strix Halo users a working escape hatch in the meantime without forking.

Changed files

  • envconfig/config.go (modified, +6/-0)
  • ml/device.go (modified, +6/-0)
  • runner/ollamarunner/runner.go (modified, +9/-0)

Code Example

discovering available GPUs...
bootstrap discovery took duration=189.56ms OLLAMA_LIBRARY_PATH="[... /rocm]"
evaluating which, if any, devices to filter out initial_count=1
verifying if device is supported library=/usr/local/lib/ollama/rocm description="Radeon 8060S Graphics" compute=gfx1151 id=0
starting runner cmd="/usr/local/bin/ollama runner --ollama-engine --port 37243"
subprocess ... LD_LIBRARY_PATH=.../rocm ROCR_VISIBLE_DEVICES=0 GGML_CUDA_INIT=1
bootstrap discovery took duration=101.133ms ... extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
filtering device which didn't fully initialize id=0 libdir=/usr/local/lib/ollama/rocm library=ROCm
inference compute id=cpu library=cpu ... total="61.2 GiB" available="54.4 GiB"

---

HSA_OVERRIDE_GFX_VERSION=11.5.1 \
LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/rocm \
ROCR_VISIBLE_DEVICES=0 \
GGML_CUDA_INIT=1 \
/usr/local/bin/ollama runner --ollama-engine --port 55555

---

sleep 1;  curl -s http://127.0.0.1:55555/health → {"status":2,"progress":0}
sleep 5;  curl -s http://127.0.0.1:55555/health → {"status":2,"progress":0}
sleep 10; curl -s http://127.0.0.1:55555/health → {"status":2,"progress":0}

---

inference compute id=... library=Vulkan name=Vulkan0 description="AMD Radeon Graphics (RADV GFX1151)" total="94.6 GiB" available="92.2 GiB"

---

ollama ps → qwen2.5:72b-instruct-q5_K_M  68 GB  100% GPU
eval rate: 4.12 tokens/s

---

// Line 538
func (d DeviceInfo) NeedsInitValidation() bool {
    return d.Library == "ROCm" || d.Library == "CUDA"
}

// Line 545
func (d DeviceInfo) AddInitValidation(env map[string]string) {
    env["GGML_CUDA_INIT"] = "1"
}

---

if len(bootstrapDevices(ctx2ndPass, devices[i].LibraryPath, extraEnvs)) == 0 {
    // filtering device which didn't fully initialize
    needsDelete[i] = true
}

---

func (d DeviceInfo) NeedsInitValidation() bool {
    if d.Library == "ROCm" && strings.HasPrefix(d.Compute(), "gfx115") {
        return false
    }
    return d.Library == "ROCm" || d.Library == "CUDA"
}

---

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_VULKAN=1"
Environment="VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json"
RAW_BUFFERClick to expand / collapse

Bug: --ollama-engine runner ignores GGML_CUDA_INIT=1, breaking ROCm bootstrap on gfx1151 APUs

System Information

  • CPU/GPU: AMD Ryzen AI MAX+ 395 (Strix Halo) — Radeon 8060S Graphics (gfx1151)
  • RAM: 128 GB unified memory (~92 GB available as GPU VRAM via Vulkan, ~64 GB dedicated VRAM)
  • OS: Ubuntu 25.10, kernel 6.x
  • ROCm: 7.2.1 at /opt/rocm-7.2.1
  • Ollama: v0.20.3
  • Vulkan: mesa-vulkan-drivers 25.2.8, libvulkan1 1.4.321

Summary

Ollama 0.20.3 uses a two-step GPU bootstrap:

  1. Discovery probe (no GGML_CUDA_INIT) → finds GPU ✓
  2. Verification probe (with GGML_CUDA_INIT=1) → supposed to confirm GPU can init → always fails

The verification probe starts the new --ollama-engine runner subprocess, which:

  • Starts and listens on a port
  • Returns {"status":2,"progress":0} from /health
  • Never opens /dev/kfd — GPU is never initialized
  • Status stays at 2 forever (never becomes 1 = "ready")
  • After ~91-101ms Ollama kills it and logs: filtering device which didn't fully initialize

Root cause: GGML_CUDA_INIT=1 was an env var for the old llama.cpp/ggml runner. The new --ollama-engine runner (Go-based) does not implement it. The runner ignores the variable entirely and never initializes the GPU during the bootstrap health check. This means ROCm devices that require NeedsInitValidation() (all ROCm and CUDA devices per ml/device.go:541) are always filtered out on 0.20.x.

What Works

  • rocminfo + rocm-smi recognize the GPU with HSA_OVERRIDE_GFX_VERSION=11.5.1
  • gfx1151 kernels confirmed compiled into /usr/local/lib/ollama/rocm/libggml-hip.so
  • All ROCm shared libraries load fine (ldd shows no missing deps)
  • ollama user is in render + video groups → has /dev/kfd and /dev/dri access
  • Discovery probe finds the GPU correctly (finds "Radeon 8060S Graphics, gfx1151")
  • Vulkan backend works with OLLAMA_VULKAN=1 — 4.12 tok/s on qwen2.5:72b-instruct-q5_K_M (100% GPU, 92 GiB available)
  • CPU inference works but at half speed (2.13 tok/s on same model)

Detailed Evidence

1. Discovery succeeds, verification fails

From OLLAMA_DEBUG=1 journal output:

discovering available GPUs...
bootstrap discovery took duration=189.56ms OLLAMA_LIBRARY_PATH="[... /rocm]"
evaluating which, if any, devices to filter out initial_count=1
verifying if device is supported library=/usr/local/lib/ollama/rocm description="Radeon 8060S Graphics" compute=gfx1151 id=0
starting runner cmd="/usr/local/bin/ollama runner --ollama-engine --port 37243"
subprocess ... LD_LIBRARY_PATH=.../rocm ROCR_VISIBLE_DEVICES=0 GGML_CUDA_INIT=1
bootstrap discovery took duration=101.133ms ... extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
filtering device which didn't fully initialize id=0 libdir=/usr/local/lib/ollama/rocm library=ROCm
inference compute id=cpu library=cpu ... total="61.2 GiB" available="54.4 GiB"

2. Runner permanently stuck at status:2

Manually starting the runner with the exact same env vars Ollama uses:

HSA_OVERRIDE_GFX_VERSION=11.5.1 \
LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/rocm \
ROCR_VISIBLE_DEVICES=0 \
GGML_CUDA_INIT=1 \
/usr/local/bin/ollama runner --ollama-engine --port 55555

Polling health:

sleep 1;  curl -s http://127.0.0.1:55555/health → {"status":2,"progress":0}
sleep 5;  curl -s http://127.0.0.1:55555/health → {"status":2,"progress":0}
sleep 10; curl -s http://127.0.0.1:55555/health → {"status":2,"progress":0}

Same result without GGML_CUDA_INIT=1 — confirming the variable is completely ignored.

lsof confirms the runner never opens /dev/kfd — the GPU is never touched.

3. Vulkan works perfectly

With OLLAMA_VULKAN=1 and VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json:

inference compute id=... library=Vulkan name=Vulkan0 description="AMD Radeon Graphics (RADV GFX1151)" total="94.6 GiB" available="92.2 GiB"
ollama ps → qwen2.5:72b-instruct-q5_K_M  68 GB  100% GPU
eval rate: 4.12 tokens/s

4. OLLAMA_NEW_ENGINE=false has no effect

Setting OLLAMA_NEW_ENGINE=false in the systemd override does not change behavior — the runner still starts with --ollama-engine.

Code Analysis

The issue is in the interaction between two functions in ml/device.go:

// Line 538
func (d DeviceInfo) NeedsInitValidation() bool {
    return d.Library == "ROCm" || d.Library == "CUDA"
}

// Line 545
func (d DeviceInfo) AddInitValidation(env map[string]string) {
    env["GGML_CUDA_INIT"] = "1"
}

And discover/runner.go line 152:

if len(bootstrapDevices(ctx2ndPass, devices[i].LibraryPath, extraEnvs)) == 0 {
    // filtering device which didn't fully initialize
    needsDelete[i] = true
}

The old llama.cpp runner honored GGML_CUDA_INIT=1 by deeply initializing the GPU during startup. The new --ollama-engine runner doesn't implement this — it starts in an idle state (status:2) waiting for a model load, regardless of GGML_CUDA_INIT.

Proposed Fix

Option A (minimal, targeted): Skip init validation for gfx115x APUs since the verification mechanism is broken:

func (d DeviceInfo) NeedsInitValidation() bool {
    if d.Library == "ROCm" && strings.HasPrefix(d.Compute(), "gfx115") {
        return false
    }
    return d.Library == "ROCm" || d.Library == "CUDA"
}

Option B (proper fix): Make the --ollama-engine runner implement GGML_CUDA_INIT=1 by actually initializing the GPU backend during startup and reporting status:1 once successful.

Option C (pragmatic): Increase the bootstrap timeout and have the runner respond to GGML_CUDA_INIT=1 with a quick GPU probe that returns status:1 if /dev/kfd is accessible and the ROCm runtime loads.

Current Workaround

Vulkan backend via systemd override:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_VULKAN=1"
Environment="VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json"

This works (4.12 tok/s on 72B) but native ROCm should be significantly faster based on community reports (~40 tok/s on 30B models with v0.18.0 where the old runner still worked).

Related Issues

  • #14855 — Working ROCm guide for gfx1151 (uses v0.18.0, before the --ollama-engine change)
  • #13589 — gfx1151 silently falls back to CPU on Linux
  • #12062 — APU VRAM/GTT memory reporting issue
  • rjmalagon/ollama-linux-amd-apu#37 — Identical bootstrap failure with ROCm v7 and gfx1151

extent analysis

TL;DR

The most likely fix for the issue is to modify the NeedsInitValidation function in ml/device.go to skip init validation for gfx115x APUs or implement GGML_CUDA_INIT=1 in the --ollama-engine runner.

Guidance

  • The issue is caused by the new --ollama-engine runner not implementing GGML_CUDA_INIT=1, which is required for ROCm devices to initialize properly.
  • To fix this, you can try one of the proposed options: skip init validation for gfx115x APUs, implement GGML_CUDA_INIT=1 in the runner, or increase the bootstrap timeout and have the runner respond to GGML_CUDA_INIT=1 with a quick GPU probe.
  • You can verify the fix by checking if the GPU is properly initialized and if the inference speed is improved.
  • As a temporary workaround, you can use the Vulkan backend by setting OLLAMA_VULKAN=1 in the systemd override.

Example

func (d DeviceInfo) NeedsInitValidation() bool {
    if d.Library == "ROCm" && strings.HasPrefix(d.Compute(), "gfx115") {
        return false
    }
    return d.Library == "ROCm" || d.Library == "CUDA"
}

Notes

  • The issue is specific to the --ollama-engine runner and ROCm devices, so the fix may not apply to other configurations.
  • The proposed fixes are based on the code analysis and may require further testing and verification.

Recommendation

Apply the workaround by setting OLLAMA_VULKAN=1 in the systemd override, as it is a simpler and more straightforward solution. However, for a proper fix, implementing GGML_CUDA_INIT=1 in the --ollama-engine runner is recommended to ensure native ROCm performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING