ollama - ✅(Solved) Fix Vulkan causing unrelated output with gemma4:e4b (AMD/Ryzen iGPU) [1 pull requests, 9 comments, 10 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15261Fetched 2026-04-08 02:33:31
View on GitHub
Comments
9
Participants
10
Timeline
17
Reactions
9
Timeline (top)
commented ×9subscribed ×7labeled ×1

Fix Action

Fix / Workaround

[GIN] 2026/04/03 - 10:58:12 | 200 | 41.695µs | 127.0.0.1 | HEAD "/" [GIN] 2026/04/03 - 10:58:12 | 200 | 228.390428ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/04/03 - 10:58:12 | 200 | 225.689364ms | 127.0.0.1 | POST "/api/show" time=2026-04-03T10:58:12.843+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40981" time=2026-04-03T10:58:13.048+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-03T10:58:13.049+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /home/alper/.ollama/models/blobs/sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a --port 37115" time=2026-04-03T10:58:13.049+03:00 level=INFO source=sched.go:484 msg="system memory" total="60.7 GiB" free="54.1 GiB" free_swap="12.0 GiB" time=2026-04-03T10:58:13.049+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=00000000-0500-0000-0000-000000000000 library=Vulkan available="30.5 GiB" free="31.0 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-04-03T10:58:13.049+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=43 requested=-1 time=2026-04-03T10:58:13.066+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-03T10:58:13.066+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37115" time=2026-04-03T10:58:13.071+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:43[ID:00000000-0500-0000-0000-000000000000 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-03T10:58:13.135+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=2131 num_key_values=55 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none load_backend: loaded Vulkan backend from /usr/lib/ollama/vulkan/libggml-vulkan.so time=2026-04-03T10:58:13.170+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000 ggml_backend_vk_get_device_memory called: luid 0x0000000000000000 time=2026-04-03T10:58:13.193+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-03T10:58:13.221+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.268673ms bounds=(0,0)-(2048,2048) time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=114.602148ms size="[768 768]" time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-03T10:58:13.336+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=117.853427ms shape="[2560 256]" time=2026-04-03T10:58:13.461+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:43[ID:00000000-0500-0000-0000-000000000000 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000 ggml_backend_vk_get_device_memory called: luid 0x0000000000000000 ggml_vulkan: Device memory allocation of size 5637144576 failed. ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory alloc_tensor_range: failed to allocate Vulkan0 buffer of size 5637144576 time=2026-04-03T10:58:13.788+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10 time=2026-04-03T10:58:13.788+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.20 time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.30 time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.40 time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.50 time=2026-04-03T10:58:13.790+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.60 time=2026-04-03T10:58:13.790+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.70 time=2026-04-03T10:58:13.791+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:42[ID:00000000-0500-0000-0000-000000000000 Layers:42(0..41)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000 ggml_backend_vk_get_device_memory called: luid 0x0000000000000000 time=2026-04-03T10:58:14.046+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-03T10:58:14.068+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.508371ms bounds=(0,0)-(2048,2048) time=2026-04-03T10:58:14.183+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=114.498922ms size="[768 768]" time=2026-04-03T10:58:14.186+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-03T10:58:14.186+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-03T10:58:14.187+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=121.567548ms shape="[2560 256]" time=2026-04-03T10:58:14.644+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:42[ID:00000000-0500-0000-0000-000000000000 Layers:42(0..41)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:482 msg="offloading 42 repeating layers to GPU" time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:494 msg="offloaded 42/43 layers to GPU" time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="2.8 GiB" time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="6.6 GiB" time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="692.0 MiB" time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="654.3 MiB" time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="21.0 MiB" time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:272 msg="total memory" size="10.8 GiB" time=2026-04-03T10:58:14.644+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-04-03T10:58:14.644+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" time=2026-04-03T10:58:14.644+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model" time=2026-04-03T10:58:19.913+03:00 level=INFO source=server.go:1390 msg="llama runner started in 6.86 seconds" [GIN] 2026/04/03 - 10:58:19 | 200 | 7.313091191s | 127.0.0.1 | POST "/api/generate"

PR fix notes

PR #15509: Add OLLAMA_SKIP_GPU_VALIDATION env var to bypass broken GPU validation on Strix Halo (gfx1151)

Description (problem / solution / changelog)

Problem

The GPU validation subprocess added in 0.18+ silently filters out AMD GPUs that crash during the deep init check. This affects AMD Strix Halo (gfx1151) and is reported in:

  • #15336 — "ollama 17.7 last version working on strix halo, all 18.x fallback to cpu"
  • #13589 — "gfx1151 silently falls back to CPU on Linux despite rocminfo detecting GPU"
  • #15261 — "Vulkan causing unrelated output with gemma4:e4b (AMD/Ryzen iGPU)"

Root cause

Two separate crashes prevent gfx1151 from working on 0.18+:

1. Bootstrap validation crash

NeedsInitValidation() triggers a runner subprocess with GGML_CUDA_INIT=1 that calls rocblas_initialize(). On gfx1151 with the bundled ROCm libraries, this crashes because TensileLibrary_lazy_gfx1151.dat cannot be loaded from the expected hipblaslt path. The Go discovery code interprets the empty subprocess output as "filtering device which didn't fully initialize" and removes the GPU.

2. Worst-case graph reservation crash

Even after working around the bootstrap, reserveWorstCaseGraph() in the new ollamarunner calls ggml_backend_sched_reserve() which crashes with SIGSEGV inside libamdhip64 — a HIP runtime memory allocator bug specific to gfx1151.

Fix

This patch adds an OLLAMA_SKIP_GPU_VALIDATION env var that:

  1. Skips NeedsInitValidation() for ROCm/CUDA devices (so the bootstrap subprocess uses bare device enumeration without the crashing rocblas init)
  2. Skips reserveWorstCaseGraph() in ollamarunner.allocModel() (memory is allocated lazily during inference instead, which works fine in practice)

The user takes responsibility for ensuring their GPU is actually compatible. This is documented in the env var description.

Tested

  • Hardware: AMD Ryzen AI MAX+ PRO 395 (Strix Halo, gfx1151), 96GB GTT
  • OS: Debian 12 in unprivileged Proxmox LXC, kernel 6.17
  • Drivers: mesa-vulkan-drivers 25.0.7 from bookworm-backports
  • Ollama: built from this branch

Results with OLLAMA_SKIP_GPU_VALIDATION=1 and OLLAMA_VULKAN=1:

ModelBackendAvg latency (warm)Tokens/call
qwen3.5:4bVulkan (gfx1151)1.89s~63
qwen3.5:4bCPU (without patch)15.6s~155

Performance via Vulkan is comparable to or faster than 0.17.7 with native ROCm support. Full 33/33 layers offload to GPU. KHR_coopmat cooperative matrix support is active.

Risk

  • Low blast radius: opt-in via env var, no behavior change for users who don't set it
  • No new dependencies: uses existing envconfig package
  • Backwards compatible: existing GPU validation logic untouched

Future work

The underlying bugs in rocblas tensile loading and HIP memory allocator should ideally be fixed upstream, but this gives Strix Halo users a working escape hatch in the meantime without forking.

Changed files

  • envconfig/config.go (modified, +6/-0)
  • ml/device.go (modified, +6/-0)
  • runner/ollamarunner/runner.go (modified, +9/-0)

Code Example

>>> how many r's are there in word strawberry
...I can understand! You are asking for a very literal translation or clarification of the phrase "How many **isso**?"

Since "isso" is not a standard English word, and it wasn't preceded by a picture or context, I have to make some educated
guesses about what you might mean.

---

❯ ollama run qwen3.5:2b
>>> how can r's are there in word strawberry?
Thinking...
Thinking Process:

1.  **Analyze the Request:**
    *   Question: "how can r's are there in word strawberry?"
    *   Core Task: Identify the occurrences of the letter 'r' in the word "strawberry".
    *   Target: Explain where the 'r's are in the word.
...clipped thinking output for brevity...
...done thinking.

There are **three** "r"s in the word "strawberry".

Here is how they are located:

**s** **t** **r** **a** **w** **b** **e** **r** **r** **y**

1.  **1st "r":** This is the **3rd letter** (inside the word **s-t-r-...**)
2.  **2nd "r":** This is the **8th letter** (inside the word ...-e-r-r-**r**-y? No, let's look closer).
3.  **3rd "r":** This is the **9th letter**.

---

OLLAMA_VULKAN=0 ollama serve
❯ ollama run gemma4:e4b
>>> how can r's are there in word strawberry?
Thinking...
Thinking Process:

1.  **Analyze the Request:** The user is asking "how can r's are there in word strawberry?". This is a counting/letter
identification task.
2.  **Identify the Target Word:** The word is "strawberry".
3.  **Identify the Target Letter:** The letter is 'r'.
4.  **Count the Occurrences:**
    *   S T R A W B E R R Y
    *   1st 'r': R (index 2)
    *   2nd 'r': R (index 8)
    *   3rd 'r': R (index 9)
5.  **Determine the Total Count:** There are three 'r's.
6.  **Formulate the Answer:** State the count clearly and provide the process (optional, but helpful).
...done thinking.

There are **3** 'r's in the word "strawberry".

---

OLLAMA_VULKAN=1 ollama serve
time=2026-04-03T10:58:04.777+03:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/alper/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-04-03T10:58:04.777+03:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false"
time=2026-04-03T10:58:04.777+03:00 level=INFO source=images.go:499 msg="total blobs: 18"
time=2026-04-03T10:58:04.778+03:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-03T10:58:04.778+03:00 level=INFO source=routes.go:1802 msg="Listening on 127.0.0.1:11434 (version 0.20.0)"
time=2026-04-03T10:58:04.778+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-03T10:58:04.779+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 38177"
time=2026-04-03T10:58:04.840+03:00 level=INFO source=types.go:42 msg="inference compute" id=00000000-0500-0000-0000-000000000000 filter_id="" library=Vulkan compute=0.0 name=Vulkan0 description="AMD Radeon Graphics (RADV RENOIR)" libdirs=ollama,vulkan driver=0.0 pci_id=0000:05:00.0 type=iGPU total="32.3 GiB" available="31.0 GiB"
time=2026-04-03T10:58:04.840+03:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="32.3 GiB" default_num_ctx=32768


[GIN] 2026/04/03 - 10:58:12 | 200 |      41.695µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/04/03 - 10:58:12 | 200 |  228.390428ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/04/03 - 10:58:12 | 200 |  225.689364ms |       127.0.0.1 | POST     "/api/show"
time=2026-04-03T10:58:12.843+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40981"
time=2026-04-03T10:58:13.048+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-03T10:58:13.049+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /home/alper/.ollama/models/blobs/sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a --port 37115"
time=2026-04-03T10:58:13.049+03:00 level=INFO source=sched.go:484 msg="system memory" total="60.7 GiB" free="54.1 GiB" free_swap="12.0 GiB"
time=2026-04-03T10:58:13.049+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=00000000-0500-0000-0000-000000000000 library=Vulkan available="30.5 GiB" free="31.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-03T10:58:13.049+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=43 requested=-1
time=2026-04-03T10:58:13.066+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-03T10:58:13.066+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37115"
time=2026-04-03T10:58:13.071+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:43[ID:00000000-0500-0000-0000-000000000000 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-03T10:58:13.135+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=2131 num_key_values=55
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /usr/lib/ollama/vulkan/libggml-vulkan.so
time=2026-04-03T10:58:13.170+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000
ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
time=2026-04-03T10:58:13.193+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-03T10:58:13.221+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.268673ms bounds=(0,0)-(2048,2048)
time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=114.602148ms size="[768 768]"
time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-03T10:58:13.336+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=117.853427ms shape="[2560 256]"
time=2026-04-03T10:58:13.461+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:43[ID:00000000-0500-0000-0000-000000000000 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000
ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
ggml_vulkan: Device memory allocation of size 5637144576 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan0 buffer of size 5637144576
time=2026-04-03T10:58:13.788+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10
time=2026-04-03T10:58:13.788+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.20
time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.30
time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.40
time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.50
time=2026-04-03T10:58:13.790+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.60
time=2026-04-03T10:58:13.790+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.70
time=2026-04-03T10:58:13.791+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:42[ID:00000000-0500-0000-0000-000000000000 Layers:42(0..41)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000
ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
time=2026-04-03T10:58:14.046+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-03T10:58:14.068+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.508371ms bounds=(0,0)-(2048,2048)
time=2026-04-03T10:58:14.183+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=114.498922ms size="[768 768]"
time=2026-04-03T10:58:14.186+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-03T10:58:14.186+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-03T10:58:14.187+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=121.567548ms shape="[2560 256]"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:42[ID:00000000-0500-0000-0000-000000000000 Layers:42(0..41)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:482 msg="offloading 42 repeating layers to GPU"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:494 msg="offloaded 42/43 layers to GPU"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="2.8 GiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="6.6 GiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="692.0 MiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="654.3 MiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="21.0 MiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:272 msg="total memory" size="10.8 GiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-04-03T10:58:14.644+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
time=2026-04-03T10:58:19.913+03:00 level=INFO source=server.go:1390 msg="llama runner started in 6.86 seconds"
[GIN] 2026/04/03 - 10:58:19 | 200 |  7.313091191s |       127.0.0.1 | POST     "/api/generate"
RAW_BUFFERClick to expand / collapse

What is the issue?

I use ollama with Vulkan and I have AMD Ryzen CPU/iGPU. When using gemma4:e4b OR gemma4:e2b I noticed that I got strange/unrelated response. Also, gemma skips thinking part as well..

I provided the log for when loading gemma4:e4b with Vulkan.

Here's a sample prompt and response (notice that there's no thinking output):

>>> how many r's are there in word strawberry
...I can understand! You are asking for a very literal translation or clarification of the phrase "How many **isso**?"

Since "isso" is not a standard English word, and it wasn't preceded by a picture or context, I have to make some educated
guesses about what you might mean.

When I use qwen, (with Vulkan) it behaves as expected:

❯ ollama run qwen3.5:2b
>>> how can r's are there in word strawberry?
Thinking...
Thinking Process:

1.  **Analyze the Request:**
    *   Question: "how can r's are there in word strawberry?"
    *   Core Task: Identify the occurrences of the letter 'r' in the word "strawberry".
    *   Target: Explain where the 'r's are in the word.
...clipped thinking output for brevity...
...done thinking.

There are **three** "r"s in the word "strawberry".

Here is how they are located:

**s** **t** **r** **a** **w** **b** **e** **r** **r** **y**

1.  **1st "r":** This is the **3rd letter** (inside the word **s-t-r-...**)
2.  **2nd "r":** This is the **8th letter** (inside the word ...-e-r-r-**r**-y? No, let's look closer).
3.  **3rd "r":** This is the **9th letter**.

Without Vulkan, gemma4:e4b works fine (notice that it actually thinks):

❯ OLLAMA_VULKAN=0 ollama serve
❯ ollama run gemma4:e4b
>>> how can r's are there in word strawberry?
Thinking...
Thinking Process:

1.  **Analyze the Request:** The user is asking "how can r's are there in word strawberry?". This is a counting/letter
identification task.
2.  **Identify the Target Word:** The word is "strawberry".
3.  **Identify the Target Letter:** The letter is 'r'.
4.  **Count the Occurrences:**
    *   S T R A W B E R R Y
    *   1st 'r': R (index 2)
    *   2nd 'r': R (index 8)
    *   3rd 'r': R (index 9)
5.  **Determine the Total Count:** There are three 'r's.
6.  **Formulate the Answer:** State the count clearly and provide the process (optional, but helpful).
...done thinking.

There are **3** 'r's in the word "strawberry".

Not sure if this bug is related to #15248

Relevant log output

OLLAMA_VULKAN=1 ollama serve
time=2026-04-03T10:58:04.777+03:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/alper/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-04-03T10:58:04.777+03:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false"
time=2026-04-03T10:58:04.777+03:00 level=INFO source=images.go:499 msg="total blobs: 18"
time=2026-04-03T10:58:04.778+03:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-03T10:58:04.778+03:00 level=INFO source=routes.go:1802 msg="Listening on 127.0.0.1:11434 (version 0.20.0)"
time=2026-04-03T10:58:04.778+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-03T10:58:04.779+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 38177"
time=2026-04-03T10:58:04.840+03:00 level=INFO source=types.go:42 msg="inference compute" id=00000000-0500-0000-0000-000000000000 filter_id="" library=Vulkan compute=0.0 name=Vulkan0 description="AMD Radeon Graphics (RADV RENOIR)" libdirs=ollama,vulkan driver=0.0 pci_id=0000:05:00.0 type=iGPU total="32.3 GiB" available="31.0 GiB"
time=2026-04-03T10:58:04.840+03:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="32.3 GiB" default_num_ctx=32768


[GIN] 2026/04/03 - 10:58:12 | 200 |      41.695µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/04/03 - 10:58:12 | 200 |  228.390428ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/04/03 - 10:58:12 | 200 |  225.689364ms |       127.0.0.1 | POST     "/api/show"
time=2026-04-03T10:58:12.843+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40981"
time=2026-04-03T10:58:13.048+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-03T10:58:13.049+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /home/alper/.ollama/models/blobs/sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a --port 37115"
time=2026-04-03T10:58:13.049+03:00 level=INFO source=sched.go:484 msg="system memory" total="60.7 GiB" free="54.1 GiB" free_swap="12.0 GiB"
time=2026-04-03T10:58:13.049+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=00000000-0500-0000-0000-000000000000 library=Vulkan available="30.5 GiB" free="31.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-03T10:58:13.049+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=43 requested=-1
time=2026-04-03T10:58:13.066+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-03T10:58:13.066+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37115"
time=2026-04-03T10:58:13.071+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:43[ID:00000000-0500-0000-0000-000000000000 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-03T10:58:13.135+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=2131 num_key_values=55
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /usr/lib/ollama/vulkan/libggml-vulkan.so
time=2026-04-03T10:58:13.170+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000
ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
time=2026-04-03T10:58:13.193+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-03T10:58:13.221+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.268673ms bounds=(0,0)-(2048,2048)
time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=114.602148ms size="[768 768]"
time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-03T10:58:13.335+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-03T10:58:13.336+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=117.853427ms shape="[2560 256]"
time=2026-04-03T10:58:13.461+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:43[ID:00000000-0500-0000-0000-000000000000 Layers:43(0..42)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000
ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
ggml_vulkan: Device memory allocation of size 5637144576 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan0 buffer of size 5637144576
time=2026-04-03T10:58:13.788+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.10
time=2026-04-03T10:58:13.788+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.20
time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.30
time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.40
time=2026-04-03T10:58:13.789+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.50
time=2026-04-03T10:58:13.790+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.60
time=2026-04-03T10:58:13.790+03:00 level=INFO source=server.go:881 msg="model layout did not fit, applying backoff" backoff=0.70
time=2026-04-03T10:58:13.791+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:42[ID:00000000-0500-0000-0000-000000000000 Layers:42(0..41)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ggml_backend_vk_get_device_memory called: uuid 00000000-0500-0000-0000-000000000000
ggml_backend_vk_get_device_memory called: luid 0x0000000000000000
time=2026-04-03T10:58:14.046+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-03T10:58:14.068+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.508371ms bounds=(0,0)-(2048,2048)
time=2026-04-03T10:58:14.183+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=114.498922ms size="[768 768]"
time=2026-04-03T10:58:14.186+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-03T10:58:14.186+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-03T10:58:14.187+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=121.567548ms shape="[2560 256]"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:32768 KvCacheType: NumThreads:8 GPULayers:42[ID:00000000-0500-0000-0000-000000000000 Layers:42(0..41)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:482 msg="offloading 42 repeating layers to GPU"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=ggml.go:494 msg="offloaded 42/43 layers to GPU"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:240 msg="model weights" device=Vulkan0 size="2.8 GiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="6.6 GiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:251 msg="kv cache" device=Vulkan0 size="692.0 MiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:262 msg="compute graph" device=Vulkan0 size="654.3 MiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="21.0 MiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=device.go:272 msg="total memory" size="10.8 GiB"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-04-03T10:58:14.644+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
time=2026-04-03T10:58:14.644+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
time=2026-04-03T10:58:19.913+03:00 level=INFO source=server.go:1390 msg="llama runner started in 6.86 seconds"
[GIN] 2026/04/03 - 10:58:19 | 200 |  7.313091191s |       127.0.0.1 | POST     "/api/generate"

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.20.0

extent analysis

TL;DR

The issue is likely due to the model layout not fitting in the available GPU memory, causing the gemma4:e4b model to skip the thinking part and produce unrelated responses.

Guidance

  • The error message ggml_vulkan: Device memory allocation of size 5637144576 failed indicates that the model requires more GPU memory than is available.
  • Try reducing the model size or increasing the GPU memory to resolve the issue.
  • As a temporary workaround, try running the model without Vulkan (OLLAMA_VULKAN=0 ollama serve) to see if it works as expected.
  • Verify that the GPU memory is sufficient for the model by checking the available memory and the model's memory requirements.

Example

No code example is provided as this issue is related to GPU memory allocation and model configuration.

Notes

The issue may be specific to the gemma4:e4b model and the AMD GPU, and may not be reproducible with other models or hardware configurations.

Recommendation

Apply a workaround by running the model without Vulkan (OLLAMA_VULKAN=0 ollama serve) until a more permanent solution can be found, such as optimizing the model for the available GPU memory or increasing the GPU memory.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING