ollama - 💡(How to fix) Fix Vulkan backend ignores maxComputeWorkGroupInvocations and creates 512-invocation pipelines (breaks Raspberry Pi 5 / V3D, cap = 256) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15877Fetched 2026-04-30 06:18:50
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Ollama's Vulkan backend hardcodes {512, 1, 1} workgroup sizes in dozens of ggml_vk_create_pipeline(...) calls and in the corresponding .comp shaders (layout(local_size_x = 512, ...)). On Raspberry Pi 5 (Broadcom V3D, V3DV driver) the hardware advertises VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations = 256, so every one of those pipelines exceeds the device limit and is rejected by the V3D shader compiler:

nir_to_vir.c:  assert(c->local_invocation_index_bits <= 8);

(8 bits = 256 invocations.)

This is the same root cause behind the "Vulkan doesn't work on Raspberry Pi 5" reports against llama.cpp:

This issue identifies the specific bug and points to the V3DV-side workaround pending to land in Mesa.

Root Cause

This is the same root cause behind the "Vulkan doesn't work on Raspberry Pi 5" reports against llama.cpp:

Fix Action

Fix / Workaround

This issue identifies the specific bug and points to the V3DV-side workaround pending to land in Mesa.

On a Raspberry Pi 5 with stock Mesa V3DV (no workaround applied):

The runner crashes during shader pipeline creation on the first compute dispatch. The affected ops (cpy, get_rows, add/mul, gelu, soft_max, rope, …) are on the hot path of every transformer forward pass, so any model fails immediately.

Code Example

nir_to_vir.c:  assert(c->local_invocation_index_bits <= 8);

---

device->max_workgroup_size_log2 =
       uint32_t(log2f(float(device->properties.limits.maxComputeWorkGroupInvocations)));

---

# 1. Start the server with Vulkan enabled (it's off by default).
OLLAMA_VULKAN=1 ollama serve

# 2. In another shell, run any model.
ollama run gemma3:1b

---
RAW_BUFFERClick to expand / collapse

What is the issue?

Summary

Ollama's Vulkan backend hardcodes {512, 1, 1} workgroup sizes in dozens of ggml_vk_create_pipeline(...) calls and in the corresponding .comp shaders (layout(local_size_x = 512, ...)). On Raspberry Pi 5 (Broadcom V3D, V3DV driver) the hardware advertises VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations = 256, so every one of those pipelines exceeds the device limit and is rejected by the V3D shader compiler:

nir_to_vir.c:  assert(c->local_invocation_index_bits <= 8);

(8 bits = 256 invocations.)

This is the same root cause behind the "Vulkan doesn't work on Raspberry Pi 5" reports against llama.cpp:

This issue identifies the specific bug and points to the V3DV-side workaround pending to land in Mesa.

Environment

  • Hardware: Raspberry Pi 5 (Broadcom BCM2712, V3D 7.1.7)
  • OS: Debian Trixie (aarch64)
  • Driver: Mesa V3DV (Vulkan 1.3)
  • Ollama: built with -DGGML_VULKAN=ON, server launched with OLLAMA_VULKAN=1
  • Device limits (relevant ones):
    • maxComputeWorkGroupInvocations = 256
    • maxComputeWorkGroupSize = [256, 256, 256]
    • subgroupSize = 16

What's happening

  1. ggml-vulkan.cpp:4478 correctly reads the device limit:

    device->max_workgroup_size_log2 =
        uint32_t(log2f(float(device->properties.limits.maxComputeWorkGroupInvocations)));

    On V3D this yields log2(256) = 8.

  2. A handful of code paths honor that cap but most of the pipelines hardcode {512, 1, 1} and ignore max_workgroup_size_log2. The matching .comp shaders declare the same hardcoded size, e.g. abs.comp, concat.comp, copy.comp, … all local_size_x = 512.

Steps to reproduce

On a Raspberry Pi 5 with stock Mesa V3DV (no workaround applied):

# 1. Start the server with Vulkan enabled (it's off by default).
OLLAMA_VULKAN=1 ollama serve

# 2. In another shell, run any model.
ollama run gemma3:1b

The runner crashes during shader pipeline creation on the first compute dispatch. The affected ops (cpy, get_rows, add/mul, gelu, soft_max, rope, …) are on the hot path of every transformer forward pass, so any model fails immediately.

Expected behavior

Ollama's Vulkan backend should respect VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations for every pipeline it creates, the same way it already does for argsort.

Driver-side workaround (V3DV)

So we implemented at the driver side a workaround for applications that don't honour the maxComputeWorkGroupInvocations lowering oversized workgroups itself so it doesn't assert. Mesa MRs:

With 3 previous Mesa MR RPi5 can use run ollama, altough there are many opportunities of performance improvements both in Mesa and ggml size making adjustments for Broadcom arquitecture like it is dome for other vendors.

These let RPi5 users run Ollama today, but the wrapping loop the driver inserts is pure overhead — fixing it in the backend would let V3D run dispatches at native size.

Relevant log output

OS

Linux

GPU

Other

CPU

Other

Ollama version

0.21.2

extent analysis

TL;DR

The most likely fix is to update the Vulkan backend to respect the VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations limit for every pipeline creation.

Guidance

  • Review the ggml-vulkan.cpp file and update the hardcoded workgroup sizes in the ggml_vk_create_pipeline calls to use the max_workgroup_size_log2 value instead of hardcoding {512, 1, 1}.
  • Update the corresponding .comp shaders to use the same dynamic workgroup size.
  • Verify that the updated pipelines respect the device limit by checking the maxComputeWorkGroupInvocations value.
  • Consider applying the driver-side workaround by merging the Mesa MRs (41257, 41256, and 41255) for improved performance.

Example

// Update ggml-vulkan.cpp to use dynamic workgroup size
device->max_workgroup_size_log2 = uint32_t(log2f(float(device->properties.limits.maxComputeWorkGroupInvocations)));
// ...
ggml_vk_create_pipeline(..., device->max_workgroup_size_log2, ...);

Notes

The fix requires updating the Vulkan backend to respect the device limits, which may involve modifying multiple code paths. The driver-side workaround can provide a temporary solution, but fixing the backend is recommended for optimal performance.

Recommendation

Apply the workaround by merging the Mesa MRs (41257, 41256, and 41255) to enable running Ollama on Raspberry Pi 5, and then update the Vulkan backend to respect the device limits for a more permanent fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Ollama's Vulkan backend should respect VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations for every pipeline it creates, the same way it already does for argsort.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING