ollama - 💡(How to fix) Fix 0.30.0-RC15 : GGML_ASSERT(buffer) failed` during loading of multimodal model `gemma4:26b` due to CUDA OOM

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

When attempting to load the multimodal model gemma4:26b, the process crashes with a GGML_ASSERT(buffer) failed error. While the main LLM layers (31/31) appear to be successfully offloaded to the GPU, The model should either load successfully (by utilizing system RAM if VRAM is insufficient) or provide a graceful error message indicating that there is not enough VRAM to load the vision component, rather The llama-server encounters a fatal assertion error and terminates: time=2026-05-14T16:06:34.047+02:00 level=INFO source=sched.go:580 msg="Load failed" model=/home/user/.ollama/models/blobs/... error="llama-server reported out-of-memory during startup:

Code Example

time=2026-05-14T16:06:32.247+02:00 level=INFO source=llama_server.go:837 msg="waiting for llama-server to become available" status="llm server not responding"
...
load_tensors: offloading 29 repeating layers to GPU
load_tensors: offloaded 31/31 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   577.50 MiB
load_tensors:        CUDA0 model buffer size = 16001.49 MiB
...
handle_gemma4_clip: detected Ollama-format gemma4 GGUF used as mmproj; translating
...
load_hparams: model size:         17140.88 MiB
load_hparams: metadata size:      0.32 MiB
/build/llama-server-cpu/_deps/llama_cpp-src/ggml/src/ggml-backend.cpp:179: GGML_ASSERT(buffer) failed
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1139.46 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 1194810624
...
time=2026-05-14T16:06:34.047+02:00 level=INFO source=sched.go:580 msg="Load failed" model=/home/user/.ollama/models/blobs/... error="llama-server reported out-of-memory during startup: 
GGML_ASSERT(buffer) failed\ncudaMalloc failed: out of memory"
RAW_BUFFERClick to expand / collapse

What is the issue?

When attempting to load the multimodal model gemma4:26b, the process crashes with a GGML_ASSERT(buffer) failed error. While the main LLM layers (31/31) appear to be successfully offloaded to the GPU, the crash occurs specifically during the initialization of the vision component (CLIP/mmproj). The logs indicate a cudaMalloc failure (Out of Memory) when trying to allocate the buffer for the CLIP model.

Environment:

  • Ollama Version: 0.30-RC15
  • Model: gemma4:26b (Multimodal)
  • GPU: RTX 3090 (24GB VRAM)
  • GPU Backend: CUDA 13.2
  • OS: Linux (Debian 13)

Steps to Reproduce:

  1. Install Ollama version 0.30-RC15.
  2. Pull/Run the model: ollama run gemma4:26b.
  3. Observe the logs during the loading phase.

Expected Behavior:

The model should either load successfully (by utilizing system RAM if VRAM is insufficient) or provide a graceful error message indicating that there is not enough VRAM to load the vision component, rather than crashing the entire llama-server via a GGML_ASSERT. NB: not reproduced using 0.23.4

Actual Behavior:

The llama-server encounters a fatal assertion error and terminates: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1139.46 MiB on device 0: cudaMalloc failed: out of memory GGML_ASSERT(buffer) failed

Additional Context:

The logs show that the main LLM weights are already consuming approximately 16GB of VRAM. The crash occurs when the system attempts to allocate an additional ~1.1GB for the CLIP/vision projection layer. It seems the failure to handle the memory overflow for the multimodal projection part results in a hard crash of the backend.

Relevant log output

time=2026-05-14T16:06:32.247+02:00 level=INFO source=llama_server.go:837 msg="waiting for llama-server to become available" status="llm server not responding"
...
load_tensors: offloading 29 repeating layers to GPU
load_tensors: offloaded 31/31 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   577.50 MiB
load_tensors:        CUDA0 model buffer size = 16001.49 MiB
...
handle_gemma4_clip: detected Ollama-format gemma4 GGUF used as mmproj; translating
...
load_hparams: model size:         17140.88 MiB
load_hparams: metadata size:      0.32 MiB
/build/llama-server-cpu/_deps/llama_cpp-src/ggml/src/ggml-backend.cpp:179: GGML_ASSERT(buffer) failed
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1139.46 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 1194810624
...
time=2026-05-14T16:06:34.047+02:00 level=INFO source=sched.go:580 msg="Load failed" model=/home/user/.ollama/models/blobs/... error="llama-server reported out-of-memory during startup: 
GGML_ASSERT(buffer) failed\ncudaMalloc failed: out of memory"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.30.0-RC15

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix 0.30.0-RC15 : GGML_ASSERT(buffer) failed` during loading of multimodal model `gemma4:26b` due to CUDA OOM