ollama - 💡(How to fix) Fix 0.30.0-RC15 : GGML_ASSERT(buffer) failed` during loading of multimodal model `gemma4:26b` due to CUDA OOM

Error Message

When attempting to load the multimodal model gemma4:26b, the process crashes with a GGML_ASSERT(buffer) failed error. While the main LLM layers (31/31) appear to be successfully offloaded to the GPU, The model should either load successfully (by utilizing system RAM if VRAM is insufficient) or provide a graceful error message indicating that there is not enough VRAM to load the vision component, rather The llama-server encounters a fatal assertion error and terminates: time=2026-05-14T16:06:34.047+02:00 level=INFO source=sched.go:580 msg="Load failed" model=/home/user/.ollama/models/blobs/... error="llama-server reported out-of-memory during startup:

Code Example

time=2026-05-14T16:06:32.247+02:00 level=INFO source=llama_server.go:837 msg="waiting for llama-server to become available" status="llm server not responding"
...
load_tensors: offloading 29 repeating layers to GPU
load_tensors: offloaded 31/31 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   577.50 MiB
load_tensors:        CUDA0 model buffer size = 16001.49 MiB
...
handle_gemma4_clip: detected Ollama-format gemma4 GGUF used as mmproj; translating
...
load_hparams: model size:         17140.88 MiB
load_hparams: metadata size:      0.32 MiB
/build/llama-server-cpu/_deps/llama_cpp-src/ggml/src/ggml-backend.cpp:179: GGML_ASSERT(buffer) failed
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1139.46 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 1194810624
...
time=2026-05-14T16:06:34.047+02:00 level=INFO source=sched.go:580 msg="Load failed" model=/home/user/.ollama/models/blobs/... error="llama-server reported out-of-memory during startup: 
GGML_ASSERT(buffer) failed\ncudaMalloc failed: out of memory"

What is the issue?

When attempting to load the multimodal model gemma4:26b, the process crashes with a GGML_ASSERT(buffer) failed error. While the main LLM layers (31/31) appear to be successfully offloaded to the GPU, the crash occurs specifically during the initialization of the vision component (CLIP/mmproj). The logs indicate a cudaMalloc failure (Out of Memory) when trying to allocate the buffer for the CLIP model.

Environment:

Ollama Version: 0.30-RC15
Model: gemma4:26b (Multimodal)
GPU: RTX 3090 (24GB VRAM)
GPU Backend: CUDA 13.2
OS: Linux (Debian 13)

Steps to Reproduce:

Install Ollama version 0.30-RC15.
Pull/Run the model: ollama run gemma4:26b.
Observe the logs during the loading phase.

Expected Behavior:

The model should either load successfully (by utilizing system RAM if VRAM is insufficient) or provide a graceful error message indicating that there is not enough VRAM to load the vision component, rather than crashing the entire llama-server via a GGML_ASSERT. NB: not reproduced using 0.23.4

Actual Behavior:

The llama-server encounters a fatal assertion error and terminates: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1139.46 MiB on device 0: cudaMalloc failed: out of memory GGML_ASSERT(buffer) failed

Additional Context:

The logs show that the main LLM weights are already consuming approximately 16GB of VRAM. The crash occurs when the system attempts to allocate an additional ~1.1GB for the CLIP/vision projection layer. It seems the failure to handle the memory overflow for the multimodal projection part results in a hard crash of the backend.

Relevant log output

time=2026-05-14T16:06:32.247+02:00 level=INFO source=llama_server.go:837 msg="waiting for llama-server to become available" status="llm server not responding"
...
load_tensors: offloading 29 repeating layers to GPU
load_tensors: offloaded 31/31 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   577.50 MiB
load_tensors:        CUDA0 model buffer size = 16001.49 MiB
...
handle_gemma4_clip: detected Ollama-format gemma4 GGUF used as mmproj; translating
...
load_hparams: model size:         17140.88 MiB
load_hparams: metadata size:      0.32 MiB
/build/llama-server-cpu/_deps/llama_cpp-src/ggml/src/ggml-backend.cpp:179: GGML_ASSERT(buffer) failed
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1139.46 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 1194810624
...
time=2026-05-14T16:06:34.047+02:00 level=INFO source=sched.go:580 msg="Load failed" model=/home/user/.ollama/models/blobs/... error="llama-server reported out-of-memory during startup: 
GGML_ASSERT(buffer) failed\ncudaMalloc failed: out of memory"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.30.0-RC15

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix 0.30.0-RC15 : GGML_ASSERT(buffer) failed` during loading of multimodal model `gemma4:26b` due to CUDA OOM

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Environment:

Steps to Reproduce:

Expected Behavior:

Actual Behavior:

Additional Context:

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix 0.30.0-RC15 : GGML_ASSERT(buffer) failed` during loading of multimodal model `gemma4:26b` due to CUDA OOM

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Environment:

Steps to Reproduce:

Expected Behavior:

Actual Behavior:

Additional Context:

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

RELATED_DISCOVERY

TRENDING