ollama - ✅(Solved) Fix gemma4:31b-it-q4_K_M fails to load with CUDA error: cublasGemmBatchedEx internal operation failed (v0.20.0, RTX 4090) [3 pull requests, 2 comments, 2 participants]

RAFOLIE · 2026-04-03T01:41:59Z

[ollama] When trying to run gemma4:31b-it-q4 K M , the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encod… When trying to run `gemma4:31b-it-q4_K_M`, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase. # PR #15296: gemma4: enable flash attention - Repository: ollama/ollama - Author: dhiltgen - State: closed | merged: True - Link: https://github.com/ollama/ollama/pull/15296 ## Description (problem / solution / changelog) ~~This patches additional code paths in the GGML CUDA backend for the memory prediction flow.~~ ~~Fixes #15249~~ Enable flash attention for gemma4 ## Changed files - `fs/ggml/ggml.go` (modified, +1/-0) --- # PR #15301: ggml: skip cublasGemmBatchedEx during graph reservation - Repository: ollama/ollama - Author: jessegross - State: closed | merged: True - Link: https://github.com/ollama/ollama/pull/15301 ## Description (problem / solution / changelog) cublasGemmBatchedEx fails during graph capture when pool allocations return fake pointers. This is triggered when NUM_PARALLEL is greater than 1 for models like gemma4 that use batched matmuls. Skip it during reservation since the memory tracking is already handled by the pool allocations. Fixes #15249 ## Changed files - `llama/patches/0020-ggml-No-alloc-mode.patch` (modified, +24/-3) - `ml/backend/ggml/ggml/src/ggml-cuda/common.cuh` (modified, +21/-0) --- # PR #1995: Update ollama/ollama Docker tag to v0.19.0 - Repository: claytono/infra - Author: renovate[bot] - State: closed | merged: True - Link: https://github.com/claytono/infra/pull/1995 ## Description (problem / solution / changelog) This PR contains the following updates: | Package | Update | Change | Pending | |---|---|---|---| | ollama/ollama | minor | `0.18.2` → `0.19.0` | `0.20.7` (+6) | --- ### Configuration 📅 **Schedule**: (in timezone America/New_York) - Branch creation - "after 2am and before 8am on monday" - Automerge - At any time (no schedule defined) 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/claytono/infra). ## Changed files - `kubernetes/ollama/deployment.yaml` (modified, +1/-1) ## Description When trying to run `gemma4:31b-it-q4_K_M`, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase. ## Environment - **Ollama version**: 0.20.0 - **OS**: Windows 11 - **GPU**: NVIDIA GeForce RTX 4090 (24 GiB VRAM, ~20.5 GiB available) - **CUDA Driver version**: 595.79 (CUDA 13.2) - **RAM**: 64 GiB (~49 GiB free) - **Model**: `gemma4:31b-it-q4_K_M` (19 GB, Q4_K_M) ## Steps to Reproduce 1. Pull `gemma4:31b-it-q4_K_M` 2. Send a chat request (text only, no image) 3. Model fails to load with HTTP 500 ## Error from server.log ``` time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]" CUDA error: an internal operation failed current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host." time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error" time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" ``` ## Notes - The error consistently occurs right after `vision: encoded` log line, during the vision projector's batched matrix multiplication (`cublasGemmBatchedEx`). - Resources are sufficient: 20+ GiB free VRAM, 49 GiB free RAM. - The same error occurs on every repeated attempt. - `OLLAMA_FLASH_ATTENTION` is currently disabled (false). - `OLLAMA_CONTEXT_LENGTH` is set to 262144. - `OLLAMA_NUM_PARALLEL` is set to 4.

ollama2026-04-03 01:41:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15249•Fetched 2026-04-08 02:33:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

referenced ×6commented ×2cross-referenced ×2marked_as_duplicate ×2

When trying to run gemma4:31b-it-q4_K_M, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase.

Error Message

time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]" CUDA error: an internal operation failed current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host." time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error" time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

Root Cause

When trying to run gemma4:31b-it-q4_K_M, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase.

PR fix notes

PR #15296: gemma4: enable flash attention

Repository: ollama/ollama
Author: dhiltgen
State: closed | merged: True
Link: https://github.com/ollama/ollama/pull/15296

Description (problem / solution / changelog)

~~This patches additional code paths in the GGML CUDA backend for the memory prediction flow.~~

~~Fixes #15249~~

Enable flash attention for gemma4

Changed files

fs/ggml/ggml.go (modified, +1/-0)

PR #15301: ggml: skip cublasGemmBatchedEx during graph reservation

Repository: ollama/ollama
Author: jessegross
State: closed | merged: True
Link: https://github.com/ollama/ollama/pull/15301

Description (problem / solution / changelog)

cublasGemmBatchedEx fails during graph capture when pool allocations return fake pointers. This is triggered when NUM_PARALLEL is greater than 1 for models like gemma4 that use batched matmuls. Skip it during reservation since the memory tracking is already handled by the pool allocations.

Fixes #15249

Changed files

llama/patches/0020-ggml-No-alloc-mode.patch (modified, +24/-3)
ml/backend/ggml/ggml/src/ggml-cuda/common.cuh (modified, +21/-0)

PR #1995: Update ollama/ollama Docker tag to v0.19.0

Repository: claytono/infra
Author: renovate[bot]
State: closed | merged: True
Link: https://github.com/claytono/infra/pull/1995

Description (problem / solution / changelog)

This PR contains the following updates:

Package	Update	Change	Pending
ollama/ollama	minor	`0.18.2` → `0.19.0`	`0.20.7` (+6)

Configuration

📅 Schedule: (in timezone America/New_York)

Branch creation
- "after 2am and before 8am on monday"
Automerge
- At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

Changed files

kubernetes/ollama/deployment.yaml (modified, +1/-1)

Code Example

time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]"
CUDA error: an internal operation failed
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host."
time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error"
time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

RAW_BUFFERClick to expand / collapse

Description

When trying to run gemma4:31b-it-q4_K_M, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase.

Environment

Ollama version: 0.20.0
OS: Windows 11
GPU: NVIDIA GeForce RTX 4090 (24 GiB VRAM, ~20.5 GiB available)
CUDA Driver version: 595.79 (CUDA 13.2)
RAM: 64 GiB (~49 GiB free)
Model: gemma4:31b-it-q4_K_M (19 GB, Q4_K_M)

Steps to Reproduce

Pull gemma4:31b-it-q4_K_M
Send a chat request (text only, no image)
Model fails to load with HTTP 500

Error from server.log

time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]"
CUDA error: an internal operation failed
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host."
time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error"
time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

Notes

The error consistently occurs right after vision: encoded log line, during the vision projector's batched matrix multiplication (cublasGemmBatchedEx).
Resources are sufficient: 20+ GiB free VRAM, 49 GiB free RAM.
The same error occurs on every repeated attempt.
OLLAMA_FLASH_ATTENTION is currently disabled (false).
OLLAMA_CONTEXT_LENGTH is set to 262144.
OLLAMA_NUM_PARALLEL is set to 4.

extent analysis

TL;DR

The most likely fix is to investigate and potentially update the CUDA driver or the ggml-cuda library to resolve the internal CUDA error during the vision encoder initialization phase.

Guidance

Verify that the CUDA driver version 595.79 is compatible with the NVIDIA GeForce RTX 4090 GPU and the CUDA 13.2 version.
Check the ggml-cuda library for any known issues or updates related to the cublasGemmBatchedEx function.
Consider reducing the OLLAMA_CONTEXT_LENGTH or OLLAMA_NUM_PARALLEL settings to decrease the resource utilization and see if it resolves the issue.
Monitor the system resources (VRAM and RAM) during the model loading process to ensure that the error is not caused by resource limitations.

Notes

The error is consistently occurring during the vision projector's batched matrix multiplication, which suggests a potential issue with the CUDA driver or the ggml-cuda library. The fact that resources are sufficient (20+ GiB free VRAM, 49 GiB free RAM) reduces the likelihood of a resource-related issue.

Recommendation

Apply a workaround by reducing the OLLAMA_CONTEXT_LENGTH or OLLAMA_NUM_PARALLEL settings to decrease the resource utilization, as this may help resolve the issue until a more permanent fix is available.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - ✅(Solved) Fix gemma4:31b-it-q4_K_M fails to load with CUDA error: cublasGemmBatchedEx internal operation failed (v0.20.0, RTX 4090) [3 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #15296: gemma4: enable flash attention

Description (problem / solution / changelog)

Changed files

PR #15301: ggml: skip cublasGemmBatchedEx during graph reservation

Description (problem / solution / changelog)

Changed files

PR #1995: Update ollama/ollama Docker tag to v0.19.0

Description (problem / solution / changelog)

Configuration

Changed files

Code Example

Description

Environment

Steps to Reproduce

Error from server.log

Notes

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - ✅(Solved) Fix gemma4:31b-it-q4_K_M fails to load with CUDA error: cublasGemmBatchedEx internal operation failed (v0.20.0, RTX 4090) [3 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #15296: gemma4: enable flash attention

Description (problem / solution / changelog)

Changed files

PR #15301: ggml: skip cublasGemmBatchedEx during graph reservation

Description (problem / solution / changelog)

Changed files

PR #1995: Update ollama/ollama Docker tag to v0.19.0

Description (problem / solution / changelog)

Configuration

Changed files

Code Example

Description

Environment

Steps to Reproduce

Error from server.log

Notes

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING