ollama - ✅(Solved) Fix gemma4:31b-it-q4_K_M fails to load with CUDA error: cublasGemmBatchedEx internal operation failed (v0.20.0, RTX 4090) [3 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15249Fetched 2026-04-08 02:33:41
View on GitHub
Comments
2
Participants
2
Timeline
15
Reactions
3
Author
Participants
Assignees
Timeline (top)
referenced ×6commented ×2cross-referenced ×2marked_as_duplicate ×2

When trying to run gemma4:31b-it-q4_K_M, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase.

Error Message

time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]" CUDA error: an internal operation failed current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host." time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error" time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

Root Cause

When trying to run gemma4:31b-it-q4_K_M, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase.

PR fix notes

PR #15296: gemma4: enable flash attention

Description (problem / solution / changelog)

This patches additional code paths in the GGML CUDA backend for the memory prediction flow.

Fixes #15249

Enable flash attention for gemma4

Changed files

  • fs/ggml/ggml.go (modified, +1/-0)

PR #15301: ggml: skip cublasGemmBatchedEx during graph reservation

Description (problem / solution / changelog)

cublasGemmBatchedEx fails during graph capture when pool allocations return fake pointers. This is triggered when NUM_PARALLEL is greater than 1 for models like gemma4 that use batched matmuls. Skip it during reservation since the memory tracking is already handled by the pool allocations.

Fixes #15249

Changed files

  • llama/patches/0020-ggml-No-alloc-mode.patch (modified, +24/-3)
  • ml/backend/ggml/ggml/src/ggml-cuda/common.cuh (modified, +21/-0)

PR #1995: Update ollama/ollama Docker tag to v0.19.0

Description (problem / solution / changelog)

This PR contains the following updates:

PackageUpdateChangePending
ollama/ollamaminor0.18.20.19.00.20.7 (+6)

Configuration

📅 Schedule: (in timezone America/New_York)

  • Branch creation
    • "after 2am and before 8am on monday"
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • <!-- rebase-check -->If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMTAuMiIsInVwZGF0ZWRJblZlciI6IjQzLjExMC4yIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZSJdfQ==-->

Changed files

  • kubernetes/ollama/deployment.yaml (modified, +1/-1)

Code Example

time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]"
CUDA error: an internal operation failed
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host."
time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error"
time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"
RAW_BUFFERClick to expand / collapse

Description

When trying to run gemma4:31b-it-q4_K_M, the model fails to load with a 500 error. The server log shows a CUDA internal error during the vision encoder initialization phase.

Environment

  • Ollama version: 0.20.0
  • OS: Windows 11
  • GPU: NVIDIA GeForce RTX 4090 (24 GiB VRAM, ~20.5 GiB available)
  • CUDA Driver version: 595.79 (CUDA 13.2)
  • RAM: 64 GiB (~49 GiB free)
  • Model: gemma4:31b-it-q4_K_M (19 GB, Q4_K_M)

Steps to Reproduce

  1. Pull gemma4:31b-it-q4_K_M
  2. Send a chat request (text only, no image)
  3. Model fails to load with HTTP 500

Error from server.log

time=2026-04-03T09:26:08.106+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=99.5538ms shape="[5376 256]"
CUDA error: an internal operation failed
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at ggml-cuda.cu:2130
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error

time=2026-04-03T09:26:30.753+08:00 level=ERROR source=server.go:1207 msg="do load request" error="Post http://127.0.0.1:60141/load: wsarecv: An existing connection was forcibly closed by the remote host."
time=2026-04-03T09:26:30.753+08:00 level=INFO source=sched.go:511 msg="Load failed" error="model failed to load, this may be due to resource limitations or an internal error"
time=2026-04-03T09:26:31.036+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

Notes

  • The error consistently occurs right after vision: encoded log line, during the vision projector's batched matrix multiplication (cublasGemmBatchedEx).
  • Resources are sufficient: 20+ GiB free VRAM, 49 GiB free RAM.
  • The same error occurs on every repeated attempt.
  • OLLAMA_FLASH_ATTENTION is currently disabled (false).
  • OLLAMA_CONTEXT_LENGTH is set to 262144.
  • OLLAMA_NUM_PARALLEL is set to 4.

extent analysis

TL;DR

The most likely fix is to investigate and potentially update the CUDA driver or the ggml-cuda library to resolve the internal CUDA error during the vision encoder initialization phase.

Guidance

  • Verify that the CUDA driver version 595.79 is compatible with the NVIDIA GeForce RTX 4090 GPU and the CUDA 13.2 version.
  • Check the ggml-cuda library for any known issues or updates related to the cublasGemmBatchedEx function.
  • Consider reducing the OLLAMA_CONTEXT_LENGTH or OLLAMA_NUM_PARALLEL settings to decrease the resource utilization and see if it resolves the issue.
  • Monitor the system resources (VRAM and RAM) during the model loading process to ensure that the error is not caused by resource limitations.

Notes

The error is consistently occurring during the vision projector's batched matrix multiplication, which suggests a potential issue with the CUDA driver or the ggml-cuda library. The fact that resources are sufficient (20+ GiB free VRAM, 49 GiB free RAM) reduces the likelihood of a resource-related issue.

Recommendation

Apply a workaround by reducing the OLLAMA_CONTEXT_LENGTH or OLLAMA_NUM_PARALLEL settings to decrease the resource utilization, as this may help resolve the issue until a more permanent fix is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING