ollama - 💡(How to fix) Fix Gemma4:31b not working with Cloud Run + Nvidia RTX Pro 6000 [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15238Fetched 2026-04-08 02:33:53
View on GitHub
Comments
3
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
commented ×3closed ×1labeled ×1

Error Message

[GIN] 2026/04/02 - 18:05:05 | 200 | 942.602238ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET "/api/tags" 4/2/2026, 11:05:05 AM [DEFAULT] [GIN] 2026/04/02 - 18:05:05 | 200 | 974.63453ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET "/api/tags" 4/2/2026, 11:05:06 AM [ERROR] undefined 4/2/2026, 11:05:06 AM [ERROR] undefined 4/2/2026, 11:05:08 AM [INFO] time=2026-04-02T18:05:08.241Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42491" 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.172Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.172Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /gcs/odeds-west4-models/ollama/models/blobs/sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 37111" 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:484 msg="system memory" total="78.5 GiB" free="76.8 GiB" free_swap="0 B" 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-b570c311-ff62-2b55-c998-37aed23891ac library=CUDA available="94.5 GiB" free="95.0 GiB" minimum="457.0 MiB" overhead="0 B" 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.173Z level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.183Z level=INFO source=runner.go:1417 msg="starting ollama engine" 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.184Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37111" 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.194Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:12 BatchSize:512 FlashAttention:Disabled KvSize:3145728 KvCacheType: NumThreads:10 GPULayers:61[ID:GPU-b570c311-ff62-2b55-c998-37aed23891ac Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" 4/2/2026, 11:05:09 AM [INFO] time=2026-04-02T18:05:09.249Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49 4/2/2026, 11:05:09 AM [DEFAULT] load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so 4/2/2026, 11:05:10 AM [DEFAULT] ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 4/2/2026, 11:05:10 AM [DEFAULT] ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 4/2/2026, 11:05:10 AM [DEFAULT] ggml_cuda_init: found 1 CUDA devices: 4/2/2026, 11:05:10 AM [DEFAULT] Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-b570c311-ff62-2b55-c998-37aed23891ac 4/2/2026, 11:05:10 AM [DEFAULT] load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so 4/2/2026, 11:05:10 AM [INFO] time=2026-04-02T18:05:10.221Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) 4/2/2026, 11:05:10 AM [INFO] time=2026-04-02T18:05:10.228Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 4/2/2026, 11:05:10 AM [INFO] time=2026-04-02T18:05:10.255Z level=INFO source=model.go:138 msg="vision: decode" elapsed=976.33µs bounds=(0,0)-(2048,2048) 4/2/2026, 11:05:10 AM [INFO] time=2026-04-02T18:05:10.348Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=93.282476ms size="[768 768]" 4/2/2026, 11:05:10 AM [INFO] time=2026-04-02T18:05:10.348Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 4/2/2026, 11:05:10 AM [INFO] time=2026-04-02T18:05:10.348Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 4/2/2026, 11:05:10 AM [INFO] time=2026-04-02T18:05:10.349Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=95.044897ms shape="[5376 256]" 4/2/2026, 11:05:51 AM [DEFAULT] CUDA error: an internal operation failed 4/2/2026, 11:05:51 AM [DEFAULT] current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2122 4/2/2026, 11:05:51 AM [DEFAULT] cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) 4/2/2026, 11:05:51 AM [DEFAULT] /ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x7f89f0b63ae8] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7f89f0b63eb6] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x7f89f0b6403d] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x15a102)[0x7f899168a102] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x167b84)[0x7f8991697b84] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16a454)[0x7f899169a454] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16d2ca)[0x7f899169d2ca] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x170416)[0x7f89916a0416] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/bin/ollama(+0x1580551)[0x56021e9b1551] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/bin/ollama(+0x14f4e0b)[0x56021e925e0b] 4/2/2026, 11:05:51 AM [DEFAULT] /usr/bin/ollama(+0x3fefc1)[0x56021d82ffc1] 4/2/2026, 11:05:52 AM [DEFAULT] SIGABRT: abort 4/2/2026, 11:05:52 AM [DEFAULT]

Fix Action

Fix / Workaround

[GIN] 2026/04/02 - 18:05:05 | 200 |  942.602238ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET      "/api/tags"
4/2/2026, 11:05:05 AM
[DEFAULT]
[GIN] 2026/04/02 - 18:05:05 | 200 |   974.63453ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET      "/api/tags"
4/2/2026, 11:05:06 AM
[ERROR]
undefined
4/2/2026, 11:05:06 AM
[ERROR]
undefined
4/2/2026, 11:05:08 AM
[INFO]
time=2026-04-02T18:05:08.241Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42491"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.172Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.172Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /gcs/odeds-west4-models/ollama/models/blobs/sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 37111"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:484 msg="system memory" total="78.5 GiB" free="76.8 GiB" free_swap="0 B"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-b570c311-ff62-2b55-c998-37aed23891ac library=CUDA available="94.5 GiB" free="95.0 GiB" minimum="457.0 MiB" overhead="0 B"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.183Z level=INFO source=runner.go:1417 msg="starting ollama engine"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.184Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37111"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.194Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:12 BatchSize:512 FlashAttention:Disabled KvSize:3145728 KvCacheType: NumThreads:10 GPULayers:61[ID:GPU-b570c311-ff62-2b55-c998-37aed23891ac Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.249Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49
4/2/2026, 11:05:09 AM
[DEFAULT]
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: found 1 CUDA devices:
4/2/2026, 11:05:10 AM
[DEFAULT]
  Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-b570c311-ff62-2b55-c998-37aed23891ac
4/2/2026, 11:05:10 AM
[DEFAULT]
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.221Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.228Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.255Z level=INFO source=model.go:138 msg="vision: decode" elapsed=976.33µs bounds=(0,0)-(2048,2048)
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=93.282476ms size="[768 768]"
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.349Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=95.044897ms shape="[5376 256]"
4/2/2026, 11:05:51 AM
[DEFAULT]
CUDA error: an internal operation failed
4/2/2026, 11:05:51 AM
[DEFAULT]
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2122
4/2/2026, 11:05:51 AM
[DEFAULT]
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
4/2/2026, 11:05:51 AM
[DEFAULT]
/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x7f89f0b63ae8]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7f89f0b63eb6]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x7f89f0b6403d]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x15a102)[0x7f899168a102]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x167b84)[0x7f8991697b84]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16a454)[0x7f899169a454]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16d2ca)[0x7f899169d2ca]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x170416)[0x7f89916a0416]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x1580551)[0x56021e9b1551]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x14f4e0b)[0x56021e925e0b]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x3fefc1)[0x56021d82ffc1]
4/2/2026, 11:05:52 AM
[DEFAULT]
SIGABRT: abort
4/2/2026, 11:05:52 AM
[DEFAULT]

Code Example

[GIN] 2026/04/02 - 18:05:05 | 200 |  942.602238ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET      "/api/tags"
4/2/2026, 11:05:05 AM
[DEFAULT]
[GIN] 2026/04/02 - 18:05:05 | 200 |   974.63453ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET      "/api/tags"
4/2/2026, 11:05:06 AM
[ERROR]
undefined
4/2/2026, 11:05:06 AM
[ERROR]
undefined
4/2/2026, 11:05:08 AM
[INFO]
time=2026-04-02T18:05:08.241Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42491"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.172Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.172Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /gcs/odeds-west4-models/ollama/models/blobs/sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 37111"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:484 msg="system memory" total="78.5 GiB" free="76.8 GiB" free_swap="0 B"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-b570c311-ff62-2b55-c998-37aed23891ac library=CUDA available="94.5 GiB" free="95.0 GiB" minimum="457.0 MiB" overhead="0 B"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.183Z level=INFO source=runner.go:1417 msg="starting ollama engine"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.184Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37111"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.194Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:12 BatchSize:512 FlashAttention:Disabled KvSize:3145728 KvCacheType: NumThreads:10 GPULayers:61[ID:GPU-b570c311-ff62-2b55-c998-37aed23891ac Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.249Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49
4/2/2026, 11:05:09 AM
[DEFAULT]
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: found 1 CUDA devices:
4/2/2026, 11:05:10 AM
[DEFAULT]
  Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-b570c311-ff62-2b55-c998-37aed23891ac
4/2/2026, 11:05:10 AM
[DEFAULT]
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.221Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.228Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.255Z level=INFO source=model.go:138 msg="vision: decode" elapsed=976.33µs bounds=(0,0)-(2048,2048)
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=93.282476ms size="[768 768]"
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.349Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=95.044897ms shape="[5376 256]"
4/2/2026, 11:05:51 AM
[DEFAULT]
CUDA error: an internal operation failed
4/2/2026, 11:05:51 AM
[DEFAULT]
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2122
4/2/2026, 11:05:51 AM
[DEFAULT]
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
4/2/2026, 11:05:51 AM
[DEFAULT]
/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x7f89f0b63ae8]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7f89f0b63eb6]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x7f89f0b6403d]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x15a102)[0x7f899168a102]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x167b84)[0x7f8991697b84]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16a454)[0x7f899169a454]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16d2ca)[0x7f899169d2ca]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x170416)[0x7f89916a0416]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x1580551)[0x56021e9b1551]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x14f4e0b)[0x56021e925e0b]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x3fefc1)[0x56021d82ffc1]
4/2/2026, 11:05:52 AM
[DEFAULT]
SIGABRT: abort
4/2/2026, 11:05:52 AM
[DEFAULT]
RAW_BUFFERClick to expand / collapse

What is the issue?

Please see the attache logs.

To deploy Ollama on Cloud Run with RTX-6000 --

gcloud beta run deploy ollama-rtx-6000
--image ollama/ollama:0.20.0-rc0 --port 11434
--gpu 1--gpu-type nvidia-rtx-pro-6000
--no-gpu-zonal-redundancy
--region us-central1

Relevant log output

[GIN] 2026/04/02 - 18:05:05 | 200 |  942.602238ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET      "/api/tags"
4/2/2026, 11:05:05 AM
[DEFAULT]
[GIN] 2026/04/02 - 18:05:05 | 200 |   974.63453ms | 2a00:79e0:2f00:c:31aa:fdb1:d172:fb53 | GET      "/api/tags"
4/2/2026, 11:05:06 AM
[ERROR]
undefined
4/2/2026, 11:05:06 AM
[ERROR]
undefined
4/2/2026, 11:05:08 AM
[INFO]
time=2026-04-02T18:05:08.241Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42491"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.172Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.172Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /gcs/odeds-west4-models/ollama/models/blobs/sha256-280af6832eca23cb322c4dcc65edfea98a21b8f8ab07dc7553bd6f7e6e7a3313 --port 37111"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:484 msg="system memory" total="78.5 GiB" free="76.8 GiB" free_swap="0 B"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-b570c311-ff62-2b55-c998-37aed23891ac library=CUDA available="94.5 GiB" free="95.0 GiB" minimum="457.0 MiB" overhead="0 B"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.173Z level=INFO source=server.go:759 msg="loading model" "model layers"=61 requested=-1
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.183Z level=INFO source=runner.go:1417 msg="starting ollama engine"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.184Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37111"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.194Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:12 BatchSize:512 FlashAttention:Disabled KvSize:3145728 KvCacheType: NumThreads:10 GPULayers:61[ID:GPU-b570c311-ff62-2b55-c998-37aed23891ac Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
4/2/2026, 11:05:09 AM
[INFO]
time=2026-04-02T18:05:09.249Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1189 num_key_values=49
4/2/2026, 11:05:09 AM
[DEFAULT]
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
4/2/2026, 11:05:10 AM
[DEFAULT]
ggml_cuda_init: found 1 CUDA devices:
4/2/2026, 11:05:10 AM
[DEFAULT]
  Device 0: NVIDIA RTX PRO 6000 Blackwell Server Edition, compute capability 12.0, VMM: yes, ID: GPU-b570c311-ff62-2b55-c998-37aed23891ac
4/2/2026, 11:05:10 AM
[DEFAULT]
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.221Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.228Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.255Z level=INFO source=model.go:138 msg="vision: decode" elapsed=976.33µs bounds=(0,0)-(2048,2048)
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=93.282476ms size="[768 768]"
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.348Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
4/2/2026, 11:05:10 AM
[INFO]
time=2026-04-02T18:05:10.349Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=95.044897ms shape="[5376 256]"
4/2/2026, 11:05:51 AM
[DEFAULT]
CUDA error: an internal operation failed
4/2/2026, 11:05:51 AM
[DEFAULT]
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2122
4/2/2026, 11:05:51 AM
[DEFAULT]
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), cu_data_type_a, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), cu_data_type_b, s11, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne0, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
4/2/2026, 11:05:51 AM
[DEFAULT]
/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x7f89f0b63ae8]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7f89f0b63eb6]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x7f89f0b6403d]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x15a102)[0x7f899168a102]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x167b84)[0x7f8991697b84]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16a454)[0x7f899169a454]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x16d2ca)[0x7f899169d2ca]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/lib/ollama/cuda_v13/libggml-cuda.so(+0x170416)[0x7f89916a0416]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x1580551)[0x56021e9b1551]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x14f4e0b)[0x56021e925e0b]
4/2/2026, 11:05:51 AM
[DEFAULT]
/usr/bin/ollama(+0x3fefc1)[0x56021d82ffc1]
4/2/2026, 11:05:52 AM
[DEFAULT]
SIGABRT: abort
4/2/2026, 11:05:52 AM
[DEFAULT]

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

TL;DR

The issue is likely related to a CUDA error causing the Ollama deployment on Cloud Run to fail, and checking the CUDA version and updating the ggml-cuda library may resolve the issue.

Guidance

  • Verify the CUDA version installed on the system and ensure it is compatible with the ggml-cuda library.
  • Check the ggml-cuda library version and update it to the latest version if necessary.
  • Review the Ollama deployment configuration and ensure that the --gpu and --gpu-type flags are set correctly.
  • Investigate the possibility of a memory issue causing the CUDA error, and consider increasing the memory allocation for the Ollama deployment.

Example

No code example is provided as the issue is related to a CUDA error and library compatibility.

Notes

The exact cause of the issue is unclear due to the lack of information about the OS, GPU, CPU, and Ollama version. Further investigation is needed to determine the root cause of the problem.

Recommendation

Apply a workaround by updating the ggml-cuda library to the latest version and verifying the CUDA version installed on the system. This may resolve the CUDA error and allow the Ollama deployment to succeed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix Gemma4:31b not working with Cloud Run + Nvidia RTX Pro 6000 [3 comments, 2 participants]