ollama - 💡(How to fix) Fix Performance regression in Ollama v0.30 on 5x GTX 1070 Ti: much slower than v0.24, low GPU clocks/utilization

ollama2026-06-02 06:58:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

More details and the exact log error are explained below in the thread. Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer" Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer" Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer" Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer" Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"

Root Cause

The root cause was an outdated NVIDIA driver. I was using driver 535, but the newer Ollama version required driver 570 or newer for the CUDA runtime it was using.

Code Example

n_decoded = 100, tg = 32.66 t/s
n_decoded = 280, tg = 18.42 t/s
n_decoded = 456, tg = 9.90 t/s
n_decoded = 722, tg = 6.83 t/s

---

Jun 02 13:52:49 debian ollama[87947]: time=2026-06-02T13:52:49.127+03:00 level=DEBUG source=model_recommendations.go:181 msg="stopping model recommendations cache"
Jun 02 13:52:49 debian ollama[87947]: time=2026-06-02T13:52:49.127+03:00 level=DEBUG source=sched.go:227 msg="shutting down scheduler pending loop"
Jun 02 13:52:49 debian ollama[87947]: time=2026-06-02T13:52:49.127+03:00 level=DEBUG source=sched.go:368 msg="shutting down scheduler completed loop"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.219+03:00 level=INFO source=routes.go:1919 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: LLAMA_ARG_FIT: LLAMA_ARG_FIT_TARGET: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32768 OLLAMA_DEBUG:DEBUG-4 OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GO_TEMPLATE:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_IGPU_ENABLE: OLLAMA_KEEP_ALIVE:2h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_TRANSFER_STREAMS:4 OLLAMA_MODELS:/home/boris/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.220+03:00 level=INFO source=routes.go:1921 msg="Ollama cloud disabled: false"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.225+03:00 level=INFO source=images.go:754 msg="total blobs: 44"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=DEBUG source=model_recommendations.go:57 msg="starting model recommendations cache" default_recommendations=6 refresh_interval=4h0m0s fetch_timeout=3s
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=DEBUG source=model_show_cache.go:128 msg="starting model show cache"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=DEBUG source=model_list_cache.go:70 msg="starting model list cache"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=INFO source=routes.go:1981 msg="Listening on [::]:11434 (version 0.30.0)"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=DEBUG source=sched.go:211 msg="starting llm scheduler"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=DEBUG source=model_recommendations.go:262 msg="loaded model recommendations snapshot" path=/home/boris/.ollama/cache/model-recommendations.json count=8
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=DEBUG source=model_recommendations.go:192 msg="refreshing model recommendations from remote" url=https://ollama.com/api/experimental/model-recommendations
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=INFO source=runner.go:55 msg="discovering available GPUs..."
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=TRACE source=llama_server.go:71 msg="running llama-server for discovery" cmd=/usr/local/lib/ollama/llama-server libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.235+03:00 level=INFO source=model_list_cache.go:111 msg="model list cache hydration complete" models=12 failures=0 elapsed=8.243871ms
Jun 02 13:52:49 debian ollama[88170]: 0.00.171.343 I common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
Jun 02 13:52:49 debian ollama[88170]: 0.00.171.349 I log_info: verbosity = 2147483647 (adjust with the `-lv N` CLI arg)
Jun 02 13:52:49 debian ollama[88170]: 0.00.171.441 I system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | REPACK = 1 | CUDA : ARCHS = 500,520,600,610,700,750,800,860,890,900,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 |
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.438+03:00 level=DEBUG source=llama_server.go:121 msg="llama-server discovery: stopped subprocess after collecting GPU info" exit=unknown libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.441+03:00 level=DEBUG source=model_recommendations.go:225 msg="model recommendations refreshed" count=8
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.443+03:00 level=DEBUG source=model_recommendations.go:302 msg="persisted model recommendations snapshot" path=/home/boris/.ollama/cache/model-recommendations.json count=8
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.443+03:00 level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=3h37m22.469775638s consecutive_failures=0
Jun 02 13:52:51 debian ollama[88170]: Available devices:
Jun 02 13:52:51 debian ollama[88170]:   CUDA0: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA1: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA2: NVIDIA GeForce GTX 1070 Ti (8111 MiB, 7821 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA3: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA4: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:52 debian ollama[88170]: ggml_cuda_init: found 5 CUDA devices (Total VRAM: 40566 MiB):
Jun 02 13:52:52 debian ollama[88170]:   Device 0: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 1: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 2: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8111 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 3: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 4: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.473+03:00 level=DEBUG source=llama_server.go:52 msg="llama-server device discovery took" duration=3.246096903s libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.473+03:00 level=DEBUG source=runner.go:472 msg="bootstrap discovery took" duration=3.246358257s OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs=map[]
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.475+03:00 level=TRACE source=llama_server.go:71 msg="running llama-server for discovery" cmd=/usr/local/lib/ollama/llama-server libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]"
Jun 02 13:52:52 debian ollama[88170]: 0.00.081.210 E ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
Jun 02 13:52:52 debian ollama[88170]: 0.00.082.821 I common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
Jun 02 13:52:52 debian ollama[88170]: 0.00.082.825 I log_info: verbosity = 2147483647 (adjust with the `-lv N` CLI arg)
Jun 02 13:52:52 debian ollama[88170]: 0.00.082.906 I system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | REPACK = 1 | CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 |
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.570+03:00 level=DEBUG source=llama_server.go:121 msg="llama-server discovery: stopped subprocess after collecting GPU info" exit=unknown libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]"
Jun 02 13:52:52 debian ollama[88170]: 0.00.081.946 E ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
Jun 02 13:52:52 debian ollama[88170]: Available devices:
Jun 02 13:52:52 debian ollama[88170]: ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
Jun 02 13:52:52 debian ollama[88170]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.886+03:00 level=DEBUG source=llama_server.go:52 msg="llama-server device discovery took" duration=411.332702ms libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.886+03:00 level=DEBUG source=runner.go:472 msg="bootstrap discovery took" duration=411.850154ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs=map[]
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.887+03:00 level=TRACE source=llama_server.go:71 msg="running llama-server for discovery" cmd=/usr/local/lib/ollama/llama-server libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]"
Jun 02 13:52:53 debian ollama[88170]: 0.00.262.083 I common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
Jun 02 13:52:53 debian ollama[88170]: 0.00.262.089 I log_info: verbosity = 2147483647 (adjust with the `-lv N` CLI arg)
Jun 02 13:52:53 debian ollama[88170]: 0.00.262.151 I system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | REPACK = 1 |
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.167+03:00 level=DEBUG source=llama_server.go:121 msg="llama-server discovery: stopped subprocess after collecting GPU info" exit=unknown libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]"
Jun 02 13:52:53 debian ollama[88170]: Available devices:
Jun 02 13:52:53 debian ollama[88170]:   Vulkan0: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8153 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan1: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8335 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan2: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8340 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan3: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8341 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan4: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8340 MiB free)
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: Found 5 Vulkan devices:
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 0 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 1 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 2 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 3 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 4 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: load_backend: loaded Vulkan backend from /usr/local/lib/ollama/vulkan/libggml-vulkan.so
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=llama_server.go:52 msg="llama-server device discovery took" duration=897.811145ms libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:472 msg="bootstrap discovery took" duration=897.991102ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]" extra_envs=map[]
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:117 msg="evaluating which, if any, devices to filter out" initial_count=5
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=TRACE source=runner.go:167 msg="supported GPU library combinations before filtering" supported="map[Vulkan:map[/usr/local/lib/ollama/vulkan:map[0:0 1:1 2:2 3:3 4:4]]]"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=0 new_ID=0
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=1 new_ID=1
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=2 new_ID=2
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=3 new_ID=3
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=4 new_ID=4
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:37 msg="GPU bootstrap discovery took" duration=4.557781248s
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=INFO source=types.go:32 msg="inference compute" id=3 filter_id=3 library=Vulkan compute=0.0 name=Vulkan3 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:83:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=2 filter_id=2 library=Vulkan compute=0.0 name=Vulkan2 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:04:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=4 filter_id=4 library=Vulkan compute=0.0 name=Vulkan4 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:84:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=1 filter_id=1 library=Vulkan compute=0.0 name=Vulkan1 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:03:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=0 filter_id=0 library=Vulkan compute=0.0 name=Vulkan0 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:82:00.0 type=discrete total="8.2 GiB" available="8.0 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=routes.go:2031 msg="vram-based default context" total_vram="41.2 GiB" default_num_ctx=32768

Jun 02 09:32:10 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    100, tg =  32.66 t/s
Jun 02 09:32:13 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    178, tg =  29.30 t/s
Jun 02 09:32:16 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    229, tg =  25.05 t/s
Jun 02 09:32:19 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    257, tg =  21.13 t/s
Jun 02 09:32:22 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    280, tg =  18.42 t/s
Jun 02 09:32:25 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    302, tg =  16.59 t/s
Jun 02 09:32:28 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    321, tg =  15.09 t/s
Jun 02 09:32:31 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    340, tg =  14.00 t/s
Jun 02 09:32:34 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    359, tg =  13.09 t/s
Jun 02 09:32:37 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    377, tg =  12.34 t/s
Jun 02 09:32:41 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    395, tg =  11.73 t/s
Jun 02 09:32:44 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    412, tg =  11.22 t/s
Jun 02 09:32:47 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    428, tg =  10.73 t/s
Jun 02 09:32:50 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    442, tg =  10.29 t/s
Jun 02 09:32:53 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    456, tg =   9.90 t/s
Jun 02 09:32:56 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    470, tg =   9.56 t/s
Jun 02 09:32:59 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    484, tg =   9.25 t/s
Jun 02 09:33:02 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    498, tg =   8.97 t/s
Jun 02 09:33:06 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    512, tg =   8.74 t/s
Jun 02 09:33:09 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    526, tg =   8.52 t/s
Jun 02 09:33:12 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    540, tg =   8.33 t/s
Jun 02 09:33:15 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    554, tg =   8.15 t/s
Jun 02 09:33:18 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    568, tg =   7.99 t/s
Jun 02 09:33:21 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    582, tg =   7.84 t/s
Jun 02 09:33:24 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    596, tg =   7.71 t/s
Jun 02 09:33:27 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    610, tg =   7.58 t/s
Jun 02 09:33:31 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    624, tg =   7.46 t/s
Jun 02 09:33:34 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    638, tg =   7.35 t/s
Jun 02 09:33:37 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    652, tg =   7.25 t/s
Jun 02 09:33:40 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    666, tg =   7.16 t/s
Jun 02 09:33:43 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    680, tg =   7.07 t/s
Jun 02 09:33:46 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    694, tg =   6.98 t/s
Jun 02 09:33:50 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    708, tg =   6.90 t/s
Jun 02 09:33:53 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    722, tg =   6.83 t/s

RAW_BUFFERClick to expand / collapse

What is the issue?

Problem

After upgrading from Ollama v0.24.0 to v0.30, performance became much slower on the same machine with the same model and same prompt.

This is not related to RAG or Open WebUI retrieval. I tested with a normal prompt without RAG and the slowdown is still present.

Hardware

OS: Debian Linux
GPUs: 5x NVIDIA GTX 1070 Ti, 8 GB each
Total VRAM: around 40 GB
NVIDIA CUDA backend
Multi-GPU setup

Model

Model: gemma4:26b-a4b-it-q4_K_M
Also tested a forced GPU-offload copy using: PARAMETER num_gpu 999
The forced version did not fix the problem.

Ollama v0.24.0 behavior

VRAM usage: around 25 GB
GPU utilization: all 5 GPUs are used steadily, around 25%
GPU clocks go high, around 1.9 GHz
Same simple prompt without RAG:
- input_tokens: 50
- output_tokens: 978
- prompt_token/s: around 303.8
- response_token/s: around 34.26
- total_duration: around 30 seconds

Ollama v0.30 behavior

VRAM usage: around 20 GB
GPU utilization: abnormal
GPU clocks stay very low, sometimes around 139 MHz / 405 MHz
Power usage stays very low compared to v0.24
Same simple prompt without RAG:
- input_tokens: 39
- output_tokens: 812
- prompt_token/s: around 11.73
- response_token/s: around 5.62
- total_duration: around 2m29s

Additional log example from v0.30

Generation starts faster, then slows down continuously:

n_decoded = 100, tg = 32.66 t/s
n_decoded = 280, tg = 18.42 t/s
n_decoded = 456, tg = 9.90 t/s
n_decoded = 722, tg = 6.83 t/s

Expected behavior

Ollama v0.30 should have similar performance to v0.24.0 on the same model and hardware, or at least should not be 5x slower.

Actual behavior

v0.30 uses less VRAM, but performance is much worse. GPU clocks and power usage stay low, and GPU utilization becomes bursty instead of steady.

This looks like a regression in CUDA / multi-GPU scheduling / offload behavior in the newer llama.cpp / llama-server based runtime, especially on older Pascal GPUs.

Request

Can you check if v0.30 changed the GPU split/offload behavior compared to v0.24.0?

It seems the new version is choosing a worse execution plan for this 5x GTX 1070 Ti setup.

Update / Solution!!!

This issue is solved in my case.

The root cause was an outdated NVIDIA driver. I was using driver 535, but the newer Ollama version required driver 570 or newer for the CUDA runtime it was using.

After upgrading the NVIDIA driver to 570.211.01, Ollama started working correctly with CUDA/GPU acceleration again.

More details and the exact log error are explained below in the thread.

Relevant log output

Jun 02 13:52:49 debian ollama[87947]: time=2026-06-02T13:52:49.127+03:00 level=DEBUG source=model_recommendations.go:181 msg="stopping model recommendations cache"
Jun 02 13:52:49 debian ollama[87947]: time=2026-06-02T13:52:49.127+03:00 level=DEBUG source=sched.go:227 msg="shutting down scheduler pending loop"
Jun 02 13:52:49 debian ollama[87947]: time=2026-06-02T13:52:49.127+03:00 level=DEBUG source=sched.go:368 msg="shutting down scheduler completed loop"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.219+03:00 level=INFO source=routes.go:1919 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: LLAMA_ARG_FIT: LLAMA_ARG_FIT_TARGET: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32768 OLLAMA_DEBUG:DEBUG-4 OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GO_TEMPLATE:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_IGPU_ENABLE: OLLAMA_KEEP_ALIVE:2h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_TRANSFER_STREAMS:4 OLLAMA_MODELS:/home/boris/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.220+03:00 level=INFO source=routes.go:1921 msg="Ollama cloud disabled: false"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.225+03:00 level=INFO source=images.go:754 msg="total blobs: 44"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=DEBUG source=model_recommendations.go:57 msg="starting model recommendations cache" default_recommendations=6 refresh_interval=4h0m0s fetch_timeout=3s
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=DEBUG source=model_show_cache.go:128 msg="starting model show cache"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=DEBUG source=model_list_cache.go:70 msg="starting model list cache"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.226+03:00 level=INFO source=routes.go:1981 msg="Listening on [::]:11434 (version 0.30.0)"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=DEBUG source=sched.go:211 msg="starting llm scheduler"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=DEBUG source=model_recommendations.go:262 msg="loaded model recommendations snapshot" path=/home/boris/.ollama/cache/model-recommendations.json count=8
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=DEBUG source=model_recommendations.go:192 msg="refreshing model recommendations from remote" url=https://ollama.com/api/experimental/model-recommendations
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=INFO source=runner.go:55 msg="discovering available GPUs..."
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.227+03:00 level=TRACE source=llama_server.go:71 msg="running llama-server for discovery" cmd=/usr/local/lib/ollama/llama-server libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.235+03:00 level=INFO source=model_list_cache.go:111 msg="model list cache hydration complete" models=12 failures=0 elapsed=8.243871ms
Jun 02 13:52:49 debian ollama[88170]: 0.00.171.343 I common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
Jun 02 13:52:49 debian ollama[88170]: 0.00.171.349 I log_info: verbosity = 2147483647 (adjust with the `-lv N` CLI arg)
Jun 02 13:52:49 debian ollama[88170]: 0.00.171.441 I system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | REPACK = 1 | CUDA : ARCHS = 500,520,600,610,700,750,800,860,890,900,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 |
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.438+03:00 level=DEBUG source=llama_server.go:121 msg="llama-server discovery: stopped subprocess after collecting GPU info" exit=unknown libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]"
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.441+03:00 level=DEBUG source=model_recommendations.go:225 msg="model recommendations refreshed" count=8
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.443+03:00 level=DEBUG source=model_recommendations.go:302 msg="persisted model recommendations snapshot" path=/home/boris/.ollama/cache/model-recommendations.json count=8
Jun 02 13:52:49 debian ollama[88170]: time=2026-06-02T13:52:49.443+03:00 level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=3h37m22.469775638s consecutive_failures=0
Jun 02 13:52:51 debian ollama[88170]: Available devices:
Jun 02 13:52:51 debian ollama[88170]:   CUDA0: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA1: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA2: NVIDIA GeForce GTX 1070 Ti (8111 MiB, 7821 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA3: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:51 debian ollama[88170]:   CUDA4: NVIDIA GeForce GTX 1070 Ti (8113 MiB, 8003 MiB free)
Jun 02 13:52:52 debian ollama[88170]: ggml_cuda_init: found 5 CUDA devices (Total VRAM: 40566 MiB):
Jun 02 13:52:52 debian ollama[88170]:   Device 0: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 1: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 2: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8111 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 3: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]:   Device 4: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8113 MiB
Jun 02 13:52:52 debian ollama[88170]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.473+03:00 level=DEBUG source=llama_server.go:52 msg="llama-server device discovery took" duration=3.246096903s libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.473+03:00 level=DEBUG source=runner.go:472 msg="bootstrap discovery took" duration=3.246358257s OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v12]" extra_envs=map[]
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.474+03:00 level=WARN source=cuda_compat.go:38 msg="NVIDIA driver too old" device="NVIDIA GeForce GTX 1070 Ti" compute=6.1 driver=535 required_driver="570 or newer"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.475+03:00 level=TRACE source=llama_server.go:71 msg="running llama-server for discovery" cmd=/usr/local/lib/ollama/llama-server libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]"
Jun 02 13:52:52 debian ollama[88170]: 0.00.081.210 E ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
Jun 02 13:52:52 debian ollama[88170]: 0.00.082.821 I common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
Jun 02 13:52:52 debian ollama[88170]: 0.00.082.825 I log_info: verbosity = 2147483647 (adjust with the `-lv N` CLI arg)
Jun 02 13:52:52 debian ollama[88170]: 0.00.082.906 I system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | REPACK = 1 | CUDA : ARCHS = 750,800,860,890,900,1000,1030,1100,1200,1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 |
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.570+03:00 level=DEBUG source=llama_server.go:121 msg="llama-server discovery: stopped subprocess after collecting GPU info" exit=unknown libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]"
Jun 02 13:52:52 debian ollama[88170]: 0.00.081.946 E ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
Jun 02 13:52:52 debian ollama[88170]: Available devices:
Jun 02 13:52:52 debian ollama[88170]: ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
Jun 02 13:52:52 debian ollama[88170]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.886+03:00 level=DEBUG source=llama_server.go:52 msg="llama-server device discovery took" duration=411.332702ms libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]"
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.886+03:00 level=DEBUG source=runner.go:472 msg="bootstrap discovery took" duration=411.850154ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/cuda_v13]" extra_envs=map[]
Jun 02 13:52:52 debian ollama[88170]: time=2026-06-02T13:52:52.887+03:00 level=TRACE source=llama_server.go:71 msg="running llama-server for discovery" cmd=/usr/local/lib/ollama/llama-server libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]"
Jun 02 13:52:53 debian ollama[88170]: 0.00.262.083 I common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
Jun 02 13:52:53 debian ollama[88170]: 0.00.262.089 I log_info: verbosity = 2147483647 (adjust with the `-lv N` CLI arg)
Jun 02 13:52:53 debian ollama[88170]: 0.00.262.151 I system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | REPACK = 1 |
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.167+03:00 level=DEBUG source=llama_server.go:121 msg="llama-server discovery: stopped subprocess after collecting GPU info" exit=unknown libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]"
Jun 02 13:52:53 debian ollama[88170]: Available devices:
Jun 02 13:52:53 debian ollama[88170]:   Vulkan0: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8153 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan1: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8335 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan2: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8340 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan3: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8341 MiB free)
Jun 02 13:52:53 debian ollama[88170]:   Vulkan4: NVIDIA GeForce GTX 1070 Ti (8438 MiB, 8340 MiB free)
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: Found 5 Vulkan devices:
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 0 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 1 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 2 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 3 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: ggml_vulkan: 4 = NVIDIA GeForce GTX 1070 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Jun 02 13:52:53 debian ollama[88170]: load_backend: loaded Vulkan backend from /usr/local/lib/ollama/vulkan/libggml-vulkan.so
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=llama_server.go:52 msg="llama-server device discovery took" duration=897.811145ms libDirs="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:472 msg="bootstrap discovery took" duration=897.991102ms OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/vulkan]" extra_envs=map[]
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:117 msg="evaluating which, if any, devices to filter out" initial_count=5
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=TRACE source=runner.go:167 msg="supported GPU library combinations before filtering" supported="map[Vulkan:map[/usr/local/lib/ollama/vulkan:map[0:0 1:1 2:2 3:3 4:4]]]"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=0 new_ID=0
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=1 new_ID=1
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=2 new_ID=2
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=3 new_ID=3
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:186 msg="adjusting filtering IDs" FilterID=4 new_ID=4
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=DEBUG source=runner.go:37 msg="GPU bootstrap discovery took" duration=4.557781248s
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.784+03:00 level=INFO source=types.go:32 msg="inference compute" id=3 filter_id=3 library=Vulkan compute=0.0 name=Vulkan3 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:83:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=2 filter_id=2 library=Vulkan compute=0.0 name=Vulkan2 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:04:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=4 filter_id=4 library=Vulkan compute=0.0 name=Vulkan4 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:84:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=1 filter_id=1 library=Vulkan compute=0.0 name=Vulkan1 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:03:00.0 type=discrete total="8.2 GiB" available="8.1 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=types.go:32 msg="inference compute" id=0 filter_id=0 library=Vulkan compute=0.0 name=Vulkan0 description="NVIDIA GeForce GTX 1070 Ti" libdirs=ollama,vulkan driver=0.0 pci_id=0000:82:00.0 type=discrete total="8.2 GiB" available="8.0 GiB"
Jun 02 13:52:53 debian ollama[88170]: time=2026-06-02T13:52:53.785+03:00 level=INFO source=routes.go:2031 msg="vram-based default context" total_vram="41.2 GiB" default_num_ctx=32768

Jun 02 09:32:10 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    100, tg =  32.66 t/s
Jun 02 09:32:13 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    178, tg =  29.30 t/s
Jun 02 09:32:16 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    229, tg =  25.05 t/s
Jun 02 09:32:19 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    257, tg =  21.13 t/s
Jun 02 09:32:22 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    280, tg =  18.42 t/s
Jun 02 09:32:25 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    302, tg =  16.59 t/s
Jun 02 09:32:28 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    321, tg =  15.09 t/s
Jun 02 09:32:31 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    340, tg =  14.00 t/s
Jun 02 09:32:34 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    359, tg =  13.09 t/s
Jun 02 09:32:37 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    377, tg =  12.34 t/s
Jun 02 09:32:41 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    395, tg =  11.73 t/s
Jun 02 09:32:44 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    412, tg =  11.22 t/s
Jun 02 09:32:47 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    428, tg =  10.73 t/s
Jun 02 09:32:50 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    442, tg =  10.29 t/s
Jun 02 09:32:53 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    456, tg =   9.90 t/s
Jun 02 09:32:56 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    470, tg =   9.56 t/s
Jun 02 09:32:59 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    484, tg =   9.25 t/s
Jun 02 09:33:02 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    498, tg =   8.97 t/s
Jun 02 09:33:06 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    512, tg =   8.74 t/s
Jun 02 09:33:09 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    526, tg =   8.52 t/s
Jun 02 09:33:12 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    540, tg =   8.33 t/s
Jun 02 09:33:15 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    554, tg =   8.15 t/s
Jun 02 09:33:18 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    568, tg =   7.99 t/s
Jun 02 09:33:21 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    582, tg =   7.84 t/s
Jun 02 09:33:24 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    596, tg =   7.71 t/s
Jun 02 09:33:27 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    610, tg =   7.58 t/s
Jun 02 09:33:31 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    624, tg =   7.46 t/s
Jun 02 09:33:34 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    638, tg =   7.35 t/s
Jun 02 09:33:37 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    652, tg =   7.25 t/s
Jun 02 09:33:40 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    666, tg =   7.16 t/s
Jun 02 09:33:43 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    680, tg =   7.07 t/s
Jun 02 09:33:46 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    694, tg =   6.98 t/s
Jun 02 09:33:50 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    708, tg =   6.90 t/s
Jun 02 09:33:53 debian ollama[16688]: slot print_timing: id  0 | task 1450 | n_decoded =    722, tg =   6.83 t/s

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.30.0

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Ollama v0.30 should have similar performance to v0.24.0 on the same model and hardware, or at least should not be 5x slower.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Performance regression in Ollama v0.30 on 5x GTX 1070 Ti: much slower than v0.24, low GPU clocks/utilization

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

What is the issue?

Problem

Hardware

Model

Ollama v0.24.0 behavior

Ollama v0.30 behavior

Additional log example from v0.30

Expected behavior

Actual behavior

Request

Update / Solution!!!

Relevant log output

OS

GPU

CPU

Ollama version

FAQ

Expected behavior

Still need to ship something?

TRENDING