ollama - 💡(How to fix) Fix 0.17.7 sometimes cant get 5090 gpu [19 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14792Fetched 2026-04-08 00:31:40
View on GitHub
Comments
19
Participants
2
Timeline
21
Reactions
0
Timeline (top)
commented ×19closed ×1labeled ×1

Error Message

time=2026-03-12T07:13:58.531Z level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES=7 time=2026-03-12T07:13:58.531Z level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"

Code Example

ubuntu 24.04
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0


(main) root@C.32098470:/workspace$ CUDA_VISIBLE_DEVICES=7 OLLAMA_HOST=0.0.0.0 ollama serve
time=2026-03-12T07:13:58.528Z level=INFO source=routes.go:1658 msg="server config" env="map[CUDA_VISIBLE_DEVICES:7 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-03-12T07:13:58.528Z level=INFO source=routes.go:1660 msg="Ollama cloud disabled: false"
time=2026-03-12T07:13:58.529Z level=INFO source=images.go:477 msg="total blobs: 24"
time=2026-03-12T07:13:58.530Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2026-03-12T07:13:58.530Z level=INFO source=routes.go:1713 msg="Listening on [::]:11434 (version 0.17.7)"
time=2026-03-12T07:13:58.531Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-03-12T07:13:58.531Z level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES=7
time=2026-03-12T07:13:58.531Z level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
time=2026-03-12T07:13:58.532Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38551"
time=2026-03-12T07:14:02.659Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35241"
time=2026-03-12T07:14:02.916Z level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-03-12T07:14:02.916Z level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="483.4 GiB" available="34.5 GiB"
time=2026-03-12T07:14:02.916Z level=INFO source=routes.go:1763 msg="vram-based default context" total_vram="0 B" default_num_ctx=4096
time=2026-03-12T07:14:23.056Z level=INFO source=server.go:246 msg="enabling flash attention"
time=2026-03-12T07:14:23.056Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-eebf93fe1af74695f4768535489d2af75c862a6bad443fa4b16e1f5a96d04394 --port 39191"
time=2026-03-12T07:14:23.057Z level=INFO source=sched.go:489 msg="system memory" total="483.4 GiB" free="34.6 GiB" free_swap="0 B"
time=2026-03-12T07:14:23.057Z level=INFO source=server.go:757 msg="loading model" "model layers"=33 requested=-1
time=2026-03-12T07:14:23.075Z level=INFO source=runner.go:1429 msg="starting ollama engine"
time=2026-03-12T07:14:23.075Z level=INFO source=runner.go:1464 msg="Server listening on 127.0.0.1:39191"
time=2026-03-12T07:14:23.080Z level=INFO source=runner.go:1302 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:122 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-12T07:14:23.139Z level=INFO source=ggml.go:136 msg="" architecture=qwen35 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=52
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
time=2026-03-12T07:14:23.145Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2026-03-12T07:14:23.625Z level=INFO source=runner.go:1302 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:122 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-12T07:14:24.820Z level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:122 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-12T07:14:24.820Z level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU"
time=2026-03-12T07:14:24.820Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-03-12T07:14:24.820Z level=INFO source=ggml.go:494 msg="offloaded 0/33 layers to GPU"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:245 msg="model weights" device=CPU size="6.1 GiB"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.4 GiB"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="433.7 MiB"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:272 msg="total memory" size="7.9 GiB"
time=2026-03-12T07:14:24.820Z level=INFO source=sched.go:565 msg="loaded runners" count=1
time=2026-03-12T07:14:24.821Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-12T07:14:24.821Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-12T07:14:25.827Z level=INFO source=server.go:1388 msg="llama runner started in 2.77 seconds"
RAW_BUFFERClick to expand / collapse

What is the issue?

0.17.7 sometimes cant get 5090 gpu,sometimes is ok,sometimes failed to get gpu within running

Relevant log output

ubuntu 24.04
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0


(main) [email protected]:/workspace$ CUDA_VISIBLE_DEVICES=7 OLLAMA_HOST=0.0.0.0 ollama serve
time=2026-03-12T07:13:58.528Z level=INFO source=routes.go:1658 msg="server config" env="map[CUDA_VISIBLE_DEVICES:7 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-03-12T07:13:58.528Z level=INFO source=routes.go:1660 msg="Ollama cloud disabled: false"
time=2026-03-12T07:13:58.529Z level=INFO source=images.go:477 msg="total blobs: 24"
time=2026-03-12T07:13:58.530Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2026-03-12T07:13:58.530Z level=INFO source=routes.go:1713 msg="Listening on [::]:11434 (version 0.17.7)"
time=2026-03-12T07:13:58.531Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-03-12T07:13:58.531Z level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES=7
time=2026-03-12T07:13:58.531Z level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
time=2026-03-12T07:13:58.532Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38551"
time=2026-03-12T07:14:02.659Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35241"
time=2026-03-12T07:14:02.916Z level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-03-12T07:14:02.916Z level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="483.4 GiB" available="34.5 GiB"
time=2026-03-12T07:14:02.916Z level=INFO source=routes.go:1763 msg="vram-based default context" total_vram="0 B" default_num_ctx=4096
time=2026-03-12T07:14:23.056Z level=INFO source=server.go:246 msg="enabling flash attention"
time=2026-03-12T07:14:23.056Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-eebf93fe1af74695f4768535489d2af75c862a6bad443fa4b16e1f5a96d04394 --port 39191"
time=2026-03-12T07:14:23.057Z level=INFO source=sched.go:489 msg="system memory" total="483.4 GiB" free="34.6 GiB" free_swap="0 B"
time=2026-03-12T07:14:23.057Z level=INFO source=server.go:757 msg="loading model" "model layers"=33 requested=-1
time=2026-03-12T07:14:23.075Z level=INFO source=runner.go:1429 msg="starting ollama engine"
time=2026-03-12T07:14:23.075Z level=INFO source=runner.go:1464 msg="Server listening on 127.0.0.1:39191"
time=2026-03-12T07:14:23.080Z level=INFO source=runner.go:1302 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:122 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-12T07:14:23.139Z level=INFO source=ggml.go:136 msg="" architecture=qwen35 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=52
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
time=2026-03-12T07:14:23.145Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2026-03-12T07:14:23.625Z level=INFO source=runner.go:1302 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:122 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-12T07:14:24.820Z level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:122 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-12T07:14:24.820Z level=INFO source=ggml.go:482 msg="offloading 0 repeating layers to GPU"
time=2026-03-12T07:14:24.820Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-03-12T07:14:24.820Z level=INFO source=ggml.go:494 msg="offloaded 0/33 layers to GPU"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:245 msg="model weights" device=CPU size="6.1 GiB"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.4 GiB"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="433.7 MiB"
time=2026-03-12T07:14:24.820Z level=INFO source=device.go:272 msg="total memory" size="7.9 GiB"
time=2026-03-12T07:14:24.820Z level=INFO source=sched.go:565 msg="loaded runners" count=1
time=2026-03-12T07:14:24.821Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-12T07:14:24.821Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-12T07:14:25.827Z level=INFO source=server.go:1388 msg="llama runner started in 2.77 seconds"

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

Fix Plan

The issue seems to be related to the GPU not being properly utilized by the Ollama server. To fix this, we can try the following steps:

  • Unset CUDA_VISIBLE_DEVICES: Try unsetting the CUDA_VISIBLE_DEVICES environment variable and let the Ollama server automatically discover the available GPUs.
  • Check GPU availability: Ensure that the GPU is properly installed and recognized by the system. You can check the GPU availability using the nvidia-smi command.
  • Update Ollama configuration: Update the Ollama configuration to use the correct GPU device. You can do this by setting the OLLAMA_GPU_DEVICE_ORDINAL environment variable to the correct GPU device ordinal.

Example code to unset CUDA_VISIBLE_DEVICES:

unset CUDA_VISIBLE_DEVICES

Example code to check GPU availability:

nvidia-smi

Example code to update Ollama configuration:

export OLLAMA_GPU_DEVICE_ORDINAL=0

Verification

To verify that the fix worked, you can check the Ollama server logs for any errors related to GPU discovery or utilization. You can also use the nvidia-smi command to check if the GPU is being utilized by the Ollama server.

Example command to check Ollama server logs:

grep "GPU" ollama.log

Example command to check GPU utilization:

nvidia-smi --query-gpu=utilization.gpu --format=csv

Extra Tips

  • Ensure that the GPU drivers are up-to-date and compatible with the Ollama server version.
  • Check the Ollama server documentation for any specific configuration requirements for GPU utilization.
  • If you are still experiencing issues, try setting the OLLAMA_DEBUG environment variable to DEBUG to enable more detailed logging.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING