ollama - 💡(How to fix) Fix Ollama under utilizes available GPU VRAM causing out of memory [8 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14632Fetched 2026-04-08 00:33:33
View on GitHub
Comments
8
Participants
2
Timeline
9
Reactions
0
Author
Participants
Timeline (top)
commented ×8labeled ×1

Error Message

ollama-gpu-1 | time=2026-03-05T03:53:42.460Z level=INFO source=sched.go:565 msg="loaded runners" count=2 ollama-gpu-1 | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" ollama-gpu-1 | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1388 msg="llama runner started in 2.50 seconds" ollama-gpu-1 | CUDA error: out of memory ollama-gpu-1 | current device: 0, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981 ... ollama-gpu-1 | time=2026-03-05T03:53:42.739Z level=ERROR source=server.go:1610 msg="post predict" error="Post "http://127.0.0.1:33587/completion\": EOF"

$ nvidia-smi Thu Mar 5 03:53:59 2026
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:0C:00.0 Off | N/A | | 30% 47C P8 24W / 350W | 264MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:0D:00.0 Off | N/A | | 30% 41C P8 21W / 350W | 23528MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 On | 00000000:0E:00.0 Off | Off | | 0% 44C P8 39W / 480W | 24042MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 8444 C /usr/bin/ollama 254MiB | | 1 N/A N/A 8444 C /usr/bin/ollama 23518MiB | | 2 N/A N/A 8444 C /usr/bin/ollama 24032MiB | +-----------------------------------------------------------------------------------------+ $ docker exec ollama-ollama-gpu-1 ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL
tinyllama:latest 2644915ede35 985 MB 100% GPU 2048 29 minutes from now
glm-4.7-flash:q8_0 a035bf4bc812 49 GB 100% GPU 80000 29 minutes from now

Code Example

ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=sched.go:565 msg="loaded runners" count=2
ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1388 msg="llama runner started in 2.50 seconds"
ollama-gpu-1  | CUDA error: out of memory
ollama-gpu-1  |   current device: 0, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981
...
ollama-gpu-1  | time=2026-03-05T03:53:42.739Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33587/completion\": EOF"


$ nvidia-smi 
Thu Mar  5 03:53:59 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:0C:00.0 Off |                  N/A |
| 30%   47C    P8             24W /  350W |     264MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:0D:00.0 Off |                  N/A |
| 30%   41C    P8             21W /  350W |   23528MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:0E:00.0 Off |                  Off |
|  0%   44C    P8             39W /  480W |   24042MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            8444      C   /usr/bin/ollama                         254MiB |
|    1   N/A  N/A            8444      C   /usr/bin/ollama                       23518MiB |
|    2   N/A  N/A            8444      C   /usr/bin/ollama                       24032MiB |
+-----------------------------------------------------------------------------------------+
$ docker exec ollama-ollama-gpu-1 ollama ps
NAME                  ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
tinyllama:latest      2644915ede35    985 MB    100% GPU     2048       29 minutes from now    
glm-4.7-flash:q8_0    a035bf4bc812    49 GB     100% GPU     80000      29 minutes from now
RAW_BUFFERClick to expand / collapse

What is the issue?

I'm trying to run multiple models side by side, but Ollama improperly puts both models on the same GPUs even when others are available. Tinyllama could easily fit in full on GPU0, but Ollama doesn't put it there.

environment: - OLLAMA_KEEP_ALIVE="30m" - OLLAMA_FLASH_ATTENTION=true - OLLAMA_LOAD_TIMEOUT="15m" - OLLAMA_CONTEXT_LENGTH=80000 - OLLAMA_NUM_PARALLEL=2

Relevant log output

ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=sched.go:565 msg="loaded runners" count=2
ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1388 msg="llama runner started in 2.50 seconds"
ollama-gpu-1  | CUDA error: out of memory
ollama-gpu-1  |   current device: 0, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981
...
ollama-gpu-1  | time=2026-03-05T03:53:42.739Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33587/completion\": EOF"


$ nvidia-smi 
Thu Mar  5 03:53:59 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:0C:00.0 Off |                  N/A |
| 30%   47C    P8             24W /  350W |     264MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:0D:00.0 Off |                  N/A |
| 30%   41C    P8             21W /  350W |   23528MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:0E:00.0 Off |                  Off |
|  0%   44C    P8             39W /  480W |   24042MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            8444      C   /usr/bin/ollama                         254MiB |
|    1   N/A  N/A            8444      C   /usr/bin/ollama                       23518MiB |
|    2   N/A  N/A            8444      C   /usr/bin/ollama                       24032MiB |
+-----------------------------------------------------------------------------------------+
$ docker exec ollama-ollama-gpu-1 ollama ps
NAME                  ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
tinyllama:latest      2644915ede35    985 MB    100% GPU     2048       29 minutes from now    
glm-4.7-flash:q8_0    a035bf4bc812    49 GB     100% GPU     80000      29 minutes from now

OS

Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.17.6

extent analysis

Fix Plan

To resolve the issue of Ollama improperly utilizing GPUs, we need to ensure that the models are distributed across available GPUs efficiently. Given the environment and log output, it seems that Ollama is not optimally allocating models to GPUs, leading to out-of-memory errors.

  1. Environment Variable Adjustment: Adjust the OLLAMA_NUM_PARALLEL environment variable to match the number of available GPUs. This ensures that Ollama is aware of and can utilize all GPUs.
  2. GPU Allocation: Manually specify GPU allocation for each model using the CUDA_VISIBLE_DEVICES environment variable. This can be done within the Docker container or in the host environment before starting the container.
  3. Model Size and GPU Capacity: Ensure that the size of the models does not exceed the capacity of a single GPU. If a model is too large, consider using model pruning, quantization, or splitting the model across multiple GPUs.
  4. Ollama Configuration: Review Ollama's configuration for any settings that might affect GPU allocation and model loading. Ensure that the configuration is optimized for the available hardware.

Example Code Snippets

To set CUDA_VISIBLE_DEVICES for a Docker container, you can use the following command:

docker run -d --gpus all --env CUDA_VISIBLE_DEVICES=0,1,2 ollama/ollama:0.17.6

This command allows the container to see and use all three GPUs (GPU 0, GPU 1, and GPU 2).

To allocate specific GPUs to different models within the Ollama configuration, you might need to adjust the model loading script or configuration file. For example:

import os

# Set CUDA_VISIBLE_DEVICES for the current process
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # For the first model
# Load the first model

os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # For the second model
# Load the second model

Note: The exact method to specify GPU allocation within Ollama might vary depending on its API and configuration options.

Verification

After applying these changes, verify that the models are correctly distributed across the GPUs by:

  • Checking the nvidia-smi output to ensure each GPU is utilized as expected.
  • Monitoring the Ollama logs for any errors related to GPU allocation or out-of-memory issues.
  • Testing the performance and responsiveness of the models to ensure they are functioning as expected.

Extra Tips

  • Regularly update Ollama and its dependencies to ensure you have the latest fixes and features.
  • Consider implementing a monitoring system to track GPU utilization and model performance in real-time.
  • If models are frequently updated or changed, automate the process of adjusting GPU allocation and model loading scripts to minimize manual intervention.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING