ollama - 💡(How to fix) Fix Ollama fails to use a coding model that doesn't fit into Vulkan memory [4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15156Fetched 2026-04-08 01:52:55
View on GitHub
Comments
4
Participants
2
Timeline
6
Reactions
0
Author
Timeline (top)
commented ×4labeled ×2

Code Example

See above.
RAW_BUFFERClick to expand / collapse

What is the issue?

The system has the Vulkan device NVIDIA GeForce RTX 2060 with 6 GB VRAM.

claude is configured to use the model gpt-oss:20b. 20b parameters requires 40+GB of memory.

Ollama serve log.

Ollama serv was loading the model very slowly for some reason.

Ollama should put some number of layers onto the Vulkan device and use CPU for the rest. But this doesn't happen for some reason.

The log shows that after 15+ minutes computation doesn't start due to whatever issues with slow model loading and timeouts.

Relevant log output

See above.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.19.0

extent analysis

Fix Plan

To resolve the issue, we need to implement a hybrid approach where Ollama uses both the Vulkan device and CPU for model computation. This can be achieved by:

  • Modifying the Ollama configuration to specify the number of layers to be loaded onto the Vulkan device
  • Implementing a fallback mechanism to use the CPU for the remaining layers

Step-by-Step Solution

  1. Update Ollama configuration: Modify the ollama.yaml file to include the vulkan_layers parameter, specifying the number of layers to be loaded onto the Vulkan device. For example:
    vulkan_layers: 10
  2. Implement hybrid computation: Update the Ollama code to use the Vulkan device for the specified number of layers and fall back to CPU for the remaining layers. This can be achieved by modifying the load_model function to:
    def load_model(model_name, vulkan_layers):
        # Load vulkan_layers onto the Vulkan device
        vulkan_model = load_vulkan_model(model_name, vulkan_layers)
        
        # Load remaining layers onto the CPU
        cpu_model = load_cpu_model(model_name, vulkan_layers)
        
        # Combine Vulkan and CPU models for hybrid computation
        hybrid_model = combine_models(vulkan_model, cpu_model)
        
        return hybrid_model
  3. Verify the fix: Restart the Ollama server and monitor the logs to ensure that the model is loaded correctly and computation starts within a reasonable time frame.

Example Code

import torch

def load_vulkan_model(model_name, vulkan_layers):
    # Load the specified number of layers onto the Vulkan device
    model = torch.load(model_name, map_location='cuda:0')
    vulkan_model = model[:vulkan_layers]
    return vulkan_model

def load_cpu_model(model_name, vulkan_layers):
    # Load the remaining layers onto the CPU
    model = torch.load(model_name, map_location='cpu')
    cpu_model = model[vulkan_layers:]
    return cpu_model

def combine_models(vulkan_model, cpu_model):
    # Combine the Vulkan and CPU models for hybrid computation
    hybrid_model = torch.nn.Sequential(vulkan_model, cpu_model)
    return hybrid_model

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING