ollama - 💡(How to fix) Fix GPU used with ollama run, but /v1 API forces CPU fallback (same model) [14 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15016Fetched 2026-04-08 01:17:15
View on GitHub
Comments
14
Participants
5
Timeline
16
Reactions
0
Author
Timeline (top)
commented ×14closed ×1labeled ×1
RAW_BUFFERClick to expand / collapse

What is the issue?

Title: GPU works with ollama run, but falls back to CPU when using OpenAI /v1 API (same model)

Description:

When running a model manually using:

ollama run qwen3.5:27b

the model runs correctly on the GPU (100% GPU usage).

However, when using the same model via the OpenAI-compatible /v1 API endpoint, the model falls back to CPU or CPU/GPU hybrid usage.

This results in extremely high CPU load (up to 100%) and poor performance.

Expected behavior:

GPU usage should be consistent across CLI and API usage No silent fallback to CPU when GPU is available Same model + same system should produce the same execution behavior

Hypothesis:

The issue may be related to a race condition during model initialization.

When using the OpenAI-compatible /v1 API, requests can arrive while the model is still loading into GPU memory. Instead of waiting for the model to finish initializing, Ollama appears to start a separate execution path that falls back to CPU.

This results in:

GPU initialization happening in parallel Incoming API requests being handled prematurely A second (CPU-based) execution being triggered

As a consequence, the system ends up with:

GPU load from model initialization CPU load from premature inference handling

This suggests that the server does not properly block or queue incoming requests until the model is fully loaded and ready for GPU execution.

A possible fix would be:

Enforcing a strict “model ready” state before handling inference requests Queuing or delaying incoming /v1 requests until GPU initialization is complete

Actual behavior:

CLI (ollama run) → 100% GPU /v1 API → CPU or mixed CPU/GPU CPU spikes to 100% GPU usage drops significantly or is not used

Example:

ollama ps qwen3.5:27b 25 GB 12%/88% CPU/GPU

or:

0% GPU / 100% CPU

System:

OS: Windows 11 GPU: (e.g. RTX 3090 24GB) Ollama version: (run ollama --version)

Configuration:

Environment variables:

OLLAMA_NUM_THREADS=6 OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_KEEP_ALIVE=-1

Client configuration:

"contextWindow": 131072 "maxTokens": 8192

Important observations:

Issue occurs only when using /v1 API Native API (/api/generate) works correctly CLI works correctly GPU is functional and properly used outside /v1

Hypothesis:

The issue appears to be related to the OpenAI compatibility layer (/v1):

Different handling of context size or memory allocation Early context allocation during model load Possible fallback triggered before GPU execution stabilizes

Steps to reproduce:

Start server: ollama serve Run model via CLI: ollama run qwen3.5:27b

→ GPU works correctly

Run via API: POST /v1/responses

→ CPU usage spikes

Impact:

High CPU usage Severe performance degradation System instability under load Inconsistent behavior depending on API path

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

Fix Plan

To address the issue of GPU fallback to CPU when using the OpenAI-compatible /v1 API, we need to enforce a strict "model ready" state before handling inference requests. Here are the steps:

  • Modify the OpenAI compatibility layer to wait for the model to finish initializing before handling incoming requests.
  • Implement a request queue to delay incoming /v1 requests until GPU initialization is complete.

Example code snippet in Python:

import queue
import threading

# Create a queue to hold incoming requests
request_queue = queue.Queue()

# Flag to indicate when the model is ready
model_ready = False

# Lock to synchronize access to the model_ready flag
lock = threading.Lock()

def initialize_model():
    global model_ready
    # Initialize the model and load it into GPU memory
    # ...
    with lock:
        model_ready = True

def handle_request(request):
    # Check if the model is ready
    with lock:
        if not model_ready:
            # If not, put the request back into the queue and wait
            request_queue.put(request)
            return
    # If the model is ready, process the request
    # ...

def worker():
    while True:
        request = request_queue.get()
        handle_request(request)
        request_queue.task_done()

# Start the model initialization thread
init_thread = threading.Thread(target=initialize_model)
init_thread.start()

# Start the worker thread to handle requests
worker_thread = threading.Thread(target=worker)
worker_thread.start()

# Main thread to handle incoming requests
while True:
    request = # get incoming request
    request_queue.put(request)

Verification

To verify that the fix worked, you can:

  • Monitor the GPU usage and CPU load while running the model via the /v1 API.
  • Check the system logs for any errors or warnings related to GPU initialization or request handling.
  • Test the model with different input sizes and types to ensure that it works correctly and consistently.

Extra Tips

To prevent similar issues in the future, consider:

  • Implementing a more robust request handling mechanism that can handle concurrent requests and model initialization.
  • Adding more logging and monitoring to detect and diagnose issues related to GPU initialization and request handling.
  • Testing the model with different GPU configurations and drivers to ensure that it works correctly and consistently.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING