ollama - 💡(How to fix) Fix GPU used with ollama run, but /v1 API forces CPU fallback (same model) [14 comments, 5 participants]

What is the issue?

Title: GPU works with ollama run, but falls back to CPU when using OpenAI /v1 API (same model)

Description:

When running a model manually using:

ollama run qwen3.5:27b

the model runs correctly on the GPU (100% GPU usage).

However, when using the same model via the OpenAI-compatible /v1 API endpoint, the model falls back to CPU or CPU/GPU hybrid usage.

This results in extremely high CPU load (up to 100%) and poor performance.

Expected behavior:

GPU usage should be consistent across CLI and API usage No silent fallback to CPU when GPU is available Same model + same system should produce the same execution behavior

Hypothesis:

The issue may be related to a race condition during model initialization.

When using the OpenAI-compatible /v1 API, requests can arrive while the model is still loading into GPU memory. Instead of waiting for the model to finish initializing, Ollama appears to start a separate execution path that falls back to CPU.

This results in:

GPU initialization happening in parallel Incoming API requests being handled prematurely A second (CPU-based) execution being triggered

As a consequence, the system ends up with:

GPU load from model initialization CPU load from premature inference handling

This suggests that the server does not properly block or queue incoming requests until the model is fully loaded and ready for GPU execution.

A possible fix would be:

Enforcing a strict “model ready” state before handling inference requests Queuing or delaying incoming /v1 requests until GPU initialization is complete

Actual behavior:

CLI (ollama run) → 100% GPU /v1 API → CPU or mixed CPU/GPU CPU spikes to 100% GPU usage drops significantly or is not used

Example:

ollama ps qwen3.5:27b 25 GB 12%/88% CPU/GPU

or:

0% GPU / 100% CPU

System:

OS: Windows 11 GPU: (e.g. RTX 3090 24GB) Ollama version: (run ollama --version)

Configuration:

Environment variables:

OLLAMA_NUM_THREADS=6 OLLAMA_NUM_PARALLEL=1 OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_KEEP_ALIVE=-1

Client configuration:

"contextWindow": 131072 "maxTokens": 8192

Important observations:

Issue occurs only when using /v1 API Native API (/api/generate) works correctly CLI works correctly GPU is functional and properly used outside /v1

Hypothesis:

The issue appears to be related to the OpenAI compatibility layer (/v1):

Different handling of context size or memory allocation Early context allocation during model load Possible fallback triggered before GPU execution stabilizes

Steps to reproduce:

Start server: ollama serve Run model via CLI: ollama run qwen3.5:27b

→ GPU works correctly

Run via API: POST /v1/responses

→ CPU usage spikes

Impact:

High CPU usage Severe performance degradation System instability under load Inconsistent behavior depending on API path

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

Fix Plan

To address the issue of GPU fallback to CPU when using the OpenAI-compatible /v1 API, we need to enforce a strict "model ready" state before handling inference requests. Here are the steps:

Modify the OpenAI compatibility layer to wait for the model to finish initializing before handling incoming requests.
Implement a request queue to delay incoming /v1 requests until GPU initialization is complete.

Example code snippet in Python:

import queue
import threading

# Create a queue to hold incoming requests
request_queue = queue.Queue()

# Flag to indicate when the model is ready
model_ready = False

# Lock to synchronize access to the model_ready flag
lock = threading.Lock()

def initialize_model():
    global model_ready
    # Initialize the model and load it into GPU memory
    # ...
    with lock:
        model_ready = True

def handle_request(request):
    # Check if the model is ready
    with lock:
        if not model_ready:
            # If not, put the request back into the queue and wait
            request_queue.put(request)
            return
    # If the model is ready, process the request
    # ...

def worker():
    while True:
        request = request_queue.get()
        handle_request(request)
        request_queue.task_done()

# Start the model initialization thread
init_thread = threading.Thread(target=initialize_model)
init_thread.start()

# Start the worker thread to handle requests
worker_thread = threading.Thread(target=worker)
worker_thread.start()

# Main thread to handle incoming requests
while True:
    request = # get incoming request
    request_queue.put(request)

Verification

To verify that the fix worked, you can:

Monitor the GPU usage and CPU load while running the model via the /v1 API.
Check the system logs for any errors or warnings related to GPU initialization or request handling.
Test the model with different input sizes and types to ensure that it works correctly and consistently.

Extra Tips

To prevent similar issues in the future, consider:

Implementing a more robust request handling mechanism that can handle concurrent requests and model initialization.
Adding more logging and monitoring to detect and diagnose issues related to GPU initialization and request handling.
Testing the model with different GPU configurations and drivers to ensure that it works correctly and consistently.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix GPU used with ollama run, but /v1 API forces CPU fallback (same model) [14 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix GPU used with ollama run, but /v1 API forces CPU fallback (same model) [14 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING