ollama - 💡(How to fix) Fix why the ollama get response very slow? (QWEN3.5 35B A3B) [13 comments, 10 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14662Fetched 2026-04-08 00:33:11
View on GitHub
Comments
13
Participants
10
Timeline
24
Reactions
2
Author
Timeline (top)
commented ×13subscribed ×8labeled ×1referenced ×1
RAW_BUFFERClick to expand / collapse

What is the issue?

My hardware is an NVIDIA DGX Spark, running Ollama 0.17.7 and the latest version of OpenWebUI.

During testing, I noticed that the first prompt responds quickly. OpenWebUI immediately shows “think” and then generates the response.

However, when I ask a second question, it takes 30 to 60 seconds before the frontend webpage shows “think” and the response begins to appear.

This issue does not occur when I run GPT-OSS 120B.

<img width="1062" height="112" alt="Image" src="https://github.com/user-attachments/assets/32e1f49c-98e7-437f-b9f8-f38d8a165e8c" />

I test this problem in my RTX 3090 have the same problem.

i see the nvidia-smi, it use 100% cuda but not thing to show.

if restart ollama, the first qustion response quick.

Relevant log output

OS

Windows, Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.7

extent analysis

Fix Plan

The issue seems to be related to GPU resource management in Ollama. To resolve this, we can try the following steps:

  • Update Ollama configuration: Increase the GPU memory allocation or adjust the batch size to prevent GPU resource exhaustion.
  • Implement GPU warming: Add a GPU warming mechanism to ensure the GPU is properly initialized before handling requests.
  • Optimize GPU utilization: Modify the code to optimize GPU utilization, reducing the time it takes to generate responses.

Example code snippet to optimize GPU utilization:

import torch

# Set GPU device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Move model to GPU
model.to(device)

# Define a function to generate responses
def generate_response(prompt):
    # Move input to GPU
    input_ids = prompt.to(device)
    
    # Generate response
    output = model.generate(input_ids)
    
    # Move output back to CPU
    output = output.cpu()
    
    return output

# Define a GPU warming function
def warm_up_gpu():
    # Perform a dummy computation to warm up the GPU
    torch.randn(100, 100, device=device)

# Call the GPU warming function before handling requests
warm_up_gpu()

Verification

To verify the fix, restart Ollama and test the response time for multiple prompts. Check the GPU utilization using nvidia-smi to ensure it's not maxed out.

Extra Tips

  • Monitor GPU memory usage and adjust the configuration accordingly.
  • Consider implementing a queueing system to handle multiple requests and prevent GPU resource exhaustion.
  • Refer to the Ollama documentation for more information on optimizing GPU performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING