ollama - 💡(How to fix) Fix why the ollama get response very slow? (QWEN3.5 35B A3B) [13 comments, 10 participants]

ollama2026-03-06 10:05:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14662•Fetched 2026-04-08 00:33:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×13subscribed ×8labeled ×1referenced ×1

RAW_BUFFERClick to expand / collapse

What is the issue?

My hardware is an NVIDIA DGX Spark, running Ollama 0.17.7 and the latest version of OpenWebUI.

During testing, I noticed that the first prompt responds quickly. OpenWebUI immediately shows “think” and then generates the response.

However, when I ask a second question, it takes 30 to 60 seconds before the frontend webpage shows “think” and the response begins to appear.

This issue does not occur when I run GPT-OSS 120B.

I test this problem in my RTX 3090 have the same problem.

i see the nvidia-smi, it use 100% cuda but not thing to show.

if restart ollama, the first qustion response quick.

Relevant log output

OS

Windows, Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.7

extent analysis

Fix Plan

The issue seems to be related to GPU resource management in Ollama. To resolve this, we can try the following steps:

Update Ollama configuration: Increase the GPU memory allocation or adjust the batch size to prevent GPU resource exhaustion.
Implement GPU warming: Add a GPU warming mechanism to ensure the GPU is properly initialized before handling requests.
Optimize GPU utilization: Modify the code to optimize GPU utilization, reducing the time it takes to generate responses.

Example code snippet to optimize GPU utilization:

import torch

# Set GPU device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Move model to GPU
model.to(device)

# Define a function to generate responses
def generate_response(prompt):
    # Move input to GPU
    input_ids = prompt.to(device)
    
    # Generate response
    output = model.generate(input_ids)
    
    # Move output back to CPU
    output = output.cpu()
    
    return output

# Define a GPU warming function
def warm_up_gpu():
    # Perform a dummy computation to warm up the GPU
    torch.randn(100, 100, device=device)

# Call the GPU warming function before handling requests
warm_up_gpu()

Verification

To verify the fix, restart Ollama and test the response time for multiple prompts. Check the GPU utilization using nvidia-smi to ensure it's not maxed out.

Extra Tips

Monitor GPU memory usage and adjust the configuration accordingly.
Consider implementing a queueing system to handle multiple requests and prevent GPU resource exhaustion.
Refer to the Ollama documentation for more information on optimizing GPU performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #retrieval issue #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix why the ollama get response very slow? (QWEN3.5 35B A3B) [13 comments, 10 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix why the ollama get response very slow? (QWEN3.5 35B A3B) [13 comments, 10 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING