ollama - 💡(How to fix) Fix qwen3:32b important performance regression (divided by 3!) after Ollama 0.15.5 to 0.15.6 (persists in 0.17.7) [9 comments, 2 participants]

ollama2026-03-09 15:26:17

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14740•Fetched 2026-04-08 00:32:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

viba1

Participants

rick-github

viba1

Timeline (top)

commented ×9closed ×1labeled ×1

RAW_BUFFERClick to expand / collapse

What is the issue?

Since updating Ollama from version 0.15.5 to 0.15.6, the performance of the qwen3:32b model has drastically dropped (from 35 tokens/second to 12 tokens/second on a single RTX 3090 for exemple). This degradation has not been fixed in subsequent versions, including the current 0.17.7 (March 2026). This makes the model impractical for interactive tasks.

System: Linux Debian 13 RTX 3090 NVIDIA linux driver 590.48.01 Ollama 0.17.7 Model: qwen3:32b (default quantization, e.g., Q4_K_M)

Steps to Reproduce Install Ollama 0.15.5. Download and run ollama run qwen3:32b → measure ~35 tokens/s. Update to 0.15.6 or later (e.g., 0.17.7). Relaunch the same model → speed drops to ~12 tokens/s.

Logs / Evidence Manual token/s measurements using ollama --verbose. No hardware or config changes during the period. Other models not affected to the same extent (ex: gemma3:27b ; gpt-oss or qwen3:14b)

Expected Behavior Return to Ollama 0.15.5 performance (~35 tokens/s) or explanation of changes (new scheduler, memory estimates, etc.) with options to disable.

Relevant log output

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.17.7

extent analysis

Fix Plan

To address the performance drop in Ollama version 0.15.6 and later, we will attempt to revert the changes that caused the degradation.

Revert to Previous Model Configuration: Try to use the model configuration from version 0.15.5.
Disable New Features: Identify and disable any new features introduced in version 0.15.6 that might be causing the performance drop.
Update NVIDIA Drivers: Ensure the NVIDIA drivers are up-to-date, as newer drivers might include performance optimizations.

Example Code Changes

To revert to the previous model configuration, you can try the following code changes:

# Import necessary libraries
import ollama

# Load the qwen3:32b model with default quantization (Q4_K_M)
model = ollama.load_model("qwen3:32b", quantization="Q4_K_M")

# Try to disable new features introduced in version 0.15.6
# This might involve setting specific flags or environment variables
# For example:
os.environ["OLLAMA_DISABLE_NEW_SCHEDULER"] = "1"

# Run the model and measure performance
performance = ollama.run(model)
print(performance)

Verification

To verify that the fix worked, measure the performance of the model using the ollama --verbose flag and compare it to the expected performance of ~35 tokens/s.

Extra Tips

Check the Ollama documentation for any known issues or performance optimizations in version 0.15.6 and later.
Consider filing a bug report with the Ollama developers to investigate the performance drop further.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix qwen3:32b important performance regression (divided by 3!) after Ollama 0.15.5 to 0.15.6 (persists in 0.17.7) [9 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Example Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix qwen3:32b important performance regression (divided by 3!) after Ollama 0.15.5 to 0.15.6 (persists in 0.17.7) [9 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Example Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING