ollama - 💡(How to fix) Fix qwen3:32b important performance regression (divided by 3!) after Ollama 0.15.5 to 0.15.6 (persists in 0.17.7) [9 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14740Fetched 2026-04-08 00:32:17
View on GitHub
Comments
9
Participants
2
Timeline
11
Reactions
0
Author
Participants
Timeline (top)
commented ×9closed ×1labeled ×1
RAW_BUFFERClick to expand / collapse

What is the issue?

Since updating Ollama from version 0.15.5 to 0.15.6, the performance of the qwen3:32b model has drastically dropped (from 35 tokens/second to 12 tokens/second on a single RTX 3090 for exemple). This degradation has not been fixed in subsequent versions, including the current 0.17.7 (March 2026). This makes the model impractical for interactive tasks.

System: Linux Debian 13 RTX 3090 NVIDIA linux driver 590.48.01 Ollama 0.17.7 Model: qwen3:32b (default quantization, e.g., Q4_K_M)

Steps to Reproduce Install Ollama 0.15.5. Download and run ollama run qwen3:32b → measure ~35 tokens/s. Update to 0.15.6 or later (e.g., 0.17.7). Relaunch the same model → speed drops to ~12 tokens/s.

Logs / Evidence Manual token/s measurements using ollama --verbose. No hardware or config changes during the period. Other models not affected to the same extent (ex: gemma3:27b ; gpt-oss or qwen3:14b)

Expected Behavior Return to Ollama 0.15.5 performance (~35 tokens/s) or explanation of changes (new scheduler, memory estimates, etc.) with options to disable.

Relevant log output

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.17.7

extent analysis

Fix Plan

To address the performance drop in Ollama version 0.15.6 and later, we will attempt to revert the changes that caused the degradation.

  1. Revert to Previous Model Configuration: Try to use the model configuration from version 0.15.5.
  2. Disable New Features: Identify and disable any new features introduced in version 0.15.6 that might be causing the performance drop.
  3. Update NVIDIA Drivers: Ensure the NVIDIA drivers are up-to-date, as newer drivers might include performance optimizations.

Example Code Changes

To revert to the previous model configuration, you can try the following code changes:

# Import necessary libraries
import ollama

# Load the qwen3:32b model with default quantization (Q4_K_M)
model = ollama.load_model("qwen3:32b", quantization="Q4_K_M")

# Try to disable new features introduced in version 0.15.6
# This might involve setting specific flags or environment variables
# For example:
os.environ["OLLAMA_DISABLE_NEW_SCHEDULER"] = "1"

# Run the model and measure performance
performance = ollama.run(model)
print(performance)

Verification

To verify that the fix worked, measure the performance of the model using the ollama --verbose flag and compare it to the expected performance of ~35 tokens/s.

Extra Tips

  • Check the Ollama documentation for any known issues or performance optimizations in version 0.15.6 and later.
  • Consider filing a bug report with the Ollama developers to investigate the performance drop further.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix qwen3:32b important performance regression (divided by 3!) after Ollama 0.15.5 to 0.15.6 (persists in 0.17.7) [9 comments, 2 participants]