ollama - 💡(How to fix) Fix Ollama very slow with LLM video on Mac mini M4 16gb [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14629Fetched 2026-04-08 00:33:36
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1
RAW_BUFFERClick to expand / collapse

What is the issue?

Environment

  • Hardware: Mac mini M4 16GB unified memory
  • OS: macOS
  • Ollama version: 0.17.6
  • Model: llama3.2-vision:11b (running 100% GPU via Metal)

Problem

Every single inference call takes ~20 seconds regardless of any optimization attempted. The visual encoder latency is not reported in the API response breakdown, making it invisible but dominant.

What I tested

  • Image sizes: 24KB to 267KB → no difference
  • Image resolutions: 560x560 to 1920x1080 → no difference
  • OLLAMA_FLASH_ATTENTION=1 → no difference
  • OLLAMA_KV_CACHE_TYPE=q8_0 → no difference
  • Warm cache vs cold cache → no difference
  • Killed duplicate Ollama instance (app + CLI conflict) → no difference

API response breakdown

total_duration: ~20s load_duration: ~0.1s prompt_eval_duration: ~0.75s eval_duration: ~0.46s ───────────────────────── Accounted for: ~1.3s Unaccounted: ~19s ← visual encoder overhead, not reported

Expected behavior

On Apple M4 with Metal acceleration and model fully loaded in GPU, visual encoding should take 2-4 seconds, not 20.

Notes

When the same image is sent twice in rapid succession (within seconds), the second call returns in ~0.7s due to visual cache hit. This confirms the M4 is capable of fast inference — the problem is the first encoding of any new image takes ~20s with no way to pre-warm with a different image.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

Fix Plan

The fix involves optimizing the visual encoder to reduce the latency.

  • Step 1: Update Ollama Version Ensure you are using the latest version of Ollama, as updates often include performance improvements.
  • Step 2: Pre-warm the Visual Encoder Implement a pre-warming mechanism for the visual encoder by sending a dummy image before making actual inference calls. This can help reduce the latency for the first image.
  • Step 3: Optimize Image Processing Optimize image processing by resizing images to a consistent size before sending them for inference.

Example code snippet in Python to pre-warm the visual encoder:

import requests

# Pre-warm the visual encoder with a dummy image
dummy_image = open("dummy_image.jpg", "rb")
response = requests.post("https://example.com/ollama", files={"image": dummy_image})

# Now make the actual inference call
actual_image = open("actual_image.jpg", "rb")
response = requests.post("https://example.com/ollama", files={"image": actual_image})

Verification

Verify that the fix worked by checking the latency of the first inference call after pre-warming the visual encoder. The latency should be significantly reduced.

Extra Tips

  • Monitor the GPU utilization to ensure it's being fully utilized during inference.
  • Consider using a more efficient image processing library to reduce the overhead of image resizing and processing.
  • If possible, use a batch processing approach to send multiple images for inference at once, which can help reduce the overall latency.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

On Apple M4 with Metal acceleration and model fully loaded in GPU, visual encoding should take 2-4 seconds, not 20.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING