ollama - 💡(How to fix) Fix Ollama very slow with LLM video on Mac mini M4 16gb [1 participants]

ollama2026-03-04 23:32:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14629•Fetched 2026-04-08 00:33:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

vpoma777

Participants

vpoma777

Timeline (top)

labeled ×1

RAW_BUFFERClick to expand / collapse

What is the issue?

Environment

Hardware: Mac mini M4 16GB unified memory
OS: macOS
Ollama version: 0.17.6
Model: llama3.2-vision:11b (running 100% GPU via Metal)

Problem

Every single inference call takes ~20 seconds regardless of any optimization attempted. The visual encoder latency is not reported in the API response breakdown, making it invisible but dominant.

What I tested

Image sizes: 24KB to 267KB → no difference
Image resolutions: 560x560 to 1920x1080 → no difference
OLLAMA_FLASH_ATTENTION=1 → no difference
OLLAMA_KV_CACHE_TYPE=q8_0 → no difference
Warm cache vs cold cache → no difference
Killed duplicate Ollama instance (app + CLI conflict) → no difference

API response breakdown

total_duration: ~20s load_duration: ~0.1s prompt_eval_duration: ~0.75s eval_duration: ~0.46s ───────────────────────── Accounted for: ~1.3s Unaccounted: ~19s ← visual encoder overhead, not reported

Expected behavior

On Apple M4 with Metal acceleration and model fully loaded in GPU, visual encoding should take 2-4 seconds, not 20.

Notes

When the same image is sent twice in rapid succession (within seconds), the second call returns in ~0.7s due to visual cache hit. This confirms the M4 is capable of fast inference — the problem is the first encoding of any new image takes ~20s with no way to pre-warm with a different image.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

Fix Plan

The fix involves optimizing the visual encoder to reduce the latency.

Step 1: Update Ollama Version Ensure you are using the latest version of Ollama, as updates often include performance improvements.
Step 2: Pre-warm the Visual Encoder Implement a pre-warming mechanism for the visual encoder by sending a dummy image before making actual inference calls. This can help reduce the latency for the first image.
Step 3: Optimize Image Processing Optimize image processing by resizing images to a consistent size before sending them for inference.

Example code snippet in Python to pre-warm the visual encoder:

import requests

# Pre-warm the visual encoder with a dummy image
dummy_image = open("dummy_image.jpg", "rb")
response = requests.post("https://example.com/ollama", files={"image": dummy_image})

# Now make the actual inference call
actual_image = open("actual_image.jpg", "rb")
response = requests.post("https://example.com/ollama", files={"image": actual_image})

Verification

Verify that the fix worked by checking the latency of the first inference call after pre-warming the visual encoder. The latency should be significantly reduced.

Extra Tips

Monitor the GPU utilization to ensure it's being fully utilized during inference.
Consider using a more efficient image processing library to reduce the overhead of image resizing and processing.
If possible, use a batch processing approach to send multiple images for inference at once, which can help reduce the overall latency.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

On Apple M4 with Metal acceleration and model fully loaded in GPU, visual encoding should take 2-4 seconds, not 20.

#api #ssr #installation #tensor shape #optimization #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Ollama very slow with LLM video on Mac mini M4 16gb [1 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Environment

Problem

What I tested

API response breakdown

Expected behavior

Notes

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Ollama very slow with LLM video on Mac mini M4 16gb [1 participants]

Recommended Tools

GitHub issue graph ai analysis

What is the issue?

Environment

Problem

What I tested

API response breakdown

Expected behavior

Notes

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING