ollama - 💡(How to fix) Fix Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4 [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15258Fetched 2026-04-08 02:33:36
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1labeled ×1

Error Message

Error log shows model loads successfully:

Reverted to Ollama 0.19.0 (ollama-darwin.tgz from GitHub releases v0.19.0). All native endpoints work. /v1/chat/completions works. Tool calling works. However, 0.19.0 does not support Gemma 4 models (500 error on load).

Code Example

# Server running with:
# OLLAMA_HOST=0.0.0.0:11434
# OLLAMA_FLASH_ATTENTION=1

# This hangs forever (or until timeout):
curl -m 60 http://localhost:11434/api/generate \
  -d '{"model":"gemma4:e2b","prompt":"Say hello","stream":false}'
# Returns: empty (0 bytes after 60s)

# This also hangs:
curl -m 60 http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4:e2b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":10}'
# Returns: empty (0 bytes after 60s)

# But this works instantly:
curl http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":"test"}'
# Returns: 768-dim embedding vector

---

# Error log shows model loads successfully:
level=INFO source=server.go:1390 msg="llama runner started in 1.62 seconds"

# But the runner process consumes 200-380% CPU indefinitely:
vsw  64149 384.6 28.8 449640000 9672752 ?? R /opt/homebrew/Cellar/ollama/0.20.0/bin/ollama runner ...

# Serve log shows the request eventually times out:
[GIN] 2026/04/02 - 21:46:51 | 499 | 3m16s | ::1 | POST "/v1/chat/completions"

---
RAW_BUFFERClick to expand / collapse

What is the issue?

Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4

Environment

  • Hardware: Mac Mini M4 (32GB unified memory)
  • OS: macOS (Apple Silicon arm64)
  • Ollama: 0.20.0 GA (both Homebrew bottle and official ollama-darwin.tgz from GitHub releases)
  • Previous working version: 0.20.0-rc1 (installed via curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.20.0-rc1 sh)

Bug Summary The OpenAI-compatible /v1/chat/completions endpoint hangs indefinitely for all generative models on Ollama 0.20.0 GA running on Apple Silicon M4. The request is accepted (TCP connection established, POST sent) but zero bytes are ever returned. After curl's timeout, the server logs a 499 (client closed connection).

Additionally, we discovered that /api/chat and /api/generate (native endpoints) are ALSO broken on 0.20.0 GA — they exhibit the same hang behavior. The runner process spawns, loads the model successfully, but produces no output.

What works

  • /api/version — responds instantly
  • /api/tags — lists models correctly
  • /api/ps — shows loaded models
  • /api/pull — pulls models successfully
  • /api/embeddings — nomic-embed-text returns 768-dim vectors in ~30ms
  • /v1/embeddings — also works perfectly
  • What's broken
  • /v1/chat/completions — hangs, 0 bytes, eventually 499
  • /api/chat — hangs, 0 bytes
  • /api/generate — hangs, 0 bytes (both stream:true and stream:false)

Models tested (all fail)

  • gemma4:e2b (7.2GB)
  • gemma4:26b (18GB)
  • qwen3-vl:8b (6.1GB)
  • qwen3.5:9b (6.6GB)

Reproduction

# Server running with:
# OLLAMA_HOST=0.0.0.0:11434
# OLLAMA_FLASH_ATTENTION=1

# This hangs forever (or until timeout):
curl -m 60 http://localhost:11434/api/generate \
  -d '{"model":"gemma4:e2b","prompt":"Say hello","stream":false}'
# Returns: empty (0 bytes after 60s)

# This also hangs:
curl -m 60 http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4:e2b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":10}'
# Returns: empty (0 bytes after 60s)

# But this works instantly:
curl http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":"test"}'
# Returns: 768-dim embedding vector

Server logs during the hang

# Error log shows model loads successfully:
level=INFO source=server.go:1390 msg="llama runner started in 1.62 seconds"

# But the runner process consumes 200-380% CPU indefinitely:
vsw  64149 384.6 28.8 449640000 9672752 ?? R /opt/homebrew/Cellar/ollama/0.20.0/bin/ollama runner ...

# Serve log shows the request eventually times out:
[GIN] 2026/04/02 - 21:46:51 | 499 | 3m16s | ::1 | POST "/v1/chat/completions"

Troubleshooting performed

  1. Stripped all env vars (removed OLLAMA_FLASH_ATTENTION, OLLAMA_KV_CACHE_TYPE, OLLAMA_NUM_PARALLEL) — same behavior
  2. Tested with bare OLLAMA_HOST=0.0.0.0:11434 only — same behavior
  3. Tested locally (127.0.0.1) and over LAN (10.0.3.161) — same behavior
  4. Tested both Homebrew bottle and official darwin tgz — same behavior
  5. Re-pulled models (/api/pull) — models pull successfully, still can't generate
  6. Tested streaming mode (stream:true) — also hangs, no output
  7. Confirmed non-multimodal models (qwen3.5:9b) also hang — not Gemma-specific

Working configuration Reverted to Ollama 0.19.0 (ollama-darwin.tgz from GitHub releases v0.19.0). All native endpoints work. /v1/chat/completions works. Tool calling works. However, 0.19.0 does not support Gemma 4 models (500 error on load).

Key observation 0.20.0-rc1 worked perfectly on the same hardware with the same models and same configuration. The rc1 was installed via the direct install script with OLLAMA_VERSION=0.20.0-rc1 and ran Gemma 4 models with native /api/chat, /api/generate, tool calling, and even /v1/chat/completions (though /v1 was slower). The GA release introduced a regression between rc1 and the final 0.20.0 build.

Impact This blocks usage of Gemma 4 models on Apple Silicon M4, since:

  • 0.19.0 doesn't support Gemma 4
  • 0.20.0 can't generate any output

Comparison: M1 Mac Mini works fine On a Mac Mini M1 (16GB) running Ollama 0.18.2, all endpoints including /v1/chat/completions work correctly with llama3.1:8b. This issue appears specific to the 0.20.0 GA build on Apple Silicon (at minimum M4, untested on M1 with 0.20.0).

Relevant log output

OS

MacOS

GPU

Apple M4

CPU

Apple M4

Ollama version

0.20.0

extent analysis

TL;DR

The most likely fix is to revert to Ollama version 0.20.0-rc1, which was previously working on the same hardware with the same models and configuration.

Guidance

  • Revert to Ollama version 0.20.0-rc1 by installing it via the direct install script with OLLAMA_VERSION=0.20.0-rc1.
  • Verify that the native endpoints (/api/chat, /api/generate) and /v1/chat/completions are working as expected with the reverted version.
  • If reverting to 0.20.0-rc1 is not feasible, consider testing Ollama 0.20.0 on a different hardware configuration, such as a Mac Mini M1, to determine if the issue is specific to the Apple Silicon M4.
  • Investigate the changes made between 0.20.0-rc1 and the final 0.20.0 build to identify the potential cause of the regression.

Example

No code snippet is provided as the issue is related to a specific version of Ollama and its compatibility with Apple Silicon M4.

Notes

The issue appears to be specific to the 0.20.0 GA build on Apple Silicon M4, and reverting to 0.20.0-rc1 may be the most straightforward solution. However, this may not be a long-term fix, and further investigation is needed to resolve the issue in the 0.20.0 GA build.

Recommendation

Apply the workaround by reverting to Ollama version 0.20.0-rc1, as it has been confirmed to work on the same hardware with the same models and configuration. This will allow for temporary usage of Gemma 4 models on Apple Silicon M4 until a permanent fix is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING