ollama - 💡(How to fix) Fix Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4 [1 comments, 1 participants]

Code Example

# Server running with:
# OLLAMA_HOST=0.0.0.0:11434
# OLLAMA_FLASH_ATTENTION=1

# This hangs forever (or until timeout):
curl -m 60 http://localhost:11434/api/generate \
  -d '{"model":"gemma4:e2b","prompt":"Say hello","stream":false}'
# Returns: empty (0 bytes after 60s)

# This also hangs:
curl -m 60 http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4:e2b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":10}'
# Returns: empty (0 bytes after 60s)

# But this works instantly:
curl http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":"test"}'
# Returns: 768-dim embedding vector

---

# Error log shows model loads successfully:
level=INFO source=server.go:1390 msg="llama runner started in 1.62 seconds"

# But the runner process consumes 200-380% CPU indefinitely:
vsw  64149 384.6 28.8 449640000 9672752 ?? R /opt/homebrew/Cellar/ollama/0.20.0/bin/ollama runner ...

# Serve log shows the request eventually times out:
[GIN] 2026/04/02 - 21:46:51 | 499 | 3m16s | ::1 | POST "/v1/chat/completions"

---

What is the issue?

Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4

Environment

Hardware: Mac Mini M4 (32GB unified memory)
OS: macOS (Apple Silicon arm64)
Ollama: 0.20.0 GA (both Homebrew bottle and official ollama-darwin.tgz from GitHub releases)
Previous working version: 0.20.0-rc1 (installed via curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.20.0-rc1 sh)

Bug Summary The OpenAI-compatible /v1/chat/completions endpoint hangs indefinitely for all generative models on Ollama 0.20.0 GA running on Apple Silicon M4. The request is accepted (TCP connection established, POST sent) but zero bytes are ever returned. After curl's timeout, the server logs a 499 (client closed connection).

Additionally, we discovered that /api/chat and /api/generate (native endpoints) are ALSO broken on 0.20.0 GA — they exhibit the same hang behavior. The runner process spawns, loads the model successfully, but produces no output.

What works

/api/version — responds instantly
/api/tags — lists models correctly
/api/ps — shows loaded models
/api/pull — pulls models successfully
/api/embeddings — nomic-embed-text returns 768-dim vectors in ~30ms
/v1/embeddings — also works perfectly
What's broken
/v1/chat/completions — hangs, 0 bytes, eventually 499
/api/chat — hangs, 0 bytes
/api/generate — hangs, 0 bytes (both stream:true and stream:false)

Models tested (all fail)

gemma4:e2b (7.2GB)
gemma4:26b (18GB)
qwen3-vl:8b (6.1GB)
qwen3.5:9b (6.6GB)

Reproduction

# Server running with:
# OLLAMA_HOST=0.0.0.0:11434
# OLLAMA_FLASH_ATTENTION=1

# This hangs forever (or until timeout):
curl -m 60 http://localhost:11434/api/generate \
  -d '{"model":"gemma4:e2b","prompt":"Say hello","stream":false}'
# Returns: empty (0 bytes after 60s)

# This also hangs:
curl -m 60 http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4:e2b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":10}'
# Returns: empty (0 bytes after 60s)

# But this works instantly:
curl http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":"test"}'
# Returns: 768-dim embedding vector

Server logs during the hang

# Error log shows model loads successfully:
level=INFO source=server.go:1390 msg="llama runner started in 1.62 seconds"

# But the runner process consumes 200-380% CPU indefinitely:
vsw  64149 384.6 28.8 449640000 9672752 ?? R /opt/homebrew/Cellar/ollama/0.20.0/bin/ollama runner ...

# Serve log shows the request eventually times out:
[GIN] 2026/04/02 - 21:46:51 | 499 | 3m16s | ::1 | POST "/v1/chat/completions"

Troubleshooting performed

Stripped all env vars (removed OLLAMA_FLASH_ATTENTION, OLLAMA_KV_CACHE_TYPE, OLLAMA_NUM_PARALLEL) — same behavior
Tested with bare OLLAMA_HOST=0.0.0.0:11434 only — same behavior
Tested locally (127.0.0.1) and over LAN (10.0.3.161) — same behavior
Tested both Homebrew bottle and official darwin tgz — same behavior
Re-pulled models (/api/pull) — models pull successfully, still can't generate
Tested streaming mode (stream:true) — also hangs, no output
Confirmed non-multimodal models (qwen3.5:9b) also hang — not Gemma-specific

Working configuration Reverted to Ollama 0.19.0 (ollama-darwin.tgz from GitHub releases v0.19.0). All native endpoints work. /v1/chat/completions works. Tool calling works. However, 0.19.0 does not support Gemma 4 models (500 error on load).

Key observation 0.20.0-rc1 worked perfectly on the same hardware with the same models and same configuration. The rc1 was installed via the direct install script with OLLAMA_VERSION=0.20.0-rc1 and ran Gemma 4 models with native /api/chat, /api/generate, tool calling, and even /v1/chat/completions (though /v1 was slower). The GA release introduced a regression between rc1 and the final 0.20.0 build.

Impact This blocks usage of Gemma 4 models on Apple Silicon M4, since:

0.19.0 doesn't support Gemma 4
0.20.0 can't generate any output

Comparison: M1 Mac Mini works fine On a Mac Mini M1 (16GB) running Ollama 0.18.2, all endpoints including /v1/chat/completions work correctly with llama3.1:8b. This issue appears specific to the 0.20.0 GA build on Apple Silicon (at minimum M4, untested on M1 with 0.20.0).

Relevant log output

OS

MacOS

GPU

Apple M4

CPU

Apple M4

Ollama version

0.20.0

extent analysis

TL;DR

The most likely fix is to revert to Ollama version 0.20.0-rc1, which was previously working on the same hardware with the same models and configuration.

Guidance

Revert to Ollama version 0.20.0-rc1 by installing it via the direct install script with OLLAMA_VERSION=0.20.0-rc1.
Verify that the native endpoints (/api/chat, /api/generate) and /v1/chat/completions are working as expected with the reverted version.
If reverting to 0.20.0-rc1 is not feasible, consider testing Ollama 0.20.0 on a different hardware configuration, such as a Mac Mini M1, to determine if the issue is specific to the Apple Silicon M4.
Investigate the changes made between 0.20.0-rc1 and the final 0.20.0 build to identify the potential cause of the regression.

Example

No code snippet is provided as the issue is related to a specific version of Ollama and its compatibility with Apple Silicon M4.

Notes

The issue appears to be specific to the 0.20.0 GA build on Apple Silicon M4, and reverting to 0.20.0-rc1 may be the most straightforward solution. However, this may not be a long-term fix, and further investigation is needed to resolve the issue in the 0.20.0 GA build.

Recommendation

Apply the workaround by reverting to Ollama version 0.20.0-rc1, as it has been confirmed to work on the same hardware with the same models and configuration. This will allow for temporary usage of Gemma 4 models on Apple Silicon M4 until a permanent fix is available.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error log shows model loads successfully:

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Ollama 0.20.0: /v1/chat/completions hangs indefinitely on Apple Silicon M4 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error log shows model loads successfully:

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING