ollama - 💡(How to fix) Fix [0.20.5][macOS Apple Silicon] Runner crashes under sustained multi-turn tool-calling on /v1/chat/completions (72% crash rate across 7 models) [1 participants]

emcee777 · 2026-05-01T20:40:48Z

[ollama] What is the issue? The Ollama runner crashes reliably under sustained multi-turn tool-calling on 0.20.5 and other 0.20.x releases . The crash manifest… ## Fix / Workaround - Not OOM at the OS level (free memory is 50%+ during crashes). - Not context overflow — single-turn single-tool requests crash too, no growing history. - Not concurrency — `OLLAMA_NUM_PARALLEL=2` and a single in-flight request both reproduce. - Not specific to one quant — happens on Q4_K_M and Q5_K_M variants we tested. - Not specific to one model family — Gemma, Mistral, GLM, and Nemotron all reproduce. - The only mitigation we've found that meaningfully reduces crash frequency is restricting tool-using workloads to small models (gemma4:e4b, ~9.6 GB) — which suggests a per-runner state path that scales poorly with model size. ## What is the issue? The Ollama runner crashes reliably under sustained multi-turn tool-calling on **0.20.5** (and other 0.20.x releases). The crash manifests as one of three error signatures returned to the client mid-conversation: 1. `model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details` (HTTP 500) 2. `an error was encountered while running the model: unexpected EOF` (HTTP 500) 3. `connection refused` (the entire `ollama serve` process dies; subsequent requests fail until restart) In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, **99 of 138 runs (72%) failed to complete**. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 `running` rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving `[DONE]` (consistent with the same root cause). This is a regression from 0.19.x, narrowed by user reports against 0.20.0–0.20.5. It is closely related to but distinct from #14611 (which targets 0.17.5 and `/api/generate`): - This report is specifically about the **OpenAI-compatible `/v1/chat/completions` path with `tools`** and **multi-turn loops** (agent-style: assistant→tool_call→tool_result→assistant…). - Single-shot `/api/generate` and single-shot `/api/chat` (no tools) are far more stable on the same machine and the same model versions. Issue surfaces specifically when the harness drives sequential tool-call/tool-result turns. ### Reproduction (minimal) Full script: see [`repro.sh` in the supporting gist](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-repro-sh). The essence: 1. Pull a tool-capable model: `ollama pull gemma4:31b` (also reproduces with `gemma4:26b`, `mistral-small3.2`, `glm-4.7-flash`, `nemotron-cascade-2`). 2. In a loop, POST to `/v1/chat/completions` with `tools: [...]` and `tool_choice: "auto"`, providing a single `read_file`-style tool. Each iteration is a fresh single-turn request — so this is not a context-blowup issue, it's *frequency*. 3. After 1–10 iterations the runner exits and one of the three error signatures appears. 4. Larger models (`gemma4:31b`, `mistral-small3.2`) often crash on the **first** tool-using request. Smaller models crash within ~5 iterations. `gemma4:e4b` (4B effective) is the only model in our matrix that survives sustained tool-using loops. ### Aggregate benchmark data ([gist](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572)) 138 runs, harness = simple agent loop driving `/v1/chat/completions` with one tool, 120 s curl timeout, single in-flight request: | Model | Completed | Total | Pass rate | |---|---|---|---| | gemma4:e4b | 14 | 24 | 58% | | gemma4:26b | 23 | 40 | 58% | | gemma4:31b | 0 | 16 | 0% | | mistral-small3.2 | 0 | 16 | 0% | | glm-4.7-flash | 1 | 32 | 3% | | nemotron-cascade-2 | 1 | 8 | 12% | Three sample crash records (one per error class) attached as JSON in the gist: - [`sample-runner-stopped-gemma4-31b.json`](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-sample-runner-stopped-gemma4-31b-json) — error #1 - [`sample-unexpected-eof-glm.json`](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-sample-unexpected-eof-glm-json) — error #2 - [`sample-connection-refused-gemma4-31b.json`](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-sample-connection-refused-gemma4-31b-json) — error #3 ### Environment - **Ollama:** 0.20.5 (also reproduced on 0.20.6, 0.20.7; 0.21.x improves but still crashes on `qwen3-coder-next` via `/v1/messages+tools`) - **Server config (excerpt from server log):** `OLLAMA_FLASH_ATTENTION=true OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_KEEP_ALIVE=5m0s OLLAMA_LOAD_TIMEOUT=5m0s OLLAMA_NEW_ENGINE=false` - **OS:** macOS 15.5 (24F74) - **Hardware:** Apple M3 Max, 128 GB unified memory (96 GB recommendedMaxWorkingSetSize per Metal init) - **Memory at crash:** 58.7 GiB free (system), 95.5 GiB GPU available — runner is not OOM-pressured

ollama2026-05-01 20:40:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15923•Fetched 2026-05-02 05:27:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

emcee777

Participants

emcee777

Error Message

The Ollama runner crashes reliably under sustained multi-turn tool-calling on 0.20.5 (and other 0.20.x releases). The crash manifests as one of three error signatures returned to the client mid-conversation:

model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details (HTTP 500)
an error was encountered while running the model: unexpected EOF (HTTP 500) In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, 99 of 138 runs (72%) failed to complete. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 running rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving [DONE] (consistent with the same root cause).
After 1–10 iterations the runner exits and one of the three error signatures appears. Three sample crash records (one per error class) attached as JSON in the gist:

sample-runner-stopped-gemma4-31b.json — error #1
sample-unexpected-eof-glm.json — error #2
sample-connection-refused-gemma4-31b.json — error #3 time=... level=ERROR source=server.go:1611 msg="post predict" error="Post "http://127.0.0.1:NNNNN/completion\": EOF" time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2" A tool-using agent loop should not be able to terminate the runner. Whatever path the runner takes through tool-call decoding (the harmony-format / function-call grammar, KV cache reuse across turns, or whatever piece of state survives between sequential /v1/chat/completions requests with tools) should not be capable of producing exit status 2 with no recoverable error.

Root Cause

In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, 99 of 138 runs (72%) failed to complete. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 running rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving [DONE] (consistent with the same root cause).

Fix Action

Fix / Workaround

Not OOM at the OS level (free memory is 50%+ during crashes).
Not context overflow — single-turn single-tool requests crash too, no growing history.
Not concurrency — OLLAMA_NUM_PARALLEL=2 and a single in-flight request both reproduce.
Not specific to one quant — happens on Q4_K_M and Q5_K_M variants we tested.
Not specific to one model family — Gemma, Mistral, GLM, and Nemotron all reproduce.
The only mitigation we've found that meaningfully reduces crash frequency is restricting tool-using workloads to small models (gemma4:e4b, ~9.6 GB) — which suggests a per-runner state path that scales poorly with model size.

Code Example

time=...  level=ERROR  source=server.go:1611  msg="post predict"  error="Post \"http://127.0.0.1:NNNNN/completion\": EOF"
[GIN] ... | 500 | ...s | 127.0.0.1 | POST "/v1/chat/completions"
time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2"

RAW_BUFFERClick to expand / collapse

What is the issue?

model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details (HTTP 500)
an error was encountered while running the model: unexpected EOF (HTTP 500)
connection refused (the entire ollama serve process dies; subsequent requests fail until restart)

This is a regression from 0.19.x, narrowed by user reports against 0.20.0–0.20.5. It is closely related to but distinct from #14611 (which targets 0.17.5 and /api/generate):

This report is specifically about the OpenAI-compatible /v1/chat/completions path with tools and multi-turn loops (agent-style: assistant→tool_call→tool_result→assistant…).
Single-shot /api/generate and single-shot /api/chat (no tools) are far more stable on the same machine and the same model versions. Issue surfaces specifically when the harness drives sequential tool-call/tool-result turns.

Reproduction (minimal)

Full script: see repro.sh in the supporting gist. The essence:

Pull a tool-capable model: ollama pull gemma4:31b (also reproduces with gemma4:26b, mistral-small3.2, glm-4.7-flash, nemotron-cascade-2).
In a loop, POST to /v1/chat/completions with tools: [...] and tool_choice: "auto", providing a single read_file-style tool. Each iteration is a fresh single-turn request — so this is not a context-blowup issue, it's frequency.
After 1–10 iterations the runner exits and one of the three error signatures appears.
Larger models (gemma4:31b, mistral-small3.2) often crash on the first tool-using request. Smaller models crash within ~5 iterations. gemma4:e4b (4B effective) is the only model in our matrix that survives sustained tool-using loops.

Aggregate benchmark data (gist)

138 runs, harness = simple agent loop driving /v1/chat/completions with one tool, 120 s curl timeout, single in-flight request:

Model	Completed	Total	Pass rate
gemma4:e4b	14	24	58%
gemma4:26b	23	40	58%
gemma4:31b	0	16	0%
mistral-small3.2	0	16	0%
glm-4.7-flash	1	32	3%
nemotron-cascade-2	1	8	12%

Three sample crash records (one per error class) attached as JSON in the gist:

sample-runner-stopped-gemma4-31b.json — error #1
sample-unexpected-eof-glm.json — error #2
sample-connection-refused-gemma4-31b.json — error #3

Environment

Ollama: 0.20.5 (also reproduced on 0.20.6, 0.20.7; 0.21.x improves but still crashes on qwen3-coder-next via /v1/messages+tools)
Server config (excerpt from server log): OLLAMA_FLASH_ATTENTION=true OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_KEEP_ALIVE=5m0s OLLAMA_LOAD_TIMEOUT=5m0s OLLAMA_NEW_ENGINE=false
OS: macOS 15.5 (24F74)
Hardware: Apple M3 Max, 128 GB unified memory (96 GB recommendedMaxWorkingSetSize per Metal init)
Memory at crash: 58.7 GiB free (system), 95.5 GiB GPU available — runner is not OOM-pressured at the OS level, yet still terminates.

Server-side log fragments

From the runner subprocess that backs /v1/chat/completions during a typical crash window (preserved from the related Apr 2026 server log; the original 0.20.5 logs were rotated, but the same termination pattern persists on 0.21.2):

time=...  level=ERROR  source=server.go:1611  msg="post predict"  error="Post \"http://127.0.0.1:NNNNN/completion\": EOF"
[GIN] ... | 500 | ...s | 127.0.0.1 | POST "/v1/chat/completions"
time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2"

This is identical in shape to what #14611 reports on 0.17.5 /api/generate, suggesting the runner's exit-status-2 path is a long-standing hot edge that the multi-turn tool-call protocol now exercises far more frequently than single-shot generation did.

Expected behavior

A tool-using agent loop should not be able to terminate the runner. Whatever path the runner takes through tool-call decoding (the harmony-format / function-call grammar, KV cache reuse across turns, or whatever piece of state survives between sequential /v1/chat/completions requests with tools) should not be capable of producing exit status 2 with no recoverable error.

What we've ruled out / additional notes

Not OOM at the OS level (free memory is 50%+ during crashes).
Not context overflow — single-turn single-tool requests crash too, no growing history.
Not concurrency — OLLAMA_NUM_PARALLEL=2 and a single in-flight request both reproduce.
Not specific to one quant — happens on Q4_K_M and Q5_K_M variants we tested.
Not specific to one model family — Gemma, Mistral, GLM, and Nemotron all reproduce.
The only mitigation we've found that meaningfully reduces crash frequency is restricting tool-using workloads to small models (gemma4:e4b, ~9.6 GB) — which suggests a per-runner state path that scales poorly with model size.

Happy to provide additional logs, run a test build, or tighten the reproducer if helpful. Thank you for the work on Ollama — the dual-endpoint architecture is genuinely useful and we'd love to keep building on it.

extent analysis

TL;DR

The Ollama runner crashes under sustained multi-turn tool-calling on version 0.20.5, likely due to a resource management issue or internal error, and a potential workaround is to restrict tool-using workloads to small models.

Guidance

Review the server configuration, particularly OLLAMA_MAX_LOADED_MODELS, OLLAMA_MAX_QUEUE, and OLLAMA_KEEP_ALIVE, to ensure they are adequately set for the workload.
Consider reducing the model size or complexity to mitigate the crash frequency, as seen with the gemma4:e4b model.
Investigate the server-side log fragments to understand the error patterns and potential resource constraints leading to the runner termination.
Test the reproducer with different model families and quantization variants to confirm the issue is not specific to one model or quantization.

Example

No code snippet is provided as the issue is more related to configuration and resource management.

Notes

The issue seems to be related to the multi-turn tool-call protocol and the runner's ability to manage resources, particularly with larger models. The fact that smaller models like gemma4:e4b are more stable suggests a potential scaling issue.

Recommendation

Apply a workaround by restricting tool-using workloads to small models, such as gemma4:e4b, until a more permanent fix is available. This is because the issue is likely related to resource management and model size, and using smaller models has been shown to reduce crash frequency.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix [0.20.5][macOS Apple Silicon] Runner crashes under sustained multi-turn tool-calling on /v1/chat/completions (72% crash rate across 7 models) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

What is the issue?

Reproduction (minimal)

Aggregate benchmark data (gist)

Environment

Server-side log fragments

Expected behavior

What we've ruled out / additional notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix [0.20.5][macOS Apple Silicon] Runner crashes under sustained multi-turn tool-calling on /v1/chat/completions (72% crash rate across 7 models) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

What is the issue?

Reproduction (minimal)

Aggregate benchmark data (gist)

Environment

Server-side log fragments

Expected behavior

What we've ruled out / additional notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING