ollama - 💡(How to fix) Fix [0.20.5][macOS Apple Silicon] Runner crashes under sustained multi-turn tool-calling on /v1/chat/completions (72% crash rate across 7 models) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15923Fetched 2026-05-02 05:27:39
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

The Ollama runner crashes reliably under sustained multi-turn tool-calling on 0.20.5 (and other 0.20.x releases). The crash manifests as one of three error signatures returned to the client mid-conversation:

  1. model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details (HTTP 500)
  2. an error was encountered while running the model: unexpected EOF (HTTP 500) In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, 99 of 138 runs (72%) failed to complete. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 running rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving [DONE] (consistent with the same root cause).
  3. After 1–10 iterations the runner exits and one of the three error signatures appears. Three sample crash records (one per error class) attached as JSON in the gist:
  • sample-runner-stopped-gemma4-31b.json — error #1
  • sample-unexpected-eof-glm.json — error #2
  • sample-connection-refused-gemma4-31b.json — error #3 time=... level=ERROR source=server.go:1611 msg="post predict" error="Post "http://127.0.0.1:NNNNN/completion\": EOF" time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2" A tool-using agent loop should not be able to terminate the runner. Whatever path the runner takes through tool-call decoding (the harmony-format / function-call grammar, KV cache reuse across turns, or whatever piece of state survives between sequential /v1/chat/completions requests with tools) should not be capable of producing exit status 2 with no recoverable error.

Root Cause

In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, 99 of 138 runs (72%) failed to complete. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 running rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving [DONE] (consistent with the same root cause).

Fix Action

Fix / Workaround

  • Not OOM at the OS level (free memory is 50%+ during crashes).
  • Not context overflow — single-turn single-tool requests crash too, no growing history.
  • Not concurrency — OLLAMA_NUM_PARALLEL=2 and a single in-flight request both reproduce.
  • Not specific to one quant — happens on Q4_K_M and Q5_K_M variants we tested.
  • Not specific to one model family — Gemma, Mistral, GLM, and Nemotron all reproduce.
  • The only mitigation we've found that meaningfully reduces crash frequency is restricting tool-using workloads to small models (gemma4:e4b, ~9.6 GB) — which suggests a per-runner state path that scales poorly with model size.

Code Example

time=...  level=ERROR  source=server.go:1611  msg="post predict"  error="Post \"http://127.0.0.1:NNNNN/completion\": EOF"
[GIN] ... | 500 | ...s | 127.0.0.1 | POST "/v1/chat/completions"
time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2"
RAW_BUFFERClick to expand / collapse

What is the issue?

The Ollama runner crashes reliably under sustained multi-turn tool-calling on 0.20.5 (and other 0.20.x releases). The crash manifests as one of three error signatures returned to the client mid-conversation:

  1. model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details (HTTP 500)
  2. an error was encountered while running the model: unexpected EOF (HTTP 500)
  3. connection refused (the entire ollama serve process dies; subsequent requests fail until restart)

In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, 99 of 138 runs (72%) failed to complete. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 running rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving [DONE] (consistent with the same root cause).

This is a regression from 0.19.x, narrowed by user reports against 0.20.0–0.20.5. It is closely related to but distinct from #14611 (which targets 0.17.5 and /api/generate):

  • This report is specifically about the OpenAI-compatible /v1/chat/completions path with tools and multi-turn loops (agent-style: assistant→tool_call→tool_result→assistant…).
  • Single-shot /api/generate and single-shot /api/chat (no tools) are far more stable on the same machine and the same model versions. Issue surfaces specifically when the harness drives sequential tool-call/tool-result turns.

Reproduction (minimal)

Full script: see repro.sh in the supporting gist. The essence:

  1. Pull a tool-capable model: ollama pull gemma4:31b (also reproduces with gemma4:26b, mistral-small3.2, glm-4.7-flash, nemotron-cascade-2).
  2. In a loop, POST to /v1/chat/completions with tools: [...] and tool_choice: "auto", providing a single read_file-style tool. Each iteration is a fresh single-turn request — so this is not a context-blowup issue, it's frequency.
  3. After 1–10 iterations the runner exits and one of the three error signatures appears.
  4. Larger models (gemma4:31b, mistral-small3.2) often crash on the first tool-using request. Smaller models crash within ~5 iterations. gemma4:e4b (4B effective) is the only model in our matrix that survives sustained tool-using loops.

Aggregate benchmark data (gist)

138 runs, harness = simple agent loop driving /v1/chat/completions with one tool, 120 s curl timeout, single in-flight request:

ModelCompletedTotalPass rate
gemma4:e4b142458%
gemma4:26b234058%
gemma4:31b0160%
mistral-small3.20160%
glm-4.7-flash1323%
nemotron-cascade-21812%

Three sample crash records (one per error class) attached as JSON in the gist:

Environment

  • Ollama: 0.20.5 (also reproduced on 0.20.6, 0.20.7; 0.21.x improves but still crashes on qwen3-coder-next via /v1/messages+tools)
  • Server config (excerpt from server log): OLLAMA_FLASH_ATTENTION=true OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_KEEP_ALIVE=5m0s OLLAMA_LOAD_TIMEOUT=5m0s OLLAMA_NEW_ENGINE=false
  • OS: macOS 15.5 (24F74)
  • Hardware: Apple M3 Max, 128 GB unified memory (96 GB recommendedMaxWorkingSetSize per Metal init)
  • Memory at crash: 58.7 GiB free (system), 95.5 GiB GPU available — runner is not OOM-pressured at the OS level, yet still terminates.

Server-side log fragments

From the runner subprocess that backs /v1/chat/completions during a typical crash window (preserved from the related Apr 2026 server log; the original 0.20.5 logs were rotated, but the same termination pattern persists on 0.21.2):

time=...  level=ERROR  source=server.go:1611  msg="post predict"  error="Post \"http://127.0.0.1:NNNNN/completion\": EOF"
[GIN] ... | 500 | ...s | 127.0.0.1 | POST "/v1/chat/completions"
time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2"

This is identical in shape to what #14611 reports on 0.17.5 /api/generate, suggesting the runner's exit-status-2 path is a long-standing hot edge that the multi-turn tool-call protocol now exercises far more frequently than single-shot generation did.

Expected behavior

A tool-using agent loop should not be able to terminate the runner. Whatever path the runner takes through tool-call decoding (the harmony-format / function-call grammar, KV cache reuse across turns, or whatever piece of state survives between sequential /v1/chat/completions requests with tools) should not be capable of producing exit status 2 with no recoverable error.

What we've ruled out / additional notes

  • Not OOM at the OS level (free memory is 50%+ during crashes).
  • Not context overflow — single-turn single-tool requests crash too, no growing history.
  • Not concurrency — OLLAMA_NUM_PARALLEL=2 and a single in-flight request both reproduce.
  • Not specific to one quant — happens on Q4_K_M and Q5_K_M variants we tested.
  • Not specific to one model family — Gemma, Mistral, GLM, and Nemotron all reproduce.
  • The only mitigation we've found that meaningfully reduces crash frequency is restricting tool-using workloads to small models (gemma4:e4b, ~9.6 GB) — which suggests a per-runner state path that scales poorly with model size.

Happy to provide additional logs, run a test build, or tighten the reproducer if helpful. Thank you for the work on Ollama — the dual-endpoint architecture is genuinely useful and we'd love to keep building on it.

extent analysis

TL;DR

The Ollama runner crashes under sustained multi-turn tool-calling on version 0.20.5, likely due to a resource management issue or internal error, and a potential workaround is to restrict tool-using workloads to small models.

Guidance

  • Review the server configuration, particularly OLLAMA_MAX_LOADED_MODELS, OLLAMA_MAX_QUEUE, and OLLAMA_KEEP_ALIVE, to ensure they are adequately set for the workload.
  • Consider reducing the model size or complexity to mitigate the crash frequency, as seen with the gemma4:e4b model.
  • Investigate the server-side log fragments to understand the error patterns and potential resource constraints leading to the runner termination.
  • Test the reproducer with different model families and quantization variants to confirm the issue is not specific to one model or quantization.

Example

No code snippet is provided as the issue is more related to configuration and resource management.

Notes

The issue seems to be related to the multi-turn tool-call protocol and the runner's ability to manage resources, particularly with larger models. The fact that smaller models like gemma4:e4b are more stable suggests a potential scaling issue.

Recommendation

Apply a workaround by restricting tool-using workloads to small models, such as gemma4:e4b, until a more permanent fix is available. This is because the issue is likely related to resource management and model size, and using smaller models has been shown to reduce crash frequency.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

A tool-using agent loop should not be able to terminate the runner. Whatever path the runner takes through tool-call decoding (the harmony-format / function-call grammar, KV cache reuse across turns, or whatever piece of state survives between sequential /v1/chat/completions requests with tools) should not be capable of producing exit status 2 with no recoverable error.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix [0.20.5][macOS Apple Silicon] Runner crashes under sustained multi-turn tool-calling on /v1/chat/completions (72% crash rate across 7 models) [1 participants]