ollama - 💡(How to fix) Fix [0.20.5] Runner accepts TCP connection but request never reaches work loop — same shape as resolved-in-0.20.2 #15258 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15950Fetched 2026-05-04 04:58:22
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

We hit the same shape as #15258 on Ollama 0.20.5, despite that issue being closed as resolved in 0.20.2 by its reporter. The pattern: large models pinned in memory for hours, then /api/generate hangs indefinitely with zero bytes returned, while listing endpoints (/api/version, /api/tags, /api/ps, /api/embeddings) keep responding normally.

Forensic analysis (stack samples + lsof) shows this is not a daemon-scheduler issue and not a deadlock primitive — it's a runner-side request-receive failure: the daemon dispatches correctly (TCP ESTABLISHED conn to runner), the runner accepts the connection at the kernel level, but the request payload never reaches the runner's main work loop. The main thread stays in its idle wait-for-work _pthread_cond_wait instead of waking to process the inbound request.

Filing this as a probable regression after #15258's 0.20.2 fix (or an adjacent variant the original fix didn't cover). May be related to but distinct from #15923 (which is multi-turn tool-calling crashes — error-emitting, not silent-hang) and #15350 (Gemma 4 Flash Attention hang).

Error Message

Filing this as a probable regression after #15258's 0.20.2 fix (or an adjacent variant the original fix didn't cover). May be related to but distinct from #15923 (which is multi-turn tool-calling crashes — error-emitting, not silent-hang) and #15350 (Gemma 4 Flash Attention hang).

  • #15923 — open; 0.20.5 + Apple Silicon. Different shape (multi-turn tool-calling crashes with explicit error signatures). Adjacent area, probably distinct cause.

Root Cause

We hit the same shape as #15258 on Ollama 0.20.5, despite that issue being closed as resolved in 0.20.2 by its reporter. The pattern: large models pinned in memory for hours, then /api/generate hangs indefinitely with zero bytes returned, while listing endpoints (/api/version, /api/tags, /api/ps, /api/embeddings) keep responding normally.

Forensic analysis (stack samples + lsof) shows this is not a daemon-scheduler issue and not a deadlock primitive — it's a runner-side request-receive failure: the daemon dispatches correctly (TCP ESTABLISHED conn to runner), the runner accepts the connection at the kernel level, but the request payload never reaches the runner's main work loop. The main thread stays in its idle wait-for-work _pthread_cond_wait instead of waking to process the inbound request.

Filing this as a probable regression after #15258's 0.20.2 fix (or an adjacent variant the original fix didn't cover). May be related to but distinct from #15923 (which is multi-turn tool-calling crashes — error-emitting, not silent-hang) and #15350 (Gemma 4 Flash Attention hang).

Fix Action

Fix / Workaround

Forensic analysis (stack samples + lsof) shows this is not a daemon-scheduler issue and not a deadlock primitive — it's a runner-side request-receive failure: the daemon dispatches correctly (TCP ESTABLISHED conn to runner), the runner accepts the connection at the kernel level, but the request payload never reaches the runner's main work loop. The main thread stays in its idle wait-for-work _pthread_cond_wait instead of waking to process the inbound request.

Both runners' main threads:

Thread_NNNNNNNN   DispatchQueue_1: com.apple.main-thread  (serial)
  + N ???  (in <unknown binary>)  [...]
  +   N ???  (in ollama)  load address 0x10XXXXXXX + 0x8b5d4
  +     N ???  (in ollama)  load address 0x10XXXXXXX + 0x8c7d8
  +       N _pthread_cond_wait  (in libsystem_pthread.dylib) + 980
  +         N __psynch_cvwait   (in libsystem_kernel.dylib) + 8

The qwen3.5 ESTABLISHED connection is the proof: the daemon DID dispatch a request to this runner (TCP handshake completed). The runner accepted at the kernel level (FD 5 is in ESTABLISHED state). But the runner's main work loop never woke to process it — the request payload was received into kernel buffers but never reached the work queue or signaled the cond_var.

Code Example

Thread_NNNNNNNN   DispatchQueue_1: com.apple.main-thread  (serial)
  + N ???  (in <unknown binary>)  [...]
  +   N ???  (in ollama)  load address 0x10XXXXXXX + 0x8b5d4
  +     N ???  (in ollama)  load address 0x10XXXXXXX + 0x8c7d8
  +       N _pthread_cond_wait  (in libsystem_pthread.dylib) + 980
  +         N __psynch_cvwait   (in libsystem_kernel.dylib) + 8

---

launchctl bootout "gui/$(id -u)/com.ollama.serve"
sleep 3
launchctl bootstrap "gui/$(id -u)" ~/Library/LaunchAgents/com.ollama.serve.plist
RAW_BUFFERClick to expand / collapse

Summary

We hit the same shape as #15258 on Ollama 0.20.5, despite that issue being closed as resolved in 0.20.2 by its reporter. The pattern: large models pinned in memory for hours, then /api/generate hangs indefinitely with zero bytes returned, while listing endpoints (/api/version, /api/tags, /api/ps, /api/embeddings) keep responding normally.

Forensic analysis (stack samples + lsof) shows this is not a daemon-scheduler issue and not a deadlock primitive — it's a runner-side request-receive failure: the daemon dispatches correctly (TCP ESTABLISHED conn to runner), the runner accepts the connection at the kernel level, but the request payload never reaches the runner's main work loop. The main thread stays in its idle wait-for-work _pthread_cond_wait instead of waking to process the inbound request.

Filing this as a probable regression after #15258's 0.20.2 fix (or an adjacent variant the original fix didn't cover). May be related to but distinct from #15923 (which is multi-turn tool-calling crashes — error-emitting, not silent-hang) and #15350 (Gemma 4 Flash Attention hang).

Environment

  • Hardware: Mac Studio M3 Ultra, 256 GB unified memory (Mac15,14, 28 physical cores, hw.memsize=274877906944)
  • OS: macOS 26.4.1 (25E253)
  • Ollama: 0.20.5 (homebrew /opt/homebrew/opt/ollama/bin/ollama serve, supervised by com.ollama.serve LaunchAgent)
  • Env vars (per launchctl print):
    • OLLAMA_NUM_PARALLEL=4
    • OLLAMA_MAX_LOADED_MODELS=4
    • OLLAMA_FLASH_ATTENTION=1
    • OLLAMA_KV_CACHE_TYPE=q4_0
    • OLLAMA_KEEP_ALIVE=5m (daemon default — but see note below)
  • Effective keep-alive: A cron job (*/2 * * * *) actively pings each loaded model with keep_alive: -1 to pin them indefinitely. So at incident time, qwen3.5:122b-32k + glm-4.7-flash + gemma4 had been resident for ~3 hours of pinned idle.

Affected models

  • qwen3.5:122b-32k (84 GB resident, MoE) — STUCK
  • glm-4.7-flash:latest (45 GB resident, MoE) — STUCK
  • gemma4:latest (15 GB resident, dense) — UNAFFECTED, kept responding cleanly to direct /api/generate throughout the incident

The smaller dense model survived. Both larger MoE models hung. Pattern correlates with model size and/or MoE arch, but I don't have enough samples to claim either with confidence.

Symptom

After ~3 hours of pinned-idle state:

  • /api/version → 200, immediate
  • /api/tags → 200, lists all models
  • /api/ps → 200, lists qwen3.5 + glm + gemma4 as loaded with expires_at: "2318-08-12..." (the keep-alive=-1 sentinel)
  • /api/embeddings (untested at incident, but per #15258 these typically work)
  • /api/generate for qwen3.5:122b-32k → hangs indefinitely (60s curl timeout, then HTTP 000)
  • /api/generate for glm-4.7-flash:latest → same hang
  • /api/generate for gemma4:latest → 200 in 3-4s, normal output

This matches #15258's symptom exactly except for: (a) we're on 0.20.5 not 0.20.0, (b) only the largest two MoE models exhibited it, gemma4 was fine.

Forensic smoking gun

I captured sample <pid> 5 (5-sec stack profile) for both stuck runners while in the hung state:

Both runners' main threads:

Thread_NNNNNNNN   DispatchQueue_1: com.apple.main-thread  (serial)
  + N ???  (in <unknown binary>)  [...]
  +   N ???  (in ollama)  load address 0x10XXXXXXX + 0x8b5d4
  +     N ???  (in ollama)  load address 0x10XXXXXXX + 0x8c7d8
  +       N _pthread_cond_wait  (in libsystem_pthread.dylib) + 980
  +         N __psynch_cvwait   (in libsystem_kernel.dylib) + 8

The two relative offsets (+0x8b5d4 and +0x8c7d8) are identical across both stuck runners despite different load addresses. This is the runner's normal idle "wait for work" loop. Not a deadlock primitive — these threads are correctly sleeping waiting for work to be enqueued.

lsof comparison at incident time:

RunnerTCP state
glm-4.7-flash (PID 58100)LISTEN only on its port
qwen3.5:122b-32k (PID 57826)LISTEN + 1 ESTABLISHED (localhost:64107->localhost:65255)

The qwen3.5 ESTABLISHED connection is the proof: the daemon DID dispatch a request to this runner (TCP handshake completed). The runner accepted at the kernel level (FD 5 is in ESTABLISHED state). But the runner's main work loop never woke to process it — the request payload was received into kernel buffers but never reached the work queue or signaled the cond_var.

Possible failure modes:

  1. The runner's HTTP-receive thread is alive but isn't notifying the work loop
  2. The runner's HTTP-receive thread is itself stuck (would also show in stack samples — and it does, ~33 of 38 threads in cond_wait state, consistent with all worker threads idle)
  3. A lost wake signal between the receive path and the main loop

Either way, the runner has the daemon's request bytes sitting in its socket buffer (or already read into userland) but is failing to act on them.

Mitigation that worked

Full daemon bootout + bootstrap:

launchctl bootout "gui/$(id -u)/com.ollama.serve"
sleep 3
launchctl bootstrap "gui/$(id -u)" ~/Library/LaunchAgents/com.ollama.serve.plist

After respawn, all three models cold-loaded on first request and responded normally.

Mitigation that did NOT work

kill -9 on the stuck runner PIDs alone. After SIGKILL:

  • Runner processes terminated correctly (verified via ps -p)
  • /api/ps continued listing the killed runners' models as loaded with expires_at in the future — stale daemon scheduler state
  • New /api/generate requests for those models hung waiting for the (now-dead) runner

Daemon's runner-tracking logic doesn't notice runner death until something else times out / cleans up, which apparently doesn't happen on its own. Only a daemon restart cleared the scheduler's fictional "loaded" entries.

This bears on the bug location: the daemon's runner-lifecycle bookkeeping is also fragile when runners die in this state, but that's a secondary concern — the primary bug is the runner-side request-receive failure.

Reproduction (untested)

I did NOT attempt reproduction — the test conditions would deliberately deadlock another runner and disrupt our production fleet. If you want me to attempt reproduction, please indicate what conditions to vary (KEEP_ALIVE pinning duration, model size, NUM_PARALLEL boundary, FLASH_ATTENTION on/off, etc.) and I'll set up a controlled run.

Forensic capture artifacts

Available on request:

  • sample-<pid>-{qwen,glm}.txt — 5-sec stack profiles per stuck runner (~115 KB each, 38 threads each)
  • lsof-<pid>-{qwen,glm}.txt — full FD inventory at stuck-time
  • api-ps-pre-kill.json — daemon's view of stuck state
  • system-memory.txtvm_stat/sysctl vm.swapusage/top at incident time
  • ioreg-metal.txt — Metal GPU state
  • ollama-server-tail.log — server.log tail at incident time
  • com.ollama.serve.{pre,post}-restart.dumplaunchctl print env captures

Related issues

  • #15258 — ancestor; same shape on 0.20.0 Apple Silicon, reporter said resolved in 0.20.2. We're hitting the same shape on 0.20.5.
  • #15923 — open; 0.20.5 + Apple Silicon. Different shape (multi-turn tool-calling crashes with explicit error signatures). Adjacent area, probably distinct cause.
  • #15350 — closed; Gemma 4 Flash Attention hang. We have FLASH_ATTENTION=1; may or may not be a contributing factor here.

What we'd find useful from maintainers

  1. Confirmation whether this is the same as #15258 returning, or a distinct adjacent bug
  2. Whether the 0.20.2 fix has unit-test coverage for the runner-side receive path that would catch a regression
  3. Guidance on safe reproduction (which env knob most reliably triggers it) before we attempt repro on a production machine
  4. Whether anyone else has filed this on 0.20.3-0.20.5 that we missed in our pre-filing search

extent analysis

TL;DR

The issue can likely be resolved by restarting the Ollama daemon using launchctl bootout and launchctl bootstrap commands.

Guidance

  1. Verify the issue: Confirm that the problem is indeed related to the runner-side request-receive failure by checking the stack samples and lsof output for the stuck runners.
  2. Check daemon version: Ensure that the Ollama daemon is running on version 0.20.5, as the issue seems to be a regression from the 0.20.2 fix.
  3. Test mitigation: Apply the mitigation that worked, which is a full daemon bootout and bootstrap, to see if it resolves the issue.
  4. Gather more information: Collect the forensic capture artifacts, such as stack profiles and lsof output, to help maintainers investigate the issue further.

Example

launchctl bootout "gui/$(id -u)/com.ollama.serve"
sleep 3
launchctl bootstrap "gui/$(id -u)" ~/Library/LaunchAgents/com.ollama.serve.plist

Notes

The issue seems to be related to the runner-side request-receive failure, and the mitigation that worked involves restarting the Ollama daemon. However, the root cause of the issue is still unknown, and more investigation is needed to determine whether it is the same as #15258 or a distinct adjacent bug.

Recommendation

Apply the workaround by restarting the Ollama daemon using launchctl bootout and launchctl bootstrap commands, as it has been shown to resolve the issue. This will allow the daemon to recover and start responding to requests again.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING