ollama - ✅(Solved) Fix [Mac/MLX] Regression: MLX Runner killed by 10s watchdog timeout during large context prefill/generation [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16081Fetched 2026-05-11 03:13:21
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Assignees
Timeline (top)
assigned ×1cross-referenced ×1labeled ×1

Error Message

Since v0.23.1/v0.23.2, the MLX runner on Apple Silicon is being terminated by a strict 10-second heartbeat/status check timeout. When processing large context windows (32k+) or running slow-prefill MoE models (like Qwen 3.5/3.6 35B), the GPU may not respond to a status check within exactly 10.00 seconds. The server then cancels the context, resulting in a 500 Internal Server Error. time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:76 msg="Failed to read MLX memory status" error="context canceled" time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.001061292s status="500 Internal Server Error" The issue is purely a synchronization/timeout logic error in the new runner architecture. The GPU is not hanging; it is simply busy. v0.22.1 handles this by allowing the runner more time to report back.

Fix Action

Fixed

PR fix notes

PR #16086: mlx: avoid status timeout during inference

Description (problem / solution / changelog)

The MLX runner now routes model work through a locked worker thread. Status also used that worker only to sample memory, so a scheduler health ping could sit behind long prefill or generation until its 10s context expired, causing /v1/status to return 500 and the server to treat the runner as unhealthy.

While Metal doesn't change VRAM reporting, CUDA does. Cache the last memory sample and make status perform only a short best-effort refresh. If the worker is busy, status returns the cached value while a single background refresh continues and updates the cache when the worker becomes available. The in-flight guard and lifecycle context keep this from spawning unbounded refreshes while preserving live VRAM refresh behavior for CUDA.

Fixes #16081

Changed files

  • x/mlxrunner/server.go (modified, +20/-10)
  • x/mlxrunner/status_memory.go (added, +111/-0)
  • x/mlxrunner/status_memory_test.go (added, +246/-0)

Code Example

Plaintext
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:76 msg="Failed to read MLX memory status" error="context canceled"
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.001061292s status="500 Internal Server Error"
Additional Notes:
The issue is purely a synchronization/timeout logic error in the new runner architecture. The GPU is not hanging; it is simply busy. v0.22.1 handles this by allowing the runner more time to report back.
RAW_BUFFERClick to expand / collapse

What is the issue?

Description: Since v0.23.1/v0.23.2, the MLX runner on Apple Silicon is being terminated by a strict 10-second heartbeat/status check timeout. When processing large context windows (32k+) or running slow-prefill MoE models (like Qwen 3.5/3.6 35B), the GPU may not respond to a status check within exactly 10.00 seconds. The server then cancels the context, resulting in a 500 Internal Server Error.

Environment:

OS: macOS (26.2)

Hardware: Mac Studio M1 Max (64GB RAM)

Ollama Version: 0.23.2 (Verified also on 0.23.1)

Regression: v0.22.1 works perfectly (logs show successful runs taking 2m 50s without timeout).

Steps to Reproduce:

Use any model with the MLX runner (e.g., qwen3.6:35b-a3b-coding-nvfp4).

Set OLLAMA_CONTEXT_LENGTH=32768.

Send a request with a large prompt (18k+ tokens) or one that triggers long "thinking" generation.

Observe the logs. At exactly 10.00 seconds of the runner being "silent" (busy with GPU tasks), the server kills the process.

When i downgraded to version 0.22.1 and MLX version"=0.31.1-23-g38ad257 all problems are gone so I think regression is introduced in MLX version"=0.31.2-7-ge8ebdeb

Tnx

Relevant log output

Plaintext
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:76 msg="Failed to read MLX memory status" error="context canceled"
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.001061292s status="500 Internal Server Error"
Additional Notes:
The issue is purely a synchronization/timeout logic error in the new runner architecture. The GPU is not hanging; it is simply busy. v0.22.1 handles this by allowing the runner more time to report back.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

v0.23.1/v0.23.2

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - ✅(Solved) Fix [Mac/MLX] Regression: MLX Runner killed by 10s watchdog timeout during large context prefill/generation [1 pull requests, 1 participants]