ollama - ✅(Solved) Fix [Mac/MLX] Regression: MLX Runner killed by 10s watchdog timeout during large context prefill/generation [1 pull requests, 1 participants]

ollama2026-05-10 12:09:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#16081•Fetched 2026-05-11 03:13:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

iggori

Participants

iggori

Assignees

dhiltgen

Timeline (top)

assigned ×1cross-referenced ×1labeled ×1

Error Message

Since v0.23.1/v0.23.2, the MLX runner on Apple Silicon is being terminated by a strict 10-second heartbeat/status check timeout. When processing large context windows (32k+) or running slow-prefill MoE models (like Qwen 3.5/3.6 35B), the GPU may not respond to a status check within exactly 10.00 seconds. The server then cancels the context, resulting in a 500 Internal Server Error. time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:76 msg="Failed to read MLX memory status" error="context canceled" time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.001061292s status="500 Internal Server Error" The issue is purely a synchronization/timeout logic error in the new runner architecture. The GPU is not hanging; it is simply busy. v0.22.1 handles this by allowing the runner more time to report back.

Fix Action

Fixed

Fixed by PR: mlx: avoid status timeout during inference (https://github.com/ollama/ollama/pull/16086)

PR fix notes

PR #16086: mlx: avoid status timeout during inference

Repository: ollama/ollama
Author: dhiltgen
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/16086

Description (problem / solution / changelog)

The MLX runner now routes model work through a locked worker thread. Status also used that worker only to sample memory, so a scheduler health ping could sit behind long prefill or generation until its 10s context expired, causing /v1/status to return 500 and the server to treat the runner as unhealthy.

While Metal doesn't change VRAM reporting, CUDA does. Cache the last memory sample and make status perform only a short best-effort refresh. If the worker is busy, status returns the cached value while a single background refresh continues and updates the cache when the worker becomes available. The in-flight guard and lifecycle context keep this from spawning unbounded refreshes while preserving live VRAM refresh behavior for CUDA.

Fixes #16081

Changed files

x/mlxrunner/server.go (modified, +20/-10)
x/mlxrunner/status_memory.go (added, +111/-0)
x/mlxrunner/status_memory_test.go (added, +246/-0)

Code Example

Plaintext
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:76 msg="Failed to read MLX memory status" error="context canceled"
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.001061292s status="500 Internal Server Error"
Additional Notes:
The issue is purely a synchronization/timeout logic error in the new runner architecture. The GPU is not hanging; it is simply busy. v0.22.1 handles this by allowing the runner more time to report back.

RAW_BUFFERClick to expand / collapse

What is the issue?

Description: Since v0.23.1/v0.23.2, the MLX runner on Apple Silicon is being terminated by a strict 10-second heartbeat/status check timeout. When processing large context windows (32k+) or running slow-prefill MoE models (like Qwen 3.5/3.6 35B), the GPU may not respond to a status check within exactly 10.00 seconds. The server then cancels the context, resulting in a 500 Internal Server Error.

Environment:

OS: macOS (26.2)

Hardware: Mac Studio M1 Max (64GB RAM)

Ollama Version: 0.23.2 (Verified also on 0.23.1)

Regression: v0.22.1 works perfectly (logs show successful runs taking 2m 50s without timeout).

Steps to Reproduce:

Use any model with the MLX runner (e.g., qwen3.6:35b-a3b-coding-nvfp4).

Set OLLAMA_CONTEXT_LENGTH=32768.

Send a request with a large prompt (18k+ tokens) or one that triggers long "thinking" generation.

Observe the logs. At exactly 10.00 seconds of the runner being "silent" (busy with GPU tasks), the server kills the process.

When i downgraded to version 0.22.1 and MLX version"=0.31.1-23-g38ad257 all problems are gone so I think regression is introduced in MLX version"=0.31.2-7-ge8ebdeb

Tnx

Relevant log output

Plaintext
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:76 msg="Failed to read MLX memory status" error="context canceled"
time=2026-05-10T13:48:46.848+02:00 level=ERROR source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.001061292s status="500 Internal Server Error"
Additional Notes:
The issue is purely a synchronization/timeout logic error in the new runner architecture. The GPU is not hanging; it is simply busy. v0.22.1 handles this by allowing the runner more time to report back.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

v0.23.1/v0.23.2

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model download #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - ✅(Solved) Fix [Mac/MLX] Regression: MLX Runner killed by 10s watchdog timeout during large context prefill/generation [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #16086: mlx: avoid status timeout during inference

Description (problem / solution / changelog)

Changed files

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

TRENDING

ollama - ✅(Solved) Fix [Mac/MLX] Regression: MLX Runner killed by 10s watchdog timeout during large context prefill/generation [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #16086: mlx: avoid status timeout during inference

Description (problem / solution / changelog)

Changed files

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

RELATED_DISCOVERY

TRENDING