ollama - 💡(How to fix) Fix Tool calling is not streaming on macOS with MLX, causing timeout when write tool outputs large code

ollama2026-05-24 02:23:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When using tool calling with MLX backend on macOS, the tool call responses are not streaming. This causes severe issues when a tool like write needs to output large amounts of code — the client waits a long time without any response and eventually times out.

Root Cause

Code Example

ollama pull qwen3.6:27b-coding-nvfp4

RAW_BUFFERClick to expand / collapse

Summary

Environment

OS: macOS (Apple Silicon)
Backend: MLX
Model: qwen3.6:27b-coding-nvfp4
Ollama version: Latest main

Expected behavior

Tool call outputs should be streamed incrementally to the client, just like regular text completion streaming. This matches the behavior when not using MLX (e.g., CUDA backend), where tool calls stream properly and clients don't timeout.

Actual behavior

When a tool call is triggered (e.g., write tool generating a large file with hundreds or thousands of lines of code), the entire tool output is buffered and only sent at the very end. This means:

The client receives no intermediate chunks for an extended period
For large code outputs, the wait can be 30 seconds or more
Client-side timeouts are triggered (typically around 30-60s depending on the client)
The request fails even though Ollama is still generating in the background

Steps to reproduce

Pull the model on macOS with MLX support:

ollama pull qwen3.6:27b-coding-nvfp4

Use the chat API with tool/function calling enabled, providing a prompt that triggers a tool to generate a large file (e.g., 500+ lines of code)
Observe that during tool execution, no streaming chunks are received by the client until the entire tool output is complete
The client times out before receiving the final response

Additional context

This is particularly problematic for coding models used in agent workflows where code generation via tools is very common. Non-streaming tool calls significantly degrade the user experience and reliability of AI coding assistants on macOS.

This issue may be related to how the MLX backend handles tool call chunking/streaming compared to other backends.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Tool calling is not streaming on macOS with MLX, causing timeout when write tool outputs large code

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Environment

Expected behavior

Actual behavior

Steps to reproduce

Additional context

FAQ

Expected behavior

Still need to ship something?

TRENDING