ollama - ✅(Solved) Fix create: MLX panic "no Stream(gpu, 1)" quantizing safetensors on Apple Silicon — regression in v0.23.1 [2 pull requests, 1 comments, 1 participants]

ollama2026-05-10 03:19:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#16070•Fetched 2026-05-11 03:13:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ooki1jp

Participants

ooki1jp

Assignees

dhiltgen

Timeline (top)

cross-referenced ×2assigned ×1commented ×1mentioned ×1

ollama create --experimental --quantize <any> panics in MLX during tensor quantization on macOS Apple Silicon. Regression bisected to v0.23.1. v0.23.0 works correctly with the same Modelfile and same input safetensors.

Root Cause

Fix Action

Fixed

Fixed by PR: create: route quantize through MLX worker thread (https://github.com/ollama/ollama/pull/16071)
Fixed by PR: Create safetensors models through a hybrid local/remote pipeline. (https://github.com/ollama/ollama/pull/14969)

PR fix notes

PR #16071: create: route quantize through MLX worker thread

Repository: ollama/ollama
Author: ooki1jp
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/16071

Description (problem / solution / changelog)

Fixes #16070.

Problem

Since v0.23.1, ollama create --experimental --quantize <any> panics with mlx: There is no Stream(gpu, 1) in current thread on macOS Apple Silicon. v0.23.0 worked correctly with the same Modelfile and same input safetensors.

The root cause is that #15845 tightened MLX's per-thread state requirement: every CGO entry point now runs on a runtime.LockOSThread()-locked OS thread, and MLX expects that thread to have been initialized with SetDefaultDeviceGPU (which is what registers Stream(gpu, *) for the current thread).

The inference path was migrated to satisfy this invariant via a long-lived mlxthread.Thread worker started in x/mlxrunner/server.go. The create-quantize path, however, still calls mlx.Eval / mlx.Quantize / mlx.Contiguous directly from whatever goroutine the caller happens to use:

$ grep -rn "mlxthread" x/mlxrunner/
x/mlxrunner/server.go:37: worker, err := mlxthread.Start("mlxrunner", ...)
x/mlxrunner/runner.go:44: mlxThread     *mlxthread.Thread

$ grep -rn "mlxthread" x/create/
(no matches)

The first MLX call from such a goroutine locks a fresh OS thread, that thread has no MLX state, and mlx_eval panics inside transforms.cpp:73.

Fix

This PR adds the same pattern in x/create/client: a package-level mlxthread.Thread, lazily started, whose init callback runs mlx.CheckInit + mlx.SetDefaultDeviceGPU. The two public entry points that call MLX — quantizeTensor and quantizePackedGroup — are now thin wrappers that dispatch the real work via runOnMLXWorker.

Internal helpers (loadAndQuantizeArray, stackAndQuantizeExpertGroup, decodeSourceFP8Tensor) are unchanged because they always run inside one of the two entry points and therefore on the worker thread.

Diff is two files:

x/create/client/mlxworker.go (new, ~60 lines): worker singleton + runOnMLXWorker helper.
x/create/client/quantize.go (~35 lines added): wrap quantizeTensor / quantizePackedGroup; rename their bodies to *OnMLXThread.

Verification

Tested on M3 Air 16 GB (the host that hit the regression).

Inputs: google/gemma-4-E4B-it (target, 15 GB BF16 safetensors) and google/gemma-4-E4B-it-assistant (drafter, 152 MB).

Scenario	Before	After
`ollama create test -f Modelfile --experimental --quantize int4` (no DRAFT)	panic	✅ creates 9.7 GB int4 model, digest `aab76c72c49c` (matches v0.23.0 output from same inputs)
Same with `DRAFT ./gemma-4-E4B-it-assistant` (MTP)	panic	✅ creates 9.9 GB model `cd730619b202`
`ollama run gemma4-mtp` after	n/a	✅ generates tokens at ~36 tok/s vs ~28 tok/s for stock `gemma4:e4b` (the expected ~1.3× E4B drafter speedup, in line with the docs)

Other notes

gofmt clean, go vet clean.
Existing client package tests still pass (go test ./x/create/client/...).
I considered a smaller fix (adding SetDefaultDeviceGPU inside mlxCheck), but that conflicts with the existing architectural decision in #15845 to centralize MLX work on a dedicated worker — applying the same pattern here keeps x/create/client consistent with x/mlxrunner.
The worker is lazily initialized so there is no startup-time cost for non-quantize create paths.

Changed files

x/create/client/mlxworker.go (added, +57/-0)
x/create/client/quantize.go (modified, +35/-0)

PR #14969: Create safetensors models through a hybrid local/remote pipeline.

Repository: ollama/ollama
Author: dhiltgen
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/14969

Description (problem / solution / changelog)

Local OLLAMA_HOST uses the optimized local create path by default, while non-local hosts use the API upload/create flow. OLLAMA_CREATE_REMOTE can force the remote path for testing, and OLLAMA_CREATE_SERVER_QUANTIZE can force remote quantization to happen on the server.

Remote safetensors creation now reuses x/create.CreateSafetensorsModel as the single traversal/import path. This keeps remote behavior aligned with local handling for architecture-specific transforms, source FP8 and prequantized tensors, skipped companion tensors, expert grouping, and per-tensor quantization decisions.

Fix server-side safetensors assembly to preserve Modelfile parameters and FileType metadata, and handle packed tensor groups during server-side quantization without dropping tensors. Packed groups are quantized as multi-tensor blobs instead of attempting to quantize the group name as a single tensor.

Improve local packed-group creation by streaming from file-backed TensorData for unquantized groups instead of extracting whole tensor byte slices.

Reduce GGUF creation memory by passing through BF16 tensors when no conversion is needed and by making large tensor writes exclusive while still allowing smaller tensor writes to run concurrently.

Add focused tests for remote request metadata, server/client quantization selection, packed quantized tensor preservation, local packed streaming, remote upload/create failures, and GGUF write concurrency.

Hardens MLX thread handling to enforce bound thread requirement.

Test results when creating from gemma-4-31b-it

Flow	Wall Time	CLI Max RSS	Server Max RSS	Notes
Safetensors local create, 4-bit quant	`29.78s`	`7.4 GiB`	`N/A`	Direct local path. Model size `18.85 GiB`; `1195` manifest layers.
Safetensors remote create, 4-bit quant	`41.72s`	`7.4 GiB`	`68 MiB`	Forced remote path with client-side quantization. Model size `18.85 GiB`; `1195` manifest layers.
Safetensors remote create, 4-bit quant, server-side	`105.38s`	`53 MiB`	`7.46 GiB`	Forced remote path with server-side quantization. Model size `18.85 GiB`; `1195` manifest layers.
GGUF 31B, q4_k_m, dynamic memory budget final	220.63s CLI / 2m50s server create	47.0 MiB	42.0 GiB	CPU saturated, no swaps, uses subset of free memory budget
GGUF create, 4-bit quant, unmodified`main`	`-`	`-`	`72.05 GiB`	Upstream baseline for the same 31B model and `q4_k_m`; peak memory footprint was `79.84 GiB`.

Fixes #16070

Changed files

api/client.go (modified, +13/-0)
api/types.go (modified, +11/-0)
cmd/cmd.go (modified, +54/-5)
cmd/cmd_test.go (modified, +27/-0)
convert/reader_safetensors.go (modified, +1/-0)
discover/gpu_info_darwin.m (modified, +12/-4)
docs/api.md (modified, +39/-1)
fs/ggml/gguf.go (modified, +217/-7)
fs/ggml/gguf_test.go (modified, +599/-0)
integration/create_test.go (modified, +239/-81)
progress/bar.go (modified, +4/-0)
server/create.go (modified, +396/-1)
server/create_test.go (modified, +85/-0)
server/images.go (modified, +92/-54)
server/images_test.go (modified, +20/-0)
server/quantization.go (modified, +31/-1)
server/routes_create_test.go (modified, +551/-119)
server/sched_test.go (modified, +4/-4)
x/create/capabilities.go (added, +147/-0)
x/create/client/create.go (modified, +35/-169)
x/create/client/create_test.go (modified, +237/-15)
x/create/client/quantize.go (modified, +691/-176)
x/create/client/quantize_test.go (modified, +160/-3)
x/create/client/remote.go (added, +650/-0)
x/create/client/remote_test.go (added, +1024/-0)
x/create/create.go (modified, +475/-84)
x/create/create_test.go (modified, +2/-2)
x/create/gemma4.go (modified, +4/-22)
x/create/quantization_planner_test.go (added, +235/-0)
x/create/qwen35.go (modified, +6/-0)
x/internal/mlxtest/mlxtest.go (added, +87/-0)
x/mlxrunner/cache/cache_test.go (modified, +2/-3)
x/mlxrunner/cache/rotating_attention_test.go (modified, +125/-97)
x/mlxrunner/mlx/array_test.go (modified, +27/-27)
x/mlxrunner/mlx/compile_test.go (modified, +123/-123)
x/mlxrunner/mlx/io.go (modified, +1/-0)
x/mlxrunner/mlx/memory.go (modified, +5/-0)
x/mlxrunner/mlx/mlx.go (modified, +34/-0)
x/mlxrunner/mlx/stream.go (modified, +48/-24)
x/mlxrunner/mlx/thread_test.go (modified, +25/-0)
x/mlxrunner/model/embedding_test.go (modified, +2/-3)
x/mlxrunner/sample/sample_test.go (modified, +2/-3)
x/mlxrunner/server.go (modified, +1/-0)
x/models/gemma4/gemma4_moe_test.go (modified, +31/-19)
x/models/gemma4/gemma4_test.go (modified, +3/-3)
x/models/laguna/laguna_test.go (modified, +2/-3)
x/models/nn/nn_test.go (modified, +2/-3)
x/models/nn/sdpa_test.go (modified, +5/-4)
x/models/qwen3_5/qwen3_5_test.go (modified, +2/-3)
x/safetensors/extractor.go (modified, +156/-4)
x/safetensors/extractor_test.go (modified, +113/-0)

Code Example

FROM ./gemma-4-E4B-it

---

ollama create test -f Modelfile --experimental --quantize int4

---

importing model.safetensors (2130 tensors, quantizing to int4) ⠧
panic: mlx: There is no Stream(gpu, 1) in current thread.
  at /Users/runner/work/ollama/ollama/build/metal-v3/_deps/mlx-c-src/mlx/c/transforms.cpp:73

goroutine 1 [running]:
github.com/ollama/ollama/x/mlxrunner/mlx.mlxCheck(...) /x/mlxrunner/mlx/mlx.go:68
github.com/ollama/ollama/x/mlxrunner/mlx.doEval(...) /x/mlxrunner/mlx/mlx.go:86
github.com/ollama/ollama/x/create/client.loadAndQuantizeArray(...) /x/create/client/quantize.go:106
github.com/ollama/ollama/x/create/client.quantizeTensor(...) /x/create/client/quantize.go:140
github.com/ollama/ollama/x/create/client.createQuantizedLayers(...) /x/create/client/create.go:365

RAW_BUFFERClick to expand / collapse

Summary

Environment

Ollama v0.23.2 (released 2026-05-07)
macOS, MacBook Air M3, 16 GB unified memory
Install via brew install --cask ollama-app

Bisection

Version	Released	Result
0.23.0	2026-05-03	✅ Succeeds, produces valid 9.7 GB int4 model
0.23.1	2026-05-05	❌ Panics
0.23.2	2026-05-07	❌ Panics

Same Modelfile and same safetensors input on each version.

Minimum reproduction (no `DRAFT` — bug is not DRAFT-specific)

FROM ./gemma-4-E4B-it

Input model: google/gemma-4-E4B-it (public on HF, BF16 safetensors, 15.0 GB).

ollama create test -f Modelfile --experimental --quantize int4

Actual

Safetensors import succeeds; panic occurs during quantization of the 2130 main tensors:

importing model.safetensors (2130 tensors, quantizing to int4) ⠧
panic: mlx: There is no Stream(gpu, 1) in current thread.
  at /Users/runner/work/ollama/ollama/build/metal-v3/_deps/mlx-c-src/mlx/c/transforms.cpp:73

goroutine 1 [running]:
github.com/ollama/ollama/x/mlxrunner/mlx.mlxCheck(...) /x/mlxrunner/mlx/mlx.go:68
github.com/ollama/ollama/x/mlxrunner/mlx.doEval(...) /x/mlxrunner/mlx/mlx.go:86
github.com/ollama/ollama/x/create/client.loadAndQuantizeArray(...) /x/create/client/quantize.go:106
github.com/ollama/ollama/x/create/client.quantizeTensor(...) /x/create/client/quantize.go:140
github.com/ollama/ollama/x/create/client.createQuantizedLayers(...) /x/create/client/create.go:365

Expected

Quantized model should be created. v0.23.0 produces a valid 9.7 GB int4 model from the same input.

Controls

All quantize formats panic identically: int4, int8, mxfp4, mxfp8, nvfp4
q4_K_M rejects with unsupported, separate from this panic
DRAFT removed from Modelfile: same panic — not DRAFT-specific
--quantize omitted (BF16): create succeeds (the resulting BF16 then OOMs at load on this 16 GB host, expected)
Ollama process restart, no models loaded: same panic
Pre-existing quantized gemma4:e4b (Q4) inference works fine on 0.23.2 → runtime path is OK; bug is in create-time quantize path through the MLX runner

Likely culprits in 0.23.1

The bisection narrows to one of three v0.23.1 changes:

"Update MLX and MLX-C with threading fixes" — strong candidate; the panic is literally a thread-local MLX Stream missing
Go 1.26 bump
#15980 (Gemma4 MTP / DRAFT) — less likely, bug repros without DRAFT

Related (different code paths, for searchability)

#15775 — same panic string but on inference of MoE model (closed)
#15746 — open MLX MoE NVFP4 issue, different symptom

Notes

16 GB host's BF16 OOM at load is unrelated. The data point is: v0.23.0 succeeds and v0.23.1 panics on the same host with same input.
Happy to commit-level bisect between v0.23.0 and v0.23.1, attach full server log, or repro on a smaller public safetensors model if helpful.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#agent execution #callback error #memory management #API rate limit #retriever error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

ollama - ✅(Solved) Fix create: MLX panic "no Stream(gpu, 1)" quantizing safetensors on Apple Silicon — regression in v0.23.1 [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #16071: create: route quantize through MLX worker thread

Description (problem / solution / changelog)

Problem

Fix

Verification

Other notes

Changed files

PR #14969: Create safetensors models through a hybrid local/remote pipeline.

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Environment

Bisection

Minimum reproduction (no DRAFT — bug is not DRAFT-specific)

Actual

Expected

Controls

Likely culprits in 0.23.1

Related (different code paths, for searchability)

Notes

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Minimum reproduction (no `DRAFT` — bug is not DRAFT-specific)