ollama - ✅(Solved) Fix create: MLX panic "no Stream(gpu, 1)" quantizing safetensors on Apple Silicon — regression in v0.23.1 [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16070Fetched 2026-05-11 03:13:33
View on GitHub
Comments
1
Participants
1
Timeline
6
Reactions
0
Author
Participants
Assignees
Timeline (top)
cross-referenced ×2assigned ×1commented ×1mentioned ×1

ollama create --experimental --quantize <any> panics in MLX during tensor quantization on macOS Apple Silicon. Regression bisected to v0.23.1. v0.23.0 works correctly with the same Modelfile and same input safetensors.

Root Cause

ollama create --experimental --quantize <any> panics in MLX during tensor quantization on macOS Apple Silicon. Regression bisected to v0.23.1. v0.23.0 works correctly with the same Modelfile and same input safetensors.

Fix Action

Fixed

PR fix notes

PR #16071: create: route quantize through MLX worker thread

Description (problem / solution / changelog)

Fixes #16070.

Problem

Since v0.23.1, ollama create --experimental --quantize <any> panics with mlx: There is no Stream(gpu, 1) in current thread on macOS Apple Silicon. v0.23.0 worked correctly with the same Modelfile and same input safetensors.

The root cause is that #15845 tightened MLX's per-thread state requirement: every CGO entry point now runs on a runtime.LockOSThread()-locked OS thread, and MLX expects that thread to have been initialized with SetDefaultDeviceGPU (which is what registers Stream(gpu, *) for the current thread).

The inference path was migrated to satisfy this invariant via a long-lived mlxthread.Thread worker started in x/mlxrunner/server.go. The create-quantize path, however, still calls mlx.Eval / mlx.Quantize / mlx.Contiguous directly from whatever goroutine the caller happens to use:

$ grep -rn "mlxthread" x/mlxrunner/
x/mlxrunner/server.go:37: worker, err := mlxthread.Start("mlxrunner", ...)
x/mlxrunner/runner.go:44: mlxThread     *mlxthread.Thread

$ grep -rn "mlxthread" x/create/
(no matches)

The first MLX call from such a goroutine locks a fresh OS thread, that thread has no MLX state, and mlx_eval panics inside transforms.cpp:73.

Fix

This PR adds the same pattern in x/create/client: a package-level mlxthread.Thread, lazily started, whose init callback runs mlx.CheckInit + mlx.SetDefaultDeviceGPU. The two public entry points that call MLX — quantizeTensor and quantizePackedGroup — are now thin wrappers that dispatch the real work via runOnMLXWorker.

Internal helpers (loadAndQuantizeArray, stackAndQuantizeExpertGroup, decodeSourceFP8Tensor) are unchanged because they always run inside one of the two entry points and therefore on the worker thread.

Diff is two files:

  • x/create/client/mlxworker.go (new, ~60 lines): worker singleton + runOnMLXWorker helper.
  • x/create/client/quantize.go (~35 lines added): wrap quantizeTensor / quantizePackedGroup; rename their bodies to *OnMLXThread.

Verification

Tested on M3 Air 16 GB (the host that hit the regression).

Inputs: google/gemma-4-E4B-it (target, 15 GB BF16 safetensors) and google/gemma-4-E4B-it-assistant (drafter, 152 MB).

ScenarioBeforeAfter
ollama create test -f Modelfile --experimental --quantize int4 (no DRAFT)panic✅ creates 9.7 GB int4 model, digest aab76c72c49c (matches v0.23.0 output from same inputs)
Same with DRAFT ./gemma-4-E4B-it-assistant (MTP)panic✅ creates 9.9 GB model cd730619b202
ollama run gemma4-mtp aftern/a✅ generates tokens at ~36 tok/s vs ~28 tok/s for stock gemma4:e4b (the expected ~1.3× E4B drafter speedup, in line with the docs)

Other notes

  • gofmt clean, go vet clean.
  • Existing client package tests still pass (go test ./x/create/client/...).
  • I considered a smaller fix (adding SetDefaultDeviceGPU inside mlxCheck), but that conflicts with the existing architectural decision in #15845 to centralize MLX work on a dedicated worker — applying the same pattern here keeps x/create/client consistent with x/mlxrunner.
  • The worker is lazily initialized so there is no startup-time cost for non-quantize create paths.

Changed files

  • x/create/client/mlxworker.go (added, +57/-0)
  • x/create/client/quantize.go (modified, +35/-0)

PR #14969: Create safetensors models through a hybrid local/remote pipeline.

Description (problem / solution / changelog)

Local OLLAMA_HOST uses the optimized local create path by default, while non-local hosts use the API upload/create flow. OLLAMA_CREATE_REMOTE can force the remote path for testing, and OLLAMA_CREATE_SERVER_QUANTIZE can force remote quantization to happen on the server.

Remote safetensors creation now reuses x/create.CreateSafetensorsModel as the single traversal/import path. This keeps remote behavior aligned with local handling for architecture-specific transforms, source FP8 and prequantized tensors, skipped companion tensors, expert grouping, and per-tensor quantization decisions.

Fix server-side safetensors assembly to preserve Modelfile parameters and FileType metadata, and handle packed tensor groups during server-side quantization without dropping tensors. Packed groups are quantized as multi-tensor blobs instead of attempting to quantize the group name as a single tensor.

Improve local packed-group creation by streaming from file-backed TensorData for unquantized groups instead of extracting whole tensor byte slices.

Reduce GGUF creation memory by passing through BF16 tensors when no conversion is needed and by making large tensor writes exclusive while still allowing smaller tensor writes to run concurrently.

Add focused tests for remote request metadata, server/client quantization selection, packed quantized tensor preservation, local packed streaming, remote upload/create failures, and GGUF write concurrency.

Hardens MLX thread handling to enforce bound thread requirement.

Test results when creating from gemma-4-31b-it

FlowWall TimeCLI Max RSSServer Max RSSNotes
Safetensors local create, 4-bit quant29.78s7.4 GiBN/ADirect local path. Model size 18.85 GiB; 1195 manifest layers.
Safetensors remote create, 4-bit quant41.72s7.4 GiB68 MiBForced remote path with client-side quantization. Model size 18.85 GiB; 1195 manifest layers.
Safetensors remote create, 4-bit quant, server-side105.38s53 MiB7.46 GiBForced remote path with server-side quantization. Model size 18.85 GiB; 1195 manifest layers.
GGUF 31B, q4_k_m, dynamic memory budget final220.63s CLI / 2m50s server create47.0 MiB42.0 GiBCPU saturated, no swaps, uses subset of free memory budget
GGUF create, 4-bit quant, unmodifiedmain--72.05 GiBUpstream baseline for the same 31B model and q4_k_m; peak memory footprint was 79.84 GiB.

Fixes #16070

Changed files

  • api/client.go (modified, +13/-0)
  • api/types.go (modified, +11/-0)
  • cmd/cmd.go (modified, +54/-5)
  • cmd/cmd_test.go (modified, +27/-0)
  • convert/reader_safetensors.go (modified, +1/-0)
  • discover/gpu_info_darwin.m (modified, +12/-4)
  • docs/api.md (modified, +39/-1)
  • fs/ggml/gguf.go (modified, +217/-7)
  • fs/ggml/gguf_test.go (modified, +599/-0)
  • integration/create_test.go (modified, +239/-81)
  • progress/bar.go (modified, +4/-0)
  • server/create.go (modified, +396/-1)
  • server/create_test.go (modified, +85/-0)
  • server/images.go (modified, +92/-54)
  • server/images_test.go (modified, +20/-0)
  • server/quantization.go (modified, +31/-1)
  • server/routes_create_test.go (modified, +551/-119)
  • server/sched_test.go (modified, +4/-4)
  • x/create/capabilities.go (added, +147/-0)
  • x/create/client/create.go (modified, +35/-169)
  • x/create/client/create_test.go (modified, +237/-15)
  • x/create/client/quantize.go (modified, +691/-176)
  • x/create/client/quantize_test.go (modified, +160/-3)
  • x/create/client/remote.go (added, +650/-0)
  • x/create/client/remote_test.go (added, +1024/-0)
  • x/create/create.go (modified, +475/-84)
  • x/create/create_test.go (modified, +2/-2)
  • x/create/gemma4.go (modified, +4/-22)
  • x/create/quantization_planner_test.go (added, +235/-0)
  • x/create/qwen35.go (modified, +6/-0)
  • x/internal/mlxtest/mlxtest.go (added, +87/-0)
  • x/mlxrunner/cache/cache_test.go (modified, +2/-3)
  • x/mlxrunner/cache/rotating_attention_test.go (modified, +125/-97)
  • x/mlxrunner/mlx/array_test.go (modified, +27/-27)
  • x/mlxrunner/mlx/compile_test.go (modified, +123/-123)
  • x/mlxrunner/mlx/io.go (modified, +1/-0)
  • x/mlxrunner/mlx/memory.go (modified, +5/-0)
  • x/mlxrunner/mlx/mlx.go (modified, +34/-0)
  • x/mlxrunner/mlx/stream.go (modified, +48/-24)
  • x/mlxrunner/mlx/thread_test.go (modified, +25/-0)
  • x/mlxrunner/model/embedding_test.go (modified, +2/-3)
  • x/mlxrunner/sample/sample_test.go (modified, +2/-3)
  • x/mlxrunner/server.go (modified, +1/-0)
  • x/models/gemma4/gemma4_moe_test.go (modified, +31/-19)
  • x/models/gemma4/gemma4_test.go (modified, +3/-3)
  • x/models/laguna/laguna_test.go (modified, +2/-3)
  • x/models/nn/nn_test.go (modified, +2/-3)
  • x/models/nn/sdpa_test.go (modified, +5/-4)
  • x/models/qwen3_5/qwen3_5_test.go (modified, +2/-3)
  • x/safetensors/extractor.go (modified, +156/-4)
  • x/safetensors/extractor_test.go (modified, +113/-0)

Code Example

FROM ./gemma-4-E4B-it

---

ollama create test -f Modelfile --experimental --quantize int4

---

importing model.safetensors (2130 tensors, quantizing to int4)panic: mlx: There is no Stream(gpu, 1) in current thread.
  at /Users/runner/work/ollama/ollama/build/metal-v3/_deps/mlx-c-src/mlx/c/transforms.cpp:73

goroutine 1 [running]:
github.com/ollama/ollama/x/mlxrunner/mlx.mlxCheck(...) /x/mlxrunner/mlx/mlx.go:68
github.com/ollama/ollama/x/mlxrunner/mlx.doEval(...) /x/mlxrunner/mlx/mlx.go:86
github.com/ollama/ollama/x/create/client.loadAndQuantizeArray(...) /x/create/client/quantize.go:106
github.com/ollama/ollama/x/create/client.quantizeTensor(...) /x/create/client/quantize.go:140
github.com/ollama/ollama/x/create/client.createQuantizedLayers(...) /x/create/client/create.go:365
RAW_BUFFERClick to expand / collapse

Summary

ollama create --experimental --quantize <any> panics in MLX during tensor quantization on macOS Apple Silicon. Regression bisected to v0.23.1. v0.23.0 works correctly with the same Modelfile and same input safetensors.

Environment

  • Ollama v0.23.2 (released 2026-05-07)
  • macOS, MacBook Air M3, 16 GB unified memory
  • Install via brew install --cask ollama-app

Bisection

VersionReleasedResult
0.23.02026-05-03✅ Succeeds, produces valid 9.7 GB int4 model
0.23.12026-05-05❌ Panics
0.23.22026-05-07❌ Panics

Same Modelfile and same safetensors input on each version.

Minimum reproduction (no DRAFT — bug is not DRAFT-specific)

FROM ./gemma-4-E4B-it

Input model: google/gemma-4-E4B-it (public on HF, BF16 safetensors, 15.0 GB).

ollama create test -f Modelfile --experimental --quantize int4

Actual

Safetensors import succeeds; panic occurs during quantization of the 2130 main tensors:

importing model.safetensors (2130 tensors, quantizing to int4) ⠧
panic: mlx: There is no Stream(gpu, 1) in current thread.
  at /Users/runner/work/ollama/ollama/build/metal-v3/_deps/mlx-c-src/mlx/c/transforms.cpp:73

goroutine 1 [running]:
github.com/ollama/ollama/x/mlxrunner/mlx.mlxCheck(...) /x/mlxrunner/mlx/mlx.go:68
github.com/ollama/ollama/x/mlxrunner/mlx.doEval(...) /x/mlxrunner/mlx/mlx.go:86
github.com/ollama/ollama/x/create/client.loadAndQuantizeArray(...) /x/create/client/quantize.go:106
github.com/ollama/ollama/x/create/client.quantizeTensor(...) /x/create/client/quantize.go:140
github.com/ollama/ollama/x/create/client.createQuantizedLayers(...) /x/create/client/create.go:365

Expected

Quantized model should be created. v0.23.0 produces a valid 9.7 GB int4 model from the same input.

Controls

  • All quantize formats panic identically: int4, int8, mxfp4, mxfp8, nvfp4
  • q4_K_M rejects with unsupported, separate from this panic
  • DRAFT removed from Modelfile: same panic — not DRAFT-specific
  • --quantize omitted (BF16): create succeeds (the resulting BF16 then OOMs at load on this 16 GB host, expected)
  • Ollama process restart, no models loaded: same panic
  • Pre-existing quantized gemma4:e4b (Q4) inference works fine on 0.23.2 → runtime path is OK; bug is in create-time quantize path through the MLX runner

Likely culprits in 0.23.1

The bisection narrows to one of three v0.23.1 changes:

  1. "Update MLX and MLX-C with threading fixes" — strong candidate; the panic is literally a thread-local MLX Stream missing
  2. Go 1.26 bump
  3. #15980 (Gemma4 MTP / DRAFT) — less likely, bug repros without DRAFT

Related (different code paths, for searchability)

  • #15775 — same panic string but on inference of MoE model (closed)
  • #15746 — open MLX MoE NVFP4 issue, different symptom

Notes

  • 16 GB host's BF16 OOM at load is unrelated. The data point is: v0.23.0 succeeds and v0.23.1 panics on the same host with same input.
  • Happy to commit-level bisect between v0.23.0 and v0.23.1, attach full server log, or repro on a smaller public safetensors model if helpful.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - ✅(Solved) Fix create: MLX panic "no Stream(gpu, 1)" quantizing safetensors on Apple Silicon — regression in v0.23.1 [2 pull requests, 1 comments, 1 participants]