ollama - 💡(How to fix) Fix Image generation crashes on NVIDIA Blackwell GPUs (RTX 5070) — MLX-C rms_norm returns 0-dim array [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15531Fetched 2026-04-15 06:20:30
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF

Root Cause

The MLX runner loads the model successfully (tokenizer ✓, text encoder ✓, transformer ✓, VAE ✓, 5.3 GB VRAM), starts listening, then panics on the first /completion request.

The call chain is:

  1. Attention.Forward() calls QNorm.Forward(q, 1e-6)
  2. Which calls mlx.RMSNorm(x, weight, eps)C.mlx_fast_rms_norm(&res, x.c, weight.c, eps, stream)
  3. The returned array has ndim=0 (empty shape)
  4. applyRoPEQwen3 then panics accessing shape[0] on the empty slice

Key finding: The Python MLX package at the same version (0.31.1) works correctly on this GPU:

pip install "mlx[cuda13]"

import mlx.core as mx
x = mx.random.normal((1, 512, 32, 128))
weight = mx.ones((128,))
result = mx.fast.rms_norm(x, weight, eps=1e-6)
mx.eval(result)
print(result.shape)  # (1, 512, 32, 128) — correct!

The shipped libmlx.so (0.31.1-23-g38ad257, 23 commits ahead of release) has the bug. The pip release libmlx.so (0.31.1 clean) does not. The issue appears to be in the 23 extra commits in the Ollama fork.

Additionally, these same image models work perfectly on the same GPU via ComfyUI (PyTorch CUDA), confirming the hardware and CUDA drivers are fine.

Fix Action

Workaround

We built the Linux port of the Ollama desktop app and worked around this by routing image generation through a local PyTorch/diffusers server instead of the broken MLX runner. The desktop app detects CapabilityImage models and calls a local server using diffusers.AutoPipelineForText2Image with enable_model_cpu_offload(). Same models, same GPU, works perfectly.

Code Example

ollama pull x/flux2-klein
ollama run x/flux2-klein "a red apple on a table"

---

Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF

---

runtime error: index out of range [0] with length 0
goroutine 66 [running]:
github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...)
    x/imagegen/models/qwen3/text_encoder.go:47

---

pip install "mlx[cuda13]"

import mlx.core as mx
x = mx.random.normal((1, 512, 32, 128))
weight = mx.ones((128,))
result = mx.fast.rms_norm(x, weight, eps=1e-6)
mx.eval(result)
print(result.shape)  # (1, 512, 32, 128) — correct!
RAW_BUFFERClick to expand / collapse

Environment

  • Ollama version: v0.20.6
  • OS: Ubuntu 26.04 (kernel 7.0.0-13-generic)
  • GPU: NVIDIA GeForce RTX 5070 (Blackwell, sm_120, compute 12.0)
  • Driver: 580.142, CUDA 13.0
  • MLX lib: mlx_cuda_v13/libmlx.so — version 0.31.1-23-g38ad257, has native sm_120 code

Steps to Reproduce

ollama pull x/flux2-klein
ollama run x/flux2-klein "a red apple on a table"

Also reproducible with x/z-image-turbo.

Expected Behavior

Image generated and saved to current directory.

Actual Behavior

Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF

Server log shows the MLX runner panics:

runtime error: index out of range [0] with length 0
goroutine 66 [running]:
github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...)
    x/imagegen/models/qwen3/text_encoder.go:47

The crash occurs at text_encoder.go:47 where x.Shape() returns an empty slice after mlx_fast_rms_norm produces a 0-dimensional array.

Root Cause Analysis

The MLX runner loads the model successfully (tokenizer ✓, text encoder ✓, transformer ✓, VAE ✓, 5.3 GB VRAM), starts listening, then panics on the first /completion request.

The call chain is:

  1. Attention.Forward() calls QNorm.Forward(q, 1e-6)
  2. Which calls mlx.RMSNorm(x, weight, eps)C.mlx_fast_rms_norm(&res, x.c, weight.c, eps, stream)
  3. The returned array has ndim=0 (empty shape)
  4. applyRoPEQwen3 then panics accessing shape[0] on the empty slice

Key finding: The Python MLX package at the same version (0.31.1) works correctly on this GPU:

pip install "mlx[cuda13]"

import mlx.core as mx
x = mx.random.normal((1, 512, 32, 128))
weight = mx.ones((128,))
result = mx.fast.rms_norm(x, weight, eps=1e-6)
mx.eval(result)
print(result.shape)  # (1, 512, 32, 128) — correct!

The shipped libmlx.so (0.31.1-23-g38ad257, 23 commits ahead of release) has the bug. The pip release libmlx.so (0.31.1 clean) does not. The issue appears to be in the 23 extra commits in the Ollama fork.

Additionally, these same image models work perfectly on the same GPU via ComfyUI (PyTorch CUDA), confirming the hardware and CUDA drivers are fine.

Workaround

We built the Linux port of the Ollama desktop app and worked around this by routing image generation through a local PyTorch/diffusers server instead of the broken MLX runner. The desktop app detects CapabilityImage models and calls a local server using diffusers.AutoPipelineForText2Image with enable_model_cpu_offload(). Same models, same GPU, works perfectly.

Suggested Fix

Either:

  1. Rebuild the shipped libmlx.so from the clean 0.31.1 release tag (the pip wheels work)
  2. Investigate what the 23 extra commits (38ad257) broke in the CUDA rms_norm kernel path
  3. Add a bounds check in applyRoPEQwen3 so it returns a meaningful error instead of panicking on 0-dim arrays

extent analysis

TL;DR

The most likely fix is to rebuild the shipped libmlx.so from the clean 0.31.1 release tag to resolve the issue with the rms_norm kernel path.

Guidance

  • Investigate the 23 extra commits (38ad257) in the Ollama fork to identify what broke the CUDA rms_norm kernel path.
  • Add a bounds check in applyRoPEQwen3 to return a meaningful error instead of panicking on 0-dim arrays as a temporary workaround.
  • Consider using the local PyTorch/diffusers server workaround used in the Ollama desktop app as an alternative solution.
  • Verify the fix by running the ollama pull and ollama run commands with the updated libmlx.so and checking for the expected image generation behavior.

Example

No code snippet is provided as the issue is related to a specific library version and commit history.

Notes

The issue appears to be specific to the Ollama fork of the MLX library, and the clean 0.31.1 release tag does not exhibit the same behavior. The pip release of the MLX library works correctly on the same GPU.

Recommendation

Apply the workaround of rebuilding the shipped libmlx.so from the clean 0.31.1 release tag, as it is the most straightforward solution to resolve the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING