ollama - 💡(How to fix) Fix Image generation crashes on NVIDIA Blackwell GPUs (RTX 5070) — MLX-C rms_norm returns 0-dim array [1 participants]

ollama2026-04-13 04:17:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15531•Fetched 2026-04-15 06:20:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

johnohhh1

Participants

johnohhh1

Error Message

Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF

Root Cause

The MLX runner loads the model successfully (tokenizer ✓, text encoder ✓, transformer ✓, VAE ✓, 5.3 GB VRAM), starts listening, then panics on the first /completion request.

The call chain is:

Attention.Forward() calls QNorm.Forward(q, 1e-6)
Which calls mlx.RMSNorm(x, weight, eps) → C.mlx_fast_rms_norm(&res, x.c, weight.c, eps, stream)
The returned array has ndim=0 (empty shape)
applyRoPEQwen3 then panics accessing shape[0] on the empty slice

Key finding: The Python MLX package at the same version (0.31.1) works correctly on this GPU:

pip install "mlx[cuda13]"

import mlx.core as mx
x = mx.random.normal((1, 512, 32, 128))
weight = mx.ones((128,))
result = mx.fast.rms_norm(x, weight, eps=1e-6)
mx.eval(result)
print(result.shape)  # (1, 512, 32, 128) — correct!

The shipped libmlx.so (0.31.1-23-g38ad257, 23 commits ahead of release) has the bug. The pip release libmlx.so (0.31.1 clean) does not. The issue appears to be in the 23 extra commits in the Ollama fork.

Additionally, these same image models work perfectly on the same GPU via ComfyUI (PyTorch CUDA), confirming the hardware and CUDA drivers are fine.

Fix Action

Workaround

We built the Linux port of the Ollama desktop app and worked around this by routing image generation through a local PyTorch/diffusers server instead of the broken MLX runner. The desktop app detects CapabilityImage models and calls a local server using diffusers.AutoPipelineForText2Image with enable_model_cpu_offload(). Same models, same GPU, works perfectly.

Code Example

ollama pull x/flux2-klein
ollama run x/flux2-klein "a red apple on a table"

---

Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF

---

runtime error: index out of range [0] with length 0
goroutine 66 [running]:
github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...)
    x/imagegen/models/qwen3/text_encoder.go:47

---

pip install "mlx[cuda13]"

import mlx.core as mx
x = mx.random.normal((1, 512, 32, 128))
weight = mx.ones((128,))
result = mx.fast.rms_norm(x, weight, eps=1e-6)
mx.eval(result)
print(result.shape)  # (1, 512, 32, 128) — correct!

RAW_BUFFERClick to expand / collapse

Environment

Ollama version: v0.20.6
OS: Ubuntu 26.04 (kernel 7.0.0-13-generic)
GPU: NVIDIA GeForce RTX 5070 (Blackwell, sm_120, compute 12.0)
Driver: 580.142, CUDA 13.0
MLX lib: mlx_cuda_v13/libmlx.so — version 0.31.1-23-g38ad257, has native sm_120 code

Steps to Reproduce

ollama pull x/flux2-klein
ollama run x/flux2-klein "a red apple on a table"

Also reproducible with x/z-image-turbo.

Expected Behavior

Image generated and saved to current directory.

Actual Behavior

Error: 500 Internal Server Error: Post "http://127.0.0.1:<port>/completion": EOF

Server log shows the MLX runner panics:

runtime error: index out of range [0] with length 0
goroutine 66 [running]:
github.com/ollama/ollama/x/imagegen/models/qwen3.applyRoPEQwen3(...)
    x/imagegen/models/qwen3/text_encoder.go:47

The crash occurs at text_encoder.go:47 where x.Shape() returns an empty slice after mlx_fast_rms_norm produces a 0-dimensional array.

Root Cause Analysis

The MLX runner loads the model successfully (tokenizer ✓, text encoder ✓, transformer ✓, VAE ✓, 5.3 GB VRAM), starts listening, then panics on the first /completion request.

The call chain is:

Attention.Forward() calls QNorm.Forward(q, 1e-6)
Which calls mlx.RMSNorm(x, weight, eps) → C.mlx_fast_rms_norm(&res, x.c, weight.c, eps, stream)
The returned array has ndim=0 (empty shape)
applyRoPEQwen3 then panics accessing shape[0] on the empty slice

Key finding: The Python MLX package at the same version (0.31.1) works correctly on this GPU:

pip install "mlx[cuda13]"

import mlx.core as mx
x = mx.random.normal((1, 512, 32, 128))
weight = mx.ones((128,))
result = mx.fast.rms_norm(x, weight, eps=1e-6)
mx.eval(result)
print(result.shape)  # (1, 512, 32, 128) — correct!

Additionally, these same image models work perfectly on the same GPU via ComfyUI (PyTorch CUDA), confirming the hardware and CUDA drivers are fine.

Workaround

Suggested Fix

Either:

Rebuild the shipped libmlx.so from the clean 0.31.1 release tag (the pip wheels work)
Investigate what the 23 extra commits (38ad257) broke in the CUDA rms_norm kernel path
Add a bounds check in applyRoPEQwen3 so it returns a meaningful error instead of panicking on 0-dim arrays

extent analysis

TL;DR

The most likely fix is to rebuild the shipped libmlx.so from the clean 0.31.1 release tag to resolve the issue with the rms_norm kernel path.

Guidance

Investigate the 23 extra commits (38ad257) in the Ollama fork to identify what broke the CUDA rms_norm kernel path.
Add a bounds check in applyRoPEQwen3 to return a meaningful error instead of panicking on 0-dim arrays as a temporary workaround.
Consider using the local PyTorch/diffusers server workaround used in the Ollama desktop app as an alternative solution.
Verify the fix by running the ollama pull and ollama run commands with the updated libmlx.so and checking for the expected image generation behavior.

Example

No code snippet is provided as the issue is related to a specific library version and commit history.

Notes

The issue appears to be specific to the Ollama fork of the MLX library, and the clean 0.31.1 release tag does not exhibit the same behavior. The pip release of the MLX library works correctly on the same GPU.

Recommendation

Apply the workaround of rebuilding the shipped libmlx.so from the clean 0.31.1 release tag, as it is the most straightforward solution to resolve the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#runtime error #prompt issue #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Image generation crashes on NVIDIA Blackwell GPUs (RTX 5070) — MLX-C rms_norm returns 0-dim array [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Workaround

Suggested Fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Image generation crashes on NVIDIA Blackwell GPUs (RTX 5070) — MLX-C rms_norm returns 0-dim array [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Workaround

Suggested Fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING