ollama - 💡(How to fix) Fix Imported Qwen3-VL-8B GGUF + mmproj registers as vision-capable but crashes on first image request on Apple Silicon (exit status 2)

Error Message

post predict error="Post \"http://127.0.0.1:<port>/completion\": EOF"
llama runner terminated" error="exit status 2" HTTP/1.1 500 Internal Server Error {"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"} HTTP/1.1 500 Internal Server Error {"error":{"message":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details","type":"api_error","param":null,"code":null}} time=2026-05-22T10:36:54.042-04:00 level=ERROR source=server.go:1654 msg="post predict" error="Post "http://127.0.0.1:64595/completion\": EOF" time=2026-05-22T10:36:54.042-04:00 level=ERROR source=server.go:316 msg="llama runner terminated" error="exit status 2" time=2026-05-22T10:37:13.646-04:00 level=ERROR source=server.go:316 msg="llama runner terminated" error="exit status 2" time=2026-05-22T10:37:13.646-04:00 level=ERROR source=server.go:1654 msg="post predict" error="Post "http://127.0.0.1:64608/completion\": EOF"

Code Example

FROM /path/to/Qwen_Qwen3-VL-8B-Instruct-Q8_0.gguf
FROM /path/to/mmproj-Qwen_Qwen3-VL-8B-Instruct-f16.gguf

PARAMETER num_ctx 32768
PARAMETER num_gpu 99
PARAMETER temperature 0.7
PARAMETER top_p 0.9

---

ollama create qwen3-vl-8b-instruct -f Modelfile

---

ollama show qwen3-vl-8b-instruct:latest

---

Model
  architecture        qwen3vl
  parameters          8.2B
  context length      262144
  embedding length    4096
  quantization        Q8_0

Capabilities
  completion
  vision

Projector
  architecture        clip
  parameters          576.39M
  embedding length    1152
  dimensions          4096

---

env OLLAMA_HOST=127.0.0.1:11434 /Applications/Ollama.app/Contents/Resources/ollama serve

---

curl http://127.0.0.1:11434/v1/models

---

IMG=$(base64 < test.png | tr -d '\n')
curl -sS -D - http://127.0.0.1:11434/api/chat \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"qwen3-vl-8b-instruct:latest\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly.\",\"images\":[\"$IMG\"]}],\"stream\":false}"

---

HTTP/1.1 500 Internal Server Error
{"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

---

IMG=$(base64 < test.png | tr -d '\n')
curl -sS -D - http://127.0.0.1:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"qwen3-vl-8b-instruct:latest\",\"messages\":[{\"role\":\"user\",\"content\":[{\"type\":\"text\",\"text\":\"Describe this image briefly.\"},{\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/png;base64,$IMG\"}}]}],\"stream\":false}"

---

HTTP/1.1 500 Internal Server Error
{"error":{"message":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details","type":"api_error","param":null,"code":null}}

---

time=2026-05-22T10:36:54.042-04:00 level=ERROR source=server.go:1654 msg="post predict" error="Post \"http://127.0.0.1:64595/completion\": EOF"
[GIN] 2026/05/22 - 10:36:54 | 500 |  4.553328875s | 127.0.0.1 | POST "/api/chat"
time=2026-05-22T10:36:54.042-04:00 level=ERROR source=server.go:316 msg="llama runner terminated" error="exit status 2"

---

time=2026-05-22T10:37:13.646-04:00 level=ERROR source=server.go:316 msg="llama runner terminated" error="exit status 2"
time=2026-05-22T10:37:13.646-04:00 level=ERROR source=server.go:1654 msg="post predict" error="Post \"http://127.0.0.1:64608/completion\": EOF"
[GIN] 2026/05/22 - 10:37:13 | 500 | 4.599645s | 127.0.0.1 | POST "/v1/chat/completions"

What is the issue?

On Apple Silicon / Metal, an imported Qwen3-VL-8B-Instruct GGUF + mmproj model registers successfully, is reported as vision-capable by ollama show, but the runner crashes on the first real image request.

I reproduced the crash through both API surfaces:

POST /api/chat
POST /v1/chat/completions

Both return HTTP 500, and the server logs show:

post predict error="Post \"http://127.0.0.1:<port>/completion\": EOF"
llama runner terminated" error="exit status 2"

This does not look like a bad import or a Merlin integration bug:

ollama create succeeds
/v1/models lists the model
ollama show qwen3-vl-8b-instruct:latest reports:
- architecture qwen3vl
- capability vision
- projector architecture clip
the same Ollama instance serves a separate text-only qwen3-coder-30b-a3b-instruct:latest model successfully

Environment

Ollama client/server version: 0.24.0
OS: macOS 26.5 (25F71)
Hardware: Apple Silicon M4 Max
Runtime: Metal

Model import

Modelfile used for import:

FROM /path/to/Qwen_Qwen3-VL-8B-Instruct-Q8_0.gguf
FROM /path/to/mmproj-Qwen_Qwen3-VL-8B-Instruct-f16.gguf

PARAMETER num_ctx 32768
PARAMETER num_gpu 99
PARAMETER temperature 0.7
PARAMETER top_p 0.9

Create command:

ollama create qwen3-vl-8b-instruct -f Modelfile

After import:

ollama show qwen3-vl-8b-instruct:latest

reported:

Model
  architecture        qwen3vl
  parameters          8.2B
  context length      262144
  embedding length    4096
  quantization        Q8_0

Capabilities
  completion
  vision

Projector
  architecture        clip
  parameters          576.39M
  embedding length    1152
  dimensions          4096

Reproduction

Start Ollama:

env OLLAMA_HOST=127.0.0.1:11434 /Applications/Ollama.app/Contents/Resources/ollama serve

Confirm model is listed:

curl http://127.0.0.1:11434/v1/models

Send a native vision request:

IMG=$(base64 < test.png | tr -d '\n')
curl -sS -D - http://127.0.0.1:11434/api/chat \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"qwen3-vl-8b-instruct:latest\",\"messages\":[{\"role\":\"user\",\"content\":\"Describe this image briefly.\",\"images\":[\"$IMG\"]}],\"stream\":false}"

Observed response:

HTTP/1.1 500 Internal Server Error
{"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

Send the OpenAI-compatible vision request:

IMG=$(base64 < test.png | tr -d '\n')
curl -sS -D - http://127.0.0.1:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"qwen3-vl-8b-instruct:latest\",\"messages\":[{\"role\":\"user\",\"content\":[{\"type\":\"text\",\"text\":\"Describe this image briefly.\"},{\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/png;base64,$IMG\"}}]}],\"stream\":false}"

Observed response:

HTTP/1.1 500 Internal Server Error
{"error":{"message":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details","type":"api_error","param":null,"code":null}}

Relevant log output

Server log excerpts from the two failing requests:

time=2026-05-22T10:36:54.042-04:00 level=ERROR source=server.go:1654 msg="post predict" error="Post \"http://127.0.0.1:64595/completion\": EOF"
[GIN] 2026/05/22 - 10:36:54 | 500 |  4.553328875s | 127.0.0.1 | POST "/api/chat"
time=2026-05-22T10:36:54.042-04:00 level=ERROR source=server.go:316 msg="llama runner terminated" error="exit status 2"

time=2026-05-22T10:37:13.646-04:00 level=ERROR source=server.go:316 msg="llama runner terminated" error="exit status 2"
time=2026-05-22T10:37:13.646-04:00 level=ERROR source=server.go:1654 msg="post predict" error="Post \"http://127.0.0.1:64608/completion\": EOF"
[GIN] 2026/05/22 - 10:37:13 | 500 | 4.599645s | 127.0.0.1 | POST "/v1/chat/completions"

The runner also emitted a fatal native crash dump immediately before those lines. I can add the full dump if useful, but the key observable is that both API surfaces trigger the same runner termination as soon as an image is actually processed.

Expected behavior

If the imported model is accepted, listed, and advertised as vision, then real image requests should execute successfully.

If this import shape is not actually supported, ollama create or model load should fail earlier and clearly instead of advertising a working vision model and then crashing on first use.

Related issues

This looks related to existing Qwen3-VL crash reports, but I did not find an exact Apple Silicon / Metal report for:

imported Qwen3-VL-8B-Instruct GGUF + mmproj
successful registration + vision capability detection
crash on first real image request through both /api/chat and /v1/chat/completions

Possibly related:

#13150
#13113
#15898

FAQ

Expected behavior

If the imported model is accepted, listed, and advertised as vision, then real image requests should execute successfully.

If this import shape is not actually supported, ollama create or model load should fail earlier and clearly instead of advertising a working vision model and then crashing on first use.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Imported Qwen3-VL-8B GGUF + mmproj registers as vision-capable but crashes on first image request on Apple Silicon (exit status 2)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Environment

Model import

Reproduction

Relevant log output

Expected behavior

Related issues

FAQ

Expected behavior

Still need to ship something?

TRENDING