ollama - 💡(How to fix) Fix Bug: `granite-4.0-1b-GGUF:Q4_K_M` crashes with assertion failure in `llama_sampler_dist_apply` [4 comments, 2 participants]

ollama2026-04-06 18:37:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15369•Fetched 2026-04-08 03:01:28

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kndtran

Participants

gabe-l-hart

kndtran

Timeline (top)

commented ×4mentioned ×4subscribed ×4labeled ×1

Loading hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M from HuggingFace crashes immediately on the first inference call with an assertion failure in llama-sampling.cpp. The model loads successfully (all layers offloaded), but the sampler aborts during the first token generation.

The BF16 version from the Ollama library (granite4:1b) works fine. Other quantizations of the same GGUF repo have not been tested.

Error Message

Actual: Error: 500 Internal Server Error: model runner has unexpectedly stopped

Root Cause

The BF16 version from the Ollama library (granite4:1b) works fine. Other quantizations of the same GGUF repo have not been tested.

Code Example

ollama run hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M "Hello"

---

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   160.78 MiB
load_tensors: Metal_Mapped model buffer size =   972.82 MiB
llama_context: constructing llama_context
llama_context: n_ctx         = 131072
llama_context: n_batch       = 512
llama_context: flash_attn    = auto
llama_kv_cache: size = 10240.00 MiB (131072 cells, 40 layers, 1/1 seqs)
llama_context: Flash Attention was auto, set to enabled
time=2026-04-04T08:50:42.148-07:00 level=INFO source=server.go:1390 msg="llama runner started in 1.42 seconds"
Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.
SIGABRT: abort
PC=0x196b8d5b0 m=7 sigcode=0
signal arrived during cgo execution

---

github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x1056e1110, 0x730c9db00, 0x1e)
    _cgo_gotypes.go:425

---

RAW_BUFFERClick to expand / collapse

What is the issue?

cc: @gabe-l-hart

Description

The BF16 version from the Ollama library (granite4:1b) works fine. Other quantizations of the same GGUF repo have not been tested.

Reproduction

ollama run hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M "Hello"

Expected: Model generates a response. Actual: Error: 500 Internal Server Error: model runner has unexpectedly stopped

Environment

Ollama versions tested: 0.18.3, 0.20.0, 0.20.2 (latest as of 2026-04-06, all crash)
OS: macOS 26.3.1 (arm64)
Hardware: Apple M1 Max, 64 GB unified memory
Model: hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M from https://huggingface.co/ibm-granite/granite-4.0-1b-GGUF

Crash Log

From ~/.ollama/logs/server.log:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   160.78 MiB
load_tensors: Metal_Mapped model buffer size =   972.82 MiB
llama_context: constructing llama_context
llama_context: n_ctx         = 131072
llama_context: n_batch       = 512
llama_context: flash_attn    = auto
llama_kv_cache: size = 10240.00 MiB (131072 cells, 40 layers, 1/1 seqs)
llama_context: Flash Attention was auto, set to enabled
time=2026-04-04T08:50:42.148-07:00 level=INFO source=server.go:1390 msg="llama runner started in 1.42 seconds"
Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.
SIGABRT: abort
PC=0x196b8d5b0 m=7 sigcode=0
signal arrived during cgo execution

The Go stack trace shows the crash originates in:

github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x1056e1110, 0x730c9db00, 0x1e)
    _cgo_gotypes.go:425

Notes

The model loads and the runner starts successfully. The crash occurs on the first sampling call, not during model loading.
The assertion (found) in llama_sampler_dist_apply (llama-sampling.cpp:660) suggests the sampler cannot find an expected token in the probability distribution.
granite4:1b from the Ollama library (BF16, 1.6B params, same architecture) works correctly.
Other Granite 4.0 GGUF models from HuggingFace work fine: granite-4.0-micro-GGUF (3.4B) and granite-4.0-350m-GGUF (0.4B), all quantizations (Q4_K_M, Q5_K_M, Q8_0, F16).
The issue is specific to the granite-4.0-1b-GGUF GGUF file, not Ollama version or hardware.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

TL;DR

The issue is likely caused by an assertion failure in the llama-sampling.cpp file, specifically in the llama_sampler_dist_apply function, and can be mitigated by using a different model or quantization.

Guidance

The crash occurs on the first sampling call, not during model loading, suggesting an issue with the sampler or the model's probability distribution.
The assertion (found) in llama_sampler_dist_apply implies that the sampler cannot find an expected token in the probability distribution, which may be related to the specific quantization or model architecture.
Trying a different quantization, such as Q5_K_M or F16, may resolve the issue, as other Granite 4.0 GGUF models with these quantizations work fine.
Using a different model, such as granite4:1b from the Ollama library, which has the same architecture but uses BF16 quantization, may also resolve the issue.

Example

No code example is provided, as the issue is related to a specific model and quantization, and modifying the code may not be necessary to resolve the issue.

Notes

The issue appears to be specific to the granite-4.0-1b-GGUF GGUF file and not related to the Ollama version or hardware. However, without further information or debugging, it is difficult to determine the root cause of the issue.

Recommendation

Apply a workaround by using a different model or quantization, such as granite4:1b or a different Granite 4.0 GGUF model with a working quantization. This may resolve the issue until the root cause can be determined and fixed.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#docker error #permission error #memory optimization #batch processing #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Bug: `granite-4.0-1b-GGUF:Q4_K_M` crashes with assertion failure in `llama_sampler_dist_apply` [4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

What is the issue?

Description

Reproduction

Environment

Crash Log

Notes

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Bug: `granite-4.0-1b-GGUF:Q4_K_M` crashes with assertion failure in `llama_sampler_dist_apply` [4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

What is the issue?

Description

Reproduction

Environment

Crash Log

Notes

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING