ollama - 💡(How to fix) Fix Bug: `granite-4.0-1b-GGUF:Q4_K_M` crashes with assertion failure in `llama_sampler_dist_apply` [4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15369Fetched 2026-04-08 03:01:28
View on GitHub
Comments
4
Participants
2
Timeline
13
Reactions
0
Author
Participants
Timeline (top)
commented ×4mentioned ×4subscribed ×4labeled ×1

Loading hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M from HuggingFace crashes immediately on the first inference call with an assertion failure in llama-sampling.cpp. The model loads successfully (all layers offloaded), but the sampler aborts during the first token generation.

The BF16 version from the Ollama library (granite4:1b) works fine. Other quantizations of the same GGUF repo have not been tested.

Error Message

Actual: Error: 500 Internal Server Error: model runner has unexpectedly stopped

Root Cause

Loading hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M from HuggingFace crashes immediately on the first inference call with an assertion failure in llama-sampling.cpp. The model loads successfully (all layers offloaded), but the sampler aborts during the first token generation.

The BF16 version from the Ollama library (granite4:1b) works fine. Other quantizations of the same GGUF repo have not been tested.

Code Example

ollama run hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M "Hello"

---

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   160.78 MiB
load_tensors: Metal_Mapped model buffer size =   972.82 MiB
llama_context: constructing llama_context
llama_context: n_ctx         = 131072
llama_context: n_batch       = 512
llama_context: flash_attn    = auto
llama_kv_cache: size = 10240.00 MiB (131072 cells, 40 layers, 1/1 seqs)
llama_context: Flash Attention was auto, set to enabled
time=2026-04-04T08:50:42.148-07:00 level=INFO source=server.go:1390 msg="llama runner started in 1.42 seconds"
Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.
SIGABRT: abort
PC=0x196b8d5b0 m=7 sigcode=0
signal arrived during cgo execution

---

github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x1056e1110, 0x730c9db00, 0x1e)
    _cgo_gotypes.go:425

---
RAW_BUFFERClick to expand / collapse

What is the issue?

cc: @gabe-l-hart

Description

Loading hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M from HuggingFace crashes immediately on the first inference call with an assertion failure in llama-sampling.cpp. The model loads successfully (all layers offloaded), but the sampler aborts during the first token generation.

The BF16 version from the Ollama library (granite4:1b) works fine. Other quantizations of the same GGUF repo have not been tested.

Reproduction

ollama run hf.co/ibm-granite/granite-4.0-1b-GGUF:Q4_K_M "Hello"

Expected: Model generates a response. Actual: Error: 500 Internal Server Error: model runner has unexpectedly stopped

Environment

Crash Log

From ~/.ollama/logs/server.log:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   160.78 MiB
load_tensors: Metal_Mapped model buffer size =   972.82 MiB
llama_context: constructing llama_context
llama_context: n_ctx         = 131072
llama_context: n_batch       = 512
llama_context: flash_attn    = auto
llama_kv_cache: size = 10240.00 MiB (131072 cells, 40 layers, 1/1 seqs)
llama_context: Flash Attention was auto, set to enabled
time=2026-04-04T08:50:42.148-07:00 level=INFO source=server.go:1390 msg="llama runner started in 1.42 seconds"
Assertion failed: (found), function llama_sampler_dist_apply, file llama-sampling.cpp, line 660.
SIGABRT: abort
PC=0x196b8d5b0 m=7 sigcode=0
signal arrived during cgo execution

The Go stack trace shows the crash originates in:

github.com/ollama/ollama/llama._Cfunc_common_sampler_csample(0x1056e1110, 0x730c9db00, 0x1e)
    _cgo_gotypes.go:425

Notes

  • The model loads and the runner starts successfully. The crash occurs on the first sampling call, not during model loading.
  • The assertion (found) in llama_sampler_dist_apply (llama-sampling.cpp:660) suggests the sampler cannot find an expected token in the probability distribution.
  • granite4:1b from the Ollama library (BF16, 1.6B params, same architecture) works correctly.
  • Other Granite 4.0 GGUF models from HuggingFace work fine: granite-4.0-micro-GGUF (3.4B) and granite-4.0-350m-GGUF (0.4B), all quantizations (Q4_K_M, Q5_K_M, Q8_0, F16).
  • The issue is specific to the granite-4.0-1b-GGUF GGUF file, not Ollama version or hardware.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

TL;DR

The issue is likely caused by an assertion failure in the llama-sampling.cpp file, specifically in the llama_sampler_dist_apply function, and can be mitigated by using a different model or quantization.

Guidance

  • The crash occurs on the first sampling call, not during model loading, suggesting an issue with the sampler or the model's probability distribution.
  • The assertion (found) in llama_sampler_dist_apply implies that the sampler cannot find an expected token in the probability distribution, which may be related to the specific quantization or model architecture.
  • Trying a different quantization, such as Q5_K_M or F16, may resolve the issue, as other Granite 4.0 GGUF models with these quantizations work fine.
  • Using a different model, such as granite4:1b from the Ollama library, which has the same architecture but uses BF16 quantization, may also resolve the issue.

Example

No code example is provided, as the issue is related to a specific model and quantization, and modifying the code may not be necessary to resolve the issue.

Notes

The issue appears to be specific to the granite-4.0-1b-GGUF GGUF file and not related to the Ollama version or hardware. However, without further information or debugging, it is difficult to determine the root cause of the issue.

Recommendation

Apply a workaround by using a different model or quantization, such as granite4:1b or a different Granite 4.0 GGUF model with a working quantization. This may resolve the issue until the root cause can be determined and fixed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix Bug: `granite-4.0-1b-GGUF:Q4_K_M` crashes with assertion failure in `llama_sampler_dist_apply` [4 comments, 2 participants]