ollama - ✅(Solved) Fix When flash attention is not supported, quantized KV cache should be disregarded instead of aborting the model run. [1 pull requests, 4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15043Fetched 2026-04-08 01:26:36
View on GitHub
Comments
4
Participants
2
Timeline
8
Reactions
0
Timeline (top)
commented ×4closed ×1cross-referenced ×1labeled ×1

Root Cause

With OLLAMA_FLASH_ATTENTION=1 when an incompatible model is used, flash_attn is automatically disabled. This is reasonable behaviour. When OLLAMA_KV_CACHE_TYPE=q8_0 is also set, but flash_attn was auto-disabled due to incompatibility, ollama panics because v cache quantization requires flash_attn.

Fix Action

Fixed

PR fix notes

PR #15050: ggml: force flash attention off for grok

Description (problem / solution / changelog)

By default, ollama supports FA for grok, but llama.cpp disables it. If KV cache quantization has been set this causes the runner to crash.

Fixes: #15043

Changed files

  • fs/ggml/ggml.go (modified, +1/-1)

Code Example

Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: flash_attn is not compatible with Grok - forcing off
Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: V cache quantization requires flash_attn
Mar 24 21:17:29 jarvis ollama[2742822]: panic: unable to create llama context
Mar 24 21:17:29 jarvis ollama[2742822]: goroutine 221 [running]:
Mar 24 21:17:29 jarvis ollama[2742822]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0005943c0, {{0xc000596540, 0x2, 0x2}, 0xe, 0x0, 0x1, {0xc000596528, 0x2, 0x2}, ...}, ...)
Mar 24 21:17:29 jarvis ollama[2742822]: #011/builddir/build/BUILD/ollama-0.17.7/runner/llamarunner/runner.go:849 +0x333
Mar 24 21:17:29 jarvis ollama[2742822]: created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 227
RAW_BUFFERClick to expand / collapse

What is the issue?

With OLLAMA_FLASH_ATTENTION=1 when an incompatible model is used, flash_attn is automatically disabled. This is reasonable behaviour. When OLLAMA_KV_CACHE_TYPE=q8_0 is also set, but flash_attn was auto-disabled due to incompatibility, ollama panics because v cache quantization requires flash_attn.

What should happen: When flash_attn isn't enabled for whatever reason, v cache quantization for the model should get automatically disabled and ignored so that it doesn't have to be disabled globally.

Relevant log output

Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: flash_attn is not compatible with Grok - forcing off
Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: V cache quantization requires flash_attn
Mar 24 21:17:29 jarvis ollama[2742822]: panic: unable to create llama context
Mar 24 21:17:29 jarvis ollama[2742822]: goroutine 221 [running]:
Mar 24 21:17:29 jarvis ollama[2742822]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0005943c0, {{0xc000596540, 0x2, 0x2}, 0xe, 0x0, 0x1, {0xc000596528, 0x2, 0x2}, ...}, ...)
Mar 24 21:17:29 jarvis ollama[2742822]: #011/builddir/build/BUILD/ollama-0.17.7/runner/llamarunner/runner.go:849 +0x333
Mar 24 21:17:29 jarvis ollama[2742822]: created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 227

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.7

extent analysis

Fix Plan

To resolve the issue, we need to modify the code to automatically disable v cache quantization when flash attention is not enabled. Here are the steps:

  • Check if flash attention is enabled before attempting to use v cache quantization.
  • If flash attention is not enabled, disable v cache quantization.

Example code changes:

if !flashAttnEnabled {
    // Disable v cache quantization
    vCacheQuantizationEnabled = false
    log.Println("V cache quantization disabled due to flash attention incompatibility")
}

Additionally, we need to ensure that the vCacheQuantizationEnabled flag is checked before attempting to create the llama context:

if vCacheQuantizationEnabled && !flashAttnEnabled {
    log.Fatal("V cache quantization requires flash attention, but it is not enabled")
}

We can also add a check to ignore v cache quantization if flash attention is not enabled:

if !flashAttnEnabled {
    // Ignore v cache quantization
    vCacheQuantizationType = ""
}

Verification

To verify that the fix worked, we can test the following scenarios:

  • Set OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 and verify that v cache quantization is disabled when flash attention is not compatible with the model.
  • Set OLLAMA_FLASH_ATTENTION=0 and OLLAMA_KV_CACHE_TYPE=q8_0 and verify that v cache quantization is disabled.

Extra Tips

  • Make sure to update the documentation to reflect the changes to the v cache quantization behavior.
  • Consider adding additional logging to help diagnose issues related to flash attention and v cache quantization.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING