ollama - ✅(Solved) Fix When flash attention is not supported, quantized KV cache should be disregarded instead of aborting the model run. [1 pull requests, 4 comments, 2 participants]

ollama2026-03-24 19:41:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15043•Fetched 2026-04-08 01:26:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

gordan-bobic

Participants

gordan-bobic

rick-github

Timeline (top)

commented ×4closed ×1cross-referenced ×1labeled ×1

Root Cause

With OLLAMA_FLASH_ATTENTION=1 when an incompatible model is used, flash_attn is automatically disabled. This is reasonable behaviour. When OLLAMA_KV_CACHE_TYPE=q8_0 is also set, but flash_attn was auto-disabled due to incompatibility, ollama panics because v cache quantization requires flash_attn.

Fix Action

Fixed

Fixed by PR: ggml: force flash attention off for grok (https://github.com/ollama/ollama/pull/15050)

PR fix notes

PR #15050: ggml: force flash attention off for grok

Repository: ollama/ollama
Author: rick-github
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/15050

Description (problem / solution / changelog)

By default, ollama supports FA for grok, but llama.cpp disables it. If KV cache quantization has been set this causes the runner to crash.

Fixes: #15043

Changed files

fs/ggml/ggml.go (modified, +1/-1)

Code Example

Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: flash_attn is not compatible with Grok - forcing off
Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: V cache quantization requires flash_attn
Mar 24 21:17:29 jarvis ollama[2742822]: panic: unable to create llama context
Mar 24 21:17:29 jarvis ollama[2742822]: goroutine 221 [running]:
Mar 24 21:17:29 jarvis ollama[2742822]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0005943c0, {{0xc000596540, 0x2, 0x2}, 0xe, 0x0, 0x1, {0xc000596528, 0x2, 0x2}, ...}, ...)
Mar 24 21:17:29 jarvis ollama[2742822]: #011/builddir/build/BUILD/ollama-0.17.7/runner/llamarunner/runner.go:849 +0x333
Mar 24 21:17:29 jarvis ollama[2742822]: created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 227

RAW_BUFFERClick to expand / collapse

What is the issue?

What should happen: When flash_attn isn't enabled for whatever reason, v cache quantization for the model should get automatically disabled and ignored so that it doesn't have to be disabled globally.

Relevant log output

Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: flash_attn is not compatible with Grok - forcing off
Mar 24 21:17:29 jarvis ollama[2742822]: llama_init_from_model: V cache quantization requires flash_attn
Mar 24 21:17:29 jarvis ollama[2742822]: panic: unable to create llama context
Mar 24 21:17:29 jarvis ollama[2742822]: goroutine 221 [running]:
Mar 24 21:17:29 jarvis ollama[2742822]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0005943c0, {{0xc000596540, 0x2, 0x2}, 0xe, 0x0, 0x1, {0xc000596528, 0x2, 0x2}, ...}, ...)
Mar 24 21:17:29 jarvis ollama[2742822]: #011/builddir/build/BUILD/ollama-0.17.7/runner/llamarunner/runner.go:849 +0x333
Mar 24 21:17:29 jarvis ollama[2742822]: created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 227

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.7

extent analysis

Fix Plan

To resolve the issue, we need to modify the code to automatically disable v cache quantization when flash attention is not enabled. Here are the steps:

Check if flash attention is enabled before attempting to use v cache quantization.
If flash attention is not enabled, disable v cache quantization.

Example code changes:

if !flashAttnEnabled {
    // Disable v cache quantization
    vCacheQuantizationEnabled = false
    log.Println("V cache quantization disabled due to flash attention incompatibility")
}

Additionally, we need to ensure that the vCacheQuantizationEnabled flag is checked before attempting to create the llama context:

if vCacheQuantizationEnabled && !flashAttnEnabled {
    log.Fatal("V cache quantization requires flash attention, but it is not enabled")
}

We can also add a check to ignore v cache quantization if flash attention is not enabled:

if !flashAttnEnabled {
    // Ignore v cache quantization
    vCacheQuantizationType = ""
}

Verification

To verify that the fix worked, we can test the following scenarios:

Set OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 and verify that v cache quantization is disabled when flash attention is not compatible with the model.
Set OLLAMA_FLASH_ATTENTION=0 and OLLAMA_KV_CACHE_TYPE=q8_0 and verify that v cache quantization is disabled.

Extra Tips

Make sure to update the documentation to reflect the changes to the v cache quantization behavior.
Consider adding additional logging to help diagnose issues related to flash attention and v cache quantization.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#network issue #logging issue #authentication issue #prompt issue #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - ✅(Solved) Fix When flash attention is not supported, quantized KV cache should be disregarded instead of aborting the model run. [1 pull requests, 4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #15050: ggml: force flash attention off for grok

Description (problem / solution / changelog)

Changed files

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - ✅(Solved) Fix When flash attention is not supported, quantized KV cache should be disregarded instead of aborting the model run. [1 pull requests, 4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #15050: ggml: force flash attention off for grok

Description (problem / solution / changelog)

Changed files

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING