ollama - ✅(Solved) Fix Gemma 4 31B Dense Specific Issue: Flash Attention hangs indefinitely on large prompt eval (>3-4K tokens) — CUDA/RTX 3090 [2 pull requests, 16 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15350Fetched 2026-04-08 02:52:23
View on GitHub
Comments
16
Participants
4
Timeline
28
Reactions
2
Author
Timeline (top)
commented ×16subscribed ×7cross-referenced ×4closed ×1

Root Cause

The bug is in how Ollama's FA implementation handles the Dense model's attention during large batched prefill:

  • Gemma 4 uses a hybrid attention architecture: 50 sliding window layers (512-1024 token window) + 10 global attention layers
  • The Dense model processes all 31B parameters on every token
  • The MoE model only activates 4B parameters per token via expert routing — this appears to change how FA processes the batched prefill, explaining why MoE succeeds
  • Token-by-token generation works fine with FA (short prompts succeed because the prefill batch is small)
  • The hang is specifically in FA processing a large batch through the Dense model's full-width hybrid attention layers simultaneously

Fix Action

Fix / Workaround

  • #15258 — Gemma 4 hanging on M4 Macs (fixed by PR #15296, but didn't address large prompt eval)
  • #15237 — Gemma 4 on 5090 showing GPU→CPU jump with FA
  • #15286 — Gemma 4 31B performance issues on M1 Max

PR fix notes

PR #15332: ggml: add CUDA flash attention support for head dimension 512 for Gemma4

Description (problem / solution / changelog)

Summary

Backport of ggml-org/llama.cpp#20998 into ollama's ggml backend. I am not sure if there is a formal way how this is done for ollama. The llama.cpp release that contains this fix is: https://github.com/ggml-org/llama.cpp/releases/tag/b8609

Why this is needed:

Gemma4's global attention layers use head_dim=512, which has no CUDA flash attention kernel in the current llama.cpp snapshot. When FA is enabled, these ops silently fall back to CPU, during inference.

  • ollama run with short prompts did not noticeably trigger the fallback, but ollama launch claude (and VS Code Copilot) did. Maybe due to large system prompts with tool definitions.

Changes:

Follows ggml-org/llama.cpp#20998

  • Add case 512 to MMA and tile kernel dispatch
  • Add kernel configs for Ampere, Turing, Volta, and RDNA architectures
  • Add template instances for D=512
  • Exclude D=512 from WMMA path and vector kernel (no D=512 vec templates)
  • Add gemma4 to flash attention default whitelist
    • this has been added and revoked in #15311 - unclear why revoked and locally this works so I suggest to re-add

Related issues:

Fixes #15237, #15350

Test plan

  • Verified on RTX 5090 + RTX 3090 Ti with gemma4:31b Q4_K_M (FA on, 128K context, 100% GPU)
    • verified that no CPU spike during ollama launch claude/vscode with long system prompts
    • verified no regression on other tool-enabled models: nemotron-cascade-2, qwen3.5:35b-a3b, gpt-oss:20b
  • go test ./fs/ggml/ ./ml/backend/ggml/ passes

Evaluation steps used:

# Dont have vulkan locally, used PATH to CUDA 13.0 nvcc compiler:
cmake -B build -DCMAKE_DISABLE_FIND_PACKAGE_Vulkan=TRUE
cmake --build build -j$(nproc) 
go build -o ./ollama .

# Deploy
sudo systemctl stop ollama
sudo cp ./ollama /usr/local/bin/ollama
sudo cp ./build/lib/ollama/libggml-cuda.so /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
sudo systemctl daemon-reload
sudo systemctl start ollama

# Enable FA (not needed with whitelist)
# In /etc/systemd/system/ollama.service.d/override.conf:
#   Environment="OLLAMA_FLASH_ATTENTION=1"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama

# Test
ollama launch claude
# select model
# "hi"

Checks:

  1. ollama ps #if running
  2. nvidia-smi # careful: Memory will be filled but util ramps up then falls to basically 0% after the prompt is triggered
  3. perf top for cpu util - if FA doesnt work, you should see things like following (I use a Q8 KV cache but it will max out CPU regardless):
    48.23%  ollama          libggml-base.so.0.0.0   [.] dequantize_row_q8_0
            |--11.46%--ggml_compute_forward_flash_attn_ext
    22.67%  ollama          libggml-cpu-haswell.so  [.] ggml_vec_dot_q8_0_q8_0
            |--5.89%--ggml_compute_forward_flash_attn_ext
    17.05%  ollama          libggml-cpu-haswell.so  [.] ggml_compute_forward_flash_attn_ext
            |--2.66%--ggml_compute_forward_flash_attn_ext
    1. Note that Gemma4 has some vision modules on the CPU - these would still be there and not a sign of FA not working

AI disclaimer: AI was used in the triaging and resolution of the issue.

Changed files

  • fs/ggml/ggml.go (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh (modified, +23/-1)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu (modified, +4/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (modified, +29/-8)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (modified, +10/-1)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-tile-instance-dkq512-dv512.cu (added, +5/-0)

PR #15378: gemma4: enable flash attention

Description (problem / solution / changelog)

Backport GGML kernels so we can enable flash attention for the gemma 4 model on Metal and CUDA.

No significant performance change, but this does reduce VRAM usage thus allowing larger context sizes.

Fixes #15368 Fixes #15350 Fixes #15237

Changed files

  • fs/ggml/ggml.go (modified, +1/-0)
  • llama/patches/0020-ggml-No-alloc-mode.patch (modified, +23/-22)
  • llama/patches/0022-ggml-Enable-resetting-backend-devices.patch (modified, +2/-2)
  • llama/patches/0024-GPU-discovery-enhancements.patch (modified, +2/-2)
  • llama/patches/0036-backport-kernels-for-gemma4.patch (added, +416/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh (modified, +25/-1)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu (modified, +4/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (modified, +29/-8)
  • ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (modified, +10/-1)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m (modified, +1/-0)
  • ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal (modified, +19/-0)
  • ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal (modified, +19/-0)

Code Example

# Start Ollama with FA enabled
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve

# This works instantly (short prompt):
curl http://localhost:11434/api/generate \
  -d '{"model":"gemma4:31b","prompt":"Say hello","stream":false,"options":{"num_predict":5}}'
# Returns in <1 second

# This hangs forever (large prompt, Dense model):
python3 -c "
import json, sys
large_prompt = 'You are a helpful AI assistant with extensive knowledge. ' * 800
payload = json.dumps({'model':'gemma4:31b','messages':[{'role':'system','content':large_prompt},{'role':'user','content':'Say hello in one sentence.'}],'stream':False,'options':{'num_predict':20}})
sys.stdout.write(payload)
" | curl -s -m 120 -X POST http://localhost:11434/api/chat \
  -H 'Content-Type: application/json' -d @-
# Hangs indefinitely. GPU shows 0% utilization via nvidia-smi.
# Returns empty after timeout.

# Same payload works fine with MoE model:
# Change gemma4:31b → gemma4:26b in the above command
# Completes in ~88 seconds with 8,021 tokens processed

# Same payload works fine with FA disabled (Dense model):
# Set OLLAMA_FLASH_ATTENTION=0, restart, run the 31b command
# Completes in ~40s with CPU offload at ~6 tok/s
RAW_BUFFERClick to expand / collapse

What is the issue?

Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. Short prompts work perfectly at full speed. The 26B MoE variant handles the same large prompts without issue — the bug is specific to the Dense model.

This blocks all agentic use cases (OpenClaw, coding agents, any tool with a system prompt) since those tools inject 10-20K+ tokens of system prompt, tools, memory, and context before the user's message.

Environment

  • OS: Ubuntu 24.04
  • GPU: NVIDIA RTX 3090 (24GB)
  • Ollama: v0.20.2
  • Model: gemma4:31b (Q4_K_M, ~20GB) and gemma4:26b for comparison
  • CUDA: 12.x
  • Settings: OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q4_0

Key finding: Dense hangs, MoE doesn't

Same server, same FA settings, same KV cache, same prompt (~8K tokens of system prompt):

ModelArchitectureResultPrompt EvalTime
gemma4:26bMoE (4B active)✅ Works8,021 tokens88s
gemma4:31bDense (31B all active)❌ HANG0 tokens processed>120s, 0% GPU

The MoE model processes the same large prompt successfully. The Dense model hangs with 0% GPU utilization — it's not slow processing, it's a complete stall.

Systematic test results (Dense model only)

All tests on same hardware, one variable changed at a time:

TestFAKV CachePrompt SizeResultNotes
1ONq4_0~13K tokens (system prompt)❌ HANGGPU 0% utilization, indefinite
2ONf16~13K tokens (system prompt)❌ HANGSame — KV type doesn't matter
3OFFf16~13K tokens (system prompt)✅ Works~40s, CPU offload, ~6 tok/s
4OFFq4_0~13K tokens (system prompt)✅ WorksFalls back to f16 silently
5ONq4_0~26 tokens (short prompt)✅ Works30 tok/s, instant
6ONq4_0~2,479 tokens✅ Works134 tok/s prompt eval
7ONq4_0~3,541 tokens✅ Works74 tok/s prompt eval
8ONq4_0~8K+ tokens (agent payload)❌ HANG3+ min, 0% GPU, aborted

The pattern: FA + Dense model works under ~3-4K tokens, hangs above that threshold. FA + MoE works at all sizes. FA off + Dense works at all sizes (slowly, with CPU offload).

Steps to reproduce

# Start Ollama with FA enabled
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve

# This works instantly (short prompt):
curl http://localhost:11434/api/generate \
  -d '{"model":"gemma4:31b","prompt":"Say hello","stream":false,"options":{"num_predict":5}}'
# Returns in <1 second

# This hangs forever (large prompt, Dense model):
python3 -c "
import json, sys
large_prompt = 'You are a helpful AI assistant with extensive knowledge. ' * 800
payload = json.dumps({'model':'gemma4:31b','messages':[{'role':'system','content':large_prompt},{'role':'user','content':'Say hello in one sentence.'}],'stream':False,'options':{'num_predict':20}})
sys.stdout.write(payload)
" | curl -s -m 120 -X POST http://localhost:11434/api/chat \
  -H 'Content-Type: application/json' -d @-
# Hangs indefinitely. GPU shows 0% utilization via nvidia-smi.
# Returns empty after timeout.

# Same payload works fine with MoE model:
# Change gemma4:31b → gemma4:26b in the above command
# Completes in ~88 seconds with 8,021 tokens processed

# Same payload works fine with FA disabled (Dense model):
# Set OLLAMA_FLASH_ATTENTION=0, restart, run the 31b command
# Completes in ~40s with CPU offload at ~6 tok/s

Why this matters — blocks all agentic use cases

This blocks every agentic use case for the Gemma 4 Dense model on Ollama:

  • OpenClaw injects ~27K chars (~8-10K tokens) of bootstrap, tools, memory, and system prompt. Multiple open issues trace back to this root cause: openclaw/openclaw#59916 (Gemma 4 hangs, filed 3 days ago), openclaw/openclaw#41871, openclaw/openclaw#31399, openclaw/openclaw#24756 — all reporting "local Ollama hangs, direct curl works fine." The community is filing these against OpenClaw, but the root cause is here in Ollama's FA implementation.
  • OpenCode, Continue, and other coding agents send large system prompts with tool definitions
  • Any application using Ollama's /api/chat with a system prompt + conversation history exceeding ~3-4K tokens

Gemma 4 31B Dense is the #1 ranked dense model in its class right now. OpenClaw is the #1 open-source agent platform. The intersection of these two is completely broken for anyone running locally with FA enabled on NVIDIA GPUs.

Analysis

The bug is in how Ollama's FA implementation handles the Dense model's attention during large batched prefill:

  • Gemma 4 uses a hybrid attention architecture: 50 sliding window layers (512-1024 token window) + 10 global attention layers
  • The Dense model processes all 31B parameters on every token
  • The MoE model only activates 4B parameters per token via expert routing — this appears to change how FA processes the batched prefill, explaining why MoE succeeds
  • Token-by-token generation works fine with FA (short prompts succeed because the prefill batch is small)
  • The hang is specifically in FA processing a large batch through the Dense model's full-width hybrid attention layers simultaneously

Gemma 3 precedent

Gemma 3 had the same architecture (sliding window + global attention) and required specific FA fixes in earlier Ollama releases:

  • #9683, #8158 — KV cache + FA speed issues with Gemma 3
  • #9857 — Gemma 3 27B on RTX 3090 becoming unresponsive (same GPU, same arch family)
  • Ollama changelog notes prior fixes: "Fixed handling of long contexts with Gemma 3 models" and "Flash attention is now enabled by default for Gemma 3"

PR #15296 enabled FA for Gemma 4 but may not have included the equivalent large-batch prefill handling that was eventually added for Gemma 3.

Related issues

  • #15258 — Gemma 4 hanging on M4 Macs (fixed by PR #15296, but didn't address large prompt eval)
  • #15237 — Gemma 4 on 5090 showing GPU→CPU jump with FA
  • #15286 — Gemma 4 31B performance issues on M1 Max

extent analysis

TL;DR

Disable Flash Attention (FA) for the Gemma 4 31B Dense model to prevent hanging during large prompt evaluations.

Guidance

  • Identify the threshold for prompt size that causes the hang, which appears to be around 3-4K tokens.
  • Consider disabling FA for large prompts or using the MoE model as a workaround.
  • Review prior fixes for Gemma 3, such as #9683, #8158, and #9857, to see if similar changes can be applied to Gemma 4.
  • Test with different KV cache types, such as f16, to see if it affects the hang.

Example

No specific code changes are suggested at this time, but the following command can be used to disable FA:

OLLAMA_FLASH_ATTENTION=0 ollama serve

Notes

The root cause of the issue appears to be related to how Ollama's FA implementation handles large batched prefill for the Dense model. The MoE model does not exhibit this issue, suggesting that the problem is specific to the Dense model's architecture.

Recommendation

Apply workaround: Disable Flash Attention for the Gemma 4 31B Dense model when evaluating large prompts. This will prevent the hang, but may affect performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING