ollama - ✅(Solved) Fix Gemma 4 31B Dense Specific Issue: Flash Attention hangs indefinitely on large prompt eval (>3-4K tokens) — CUDA/RTX 3090 [2 pull requests, 16 comments, 4 participants]

ollama2026-04-05 20:47:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15350•Fetched 2026-04-08 02:52:23

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×16subscribed ×7cross-referenced ×4closed ×1

Root Cause

The bug is in how Ollama's FA implementation handles the Dense model's attention during large batched prefill:

Gemma 4 uses a hybrid attention architecture: 50 sliding window layers (512-1024 token window) + 10 global attention layers
The Dense model processes all 31B parameters on every token
The MoE model only activates 4B parameters per token via expert routing — this appears to change how FA processes the batched prefill, explaining why MoE succeeds
Token-by-token generation works fine with FA (short prompts succeed because the prefill batch is small)
The hang is specifically in FA processing a large batch through the Dense model's full-width hybrid attention layers simultaneously

Fix Action

Fix / Workaround

#15258 — Gemma 4 hanging on M4 Macs (fixed by PR #15296, but didn't address large prompt eval)
#15237 — Gemma 4 on 5090 showing GPU→CPU jump with FA
#15286 — Gemma 4 31B performance issues on M1 Max

PR fix notes

PR #15332: ggml: add CUDA flash attention support for head dimension 512 for Gemma4

Repository: ollama/ollama
Author: mazphilip
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/15332

Description (problem / solution / changelog)

Summary

Backport of ggml-org/llama.cpp#20998 into ollama's ggml backend. I am not sure if there is a formal way how this is done for ollama. The llama.cpp release that contains this fix is: https://github.com/ggml-org/llama.cpp/releases/tag/b8609

Why this is needed:

Gemma4's global attention layers use head_dim=512, which has no CUDA flash attention kernel in the current llama.cpp snapshot. When FA is enabled, these ops silently fall back to CPU, during inference.

ollama run with short prompts did not noticeably trigger the fallback, but ollama launch claude (and VS Code Copilot) did. Maybe due to large system prompts with tool definitions.

Changes:

Follows ggml-org/llama.cpp#20998

Add case 512 to MMA and tile kernel dispatch
Add kernel configs for Ampere, Turing, Volta, and RDNA architectures
Add template instances for D=512
Exclude D=512 from WMMA path and vector kernel (no D=512 vec templates)
Add gemma4 to flash attention default whitelist
- this has been added and revoked in #15311 - unclear why revoked and locally this works so I suggest to re-add

Related issues:

Fixes #15237, #15350

Test plan

Verified on RTX 5090 + RTX 3090 Ti with gemma4:31b Q4_K_M (FA on, 128K context, 100% GPU)
- verified that no CPU spike during ollama launch claude/vscode with long system prompts
- verified no regression on other tool-enabled models: nemotron-cascade-2, qwen3.5:35b-a3b, gpt-oss:20b
go test ./fs/ggml/ ./ml/backend/ggml/ passes

Evaluation steps used:

# Dont have vulkan locally, used PATH to CUDA 13.0 nvcc compiler:
cmake -B build -DCMAKE_DISABLE_FIND_PACKAGE_Vulkan=TRUE
cmake --build build -j$(nproc) 
go build -o ./ollama .

# Deploy
sudo systemctl stop ollama
sudo cp ./ollama /usr/local/bin/ollama
sudo cp ./build/lib/ollama/libggml-cuda.so /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
sudo systemctl daemon-reload
sudo systemctl start ollama

# Enable FA (not needed with whitelist)
# In /etc/systemd/system/ollama.service.d/override.conf:
#   Environment="OLLAMA_FLASH_ATTENTION=1"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama

# Test
ollama launch claude
# select model
# "hi"

Checks:

ollama ps #if running
nvidia-smi # careful: Memory will be filled but util ramps up then falls to basically 0% after the prompt is triggered

perf top for cpu util - if FA doesnt work, you should see things like following (I use a Q8 KV cache but it will max out CPU regardless):

48.23%  ollama          libggml-base.so.0.0.0   [.] dequantize_row_q8_0
        |--11.46%--ggml_compute_forward_flash_attn_ext
22.67%  ollama          libggml-cpu-haswell.so  [.] ggml_vec_dot_q8_0_q8_0
        |--5.89%--ggml_compute_forward_flash_attn_ext
17.05%  ollama          libggml-cpu-haswell.so  [.] ggml_compute_forward_flash_attn_ext
        |--2.66%--ggml_compute_forward_flash_attn_ext

Note that Gemma4 has some vision modules on the CPU - these would still be there and not a sign of FA not working

AI disclaimer: AI was used in the triaging and resolution of the issue.

Changed files

fs/ggml/ggml.go (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh (modified, +23/-1)
ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu (modified, +4/-0)
ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (modified, +29/-8)
ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (modified, +10/-1)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-tile-instance-dkq512-dv512.cu (added, +5/-0)

PR #15378: gemma4: enable flash attention

Repository: ollama/ollama
Author: dhiltgen
State: closed | merged: True
Link: https://github.com/ollama/ollama/pull/15378

Description (problem / solution / changelog)

Backport GGML kernels so we can enable flash attention for the gemma 4 model on Metal and CUDA.

No significant performance change, but this does reduce VRAM usage thus allowing larger context sizes.

Fixes #15368 Fixes #15350 Fixes #15237

Changed files

fs/ggml/ggml.go (modified, +1/-0)
llama/patches/0020-ggml-No-alloc-mode.patch (modified, +23/-22)
llama/patches/0022-ggml-Enable-resetting-backend-devices.patch (modified, +2/-2)
llama/patches/0024-GPU-discovery-enhancements.patch (modified, +2/-2)
llama/patches/0036-backport-kernels-for-gemma4.patch (added, +416/-0)
ml/backend/ggml/ggml/src/ggml-cuda/fattn-mma-f16.cuh (modified, +25/-1)
ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cu (modified, +4/-0)
ml/backend/ggml/ggml/src/ggml-cuda/fattn-tile.cuh (modified, +29/-8)
ml/backend/ggml/ggml/src/ggml-cuda/fattn.cu (modified, +10/-1)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-device.m (modified, +1/-0)
ml/backend/ggml/ggml/src/ggml-metal/ggml-metal-embed.metal (modified, +19/-0)
ml/backend/ggml/ggml/src/ggml-metal/ggml-metal.metal (modified, +19/-0)

Code Example

# Start Ollama with FA enabled
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve

# This works instantly (short prompt):
curl http://localhost:11434/api/generate \
  -d '{"model":"gemma4:31b","prompt":"Say hello","stream":false,"options":{"num_predict":5}}'
# Returns in <1 second

# This hangs forever (large prompt, Dense model):
python3 -c "
import json, sys
large_prompt = 'You are a helpful AI assistant with extensive knowledge. ' * 800
payload = json.dumps({'model':'gemma4:31b','messages':[{'role':'system','content':large_prompt},{'role':'user','content':'Say hello in one sentence.'}],'stream':False,'options':{'num_predict':20}})
sys.stdout.write(payload)
" | curl -s -m 120 -X POST http://localhost:11434/api/chat \
  -H 'Content-Type: application/json' -d @-
# Hangs indefinitely. GPU shows 0% utilization via nvidia-smi.
# Returns empty after timeout.

# Same payload works fine with MoE model:
# Change gemma4:31b → gemma4:26b in the above command
# Completes in ~88 seconds with 8,021 tokens processed

# Same payload works fine with FA disabled (Dense model):
# Set OLLAMA_FLASH_ATTENTION=0, restart, run the 31b command
# Completes in ~40s with CPU offload at ~6 tok/s

RAW_BUFFERClick to expand / collapse

What is the issue?

Flash Attention causes Gemma 4 31B Dense to hang indefinitely during prompt evaluation when the prompt exceeds ~3-4K tokens. Short prompts work perfectly at full speed. The 26B MoE variant handles the same large prompts without issue — the bug is specific to the Dense model.

This blocks all agentic use cases (OpenClaw, coding agents, any tool with a system prompt) since those tools inject 10-20K+ tokens of system prompt, tools, memory, and context before the user's message.

Environment

OS: Ubuntu 24.04
GPU: NVIDIA RTX 3090 (24GB)
Ollama: v0.20.2
Model: gemma4:31b (Q4_K_M, ~20GB) and gemma4:26b for comparison
CUDA: 12.x
Settings: OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q4_0

Key finding: Dense hangs, MoE doesn't

Same server, same FA settings, same KV cache, same prompt (~8K tokens of system prompt):

Model	Architecture	Result	Prompt Eval	Time
gemma4:26b	MoE (4B active)	✅ Works	8,021 tokens	88s
gemma4:31b	Dense (31B all active)	❌ HANG	0 tokens processed	>120s, 0% GPU

The MoE model processes the same large prompt successfully. The Dense model hangs with 0% GPU utilization — it's not slow processing, it's a complete stall.

Systematic test results (Dense model only)

All tests on same hardware, one variable changed at a time:

Test	FA	KV Cache	Prompt Size	Result	Notes
1	ON	q4_0	~13K tokens (system prompt)	❌ HANG	GPU 0% utilization, indefinite
2	ON	f16	~13K tokens (system prompt)	❌ HANG	Same — KV type doesn't matter
3	OFF	f16	~13K tokens (system prompt)	✅ Works	~40s, CPU offload, ~6 tok/s
4	OFF	q4_0	~13K tokens (system prompt)	✅ Works	Falls back to f16 silently
5	ON	q4_0	~26 tokens (short prompt)	✅ Works	30 tok/s, instant
6	ON	q4_0	~2,479 tokens	✅ Works	134 tok/s prompt eval
7	ON	q4_0	~3,541 tokens	✅ Works	74 tok/s prompt eval
8	ON	q4_0	~8K+ tokens (agent payload)	❌ HANG	3+ min, 0% GPU, aborted

The pattern: FA + Dense model works under ~3-4K tokens, hangs above that threshold. FA + MoE works at all sizes. FA off + Dense works at all sizes (slowly, with CPU offload).

Steps to reproduce

# Start Ollama with FA enabled
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve

# This works instantly (short prompt):
curl http://localhost:11434/api/generate \
  -d '{"model":"gemma4:31b","prompt":"Say hello","stream":false,"options":{"num_predict":5}}'
# Returns in <1 second

# This hangs forever (large prompt, Dense model):
python3 -c "
import json, sys
large_prompt = 'You are a helpful AI assistant with extensive knowledge. ' * 800
payload = json.dumps({'model':'gemma4:31b','messages':[{'role':'system','content':large_prompt},{'role':'user','content':'Say hello in one sentence.'}],'stream':False,'options':{'num_predict':20}})
sys.stdout.write(payload)
" | curl -s -m 120 -X POST http://localhost:11434/api/chat \
  -H 'Content-Type: application/json' -d @-
# Hangs indefinitely. GPU shows 0% utilization via nvidia-smi.
# Returns empty after timeout.

# Same payload works fine with MoE model:
# Change gemma4:31b → gemma4:26b in the above command
# Completes in ~88 seconds with 8,021 tokens processed

# Same payload works fine with FA disabled (Dense model):
# Set OLLAMA_FLASH_ATTENTION=0, restart, run the 31b command
# Completes in ~40s with CPU offload at ~6 tok/s

Why this matters — blocks all agentic use cases

This blocks every agentic use case for the Gemma 4 Dense model on Ollama:

OpenClaw injects ~27K chars (~8-10K tokens) of bootstrap, tools, memory, and system prompt. Multiple open issues trace back to this root cause: openclaw/openclaw#59916 (Gemma 4 hangs, filed 3 days ago), openclaw/openclaw#41871, openclaw/openclaw#31399, openclaw/openclaw#24756 — all reporting "local Ollama hangs, direct curl works fine." The community is filing these against OpenClaw, but the root cause is here in Ollama's FA implementation.
OpenCode, Continue, and other coding agents send large system prompts with tool definitions
Any application using Ollama's /api/chat with a system prompt + conversation history exceeding ~3-4K tokens

Gemma 4 31B Dense is the #1 ranked dense model in its class right now. OpenClaw is the #1 open-source agent platform. The intersection of these two is completely broken for anyone running locally with FA enabled on NVIDIA GPUs.

Analysis

The bug is in how Ollama's FA implementation handles the Dense model's attention during large batched prefill:

Gemma 4 uses a hybrid attention architecture: 50 sliding window layers (512-1024 token window) + 10 global attention layers
The Dense model processes all 31B parameters on every token
The MoE model only activates 4B parameters per token via expert routing — this appears to change how FA processes the batched prefill, explaining why MoE succeeds
Token-by-token generation works fine with FA (short prompts succeed because the prefill batch is small)
The hang is specifically in FA processing a large batch through the Dense model's full-width hybrid attention layers simultaneously

Gemma 3 precedent

Gemma 3 had the same architecture (sliding window + global attention) and required specific FA fixes in earlier Ollama releases:

#9683, #8158 — KV cache + FA speed issues with Gemma 3
#9857 — Gemma 3 27B on RTX 3090 becoming unresponsive (same GPU, same arch family)
Ollama changelog notes prior fixes: "Fixed handling of long contexts with Gemma 3 models" and "Flash attention is now enabled by default for Gemma 3"

PR #15296 enabled FA for Gemma 4 but may not have included the equivalent large-batch prefill handling that was eventually added for Gemma 3.

Related issues

#15258 — Gemma 4 hanging on M4 Macs (fixed by PR #15296, but didn't address large prompt eval)
#15237 — Gemma 4 on 5090 showing GPU→CPU jump with FA
#15286 — Gemma 4 31B performance issues on M1 Max

extent analysis

TL;DR

Disable Flash Attention (FA) for the Gemma 4 31B Dense model to prevent hanging during large prompt evaluations.

Guidance

Identify the threshold for prompt size that causes the hang, which appears to be around 3-4K tokens.
Consider disabling FA for large prompts or using the MoE model as a workaround.
Review prior fixes for Gemma 3, such as #9683, #8158, and #9857, to see if similar changes can be applied to Gemma 4.
Test with different KV cache types, such as f16, to see if it affects the hang.

Example

No specific code changes are suggested at this time, but the following command can be used to disable FA:

OLLAMA_FLASH_ATTENTION=0 ollama serve

Notes

The root cause of the issue appears to be related to how Ollama's FA implementation handles large batched prefill for the Dense model. The MoE model does not exhibit this issue, suggesting that the problem is specific to the Dense model's architecture.

Recommendation

Apply workaround: Disable Flash Attention for the Gemma 4 31B Dense model when evaluating large prompts. This will prevent the hang, but may affect performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #conversation history #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

ollama - ✅(Solved) Fix Gemma 4 31B Dense Specific Issue: Flash Attention hangs indefinitely on large prompt eval (>3-4K tokens) — CUDA/RTX 3090 [2 pull requests, 16 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #15332: ggml: add CUDA flash attention support for head dimension 512 for Gemma4

Description (problem / solution / changelog)

Summary

Why this is needed:

Changes:

Related issues:

Test plan

Evaluation steps used:

Changed files

PR #15378: gemma4: enable flash attention

Description (problem / solution / changelog)

Changed files

Code Example

What is the issue?

Environment

Key finding: Dense hangs, MoE doesn't

Systematic test results (Dense model only)

Steps to reproduce

Why this matters — blocks all agentic use cases

Analysis

Gemma 3 precedent

Related issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING