ollama - 💡(How to fix) Fix gemma4 -flash attention disabled --GPU: Tesla V100--ollama version 0.20.7 [1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15641Fetched 2026-04-17 08:26:57
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Error Message

Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=WARN source=server.go:270 msg="quantized kv cache requested but flash attention disabled" type=q8_0

Fix Action

Fix / Workaround

Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |       89.75µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |     268.722µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      67.843µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      26.107µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:39 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:39 | 200 |      48.112µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  666.375868ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  671.534583ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:41 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:41.703+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46699"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=WARN source=server.go:270 msg="quantized kv cache requested but flash attention disabled" type=q8_0
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /mnt/data/ollama/models/blobs/sha256-a0feadb736f521df6de4b1bd3cbf06c00f9fd04570ddc1e47b8ec9ecbbd6b51d --port 35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:484 msg="system memory" total="503.7 GiB" free="498.6 GiB" free_swap="8.0 GiB"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f library=CUDA available="31.3 GiB" free="31.7 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8903761c-f5a9-23c1-398c-0536a7886912 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:771 msg="loading model" "model layers"=61 requested=-1
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.651+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.783+08:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description="" num_tensors=1189 num_key_values=49
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: found 2 CUDA devices:
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-8903761c-f5a9-23c1-398c-0536a7886912
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.975+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.996+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.039+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=3.971884ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=313.311386ms size="[768 768]"
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.355+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=319.29888ms shape="[5376 256]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.212+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:28(0..27) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:33(28..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.378+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.411+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.413845ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=285.244238ms size="[768 768]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.698+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=289.605084ms shape="[5376 256]"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.503+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.694+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.742+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=6.097521ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.045+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=303.313213ms size="[768 768]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.047+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=311.410631ms shape="[5376 256]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.665+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.869+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.933+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=4.648597ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.172+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=238.741814ms size="[768 768]"
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.177+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=248.797295ms shape="[5376 256]"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.094+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="15.4 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="16.0 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="5.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="4.9 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="8.3 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="8.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.5 MiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:272 msg="total memory" size="59.6 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=server.go:1364 msg="waiting for llama runner to start responding"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.096+08:00 level=INFO source=server.go:1398 msg="waiting for server to become available" status="llm server loading model"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: time=2026-04-17T11:18:02.173+08:00 level=INFO source=server.go:1402 msg="llama runner started in 19.56 seconds"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:18:02 | 200 | 21.225892456s |       127.0.0.1 | POST     "/api/generate"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |       64.15µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |      51.718µs |       127.0.0.1 | GET      "/api/ps"

Code Example

Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |       89.75µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |     268.722µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      67.843µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      26.107µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:39 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:39 | 200 |      48.112µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  666.375868ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  671.534583ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:41 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:41.703+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46699"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=WARN source=server.go:270 msg="quantized kv cache requested but flash attention disabled" type=q8_0
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /mnt/data/ollama/models/blobs/sha256-a0feadb736f521df6de4b1bd3cbf06c00f9fd04570ddc1e47b8ec9ecbbd6b51d --port 35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:484 msg="system memory" total="503.7 GiB" free="498.6 GiB" free_swap="8.0 GiB"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f library=CUDA available="31.3 GiB" free="31.7 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8903761c-f5a9-23c1-398c-0536a7886912 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:771 msg="loading model" "model layers"=61 requested=-1
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.651+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.783+08:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description="" num_tensors=1189 num_key_values=49
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: found 2 CUDA devices:
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-8903761c-f5a9-23c1-398c-0536a7886912
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.975+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.996+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.039+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=3.971884ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=313.311386ms size="[768 768]"
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.355+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=319.29888ms shape="[5376 256]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.212+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:28(0..27) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:33(28..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.378+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.411+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.413845ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=285.244238ms size="[768 768]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.698+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=289.605084ms shape="[5376 256]"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.503+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.694+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.742+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=6.097521ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.045+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=303.313213ms size="[768 768]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.047+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=311.410631ms shape="[5376 256]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.665+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.869+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.933+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=4.648597ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.172+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=238.741814ms size="[768 768]"
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.177+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=248.797295ms shape="[5376 256]"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.094+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="15.4 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="16.0 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="5.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="4.9 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="8.3 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="8.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.5 MiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:272 msg="total memory" size="59.6 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=server.go:1364 msg="waiting for llama runner to start responding"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.096+08:00 level=INFO source=server.go:1398 msg="waiting for server to become available" status="llm server loading model"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: time=2026-04-17T11:18:02.173+08:00 level=INFO source=server.go:1402 msg="llama runner started in 19.56 seconds"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:18:02 | 200 | 21.225892456s |       127.0.0.1 | POST     "/api/generate"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |       64.15µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |      51.718µs |       127.0.0.1 | GET      "/api/ps"
RAW_BUFFERClick to expand / collapse

What is the issue?

I tested two models: gemma4:31b-it-q8_0 and gemma4:26b-a4b-it-q8_0. Flash Attention disabled. The environment is: a server equipped with two GPUs (Tesla V100-32G), Ollama version 0.20.7 (which I see claims to support Gemma 4 Flash Attention). The same server environment, I also tested qwen3.5:27b-q8_0 and qwen3.6:35b-a3b-q8_0 — FlashAttention Enabled. Ubuntu 22.04.5 LTS ; CUDA Version: 12.2 ;Ollama 0.20.7

Relevant log output

Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |       89.75µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:33 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:33 | 200 |     268.722µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      67.843µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:35 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:35 | 200 |      26.107µs |       127.0.0.1 | GET      "/api/ps"
Apr 17 11:17:39 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:39 | 200 |      48.112µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  666.375868ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:40 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:17:40 | 200 |  671.534583ms |       127.0.0.1 | POST     "/api/show"
Apr 17 11:17:41 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:41.703+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46699"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.615+08:00 level=WARN source=server.go:270 msg="quantized kv cache requested but flash attention disabled" type=q8_0
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:444 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /mnt/data/ollama/models/blobs/sha256-a0feadb736f521df6de4b1bd3cbf06c00f9fd04570ddc1e47b8ec9ecbbd6b51d --port 35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:484 msg="system memory" total="503.7 GiB" free="498.6 GiB" free_swap="8.0 GiB"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f library=CUDA available="31.3 GiB" free="31.7 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8903761c-f5a9-23c1-398c-0536a7886912 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.616+08:00 level=INFO source=server.go:771 msg="loading model" "model layers"=61 requested=-1
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.641+08:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:35557"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.651+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:61(0..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.783+08:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q8_0 name="" description="" num_tensors=1189 num_key_values=49
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: ggml_cuda_init: found 2 CUDA devices:
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f
Apr 17 11:17:42 LLM-T01-Server ollama[256526]:   Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-8903761c-f5a9-23c1-398c-0536a7886912
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.975+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 17 11:17:42 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:42.996+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.039+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=3.971884ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=313.311386ms size="[768 768]"
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.353+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:43 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:43.355+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=319.29888ms shape="[5376 256]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.212+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:28(0..27) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:33(28..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.378+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.411+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=2.413845ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=285.244238ms size="[768 768]"
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.696+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:44 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:44.698+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=289.605084ms shape="[5376 256]"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.503+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.694+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:45 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:45.742+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=6.097521ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.045+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=303.313213ms size="[768 768]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.046+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.047+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=311.410631ms shape="[5376 256]"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.665+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.869+08:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 17 11:17:46 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:46.933+08:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=4.648597ms bounds=(0,0)-(2048,2048)
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.172+08:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=238.741814ms size="[768 768]"
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.176+08:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 17 11:17:47 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:47.177+08:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=248.797295ms shape="[5376 256]"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.094+08:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:87040 KvCacheType: NumThreads:36 GPULayers:61[ID:GPU-6d4e40b5-a4e2-5758-8195-4f0cae3b229f Layers:32(0..31) ID:GPU-8903761c-f5a9-23c1-398c-0536a7886912 Layers:29(32..60)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:482 msg="offloading 60 repeating layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=ggml.go:494 msg="offloaded 61/61 layers to GPU"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="15.4 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="16.0 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.5 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="5.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="4.9 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="8.3 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="8.2 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.5 MiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=device.go:272 msg="total memory" size="59.6 GiB"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.095+08:00 level=INFO source=server.go:1364 msg="waiting for llama runner to start responding"
Apr 17 11:17:48 LLM-T01-Server ollama[256526]: time=2026-04-17T11:17:48.096+08:00 level=INFO source=server.go:1398 msg="waiting for server to become available" status="llm server loading model"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: time=2026-04-17T11:18:02.173+08:00 level=INFO source=server.go:1402 msg="llama runner started in 19.56 seconds"
Apr 17 11:18:02 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:18:02 | 200 | 21.225892456s |       127.0.0.1 | POST     "/api/generate"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |       64.15µs |       127.0.0.1 | HEAD     "/"
Apr 17 11:19:06 LLM-T01-Server ollama[256526]: [GIN] 2026/04/17 - 11:19:06 | 200 |      51.718µs |       127.0.0.1 | GET      "/api/ps"

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

extent analysis

TL;DR

Enable Flash Attention for the gemma4 models to potentially resolve the issue, as the current configuration has Flash Attention disabled.

Guidance

  • The provided log output indicates that Flash Attention is disabled for the gemma4 models, which might be causing the issue, as suggested by the warning message "quantized kv cache requested but flash attention disabled".
  • To resolve this, try enabling Flash Attention for the gemma4 models and see if it improves the performance or resolves the issue.
  • Review the log output for any other potential issues or warnings that might be related to the problem.
  • Consider testing the models with Flash Attention enabled to determine if it makes a significant difference.

Example

No specific code example is provided, as the issue seems to be related to the configuration of the models rather than a code-specific problem.

Notes

The exact cause of the issue is not explicitly stated in the provided log output, but the warning message suggests that disabling Flash Attention might be a contributing factor. Enabling Flash Attention and testing the models again could help determine if this is the root cause of the problem.

Recommendation

Apply the workaround by enabling Flash Attention for the gemma4 models, as it is claimed to be supported by Ollama version 0.20.7, and test if it resolves the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix gemma4 -flash attention disabled --GPU: Tesla V100--ollama version 0.20.7 [1 participants]