ollama - 💡(How to fix) Fix gemma4:26b with flash attention on 3090 with num_ctx 32k running slow on gpu+cpu [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15634Fetched 2026-04-17 08:27:05
View on GitHub
Comments
2
Participants
3
Timeline
3
Reactions
0
Author
Timeline (top)
commented ×2labeled ×1

Fix Action

Fix / Workaround

while true; do OLLAMA_KEEP_ALIVE="-1" OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0:11434 ollama serve; echo "ollama crashed, restarting in 2 seconds..."; sleep 2; done time=2026-04-17T00:37:46.230+03:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-04-17T00:37:46.231+03:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false" time=2026-04-17T00:37:46.236+03:00 level=INFO source=images.go:499 msg="total blobs: 50" time=2026-04-17T00:37:46.237+03:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0" time=2026-04-17T00:37:46.237+03:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)" time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-04-17T00:37:46.238+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41551" time=2026-04-17T00:37:46.521+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43635" time=2026-04-17T00:37:46.627+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 47847" time=2026-04-17T00:37:46.888+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v12 driver=12.8 pci_id=0000:81:00.0 type=discrete total="24.0 GiB" available="23.5 GiB" time=2026-04-17T00:37:46.888+03:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768 time=2026-04-17T00:38:01.143+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43511" time=2026-04-17T00:38:01.775+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:247 msg="enabling flash attention" time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 37345" time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:484 msg="system memory" total="503.5 GiB" free="491.1 GiB" free_swap="0 B" time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 library=CUDA available="23.0 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-04-17T00:38:01.776+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=99 time=2026-04-17T00:38:01.795+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-17T00:38:01.796+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37345" time=2026-04-17T00:38:01.799+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-17T00:38:01.930+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52 load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so time=2026-04-17T00:38:02.140+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-04-17T00:38:02.148+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-17T00:38:02.176+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=5.370284ms bounds=(0,0)-(2048,2048) time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=186.589034ms size="[768 768]" time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-17T00:38:02.364+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=193.358759ms shape="[2816 256]" time=2026-04-17T00:38:02.914+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-17T00:38:03.043+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-17T00:38:03.062+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=777.192µs bounds=(0,0)-(2048,2048) time=2026-04-17T00:38:03.232+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=170.482516ms size="[768 768]" time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-17T00:38:03.238+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=176.789484ms shape="[2816 256]" time=2026-04-17T00:38:03.498+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.5 GiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="243.9 MiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="192.0 MiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:272 msg="total memory" size="19.2 GiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU" time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU" time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model" time=2026-04-17T00:38:07.013+03:00 level=INFO source=server.go:1390 msg="llama runner started in 5.24 seconds" [GIN] 2026/04/17 - 00:39:56 | 200 | 1m56s | 127.0.0.1 | POST "/api/generate"

while true; do   OLLAMA_KEEP_ALIVE="-1"   OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_FLASH_ATTENTION=1  OLLAMA_HOST=0.0.0.0:11434   ollama serve;    echo "ollama crashed, restarting in 2 seconds...";   sleep 2; done
time=2026-04-17T00:37:46.230+03:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-04-17T00:37:46.231+03:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false"
time=2026-04-17T00:37:46.236+03:00 level=INFO source=images.go:499 msg="total blobs: 50"
time=2026-04-17T00:37:46.237+03:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-17T00:37:46.237+03:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)"
time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-04-17T00:37:46.238+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41551"
time=2026-04-17T00:37:46.521+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43635"
time=2026-04-17T00:37:46.627+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 47847"
time=2026-04-17T00:37:46.888+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v12 driver=12.8 pci_id=0000:81:00.0 type=discrete total="24.0 GiB" available="23.5 GiB"
time=2026-04-17T00:37:46.888+03:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768
time=2026-04-17T00:38:01.143+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43511"
time=2026-04-17T00:38:01.775+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 37345"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:484 msg="system memory" total="503.5 GiB" free="491.1 GiB" free_swap="0 B"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 library=CUDA available="23.0 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=99
time=2026-04-17T00:38:01.795+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-17T00:38:01.796+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37345"
time=2026-04-17T00:38:01.799+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:01.930+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-a94b4107-51b7-5a6f-b872-3e121139ed72
load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-04-17T00:38:02.140+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-04-17T00:38:02.148+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:02.176+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=5.370284ms bounds=(0,0)-(2048,2048)
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=186.589034ms size="[768 768]"
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-17T00:38:02.364+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=193.358759ms shape="[2816 256]"
time=2026-04-17T00:38:02.914+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:03.043+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:03.062+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=777.192µs bounds=(0,0)-(2048,2048)
time=2026-04-17T00:38:03.232+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=170.482516ms size="[768 768]"
time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-17T00:38:03.238+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=176.789484ms shape="[2816 256]"
time=2026-04-17T00:38:03.498+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.5 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="243.9 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="192.0 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:272 msg="total memory" size="19.2 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
time=2026-04-17T00:38:07.013+03:00 level=INFO source=server.go:1390 msg="llama runner started in 5.24 seconds"
[GIN] 2026/04/17 - 00:39:56 | 200 |         1m56s |       127.0.0.1 | POST     "/api/generate"

Code Example

while true; do   OLLAMA_KEEP_ALIVE="-1"   OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_FLASH_ATTENTION=1  OLLAMA_HOST=0.0.0.0:11434   ollama serve;    echo "ollama crashed, restarting in 2 seconds...";   sleep 2; done
time=2026-04-17T00:37:46.230+03:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-04-17T00:37:46.231+03:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false"
time=2026-04-17T00:37:46.236+03:00 level=INFO source=images.go:499 msg="total blobs: 50"
time=2026-04-17T00:37:46.237+03:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-17T00:37:46.237+03:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)"
time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-04-17T00:37:46.238+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41551"
time=2026-04-17T00:37:46.521+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43635"
time=2026-04-17T00:37:46.627+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 47847"
time=2026-04-17T00:37:46.888+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v12 driver=12.8 pci_id=0000:81:00.0 type=discrete total="24.0 GiB" available="23.5 GiB"
time=2026-04-17T00:37:46.888+03:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768
time=2026-04-17T00:38:01.143+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43511"
time=2026-04-17T00:38:01.775+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 37345"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:484 msg="system memory" total="503.5 GiB" free="491.1 GiB" free_swap="0 B"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 library=CUDA available="23.0 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=99
time=2026-04-17T00:38:01.795+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-17T00:38:01.796+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37345"
time=2026-04-17T00:38:01.799+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:01.930+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-a94b4107-51b7-5a6f-b872-3e121139ed72
load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-04-17T00:38:02.140+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-04-17T00:38:02.148+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:02.176+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=5.370284ms bounds=(0,0)-(2048,2048)
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=186.589034ms size="[768 768]"
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-17T00:38:02.364+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=193.358759ms shape="[2816 256]"
time=2026-04-17T00:38:02.914+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:03.043+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:03.062+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=777.192µs bounds=(0,0)-(2048,2048)
time=2026-04-17T00:38:03.232+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=170.482516ms size="[768 768]"
time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-17T00:38:03.238+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=176.789484ms shape="[2816 256]"
time=2026-04-17T00:38:03.498+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.5 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="243.9 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="192.0 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:272 msg="total memory" size="19.2 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
time=2026-04-17T00:38:07.013+03:00 level=INFO source=server.go:1390 msg="llama runner started in 5.24 seconds"
[GIN] 2026/04/17 - 00:39:56 | 200 |         1m56s |       127.0.0.1 | POST     "/api/generate"
RAW_BUFFERClick to expand / collapse

What is the issue?

without flash attention gemma4:26b run fast on gpu only with flash attention enabled it run really slow gpu+cpu

while true; do OLLAMA_KEEP_ALIVE="-1" OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0:11434 ollama serve; echo "ollama crashed, restarting in 2 seconds..."; sleep 2; done time=2026-04-17T00:37:46.230+03:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-04-17T00:37:46.231+03:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false" time=2026-04-17T00:37:46.236+03:00 level=INFO source=images.go:499 msg="total blobs: 50" time=2026-04-17T00:37:46.237+03:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0" time=2026-04-17T00:37:46.237+03:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)" time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-04-17T00:37:46.238+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41551" time=2026-04-17T00:37:46.521+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43635" time=2026-04-17T00:37:46.627+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 47847" time=2026-04-17T00:37:46.888+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v12 driver=12.8 pci_id=0000:81:00.0 type=discrete total="24.0 GiB" available="23.5 GiB" time=2026-04-17T00:37:46.888+03:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768 time=2026-04-17T00:38:01.143+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43511" time=2026-04-17T00:38:01.775+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:247 msg="enabling flash attention" time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 37345" time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:484 msg="system memory" total="503.5 GiB" free="491.1 GiB" free_swap="0 B" time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 library=CUDA available="23.0 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-04-17T00:38:01.776+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=99 time=2026-04-17T00:38:01.795+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine" time=2026-04-17T00:38:01.796+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37345" time=2026-04-17T00:38:01.799+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-17T00:38:01.930+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52 load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so time=2026-04-17T00:38:02.140+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-04-17T00:38:02.148+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-17T00:38:02.176+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=5.370284ms bounds=(0,0)-(2048,2048) time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=186.589034ms size="[768 768]" time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-17T00:38:02.364+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=193.358759ms shape="[2816 256]" time=2026-04-17T00:38:02.914+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-17T00:38:03.043+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883 time=2026-04-17T00:38:03.062+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=777.192µs bounds=(0,0)-(2048,2048) time=2026-04-17T00:38:03.232+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=170.482516ms size="[768 768]" time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3 time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16 time=2026-04-17T00:38:03.238+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=176.789484ms shape="[2816 256]" time=2026-04-17T00:38:03.498+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.5 GiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="243.9 MiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="192.0 MiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:272 msg="total memory" size="19.2 GiB" time=2026-04-17T00:38:03.499+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU" time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU" time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model" time=2026-04-17T00:38:07.013+03:00 level=INFO source=server.go:1390 msg="llama runner started in 5.24 seconds" [GIN] 2026/04/17 - 00:39:56 | 200 | 1m56s | 127.0.0.1 | POST "/api/generate"

Relevant log output

while true; do   OLLAMA_KEEP_ALIVE="-1"   OLLAMA_MODELS=/usr/share/ollama/.ollama/models OLLAMA_FLASH_ATTENTION=1  OLLAMA_HOST=0.0.0.0:11434   ollama serve;    echo "ollama crashed, restarting in 2 seconds...";   sleep 2; done
time=2026-04-17T00:37:46.230+03:00 level=INFO source=routes.go:1744 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-04-17T00:37:46.231+03:00 level=INFO source=routes.go:1746 msg="Ollama cloud disabled: false"
time=2026-04-17T00:37:46.236+03:00 level=INFO source=images.go:499 msg="total blobs: 50"
time=2026-04-17T00:37:46.237+03:00 level=INFO source=images.go:506 msg="total unused blobs removed: 0"
time=2026-04-17T00:37:46.237+03:00 level=INFO source=routes.go:1802 msg="Listening on [::]:11434 (version 0.20.2)"
time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-04-17T00:37:46.238+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-04-17T00:37:46.238+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 41551"
time=2026-04-17T00:37:46.521+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43635"
time=2026-04-17T00:37:46.627+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 47847"
time=2026-04-17T00:37:46.888+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v12 driver=12.8 pci_id=0000:81:00.0 type=discrete total="24.0 GiB" available="23.5 GiB"
time=2026-04-17T00:37:46.888+03:00 level=INFO source=routes.go:1852 msg="vram-based default context" total_vram="24.0 GiB" default_num_ctx=32768
time=2026-04-17T00:38:01.143+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43511"
time=2026-04-17T00:38:01.775+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-04-17T00:38:01.775+03:00 level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 37345"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:484 msg="system memory" total="503.5 GiB" free="491.1 GiB" free_swap="0 B"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 library=CUDA available="23.0 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-04-17T00:38:01.776+03:00 level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=99
time=2026-04-17T00:38:01.795+03:00 level=INFO source=runner.go:1417 msg="starting ollama engine"
time=2026-04-17T00:38:01.796+03:00 level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:37345"
time=2026-04-17T00:38:01.799+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:01.930+03:00 level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-a94b4107-51b7-5a6f-b872-3e121139ed72
load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-04-17T00:38:02.140+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-04-17T00:38:02.148+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:02.176+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=5.370284ms bounds=(0,0)-(2048,2048)
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=186.589034ms size="[768 768]"
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-17T00:38:02.362+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-17T00:38:02.364+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=193.358759ms shape="[2816 256]"
time=2026-04-17T00:38:02.914+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:03.043+03:00 level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
time=2026-04-17T00:38:03.062+03:00 level=INFO source=model.go:138 msg="vision: decode" elapsed=777.192µs bounds=(0,0)-(2048,2048)
time=2026-04-17T00:38:03.232+03:00 level=INFO source=model.go:145 msg="vision: preprocess" elapsed=170.482516ms size="[768 768]"
time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
time=2026-04-17T00:38:03.237+03:00 level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
time=2026-04-17T00:38:03.238+03:00 level=INFO source=model.go:156 msg="vision: encoded" elapsed=176.789484ms shape="[2816 256]"
time=2026-04-17T00:38:03.498+03:00 level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType: NumThreads:48 GPULayers:31[ID:GPU-a94b4107-51b7-5a6f-b872-3e121139ed72 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.5 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="243.9 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="192.0 MiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=device.go:272 msg="total memory" size="19.2 GiB"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU"
time=2026-04-17T00:38:03.499+03:00 level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
time=2026-04-17T00:38:07.013+03:00 level=INFO source=server.go:1390 msg="llama runner started in 5.24 seconds"
[GIN] 2026/04/17 - 00:39:56 | 200 |         1m56s |       127.0.0.1 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

No response

extent analysis

TL;DR

Disabling flash attention may resolve the performance issue with Ollama running on GPU and CPU.

Guidance

  1. Verify the impact of flash attention: Try running Ollama with flash attention disabled to see if it improves performance.
  2. Check system resources: Monitor CPU and GPU usage to identify potential bottlenecks when flash attention is enabled.
  3. Review Ollama configuration: Ensure that Ollama is properly configured for the available hardware, especially considering the GPU and CPU combination.
  4. Consider updating Ollama: Although the version is not specified, updating to a newer version might include performance optimizations or fixes related to flash attention.

Example

No specific code changes are suggested based on the provided information, but disabling flash attention can be done by setting OLLAMA_FLASH_ATTENTION=0 before running Ollama.

Notes

The exact cause of the performance issue is not clear from the provided logs, but disabling flash attention is a straightforward step to troubleshoot the problem. Additionally, the performance difference between running on GPU only and GPU+CPU suggests a potential issue with how Ollama utilizes system resources when flash attention is enabled.

Recommendation

Apply workaround: Disable flash attention to see if it improves performance, as it seems to be a contributing factor to the slowdown.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING