ollama - 💡(How to fix) Fix Performance drop from v0.17 onwards [4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14772Fetched 2026-04-08 00:31:49
View on GitHub
Comments
4
Participants
3
Timeline
13
Reactions
1
Timeline (top)
commented ×4subscribed ×4labeled ×2mentioned ×2

Code Example

[Service]
Environment="OLLAMA_NUM_PARALLEL=32"
Environment="OLLAMA_CONTEXT_LENGTH=32500"

---

Mar 11 12:02:35 computation9 systemd[1]: Started ollama.service - Ollama Service.
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.923+08:00 level=INFO source=routes.go:1658 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32500 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:32 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.923+08:00 level=INFO source=routes.go:1660 msg="Ollama cloud disabled: false"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.926+08:00 level=INFO source=images.go:477 msg="total blobs: 91"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.927+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.928+08:00 level=INFO source=routes.go:1713 msg="Listening on 127.0.0.1:11434 (version 0.17.7)"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.928+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.928+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 39363"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.079+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 40109"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.119+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42013"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.119+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 37303"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.814+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 filter_id="" library=CUDA compute=9.0 name=CUDA0 description="NVIDIA GH200 144G HBM3e" libdirs=ollama,cuda_v12 driver=12.9 pci_id=0009:01:00.0 type=discrete total="143.4 GiB" available="142.5 GiB"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.814+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-906a60a9-f6a7-80b4-08fc-0a69c146aa90 filter_id="" library=CUDA compute=9.0 name=CUDA1 description="NVIDIA GH200 144G HBM3e" libdirs=ollama,cuda_v12 driver=12.9 pci_id=0019:01:00.0 type=discrete total="143.4 GiB" available="142.0 GiB"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.814+08:00 level=INFO source=routes.go:1763 msg="vram-based default context" total_vram="286.8 GiB" default_num_ctx=262144
Mar 11 12:02:46 computation9 ollama[551222]: time=2026-03-11T12:02:46.828+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35939"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.911+08:00 level=INFO source=server.go:246 msg="enabling flash attention"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.911+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 39807"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=sched.go:489 msg="system memory" total="1242.8 GiB" free="1163.8 GiB" free_swap="0 B"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=sched.go:496 msg="gpu memory" id=GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 library=CUDA available="142.0 GiB" free="142.5 GiB" minimum="457.0 MiB" overhead="0 B"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=sched.go:496 msg="gpu memory" id=GPU-906a60a9-f6a7-80b4-08fc-0a69c146aa90 library=CUDA available="141.5 GiB" free="142.0 GiB" minimum="457.0 MiB" overhead="0 B"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=server.go:757 msg="loading model" "model layers"=37 requested=-1
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.921+08:00 level=INFO source=runner.go:1429 msg="starting ollama engine"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.921+08:00 level=INFO source=runner.go:1464 msg="Server listening on 127.0.0.1:39807"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.923+08:00 level=INFO source=runner.go:1302 msg=load request="{Operation:fit LoraPath:[] Parallel:32 BatchSize:512 FlashAttention:Enabled KvSize:1040000 KvCacheType: NumThreads:144 GPULayers:37[ID:GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.962+08:00 level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Mar 11 12:02:47 computation9 ollama[551222]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu.so
Mar 11 12:02:48 computation9 ollama[551222]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 11 12:02:48 computation9 ollama[551222]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 11 12:02:48 computation9 ollama[551222]: ggml_cuda_init: found 2 CUDA devices:
Mar 11 12:02:48 computation9 ollama[551222]:   Device 0: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes, ID: GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134
Mar 11 12:02:48 computation9 ollama[551222]:   Device 1: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes, ID: GPU-906a60a9-f6a7-80b4-08fc-0a69c146aa90
Mar 11 12:02:48 computation9 ollama[551222]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 11 12:02:48 computation9 ollama[551222]: time=2026-03-11T12:02:48.555+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
Mar 11 12:02:49 computation9 ollama[551222]: time=2026-03-11T12:02:49.980+08:00 level=INFO source=runner.go:1302 msg=load request="{Operation:alloc LoraPath:[] Parallel:32 BatchSize:512 FlashAttention:Enabled KvSize:1040000 KvCacheType: NumThreads:144 GPULayers:37[ID:GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:32 BatchSize:512 FlashAttention:Enabled KvSize:1040000 KvCacheType: NumThreads:144 GPULayers:37[ID:GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="59.8 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="40.2 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="3.4 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:272 msg="total memory" size="104.5 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=sched.go:565 msg="loaded runners" count=1
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.461+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
Mar 11 12:03:01 computation9 ollama[551222]: time=2026-03-11T12:03:01.722+08:00 level=INFO source=server.go:1388 msg="llama runner started in 13.81 seconds"
Mar 11 12:03:03 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:03 | 200 | 16.513907281s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:05 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:05 | 200 |  2.325891396s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:06 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:06 | 200 |  1.082572352s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:08 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:08 | 200 |  1.979476779s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:31 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:31 | 200 |    1.916989ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:04:02 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:04:02 | 200 |  31.27136917s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:04:03 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:04:03 | 200 |    1.854138ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:05:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:05:24 | 200 |         1m21s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:05:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:05:24 | 200 |         1m21s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:05:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:05:24 | 200 |    1.809527ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |    1.740371ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |     2.30562ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:13:23 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |    1.811661ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:16:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:16:14 | 200 |         2m50s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:27 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:27 | 200 |    2.269165ms |       127.0.0.1 | GET      "/api/tags"
RAW_BUFFERClick to expand / collapse

It seems that from v0.17 onwards, the speed benchmark (at least for GH200 NVL2 server (Single-GPU 144GB), Ubuntu 24 aarch64, gpt-oss:120b) has degraded substantially.

N ParallelOllama (v0.15.6)Ollama (v0.16.3)Ollama (v0.17.7)
1146.79145.68127.46
2109.17108.1997.57
4168.35169.23145.76
8233.95252.31203.71
12304.79303.98233.43
16356.92339.49256.56
24408.89397.94301.12
32441.42441.93315.62

Ollama service configuration:

[Service]
Environment="OLLAMA_NUM_PARALLEL=32"
Environment="OLLAMA_CONTEXT_LENGTH=32500"

Hope the developer team has noticed. Thanks!

Relevant log output

Mar 11 12:02:35 computation9 systemd[1]: Started ollama.service - Ollama Service.
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.923+08:00 level=INFO source=routes.go:1658 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32500 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:32 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.923+08:00 level=INFO source=routes.go:1660 msg="Ollama cloud disabled: false"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.926+08:00 level=INFO source=images.go:477 msg="total blobs: 91"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.927+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.928+08:00 level=INFO source=routes.go:1713 msg="Listening on 127.0.0.1:11434 (version 0.17.7)"
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.928+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
Mar 11 12:02:35 computation9 ollama[551222]: time=2026-03-11T12:02:35.928+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 39363"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.079+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 40109"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.119+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42013"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.119+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 37303"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.814+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 filter_id="" library=CUDA compute=9.0 name=CUDA0 description="NVIDIA GH200 144G HBM3e" libdirs=ollama,cuda_v12 driver=12.9 pci_id=0009:01:00.0 type=discrete total="143.4 GiB" available="142.5 GiB"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.814+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-906a60a9-f6a7-80b4-08fc-0a69c146aa90 filter_id="" library=CUDA compute=9.0 name=CUDA1 description="NVIDIA GH200 144G HBM3e" libdirs=ollama,cuda_v12 driver=12.9 pci_id=0019:01:00.0 type=discrete total="143.4 GiB" available="142.0 GiB"
Mar 11 12:02:37 computation9 ollama[551222]: time=2026-03-11T12:02:37.814+08:00 level=INFO source=routes.go:1763 msg="vram-based default context" total_vram="286.8 GiB" default_num_ctx=262144
Mar 11 12:02:46 computation9 ollama[551222]: time=2026-03-11T12:02:46.828+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35939"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.911+08:00 level=INFO source=server.go:246 msg="enabling flash attention"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.911+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 39807"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=sched.go:489 msg="system memory" total="1242.8 GiB" free="1163.8 GiB" free_swap="0 B"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=sched.go:496 msg="gpu memory" id=GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 library=CUDA available="142.0 GiB" free="142.5 GiB" minimum="457.0 MiB" overhead="0 B"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=sched.go:496 msg="gpu memory" id=GPU-906a60a9-f6a7-80b4-08fc-0a69c146aa90 library=CUDA available="141.5 GiB" free="142.0 GiB" minimum="457.0 MiB" overhead="0 B"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.912+08:00 level=INFO source=server.go:757 msg="loading model" "model layers"=37 requested=-1
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.921+08:00 level=INFO source=runner.go:1429 msg="starting ollama engine"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.921+08:00 level=INFO source=runner.go:1464 msg="Server listening on 127.0.0.1:39807"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.923+08:00 level=INFO source=runner.go:1302 msg=load request="{Operation:fit LoraPath:[] Parallel:32 BatchSize:512 FlashAttention:Enabled KvSize:1040000 KvCacheType: NumThreads:144 GPULayers:37[ID:GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Mar 11 12:02:47 computation9 ollama[551222]: time=2026-03-11T12:02:47.962+08:00 level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Mar 11 12:02:47 computation9 ollama[551222]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu.so
Mar 11 12:02:48 computation9 ollama[551222]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 11 12:02:48 computation9 ollama[551222]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 11 12:02:48 computation9 ollama[551222]: ggml_cuda_init: found 2 CUDA devices:
Mar 11 12:02:48 computation9 ollama[551222]:   Device 0: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes, ID: GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134
Mar 11 12:02:48 computation9 ollama[551222]:   Device 1: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes, ID: GPU-906a60a9-f6a7-80b4-08fc-0a69c146aa90
Mar 11 12:02:48 computation9 ollama[551222]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 11 12:02:48 computation9 ollama[551222]: time=2026-03-11T12:02:48.555+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
Mar 11 12:02:49 computation9 ollama[551222]: time=2026-03-11T12:02:49.980+08:00 level=INFO source=runner.go:1302 msg=load request="{Operation:alloc LoraPath:[] Parallel:32 BatchSize:512 FlashAttention:Enabled KvSize:1040000 KvCacheType: NumThreads:144 GPULayers:37[ID:GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=runner.go:1302 msg=load request="{Operation:commit LoraPath:[] Parallel:32 BatchSize:512 FlashAttention:Enabled KvSize:1040000 KvCacheType: NumThreads:144 GPULayers:37[ID:GPU-bc1dca20-2fec-fd2c-1a62-bf91452b6134 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=ggml.go:482 msg="offloading 36 repeating layers to GPU"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=ggml.go:494 msg="offloaded 37/37 layers to GPU"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="59.8 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="40.2 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="3.4 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=device.go:272 msg="total memory" size="104.5 GiB"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=sched.go:565 msg="loaded runners" count=1
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.460+08:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
Mar 11 12:02:59 computation9 ollama[551222]: time=2026-03-11T12:02:59.461+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
Mar 11 12:03:01 computation9 ollama[551222]: time=2026-03-11T12:03:01.722+08:00 level=INFO source=server.go:1388 msg="llama runner started in 13.81 seconds"
Mar 11 12:03:03 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:03 | 200 | 16.513907281s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:05 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:05 | 200 |  2.325891396s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:06 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:06 | 200 |  1.082572352s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:08 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:08 | 200 |  1.979476779s |       127.0.0.1 | POST     "/api/chat"
Mar 11 12:03:31 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:03:31 | 200 |    1.916989ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:04:02 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:04:02 | 200 |  31.27136917s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:04:03 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:04:03 | 200 |    1.854138ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:05:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:05:24 | 200 |         1m21s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:05:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:05:24 | 200 |         1m21s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:05:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:05:24 | 200 |    1.809527ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |         1m49s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:07:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:07:14 | 200 |    1.740371ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |         2m41s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:09:56 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:09:56 | 200 |     2.30562ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:13:23 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:23 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |         3m27s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:13:24 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:13:24 | 200 |    1.811661ms |       127.0.0.1 | GET      "/api/tags"
Mar 11 12:16:14 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:16:14 | 200 |         2m50s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:26 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:26 | 200 |          4m2s |       127.0.0.1 | POST     "/api/generate"
Mar 11 12:17:27 computation9 ollama[551222]: [GIN] 2026/03/11 - 12:17:27 | 200 |    2.269165ms |       127.0.0.1 | GET      "/api/tags"

OS

Linux / Ubuntu

GPU

Nvidia

CPU

aarch64

Ollama version

multiple

extent analysis

Fix Plan

The fix involves adjusting the Ollama configuration to optimize performance.

  • Update the OLLAMA_NUM_PARALLEL environment variable to a lower value, such as 16, to reduce the load on the GPU and CPU.
  • Set the OLLAMA_CONTEXT_LENGTH environment variable to a lower value, such as 2048, to reduce memory usage.
  • Ensure that the OLLAMA_FLASH_ATTENTION environment variable is set to true to enable flash attention, which can improve performance.

Example configuration:

[Service]
Environment="OLLAMA_NUM_PARALLEL=16"
Environment="OLLAMA_CONTEXT_LENGTH=2048"
Environment="OLLAMA_FLASH_ATTENTION=true"

Restart the Ollama service after updating the configuration.

Verification

To verify that the fix worked, monitor the Ollama service logs and system resources (e.g., GPU and CPU usage, memory usage) after applying the configuration changes. Run the speed benchmark again to compare the results with the previous version.

Extra Tips

  • Regularly update the Ollama version to ensure you have the latest performance optimizations and bug fixes.
  • Consider adjusting the OLLAMA_GPU_OVERHEAD environment variable to optimize GPU memory allocation.
  • Monitor system resources and adjust the configuration as needed to achieve optimal performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix Performance drop from v0.17 onwards [4 comments, 3 participants]