ollama - 💡(How to fix) Fix generate completion API hangs with certain models but not with others

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

time=2026-05-08T18:24:02.280+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35 time=2026-05-08T18:32:43.540+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35 time=2026-05-08T18:38:25.054+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35 time=2026-05-08T18:41:08.341+02:00 level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072

Code Example

time=2026-05-08T18:24:02.280+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49335"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.4 GiB" free_swap="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49335"
time=2026-05-08T18:24:02.382+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:02.407+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.019 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:24:02.439+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:24:03.227+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:24:04.482+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:24:04.483+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:24:05.237+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.91 seconds"
[GIN] 2026/05/08 - 18:25:21 | 200 |     804.167µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:25:21 | 200 |          46µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:31:26 | 200 |      16.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.700084ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.635458ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:32:06 | 200 |      13.042µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:32:06 | 200 |      18.125µs |       127.0.0.1 | GET      "/api/ps"
time=2026-05-08T18:32:08.392+02:00 level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=3h32m45.889623412s consecutive_failures=0
[GIN] 2026/05/08 - 18:32:42 | 200 |         8m40s |  129.152.22.122 | POST     "/api/generate"
time=2026-05-08T18:32:43.540+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49346"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="49.2 GiB" free_swap="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49346"
time=2026-05-08T18:32:43.613+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:43.637+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:32:43.652+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:32:44.441+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:32:45.669+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:32:45.670+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:32:45.921+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.34 seconds"
[GIN] 2026/05/08 - 18:34:09 | 200 |         1m26s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:34:56 | 200 |      21.541µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:34:56 | 200 |      42.292µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:35:09 | 200 |      22.375µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:09 | 200 |   16.207667ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:35:09 | 200 |   15.012042ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:35:11 | 200 |      19.917µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:11 | 200 |       8.083µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:38:24 | 200 |      31.833µs |  129.152.22.122 | GET      "/api/version"
time=2026-05-08T18:38:25.054+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49355"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49355"
time=2026-05-08T18:38:25.127+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:25.151+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:38:25.165+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:38:25.924+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:38:27.165+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:38:27.166+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:38:27.417+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.32 seconds"
[GIN] 2026/05/08 - 18:39:03 | 200 |      11.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:03 | 200 |      18.833µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:27 | 200 |      12.916µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:27 | 200 |    7.448292ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:39:27 | 200 |      6.8675ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:39:31 | 200 |      18.667µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:31 | 200 |       32.75µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:58 | 200 |         1m33s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:40:18 | 200 |      19.917µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:40:59 | 200 |      55.417µs |  129.152.22.122 | GET      "/api/version"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 7.989 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57343 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-05-08T18:41:08.341+02:00 level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072
time=2026-05-08T18:41:08.341+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 49363"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=server.go:532 msg="loading model" "model layers"=29 requested=-1
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="1.9 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="28.0 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="12.5 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:272 msg="total memory" size="42.4 GiB"
time=2026-05-08T18:41:08.376+02:00 level=INFO source=runner.go:965 msg="starting go runner"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 8.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:41:16.395+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:49363"
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57342 MiB free
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_embd_inp       = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors: Metal_Mapped model buffer size =  1918.35 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache:      Metal KV buffer size = 28672.00 MiB
llama_kv_cache: size = 28672.00 MiB (131072 cells,  28 layers,  2/2 seqs), K (f16): 14336.00 MiB, V (f16): 14336.00 MiB
llama_context:      Metal compute buffer size =   396.01 MiB
llama_context:        CPU compute buffer size =   262.01 MiB
llama_context: graph nodes  = 931
llama_context: graph splits = 2
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
[GIN] 2026/05/08 - 18:41:19 | 200 | 19.028101083s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:41:23 | 200 |      20.083µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:41:23 | 200 |      36.708µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:42:04 | 200 |      19.625µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:04 | 200 |  383.160792ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:42:37 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:37 | 200 |   278.98225ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:43:09 | 200 |       18.75µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:12 | 200 |        19.5µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:22 | 200 |      18.583µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:26 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:36 | 200 |      20.333µs |  129.152.22.122 | GET      "/api/version"
RAW_BUFFERClick to expand / collapse

What is the issue?

generate completion API "hangs" when request is submitted to model qwen3.5:2b (Q8_0) but executes successfully in less than 2s when submitted to model llama3.2:3b (Q4_K_M) on Ollama 0.23.2 and macOS Tahoe 26.4.1.

After submitting the request to qwen3.5:2b mac's GPU goes up to 100% and keeps spinning forever (I had to kill the process with kill -9 after 5 minutes).

This happens regardless of the prompt.

Relevant log output

time=2026-05-08T18:24:02.280+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49335"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.4 GiB" free_swap="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49335"
time=2026-05-08T18:24:02.382+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:02.407+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.019 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:24:02.439+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:24:03.227+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:24:04.482+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:24:04.483+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:24:05.237+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.91 seconds"
[GIN] 2026/05/08 - 18:25:21 | 200 |     804.167µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:25:21 | 200 |          46µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:31:26 | 200 |      16.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.700084ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.635458ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:32:06 | 200 |      13.042µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:32:06 | 200 |      18.125µs |       127.0.0.1 | GET      "/api/ps"
time=2026-05-08T18:32:08.392+02:00 level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=3h32m45.889623412s consecutive_failures=0
[GIN] 2026/05/08 - 18:32:42 | 200 |         8m40s |  129.152.22.122 | POST     "/api/generate"
time=2026-05-08T18:32:43.540+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49346"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="49.2 GiB" free_swap="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49346"
time=2026-05-08T18:32:43.613+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:43.637+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:32:43.652+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:32:44.441+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:32:45.669+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:32:45.670+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:32:45.921+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.34 seconds"
[GIN] 2026/05/08 - 18:34:09 | 200 |         1m26s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:34:56 | 200 |      21.541µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:34:56 | 200 |      42.292µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:35:09 | 200 |      22.375µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:09 | 200 |   16.207667ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:35:09 | 200 |   15.012042ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:35:11 | 200 |      19.917µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:11 | 200 |       8.083µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:38:24 | 200 |      31.833µs |  129.152.22.122 | GET      "/api/version"
time=2026-05-08T18:38:25.054+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49355"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49355"
time=2026-05-08T18:38:25.127+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:25.151+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:38:25.165+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:38:25.924+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:38:27.165+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:38:27.166+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:38:27.417+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.32 seconds"
[GIN] 2026/05/08 - 18:39:03 | 200 |      11.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:03 | 200 |      18.833µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:27 | 200 |      12.916µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:27 | 200 |    7.448292ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:39:27 | 200 |      6.8675ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:39:31 | 200 |      18.667µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:31 | 200 |       32.75µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:58 | 200 |         1m33s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:40:18 | 200 |      19.917µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:40:59 | 200 |      55.417µs |  129.152.22.122 | GET      "/api/version"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 7.989 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57343 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-05-08T18:41:08.341+02:00 level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072
time=2026-05-08T18:41:08.341+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 49363"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=server.go:532 msg="loading model" "model layers"=29 requested=-1
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="1.9 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="28.0 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="12.5 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:272 msg="total memory" size="42.4 GiB"
time=2026-05-08T18:41:08.376+02:00 level=INFO source=runner.go:965 msg="starting go runner"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 8.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:41:16.395+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:49363"
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57342 MiB free
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_embd_inp       = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors: Metal_Mapped model buffer size =  1918.35 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache:      Metal KV buffer size = 28672.00 MiB
llama_kv_cache: size = 28672.00 MiB (131072 cells,  28 layers,  2/2 seqs), K (f16): 14336.00 MiB, V (f16): 14336.00 MiB
llama_context:      Metal compute buffer size =   396.01 MiB
llama_context:        CPU compute buffer size =   262.01 MiB
llama_context: graph nodes  = 931
llama_context: graph splits = 2
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
[GIN] 2026/05/08 - 18:41:19 | 200 | 19.028101083s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:41:23 | 200 |      20.083µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:41:23 | 200 |      36.708µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:42:04 | 200 |      19.625µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:04 | 200 |  383.160792ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:42:37 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:37 | 200 |   278.98225ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:43:09 | 200 |       18.75µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:12 | 200 |        19.5µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:22 | 200 |      18.583µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:26 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:36 | 200 |      20.333µs |  129.152.22.122 | GET      "/api/version"

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.23.2

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix generate completion API hangs with certain models but not with others