ollama - 💡(How to fix) Fix generate completion API hangs with certain models but not with others

ollama2026-05-08 17:12:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

time=2026-05-08T18:24:02.280+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35 time=2026-05-08T18:32:43.540+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35 time=2026-05-08T18:38:25.054+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35 time=2026-05-08T18:41:08.341+02:00 level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072

Code Example

time=2026-05-08T18:24:02.280+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49335"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.4 GiB" free_swap="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49335"
time=2026-05-08T18:24:02.382+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:02.407+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.019 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:24:02.439+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:24:03.227+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:24:04.482+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:24:04.483+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:24:05.237+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.91 seconds"
[GIN] 2026/05/08 - 18:25:21 | 200 |     804.167µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:25:21 | 200 |          46µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:31:26 | 200 |      16.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.700084ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.635458ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:32:06 | 200 |      13.042µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:32:06 | 200 |      18.125µs |       127.0.0.1 | GET      "/api/ps"
time=2026-05-08T18:32:08.392+02:00 level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=3h32m45.889623412s consecutive_failures=0
[GIN] 2026/05/08 - 18:32:42 | 200 |         8m40s |  129.152.22.122 | POST     "/api/generate"
time=2026-05-08T18:32:43.540+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49346"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="49.2 GiB" free_swap="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49346"
time=2026-05-08T18:32:43.613+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:43.637+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:32:43.652+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:32:44.441+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:32:45.669+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:32:45.670+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:32:45.921+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.34 seconds"
[GIN] 2026/05/08 - 18:34:09 | 200 |         1m26s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:34:56 | 200 |      21.541µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:34:56 | 200 |      42.292µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:35:09 | 200 |      22.375µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:09 | 200 |   16.207667ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:35:09 | 200 |   15.012042ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:35:11 | 200 |      19.917µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:11 | 200 |       8.083µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:38:24 | 200 |      31.833µs |  129.152.22.122 | GET      "/api/version"
time=2026-05-08T18:38:25.054+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49355"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49355"
time=2026-05-08T18:38:25.127+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:25.151+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:38:25.165+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:38:25.924+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:38:27.165+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:38:27.166+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:38:27.417+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.32 seconds"
[GIN] 2026/05/08 - 18:39:03 | 200 |      11.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:03 | 200 |      18.833µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:27 | 200 |      12.916µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:27 | 200 |    7.448292ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:39:27 | 200 |      6.8675ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:39:31 | 200 |      18.667µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:31 | 200 |       32.75µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:58 | 200 |         1m33s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:40:18 | 200 |      19.917µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:40:59 | 200 |      55.417µs |  129.152.22.122 | GET      "/api/version"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 7.989 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57343 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-05-08T18:41:08.341+02:00 level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072
time=2026-05-08T18:41:08.341+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 49363"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=server.go:532 msg="loading model" "model layers"=29 requested=-1
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="1.9 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="28.0 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="12.5 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:272 msg="total memory" size="42.4 GiB"
time=2026-05-08T18:41:08.376+02:00 level=INFO source=runner.go:965 msg="starting go runner"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 8.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:41:16.395+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:49363"
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57342 MiB free
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_embd_inp       = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors: Metal_Mapped model buffer size =  1918.35 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache:      Metal KV buffer size = 28672.00 MiB
llama_kv_cache: size = 28672.00 MiB (131072 cells,  28 layers,  2/2 seqs), K (f16): 14336.00 MiB, V (f16): 14336.00 MiB
llama_context:      Metal compute buffer size =   396.01 MiB
llama_context:        CPU compute buffer size =   262.01 MiB
llama_context: graph nodes  = 931
llama_context: graph splits = 2
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
[GIN] 2026/05/08 - 18:41:19 | 200 | 19.028101083s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:41:23 | 200 |      20.083µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:41:23 | 200 |      36.708µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:42:04 | 200 |      19.625µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:04 | 200 |  383.160792ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:42:37 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:37 | 200 |   278.98225ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:43:09 | 200 |       18.75µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:12 | 200 |        19.5µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:22 | 200 |      18.583µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:26 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:36 | 200 |      20.333µs |  129.152.22.122 | GET      "/api/version"

RAW_BUFFERClick to expand / collapse

What is the issue?

generate completion API "hangs" when request is submitted to model qwen3.5:2b (Q8_0) but executes successfully in less than 2s when submitted to model llama3.2:3b (Q4_K_M) on Ollama 0.23.2 and macOS Tahoe 26.4.1.

After submitting the request to qwen3.5:2b mac's GPU goes up to 100% and keeps spinning forever (I had to kill the process with kill -9 after 5 minutes).

This happens regardless of the prompt.

Relevant log output

time=2026-05-08T18:24:02.280+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:24:02.320+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49335"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.4 GiB" free_swap="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:24:02.322+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:24:02.379+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49335"
time=2026-05-08T18:24:02.382+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:02.407+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.019 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:24:02.439+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:24:03.227+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:24:04.482+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:24:04.482+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:24:04.483+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:24:05.237+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.91 seconds"
[GIN] 2026/05/08 - 18:25:21 | 200 |     804.167µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:25:21 | 200 |          46µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:31:26 | 200 |      16.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.700084ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:31:26 | 200 |    6.635458ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:32:06 | 200 |      13.042µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:32:06 | 200 |      18.125µs |       127.0.0.1 | GET      "/api/ps"
time=2026-05-08T18:32:08.392+02:00 level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=3h32m45.889623412s consecutive_failures=0
[GIN] 2026/05/08 - 18:32:42 | 200 |         8m40s |  129.152.22.122 | POST     "/api/generate"
time=2026-05-08T18:32:43.540+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:32:43.579+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49346"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="49.2 GiB" free_swap="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:32:43.581+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:32:43.612+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49346"
time=2026-05-08T18:32:43.613+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:43.637+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:32:43.652+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:32:44.441+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:32:45.669+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:32:45.669+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:32:45.670+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:32:45.921+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.34 seconds"
[GIN] 2026/05/08 - 18:34:09 | 200 |         1m26s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:34:56 | 200 |      21.541µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:34:56 | 200 |      42.292µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:35:09 | 200 |      22.375µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:09 | 200 |   16.207667ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:35:09 | 200 |   15.012042ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:35:11 | 200 |      19.917µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:35:11 | 200 |       8.083µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:38:24 | 200 |      31.833µs |  129.152.22.122 | GET      "/api/version"
time=2026-05-08T18:38:25.054+02:00 level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:259 msg="enabling flash attention"
time=2026-05-08T18:38:25.093+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/flavio/.ollama/models/blobs/sha256-b709d81508a078a686961de6ca07a953b895d9b286c46e17f00fb267f4f2d297 --port 49355"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:38:25.094+02:00 level=INFO source=server.go:792 msg="loading model" "model layers"=25 requested=-1
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1450 msg="starting ollama engine"
time=2026-05-08T18:38:25.126+02:00 level=INFO source=runner.go:1485 msg="Server listening on 127.0.0.1:49355"
time=2026-05-08T18:38:25.127+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:25.151+02:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35 file_type=Q8_0 name="" description="" num_tensors=728 num_key_values=52
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:38:25.165+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
time=2026-05-08T18:38:25.924+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=runner.go:1297 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:490 msg="offloading 24 repeating layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:497 msg="offloading output layer to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=ggml.go:502 msg="offloaded 25/25 layers to GPU"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="2.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="523.8 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="3.5 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="3.2 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="26.7 MiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=device.go:272 msg="total memory" size="9.7 GiB"
time=2026-05-08T18:38:27.165+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:38:27.165+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:38:27.166+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
time=2026-05-08T18:38:27.417+02:00 level=INFO source=server.go:1432 msg="llama runner started in 2.32 seconds"
[GIN] 2026/05/08 - 18:39:03 | 200 |      11.125µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:03 | 200 |      18.833µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:27 | 200 |      12.916µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:27 | 200 |    7.448292ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/05/08 - 18:39:27 | 200 |      6.8675ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:39:31 | 200 |      18.667µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:39:31 | 200 |       32.75µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:39:58 | 200 |         1m33s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:40:18 | 200 |      19.917µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:40:59 | 200 |      55.417µs |  129.152.22.122 | GET      "/api/version"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 7.989 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57343 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-05-08T18:41:08.341+02:00 level=WARN source=server.go:169 msg="requested context size too large for model" num_ctx=262144 n_ctx_train=131072
time=2026-05-08T18:41:08.341+02:00 level=INFO source=server.go:433 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --model /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --port 49363"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:484 msg="system memory" total="64.0 GiB" free="47.6 GiB" free_swap="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="51.3 GiB" free="51.8 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=server.go:532 msg="loading model" "model layers"=29 requested=-1
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:240 msg="model weights" device=Metal size="1.9 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:251 msg="kv cache" device=Metal size="28.0 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:262 msg="compute graph" device=Metal size="12.5 GiB"
time=2026-05-08T18:41:08.344+02:00 level=INFO source=device.go:272 msg="total memory" size="42.4 GiB"
time=2026-05-08T18:41:08.376+02:00 level=INFO source=runner.go:965 msg="starting go runner"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 8.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 60129.54 MB
load_backend: loaded CPU backend from /Applications/Ollama.app/Contents/Resources/libggml-cpu.so
time=2026-05-08T18:41:16.395+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.FP16_VA=1 CPU.1.DOTPROD=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:49363"
time=2026-05-08T18:41:16.396+02:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:262144 KvCacheType: NumThreads:10 GPULayers:29[ID:0 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) (unknown id) - 57342 MiB free
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:16.397+02:00 level=INFO source=server.go:1428 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/flavio/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_embd_inp       = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128001 '<|end_of_text|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors: Metal_Mapped model buffer size =  1918.35 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache:      Metal KV buffer size = 28672.00 MiB
llama_kv_cache: size = 28672.00 MiB (131072 cells,  28 layers,  2/2 seqs), K (f16): 14336.00 MiB, V (f16): 14336.00 MiB
llama_context:      Metal compute buffer size =   396.01 MiB
llama_context:        CPU compute buffer size =   262.01 MiB
llama_context: graph nodes  = 931
llama_context: graph splits = 2
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1385 msg="waiting for llama runner to start responding"
time=2026-05-08T18:41:18.657+02:00 level=INFO source=server.go:1432 msg="llama runner started in 10.31 seconds"
[GIN] 2026/05/08 - 18:41:19 | 200 | 19.028101083s |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:41:23 | 200 |      20.083µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/05/08 - 18:41:23 | 200 |      36.708µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/05/08 - 18:42:04 | 200 |      19.625µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:04 | 200 |  383.160792ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:42:37 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:42:37 | 200 |   278.98225ms |  129.152.22.122 | POST     "/api/generate"
[GIN] 2026/05/08 - 18:43:09 | 200 |       18.75µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:12 | 200 |        19.5µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:22 | 200 |      18.583µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:26 | 200 |      18.125µs |  129.152.22.122 | GET      "/api/version"
[GIN] 2026/05/08 - 18:43:36 | 200 |      20.333µs |  129.152.22.122 | GET      "/api/version"

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.23.2

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix generate completion API hangs with certain models but not with others

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix generate completion API hangs with certain models but not with others

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

RELATED_DISCOVERY

TRENDING