ollama - 💡(How to fix) Fix [Windows] CUDA error: out of memory (cuMemAddressReserve) on 8x GPU setup [1 comments, 1 participants]

ollama2026-03-24 03:36:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15032•Fetched 2026-04-08 01:21:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

shankangke

Participants

shankangke

Timeline (top)

closed ×1commented ×1labeled ×1

Error Message

PS C:\Users\its> ollama serve time=2026-03-24T11:31:02.892+08:00 level=INFO source=routes.go:1727 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0, 1, 2, 3, 4, 5, 6, 7 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\its\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]" time=2026-03-24T11:31:02.922+08:00 level=INFO source=routes.go:1729 msg="Ollama cloud disabled: false" time=2026-03-24T11:31:02.934+08:00 level=INFO source=images.go:477 msg="total blobs: 25" time=2026-03-24T11:31:02.939+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2026-03-24T11:31:02.942+08:00 level=INFO source=routes.go:1782 msg="Listening on [::]:11434 (version 0.18.2)" time=2026-03-24T11:31:02.944+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-03-24T11:31:02.981+08:00 level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES="0, 1, 2, 3, 4, 5, 6, 7" time=2026-03-24T11:31:02.981+08:00 level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again" time=2026-03-24T11:31:03.001+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58225" time=2026-03-24T11:31:05.977+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58369" time=2026-03-24T11:31:08.823+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58526" time=2026-03-24T11:31:11.200+08:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-03-24T11:31:11.205+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58761" time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58763" time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58762" time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58764" time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58767" time=2026-03-24T11:31:11.208+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58765" time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58766" time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58769" time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58768" time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58771" time=2026-03-24T11:31:11.210+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58770" time=2026-03-24T11:31:11.210+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58772" time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58773" time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58774" time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58775" time=2026-03-24T11:31:11.212+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 58776" time=2026-03-24T11:31:14.490+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 filter_id="" library=CUDA compute=7.5 name=CUDA0 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:04:00.0 type=discrete total="24.0 GiB" available="23.4 GiB" time=2026-03-24T11:31:14.490+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 filter_id="" library=CUDA compute=7.5 name=CUDA1 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:05:00.0 type=discrete total="24.0 GiB" available="23.4 GiB" time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-8c020d1f-280d-e705-8f69-3a5342688f1a filter_id="" library=CUDA compute=7.5 name=CUDA2 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:08:00.0 type=discrete total="24.0 GiB" available="23.4 GiB" time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-36588785-9363-4c15-053d-05548b16e1a1 filter_id="" library=CUDA compute=7.5 name=CUDA4 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:84:00.0 type=discrete total="24.0 GiB" available="23.4 GiB" time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 filter_id="" library=CUDA compute=7.5 name=CUDA5 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:85:00.0 type=discrete total="24.0 GiB" available="23.4 GiB" time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e filter_id="" library=CUDA compute=7.5 name=CUDA7 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:89:00.0 type=discrete total="24.0 GiB" available="23.4 GiB" time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e filter_id="" library=CUDA compute=7.5 name=CUDA6 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:88:00.0 type=discrete total="24.0 GiB" available="23.4 GiB" time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 filter_id="" library=CUDA compute=7.5 name=CUDA3 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:09:00.0 type=discrete total="24.0 GiB" available="23.2 GiB" time=2026-03-24T11:31:14.491+08:00 level=INFO source=routes.go:1832 msg="vram-based default context" total_vram="192.0 GiB" default_num_ctx=262144 [GIN] 2026/03/24 - 11:31:14 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2026/03/24 - 11:31:14 | 200 | 345.3334ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/03/24 - 11:31:15 | 200 | 333.648ms | 127.0.0.1 | POST "/api/show" time=2026-03-24T11:31:15.588+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --port 62523" time=2026-03-24T11:31:18.579+08:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13]" extra_envs=map[] error="failed to finish discovery before timeout" time=2026-03-24T11:31:18.581+08:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values" time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:148 msg=packages count=2 time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=14 efficiency=0 threads=28 time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:195 msg="" package=1 cores=14 efficiency=0 threads=28 llama_model_loader: loaded meta data with 53 key-value pairs and 809 tensors from C:\Users\its.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = minimax-m2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.sampling.top_k i32 = 40 llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000 llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000 llama_model_loader: - kv 5: general.name str = Minimax-M2.5 llama_model_loader: - kv 6: general.basename str = Minimax-M2.5 llama_model_loader: - kv 7: general.quantized_by str = Unsloth llama_model_loader: - kv 8: general.size_label str = 256x4.9B llama_model_loader: - kv 9: general.license str = other llama_model_loader: - kv 10: general.license.name str = modified-mit llama_model_loader: - kv 11: general.license.link str = https://github.com/MiniMax-AI/MiniMax... llama_model_loader: - kv 12: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 13: general.base_model.count u32 = 1 llama_model_loader: - kv 14: general.base_model.0.name str = MiniMax M2.5 llama_model_loader: - kv 15: general.base_model.0.organization str = MiniMaxAI llama_model_loader: - kv 16: general.base_model.0.repo_url str = https://huggingface.co/MiniMaxAI/Mini... llama_model_loader: - kv 17: general.tags arr[str,2] = ["unsloth", "text-generation"] llama_model_loader: - kv 18: minimax-m2.block_count u32 = 62 llama_model_loader: - kv 19: minimax-m2.context_length u32 = 196608 llama_model_loader: - kv 20: minimax-m2.embedding_length u32 = 3072 llama_model_loader: - kv 21: minimax-m2.feed_forward_length u32 = 1536 llama_model_loader: - kv 22: minimax-m2.attention.head_count u32 = 48 llama_model_loader: - kv 23: minimax-m2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 24: minimax-m2.rope.freq_base f32 = 5000000.000000 llama_model_loader: - kv 25: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 26: minimax-m2.expert_count u32 = 256 llama_model_loader: - kv 27: minimax-m2.expert_used_count u32 = 8 llama_model_loader: - kv 28: minimax-m2.expert_gating_func u32 = 2 llama_model_loader: - kv 29: minimax-m2.attention.key_length u32 = 128 llama_model_loader: - kv 30: minimax-m2.attention.value_length u32 = 128 llama_model_loader: - kv 31: minimax-m2.expert_feed_forward_length u32 = 1536 llama_model_loader: - kv 32: minimax-m2.rope.dimension_count u32 = 64 llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 34: tokenizer.ggml.pre str = minimax-m2 llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ... llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r... llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 200034 llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 200020 llama_model_loader: - kv 40: tokenizer.ggml.unknown_token_id u32 = 200021 llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 200004 llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 43: tokenizer.chat_template str = {# Unsloth template fixes #}\n{# -----... llama_model_loader: - kv 44: general.quantization_version u32 = 2 llama_model_loader: - kv 45: general.file_type u32 = 12 llama_model_loader: - kv 46: quantize.imatrix.file str = MiniMax-M2.5-GGUF/imatrix_unsloth.gguf llama_model_loader: - kv 47: quantize.imatrix.dataset str = unsloth_calibration_MiniMax-M2.5.txt llama_model_loader: - kv 48: quantize.imatrix.entries_count u32 = 496 llama_model_loader: - kv 49: quantize.imatrix.chunks_count u32 = 81 llama_model_loader: - kv 50: split.no u16 = 0 llama_model_loader: - kv 51: split.tensors.count i32 = 809 llama_model_loader: - kv 52: split.count u16 = 0 llama_model_loader: - type f32: 373 tensors llama_model_loader: - type q3_K: 173 tensors llama_model_loader: - type q4_K: 232 tensors llama_model_loader: - type q5_K: 20 tensors llama_model_loader: - type q6_K: 11 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q3_K - Medium print_info: file size = 94.33 GiB (3.54 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 200004 ('<fim_pad>') load: - 200005 ('<reponame>') load: - 200020 ('[e~[') load: special tokens cache size = 54 load: token to piece cache size = 1.3355 MB print_info: arch = minimax-m2 print_info: vocab_only = 1 print_info: no_alloc = 0 print_info: model type = ?B print_info: model params = 228.69 B print_info: general.name = Minimax-M2.5 print_info: vocab type = BPE print_info: n_vocab = 200064 print_info: n_merges = 199744 print_info: BOS token = 200034 ']~~!b[' print_info: EOS token = 200020 '[e~~[' print_info: UNK token = 200021 ']!d~[' print_info: PAD token = 200004 '<fim_pad>' print_info: LF token = 10 'Ċ' print_info: FIM PRE token = 200001 '<fim_prefix>' print_info: FIM SUF token = 200003 '<fim_suffix>' print_info: FIM MID token = 200002 '<fim_middle>' print_info: FIM PAD token = 200004 '<fim_pad>' print_info: FIM REP token = 200005 '<reponame>' print_info: EOG token = 200004 '<fim_pad>' print_info: EOG token = 200005 '<reponame>' print_info: EOG token = 200020 '[e~[' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2026-03-24T11:31:19.413+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\Users\its\AppData\Local\Programs\Ollama\ollama.exe runner --model C:\Users\its\.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf --port 62697" time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:484 msg="system memory" total="255.9 GiB" free="234.9 GiB" free_swap="238.6 GiB" time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8c020d1f-280d-e705-8f69-3a5342688f1a library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 library=CUDA available="22.7 GiB" free="23.2 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-36588785-9363-4c15-053d-05548b16e1a1 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e library=CUDA available="22.9 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-24T11:31:19.452+08:00 level=INFO source=server.go:497 msg="loading model" "model layers"=63 requested=-1 time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="11.8 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA2 size="11.8 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA3 size="11.3 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA4 size="12.3 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA5 size="12.0 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA6 size="12.0 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA7 size="12.1 GiB" time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="224.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="256.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA2 size="256.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA3 size="224.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA4 size="256.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA5 size="256.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA6 size="256.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA7 size="256.0 MiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.9 GiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="1.9 GiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA2 size="1.9 GiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA3 size="1.9 GiB" time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA4 size="1.9 GiB" time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA5 size="1.9 GiB" time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA6 size="1.9 GiB" time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA7 size="1.9 GiB" time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:272 msg="total memory" size="111.4 GiB" time=2026-03-24T11:31:20.810+08:00 level=INFO source=runner.go:965 msg="starting go runner" load_backend: loaded CPU backend from C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 Device 1: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 Device 2: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-8c020d1f-280d-e705-8f69-3a5342688f1a Device 3: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 Device 4: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-36588785-9363-4c15-053d-05548b16e1a1 Device 5: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 Device 6: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e Device 7: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e load_backend: loaded CUDA backend from C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-03-24T11:31:21.082+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-03-24T11:31:21.085+08:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:62697" time=2026-03-24T11:31:21.089+08:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:8 BatchSize:512 FlashAttention:Auto KvSize:8192 KvCacheType: NumThreads:28 GPULayers:63[ID:GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 Layers:7(0..6) ID:GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 Layers:8(7..14) ID:GPU-8c020d1f-280d-e705-8f69-3a5342688f1a Layers:8(15..22) ID:GPU-36588785-9363-4c15-053d-05548b16e1a1 Layers:8(23..30) ID:GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 Layers:8(31..38) ID:GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e Layers:8(39..46) ID:GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e Layers:8(47..54) ID:GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 Layers:8(55..62)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-24T11:31:21.089+08:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-03-24T11:31:21.089+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" ggml_backend_cuda_device_get_memory device GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 utilizing NVML memory reporting free: 24976601088 total: 25769803776 llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 6000) (0000:04:00.0) - 23819 MiB free ggml_backend_cuda_device_get_memory device GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 utilizing NVML memory reporting free: 24976601088 total: 25769803776 llama_model_load_from_file_impl: using device CUDA1 (Quadro RTX 6000) (0000:05:00.0) - 23819 MiB free ggml_backend_cuda_device_get_memory device GPU-8c020d1f-280d-e705-8f69-3a5342688f1a utilizing NVML memory reporting free: 24976601088 total: 25769803776 llama_model_load_from_file_impl: using device CUDA2 (Quadro RTX 6000) (0000:08:00.0) - 23819 MiB free ggml_backend_cuda_device_get_memory device GPU-36588785-9363-4c15-053d-05548b16e1a1 utilizing NVML memory reporting free: 24976601088 total: 25769803776 llama_model_load_from_file_impl: using device CUDA4 (Quadro RTX 6000) (0000:84:00.0) - 23819 MiB free ggml_backend_cuda_device_get_memory device GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 utilizing NVML memory reporting free: 24976601088 total: 25769803776 llama_model_load_from_file_impl: using device CUDA5 (Quadro RTX 6000) (0000:85:00.0) - 23819 MiB free ggml_backend_cuda_device_get_memory device GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e utilizing NVML memory reporting free: 24976601088 total: 25769803776 llama_model_load_from_file_impl: using device CUDA7 (Quadro RTX 6000) (0000:89:00.0) - 23819 MiB free ggml_backend_cuda_device_get_memory device GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e utilizing NVML memory reporting free: 24959668224 total: 25769803776 llama_model_load_from_file_impl: using device CUDA6 (Quadro RTX 6000) (0000:88:00.0) - 23803 MiB free ggml_backend_cuda_device_get_memory device GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 utilizing NVML memory reporting free: 24738156544 total: 25769803776 llama_model_load_from_file_impl: using device CUDA3 (Quadro RTX 6000) (0000:09:00.0) - 23592 MiB free llama_model_loader: loaded meta data with 53 key-value pairs and 809 tensors from C:\Users\its.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = minimax-m2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.sampling.top_k i32 = 40 llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000 llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000 llama_model_loader: - kv 5: general.name str = Minimax-M2.5 llama_model_loader: - kv 6: general.basename str = Minimax-M2.5 llama_model_loader: - kv 7: general.quantized_by str = Unsloth llama_model_loader: - kv 8: general.size_label str = 256x4.9B llama_model_loader: - kv 9: general.license str = other llama_model_loader: - kv 10: general.license.name str = modified-mit llama_model_loader: - kv 11: general.license.link str = https://github.com/MiniMax-AI/MiniMax... llama_model_loader: - kv 12: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 13: general.base_model.count u32 = 1 llama_model_loader: - kv 14: general.base_model.0.name str = MiniMax M2.5 llama_model_loader: - kv 15: general.base_model.0.organization str = MiniMaxAI llama_model_loader: - kv 16: general.base_model.0.repo_url str = https://huggingface.co/MiniMaxAI/Mini... llama_model_loader: - kv 17: general.tags arr[str,2] = ["unsloth", "text-generation"] llama_model_loader: - kv 18: minimax-m2.block_count u32 = 62 llama_model_loader: - kv 19: minimax-m2.context_length u32 = 196608 llama_model_loader: - kv 20: minimax-m2.embedding_length u32 = 3072 llama_model_loader: - kv 21: minimax-m2.feed_forward_length u32 = 1536 llama_model_loader: - kv 22: minimax-m2.attention.head_count u32 = 48 llama_model_loader: - kv 23: minimax-m2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 24: minimax-m2.rope.freq_base f32 = 5000000.000000 llama_model_loader: - kv 25: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 26: minimax-m2.expert_count u32 = 256 llama_model_loader: - kv 27: minimax-m2.expert_used_count u32 = 8 llama_model_loader: - kv 28: minimax-m2.expert_gating_func u32 = 2 llama_model_loader: - kv 29: minimax-m2.attention.key_length u32 = 128 llama_model_loader: - kv 30: minimax-m2.attention.value_length u32 = 128 llama_model_loader: - kv 31: minimax-m2.expert_feed_forward_length u32 = 1536 llama_model_loader: - kv 32: minimax-m2.rope.dimension_count u32 = 64 llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 34: tokenizer.ggml.pre str = minimax-m2 llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ... llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r... llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 200034 llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 200020 llama_model_loader: - kv 40: tokenizer.ggml.unknown_token_id u32 = 200021 llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 200004 llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 43: tokenizer.chat_template str = {# Unsloth template fixes #}\n{# -----... llama_model_loader: - kv 44: general.quantization_version u32 = 2 llama_model_loader: - kv 45: general.file_type u32 = 12 llama_model_loader: - kv 46: quantize.imatrix.file str = MiniMax-M2.5-GGUF/imatrix_unsloth.gguf llama_model_loader: - kv 47: quantize.imatrix.dataset str = unsloth_calibration_MiniMax-M2.5.txt llama_model_loader: - kv 48: quantize.imatrix.entries_count u32 = 496 llama_model_loader: - kv 49: quantize.imatrix.chunks_count u32 = 81 llama_model_loader: - kv 50: split.no u16 = 0 llama_model_loader: - kv 51: split.tensors.count i32 = 809 llama_model_loader: - kv 52: split.count u16 = 0 llama_model_loader: - type f32: 373 tensors llama_model_loader: - type q3_K: 173 tensors llama_model_loader: - type q4_K: 232 tensors llama_model_loader: - type q5_K: 20 tensors llama_model_loader: - type q6_K: 11 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q3_K - Medium print_info: file size = 94.33 GiB (3.54 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 200004 ('<fim_pad>') load: - 200005 ('<reponame>') load: - 200020 ('[e~[') load: special tokens cache size = 54 load: token to piece cache size = 1.3355 MB print_info: arch = minimax-m2 print_info: vocab_only = 0 print_info: no_alloc = 0 print_info: n_ctx_train = 196608 print_info: n_embd = 3072 print_info: n_embd_inp = 3072 print_info: n_layer = 62 print_info: n_head = 48 print_info: n_head_kv = 8 print_info: n_rot = 64 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 1536 print_info: n_expert = 256 print_info: n_expert_used = 8 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 5000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 196608 print_info: rope_yarn_log_mul= 0.0000 print_info: rope_finetuned = unknown print_info: model type = 230B.A10B print_info: model params = 228.69 B print_info: general.name = Minimax-M2.5 print_info: vocab type = BPE print_info: n_vocab = 200064 print_info: n_merges = 199744 print_info: BOS token = 200034 ']~~!b[' print_info: EOS token = 200020 '[e~~[' print_info: UNK token = 200021 ']!d~[' print_info: PAD token = 200004 '<fim_pad>' print_info: LF token = 10 'Ċ' print_info: FIM PRE token = 200001 '<fim_prefix>' print_info: FIM SUF token = 200003 '<fim_suffix>' print_info: FIM MID token = 200002 '<fim_middle>' print_info: FIM PAD token = 200004 '<fim_pad>' print_info: FIM REP token = 200005 '<reponame>' print_info: EOG token = 200004 '<fim_pad>' print_info: EOG token = 200005 '<reponame>' print_info: EOG token = 200020 '[e~[' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 62 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 63/63 layers to GPU load_tensors: CPU model buffer size = 329.70 MiB load_tensors: CUDA0 model buffer size = 11054.66 MiB load_tensors: CUDA1 model buffer size = 12107.34 MiB load_tensors: CUDA2 model buffer size = 12093.41 MiB load_tensors: CUDA3 model buffer size = 11536.70 MiB load_tensors: CUDA4 model buffer size = 12552.41 MiB load_tensors: CUDA5 model buffer size = 12251.66 MiB load_tensors: CUDA6 model buffer size = 12260.34 MiB load_tensors: CUDA7 model buffer size = 12409.91 MiB llama_context: constructing llama_context llama_context: n_seq_max = 8 llama_context: n_ctx = 8192 llama_context: n_ctx_seq = 1024 llama_context: n_batch = 4096 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 5000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (1024) < n_ctx_train (196608) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 6.20 MiB llama_kv_cache: CUDA0 KV buffer size = 224.00 MiB llama_kv_cache: CUDA1 KV buffer size = 256.00 MiB llama_kv_cache: CUDA2 KV buffer size = 256.00 MiB llama_kv_cache: CUDA3 KV buffer size = 224.00 MiB llama_kv_cache: CUDA4 KV buffer size = 256.00 MiB llama_kv_cache: CUDA5 KV buffer size = 256.00 MiB llama_kv_cache: CUDA6 KV buffer size = 256.00 MiB llama_kv_cache: CUDA7 KV buffer size = 256.00 MiB llama_kv_cache: size = 1984.00 MiB ( 1024 cells, 62 layers, 8/8 seqs), K (f16): 992.00 MiB, V (f16): 992.00 MiB llama_context: pipeline parallelism enabled (n_copies=4) llama_context: Flash Attention was auto, set to enabled CUDA error: out of memory current device: 6, in function alloc at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:576 cuMemAddressReserve(&pool_addr, CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-03-24T11:32:15.465+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding" time=2026-03-24T11:32:17.522+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server error" time=2026-03-24T11:32:17.615+08:00 level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 1" time=2026-03-24T11:32:17.772+08:00 level=INFO source=sched.go:511 msg="Load failed" model=C:\Users\its.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf error="llama runner process has terminated: CUDA error" [GIN] 2026/03/24 - 11:32:17 | 500 | 1m2s | 127.0.0.1 | POST "/api/generate"

Root Cause

The error specifically occurs at cuMemAddressReserve on a random device. The physical VRAM is more than sufficient for this model (each card has 24GB, 8 cards total). The crash is clearly not caused by a lack of physical VRAM.

Code Example

FROM ./MiniMax-M2.5-UD-Q3_K_XL.gguf
PARAMETER num_ctx 1024

---

PS C:\Users\its> ollama serve
time=2026-03-24T11:31:02.892+08:00 level=INFO source=routes.go:1727 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0, 1, 2, 3, 4, 5, 6, 7 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\its\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]"
time=2026-03-24T11:31:02.922+08:00 level=INFO source=routes.go:1729 msg="Ollama cloud disabled: false"
time=2026-03-24T11:31:02.934+08:00 level=INFO source=images.go:477 msg="total blobs: 25"
time=2026-03-24T11:31:02.939+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2026-03-24T11:31:02.942+08:00 level=INFO source=routes.go:1782 msg="Listening on [::]:11434 (version 0.18.2)"
time=2026-03-24T11:31:02.944+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-03-24T11:31:02.981+08:00 level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES="0, 1, 2, 3, 4, 5, 6, 7"
time=2026-03-24T11:31:02.981+08:00 level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
time=2026-03-24T11:31:03.001+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58225"
time=2026-03-24T11:31:05.977+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58369"
time=2026-03-24T11:31:08.823+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58526"
time=2026-03-24T11:31:11.200+08:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-03-24T11:31:11.205+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58761"
time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58763"
time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58762"
time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58764"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58767"
time=2026-03-24T11:31:11.208+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58765"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58766"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58769"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58768"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58771"
time=2026-03-24T11:31:11.210+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58770"
time=2026-03-24T11:31:11.210+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58772"
time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58773"
time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58774"
time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58775"
time=2026-03-24T11:31:11.212+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58776"
time=2026-03-24T11:31:14.490+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 filter_id="" library=CUDA compute=7.5 name=CUDA0 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:04:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.490+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 filter_id="" library=CUDA compute=7.5 name=CUDA1 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:05:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-8c020d1f-280d-e705-8f69-3a5342688f1a filter_id="" library=CUDA compute=7.5 name=CUDA2 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:08:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-36588785-9363-4c15-053d-05548b16e1a1 filter_id="" library=CUDA compute=7.5 name=CUDA4 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:84:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 filter_id="" library=CUDA compute=7.5 name=CUDA5 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:85:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e filter_id="" library=CUDA compute=7.5 name=CUDA7 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:89:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e filter_id="" library=CUDA compute=7.5 name=CUDA6 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:88:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 filter_id="" library=CUDA compute=7.5 name=CUDA3 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:09:00.0 type=discrete total="24.0 GiB" available="23.2 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=routes.go:1832 msg="vram-based default context" total_vram="192.0 GiB" default_num_ctx=262144
[GIN] 2026/03/24 - 11:31:14 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2026/03/24 - 11:31:14 | 200 |    345.3334ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/03/24 - 11:31:15 | 200 |     333.648ms |       127.0.0.1 | POST     "/api/show"
time=2026-03-24T11:31:15.588+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 62523"
time=2026-03-24T11:31:18.579+08:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] error="failed to finish discovery before timeout"
time=2026-03-24T11:31:18.581+08:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:148 msg=packages count=2
time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=14 efficiency=0 threads=28
time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:195 msg="" package=1 cores=14 efficiency=0 threads=28
llama_model_loader: loaded meta data with 53 key-value pairs and 809 tensors from C:\Users\its\.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 40
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Minimax-M2.5
llama_model_loader: - kv   6:                           general.basename str              = Minimax-M2.5
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 256x4.9B
llama_model_loader: - kv   9:                            general.license str              = other
llama_model_loader: - kv  10:                       general.license.name str              = modified-mit
llama_model_loader: - kv  11:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  13:                   general.base_model.count u32              = 1
llama_model_loader: - kv  14:                  general.base_model.0.name str              = MiniMax M2.5
llama_model_loader: - kv  15:          general.base_model.0.organization str              = MiniMaxAI
llama_model_loader: - kv  16:              general.base_model.0.repo_url str              = https://huggingface.co/MiniMaxAI/Mini...
llama_model_loader: - kv  17:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  18:                     minimax-m2.block_count u32              = 62
llama_model_loader: - kv  19:                  minimax-m2.context_length u32              = 196608
llama_model_loader: - kv  20:                minimax-m2.embedding_length u32              = 3072
llama_model_loader: - kv  21:             minimax-m2.feed_forward_length u32              = 1536
llama_model_loader: - kv  22:            minimax-m2.attention.head_count u32              = 48
llama_model_loader: - kv  23:         minimax-m2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  24:                  minimax-m2.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  25: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                    minimax-m2.expert_count u32              = 256
llama_model_loader: - kv  27:               minimax-m2.expert_used_count u32              = 8
llama_model_loader: - kv  28:              minimax-m2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:            minimax-m2.attention.key_length u32              = 128
llama_model_loader: - kv  30:          minimax-m2.attention.value_length u32              = 128
llama_model_loader: - kv  31:      minimax-m2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  32:            minimax-m2.rope.dimension_count u32              = 64
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 200034
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 200020
llama_model_loader: - kv  40:            tokenizer.ggml.unknown_token_id u32              = 200021
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 200004
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{# -----...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 12
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = MiniMax-M2.5-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_MiniMax-M2.5.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 81
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                        split.tensors.count i32              = 809
llama_model_loader: - kv  52:                                split.count u16              = 0
llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q3_K:  173 tensors
llama_model_loader: - type q4_K:  232 tensors
llama_model_loader: - type q5_K:   20 tensors
llama_model_loader: - type q6_K:   11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 94.33 GiB (3.54 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 200004 ('<fim_pad>')
load:   - 200005 ('<reponame>')
load:   - 200020 ('[e~[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
print_info: arch             = minimax-m2
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 228.69 B
print_info: general.name     = Minimax-M2.5
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199744
print_info: BOS token        = 200034 ']~!b['
print_info: EOS token        = 200020 '[e~['
print_info: UNK token        = 200021 ']!d~['
print_info: PAD token        = 200004 '<fim_pad>'
print_info: LF token         = 10 'Ċ'
print_info: FIM PRE token    = 200001 '<fim_prefix>'
print_info: FIM SUF token    = 200003 '<fim_suffix>'
print_info: FIM MID token    = 200002 '<fim_middle>'
print_info: FIM PAD token    = 200004 '<fim_pad>'
print_info: FIM REP token    = 200005 '<reponame>'
print_info: EOG token        = 200004 '<fim_pad>'
print_info: EOG token        = 200005 '<reponame>'
print_info: EOG token        = 200020 '[e~['
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-03-24T11:31:19.413+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\its\\.ollama\\models\\blobs\\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf --port 62697"
time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:484 msg="system memory" total="255.9 GiB" free="234.9 GiB" free_swap="238.6 GiB"
time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8c020d1f-280d-e705-8f69-3a5342688f1a library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 library=CUDA available="22.7 GiB" free="23.2 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-36588785-9363-4c15-053d-05548b16e1a1 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e library=CUDA available="22.9 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=server.go:497 msg="loading model" "model layers"=63 requested=-1
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="11.8 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA2 size="11.8 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA3 size="11.3 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA4 size="12.3 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA5 size="12.0 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA6 size="12.0 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA7 size="12.1 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="224.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA2 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA3 size="224.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA4 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA5 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA6 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA7 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA2 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA3 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA4 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA5 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA6 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA7 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:272 msg="total memory" size="111.4 GiB"
time=2026-03-24T11:31:20.810+08:00 level=INFO source=runner.go:965 msg="starting go runner"
load_backend: loaded CPU backend from C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36
  Device 1: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-f7ad384d-30ee-b723-a586-06a5b29b8900
  Device 2: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-8c020d1f-280d-e705-8f69-3a5342688f1a
  Device 3: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538
  Device 4: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-36588785-9363-4c15-053d-05548b16e1a1
  Device 5: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0
  Device 6: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e
  Device 7: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e
load_backend: loaded CUDA backend from C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-03-24T11:31:21.082+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-03-24T11:31:21.085+08:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:62697"
time=2026-03-24T11:31:21.089+08:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:8 BatchSize:512 FlashAttention:Auto KvSize:8192 KvCacheType: NumThreads:28 GPULayers:63[ID:GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 Layers:7(0..6) ID:GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 Layers:8(7..14) ID:GPU-8c020d1f-280d-e705-8f69-3a5342688f1a Layers:8(15..22) ID:GPU-36588785-9363-4c15-053d-05548b16e1a1 Layers:8(23..30) ID:GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 Layers:8(31..38) ID:GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e Layers:8(39..46) ID:GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e Layers:8(47..54) ID:GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 Layers:8(55..62)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-24T11:31:21.089+08:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-24T11:31:21.089+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
ggml_backend_cuda_device_get_memory device GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 6000) (0000:04:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA1 (Quadro RTX 6000) (0000:05:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-8c020d1f-280d-e705-8f69-3a5342688f1a utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA2 (Quadro RTX 6000) (0000:08:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-36588785-9363-4c15-053d-05548b16e1a1 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA4 (Quadro RTX 6000) (0000:84:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA5 (Quadro RTX 6000) (0000:85:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA7 (Quadro RTX 6000) (0000:89:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e utilizing NVML memory reporting free: 24959668224 total: 25769803776
llama_model_load_from_file_impl: using device CUDA6 (Quadro RTX 6000) (0000:88:00.0) - 23803 MiB free
ggml_backend_cuda_device_get_memory device GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 utilizing NVML memory reporting free: 24738156544 total: 25769803776
llama_model_load_from_file_impl: using device CUDA3 (Quadro RTX 6000) (0000:09:00.0) - 23592 MiB free
llama_model_loader: loaded meta data with 53 key-value pairs and 809 tensors from C:\Users\its\.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 40
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Minimax-M2.5
llama_model_loader: - kv   6:                           general.basename str              = Minimax-M2.5
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 256x4.9B
llama_model_loader: - kv   9:                            general.license str              = other
llama_model_loader: - kv  10:                       general.license.name str              = modified-mit
llama_model_loader: - kv  11:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  13:                   general.base_model.count u32              = 1
llama_model_loader: - kv  14:                  general.base_model.0.name str              = MiniMax M2.5
llama_model_loader: - kv  15:          general.base_model.0.organization str              = MiniMaxAI
llama_model_loader: - kv  16:              general.base_model.0.repo_url str              = https://huggingface.co/MiniMaxAI/Mini...
llama_model_loader: - kv  17:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  18:                     minimax-m2.block_count u32              = 62
llama_model_loader: - kv  19:                  minimax-m2.context_length u32              = 196608
llama_model_loader: - kv  20:                minimax-m2.embedding_length u32              = 3072
llama_model_loader: - kv  21:             minimax-m2.feed_forward_length u32              = 1536
llama_model_loader: - kv  22:            minimax-m2.attention.head_count u32              = 48
llama_model_loader: - kv  23:         minimax-m2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  24:                  minimax-m2.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  25: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                    minimax-m2.expert_count u32              = 256
llama_model_loader: - kv  27:               minimax-m2.expert_used_count u32              = 8
llama_model_loader: - kv  28:              minimax-m2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:            minimax-m2.attention.key_length u32              = 128
llama_model_loader: - kv  30:          minimax-m2.attention.value_length u32              = 128
llama_model_loader: - kv  31:      minimax-m2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  32:            minimax-m2.rope.dimension_count u32              = 64
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 200034
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 200020
llama_model_loader: - kv  40:            tokenizer.ggml.unknown_token_id u32              = 200021
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 200004
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{# -----...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 12
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = MiniMax-M2.5-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_MiniMax-M2.5.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 81
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                        split.tensors.count i32              = 809
llama_model_loader: - kv  52:                                split.count u16              = 0
llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q3_K:  173 tensors
llama_model_loader: - type q4_K:  232 tensors
llama_model_loader: - type q5_K:   20 tensors
llama_model_loader: - type q6_K:   11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 94.33 GiB (3.54 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 200004 ('<fim_pad>')
load:   - 200005 ('<reponame>')
load:   - 200020 ('[e~[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
print_info: arch             = minimax-m2
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 196608
print_info: n_embd           = 3072
print_info: n_embd_inp       = 3072
print_info: n_layer          = 62
print_info: n_head           = 48
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 1536
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 196608
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 230B.A10B
print_info: model params     = 228.69 B
print_info: general.name     = Minimax-M2.5
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199744
print_info: BOS token        = 200034 ']~!b['
print_info: EOS token        = 200020 '[e~['
print_info: UNK token        = 200021 ']!d~['
print_info: PAD token        = 200004 '<fim_pad>'
print_info: LF token         = 10 'Ċ'
print_info: FIM PRE token    = 200001 '<fim_prefix>'
print_info: FIM SUF token    = 200003 '<fim_suffix>'
print_info: FIM MID token    = 200002 '<fim_middle>'
print_info: FIM PAD token    = 200004 '<fim_pad>'
print_info: FIM REP token    = 200005 '<reponame>'
print_info: EOG token        = 200004 '<fim_pad>'
print_info: EOG token        = 200005 '<reponame>'
print_info: EOG token        = 200020 '[e~['
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors:          CPU model buffer size =   329.70 MiB
load_tensors:        CUDA0 model buffer size = 11054.66 MiB
load_tensors:        CUDA1 model buffer size = 12107.34 MiB
load_tensors:        CUDA2 model buffer size = 12093.41 MiB
load_tensors:        CUDA3 model buffer size = 11536.70 MiB
load_tensors:        CUDA4 model buffer size = 12552.41 MiB
load_tensors:        CUDA5 model buffer size = 12251.66 MiB
load_tensors:        CUDA6 model buffer size = 12260.34 MiB
load_tensors:        CUDA7 model buffer size = 12409.91 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 8
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 1024
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (1024) < n_ctx_train (196608) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     6.20 MiB
llama_kv_cache:      CUDA0 KV buffer size =   224.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA2 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =   224.00 MiB
llama_kv_cache:      CUDA4 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA5 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA6 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA7 KV buffer size =   256.00 MiB
llama_kv_cache: size = 1984.00 MiB (  1024 cells,  62 layers,  8/8 seqs), K (f16):  992.00 MiB, V (f16):  992.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: Flash Attention was auto, set to enabled
CUDA error: out of memory
  current device: 6, in function alloc at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:576
  cuMemAddressReserve(&pool_addr, CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-03-24T11:32:15.465+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
time=2026-03-24T11:32:17.522+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server error"
time=2026-03-24T11:32:17.615+08:00 level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 1"
time=2026-03-24T11:32:17.772+08:00 level=INFO source=sched.go:511 msg="Load failed" model=C:\Users\its\.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf error="llama runner process has terminated: CUDA error"
[GIN] 2026/03/24 - 11:32:17 | 500 |          1m2s |       127.0.0.1 | POST     "/api/generate"

RAW_BUFFERClick to expand / collapse

What is the issue?

When attempting to run a model (MiniMax-M2.5-UD-Q3_K_XL.gguf) on a Windows machine equipped with 8x NVIDIA Quadro RTX 6000 GPUs, Ollama crashes with a CUDA error: out of memory just after the loading phase.

Modelfile

FROM ./MiniMax-M2.5-UD-Q3_K_XL.gguf
PARAMETER num_ctx 1024

Relevant log output

PS C:\Users\its> ollama serve
time=2026-03-24T11:31:02.892+08:00 level=INFO source=routes.go:1727 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0, 1, 2, 3, 4, 5, 6, 7 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\its\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:8 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]"
time=2026-03-24T11:31:02.922+08:00 level=INFO source=routes.go:1729 msg="Ollama cloud disabled: false"
time=2026-03-24T11:31:02.934+08:00 level=INFO source=images.go:477 msg="total blobs: 25"
time=2026-03-24T11:31:02.939+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2026-03-24T11:31:02.942+08:00 level=INFO source=routes.go:1782 msg="Listening on [::]:11434 (version 0.18.2)"
time=2026-03-24T11:31:02.944+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-03-24T11:31:02.981+08:00 level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES="0, 1, 2, 3, 4, 5, 6, 7"
time=2026-03-24T11:31:02.981+08:00 level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
time=2026-03-24T11:31:03.001+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58225"
time=2026-03-24T11:31:05.977+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58369"
time=2026-03-24T11:31:08.823+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58526"
time=2026-03-24T11:31:11.200+08:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-03-24T11:31:11.205+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58761"
time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58763"
time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58762"
time=2026-03-24T11:31:11.207+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58764"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58767"
time=2026-03-24T11:31:11.208+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58765"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58766"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58769"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58768"
time=2026-03-24T11:31:11.209+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58771"
time=2026-03-24T11:31:11.210+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58770"
time=2026-03-24T11:31:11.210+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58772"
time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58773"
time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58774"
time=2026-03-24T11:31:11.211+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58775"
time=2026-03-24T11:31:11.212+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58776"
time=2026-03-24T11:31:14.490+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 filter_id="" library=CUDA compute=7.5 name=CUDA0 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:04:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.490+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 filter_id="" library=CUDA compute=7.5 name=CUDA1 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:05:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-8c020d1f-280d-e705-8f69-3a5342688f1a filter_id="" library=CUDA compute=7.5 name=CUDA2 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:08:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-36588785-9363-4c15-053d-05548b16e1a1 filter_id="" library=CUDA compute=7.5 name=CUDA4 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:84:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 filter_id="" library=CUDA compute=7.5 name=CUDA5 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:85:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e filter_id="" library=CUDA compute=7.5 name=CUDA7 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:89:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e filter_id="" library=CUDA compute=7.5 name=CUDA6 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:88:00.0 type=discrete total="24.0 GiB" available="23.4 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 filter_id="" library=CUDA compute=7.5 name=CUDA3 description="Quadro RTX 6000" libdirs=ollama,cuda_v13 driver=13.2 pci_id=0000:09:00.0 type=discrete total="24.0 GiB" available="23.2 GiB"
time=2026-03-24T11:31:14.491+08:00 level=INFO source=routes.go:1832 msg="vram-based default context" total_vram="192.0 GiB" default_num_ctx=262144
[GIN] 2026/03/24 - 11:31:14 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2026/03/24 - 11:31:14 | 200 |    345.3334ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/03/24 - 11:31:15 | 200 |     333.648ms |       127.0.0.1 | POST     "/api/show"
time=2026-03-24T11:31:15.588+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 62523"
time=2026-03-24T11:31:18.579+08:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] error="failed to finish discovery before timeout"
time=2026-03-24T11:31:18.581+08:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:148 msg=packages count=2
time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=14 efficiency=0 threads=28
time=2026-03-24T11:31:18.582+08:00 level=INFO source=cpu_windows.go:195 msg="" package=1 cores=14 efficiency=0 threads=28
llama_model_loader: loaded meta data with 53 key-value pairs and 809 tensors from C:\Users\its\.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 40
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Minimax-M2.5
llama_model_loader: - kv   6:                           general.basename str              = Minimax-M2.5
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 256x4.9B
llama_model_loader: - kv   9:                            general.license str              = other
llama_model_loader: - kv  10:                       general.license.name str              = modified-mit
llama_model_loader: - kv  11:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  13:                   general.base_model.count u32              = 1
llama_model_loader: - kv  14:                  general.base_model.0.name str              = MiniMax M2.5
llama_model_loader: - kv  15:          general.base_model.0.organization str              = MiniMaxAI
llama_model_loader: - kv  16:              general.base_model.0.repo_url str              = https://huggingface.co/MiniMaxAI/Mini...
llama_model_loader: - kv  17:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  18:                     minimax-m2.block_count u32              = 62
llama_model_loader: - kv  19:                  minimax-m2.context_length u32              = 196608
llama_model_loader: - kv  20:                minimax-m2.embedding_length u32              = 3072
llama_model_loader: - kv  21:             minimax-m2.feed_forward_length u32              = 1536
llama_model_loader: - kv  22:            minimax-m2.attention.head_count u32              = 48
llama_model_loader: - kv  23:         minimax-m2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  24:                  minimax-m2.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  25: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                    minimax-m2.expert_count u32              = 256
llama_model_loader: - kv  27:               minimax-m2.expert_used_count u32              = 8
llama_model_loader: - kv  28:              minimax-m2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:            minimax-m2.attention.key_length u32              = 128
llama_model_loader: - kv  30:          minimax-m2.attention.value_length u32              = 128
llama_model_loader: - kv  31:      minimax-m2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  32:            minimax-m2.rope.dimension_count u32              = 64
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 200034
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 200020
llama_model_loader: - kv  40:            tokenizer.ggml.unknown_token_id u32              = 200021
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 200004
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{# -----...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 12
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = MiniMax-M2.5-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_MiniMax-M2.5.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 81
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                        split.tensors.count i32              = 809
llama_model_loader: - kv  52:                                split.count u16              = 0
llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q3_K:  173 tensors
llama_model_loader: - type q4_K:  232 tensors
llama_model_loader: - type q5_K:   20 tensors
llama_model_loader: - type q6_K:   11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 94.33 GiB (3.54 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 200004 ('<fim_pad>')
load:   - 200005 ('<reponame>')
load:   - 200020 ('[e~[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
print_info: arch             = minimax-m2
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 228.69 B
print_info: general.name     = Minimax-M2.5
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199744
print_info: BOS token        = 200034 ']~!b['
print_info: EOS token        = 200020 '[e~['
print_info: UNK token        = 200021 ']!d~['
print_info: PAD token        = 200004 '<fim_pad>'
print_info: LF token         = 10 'Ċ'
print_info: FIM PRE token    = 200001 '<fim_prefix>'
print_info: FIM SUF token    = 200003 '<fim_suffix>'
print_info: FIM MID token    = 200002 '<fim_middle>'
print_info: FIM PAD token    = 200004 '<fim_pad>'
print_info: FIM REP token    = 200005 '<reponame>'
print_info: EOG token        = 200004 '<fim_pad>'
print_info: EOG token        = 200005 '<reponame>'
print_info: EOG token        = 200020 '[e~['
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-03-24T11:31:19.413+08:00 level=INFO source=server.go:430 msg="starting runner" cmd="C:\\Users\\its\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\its\\.ollama\\models\\blobs\\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf --port 62697"
time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:484 msg="system memory" total="255.9 GiB" free="234.9 GiB" free_swap="238.6 GiB"
time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.451+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-8c020d1f-280d-e705-8f69-3a5342688f1a library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 library=CUDA available="22.7 GiB" free="23.2 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-36588785-9363-4c15-053d-05548b16e1a1 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e library=CUDA available="22.9 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=sched.go:491 msg="gpu memory" id=GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e library=CUDA available="23.0 GiB" free="23.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-24T11:31:19.452+08:00 level=INFO source=server.go:497 msg="loading model" "model layers"=63 requested=-1
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.8 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="11.8 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA2 size="11.8 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA3 size="11.3 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA4 size="12.3 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA5 size="12.0 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA6 size="12.0 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA7 size="12.1 GiB"
time=2026-03-24T11:31:19.455+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="224.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA2 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA3 size="224.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA4 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA5 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA6 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA7 size="256.0 MiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA2 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA3 size="1.9 GiB"
time=2026-03-24T11:31:19.456+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA4 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA5 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA6 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA7 size="1.9 GiB"
time=2026-03-24T11:31:19.457+08:00 level=INFO source=device.go:272 msg="total memory" size="111.4 GiB"
time=2026-03-24T11:31:20.810+08:00 level=INFO source=runner.go:965 msg="starting go runner"
load_backend: loaded CPU backend from C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36
  Device 1: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-f7ad384d-30ee-b723-a586-06a5b29b8900
  Device 2: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-8c020d1f-280d-e705-8f69-3a5342688f1a
  Device 3: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538
  Device 4: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-36588785-9363-4c15-053d-05548b16e1a1
  Device 5: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0
  Device 6: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e
  Device 7: Quadro RTX 6000, compute capability 7.5, VMM: yes, ID: GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e
load_backend: loaded CUDA backend from C:\Users\its\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-03-24T11:31:21.082+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-03-24T11:31:21.085+08:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:62697"
time=2026-03-24T11:31:21.089+08:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:8 BatchSize:512 FlashAttention:Auto KvSize:8192 KvCacheType: NumThreads:28 GPULayers:63[ID:GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 Layers:7(0..6) ID:GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 Layers:8(7..14) ID:GPU-8c020d1f-280d-e705-8f69-3a5342688f1a Layers:8(15..22) ID:GPU-36588785-9363-4c15-053d-05548b16e1a1 Layers:8(23..30) ID:GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 Layers:8(31..38) ID:GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e Layers:8(39..46) ID:GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e Layers:8(47..54) ID:GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 Layers:8(55..62)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-24T11:31:21.089+08:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-24T11:31:21.089+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
ggml_backend_cuda_device_get_memory device GPU-73d44d75-0d12-df63-c91d-ab76ac0c8b36 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 6000) (0000:04:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-f7ad384d-30ee-b723-a586-06a5b29b8900 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA1 (Quadro RTX 6000) (0000:05:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-8c020d1f-280d-e705-8f69-3a5342688f1a utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA2 (Quadro RTX 6000) (0000:08:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-36588785-9363-4c15-053d-05548b16e1a1 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA4 (Quadro RTX 6000) (0000:84:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-57ab7e6f-39e8-2f43-7071-5eb9bfc8a9d0 utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA5 (Quadro RTX 6000) (0000:85:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-cc5bbee3-57ee-b847-ba4e-1f3847a4325e utilizing NVML memory reporting free: 24976601088 total: 25769803776
llama_model_load_from_file_impl: using device CUDA7 (Quadro RTX 6000) (0000:89:00.0) - 23819 MiB free
ggml_backend_cuda_device_get_memory device GPU-61a8ad69-4903-8ef4-d663-ad91c49fc24e utilizing NVML memory reporting free: 24959668224 total: 25769803776
llama_model_load_from_file_impl: using device CUDA6 (Quadro RTX 6000) (0000:88:00.0) - 23803 MiB free
ggml_backend_cuda_device_get_memory device GPU-1c1cd6d7-5b20-236b-70b9-78cb46647538 utilizing NVML memory reporting free: 24738156544 total: 25769803776
llama_model_load_from_file_impl: using device CUDA3 (Quadro RTX 6000) (0000:09:00.0) - 23592 MiB free
llama_model_loader: loaded meta data with 53 key-value pairs and 809 tensors from C:\Users\its\.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 40
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Minimax-M2.5
llama_model_loader: - kv   6:                           general.basename str              = Minimax-M2.5
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 256x4.9B
llama_model_loader: - kv   9:                            general.license str              = other
llama_model_loader: - kv  10:                       general.license.name str              = modified-mit
llama_model_loader: - kv  11:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  13:                   general.base_model.count u32              = 1
llama_model_loader: - kv  14:                  general.base_model.0.name str              = MiniMax M2.5
llama_model_loader: - kv  15:          general.base_model.0.organization str              = MiniMaxAI
llama_model_loader: - kv  16:              general.base_model.0.repo_url str              = https://huggingface.co/MiniMaxAI/Mini...
llama_model_loader: - kv  17:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  18:                     minimax-m2.block_count u32              = 62
llama_model_loader: - kv  19:                  minimax-m2.context_length u32              = 196608
llama_model_loader: - kv  20:                minimax-m2.embedding_length u32              = 3072
llama_model_loader: - kv  21:             minimax-m2.feed_forward_length u32              = 1536
llama_model_loader: - kv  22:            minimax-m2.attention.head_count u32              = 48
llama_model_loader: - kv  23:         minimax-m2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  24:                  minimax-m2.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  25: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                    minimax-m2.expert_count u32              = 256
llama_model_loader: - kv  27:               minimax-m2.expert_used_count u32              = 8
llama_model_loader: - kv  28:              minimax-m2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:            minimax-m2.attention.key_length u32              = 128
llama_model_loader: - kv  30:          minimax-m2.attention.value_length u32              = 128
llama_model_loader: - kv  31:      minimax-m2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  32:            minimax-m2.rope.dimension_count u32              = 64
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 200034
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 200020
llama_model_loader: - kv  40:            tokenizer.ggml.unknown_token_id u32              = 200021
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 200004
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {# Unsloth template fixes #}\n{# -----...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 12
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = MiniMax-M2.5-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_MiniMax-M2.5.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 81
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                        split.tensors.count i32              = 809
llama_model_loader: - kv  52:                                split.count u16              = 0
llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q3_K:  173 tensors
llama_model_loader: - type q4_K:  232 tensors
llama_model_loader: - type q5_K:   20 tensors
llama_model_loader: - type q6_K:   11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 94.33 GiB (3.54 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 200004 ('<fim_pad>')
load:   - 200005 ('<reponame>')
load:   - 200020 ('[e~[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
print_info: arch             = minimax-m2
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 196608
print_info: n_embd           = 3072
print_info: n_embd_inp       = 3072
print_info: n_layer          = 62
print_info: n_head           = 48
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 1536
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 196608
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 230B.A10B
print_info: model params     = 228.69 B
print_info: general.name     = Minimax-M2.5
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199744
print_info: BOS token        = 200034 ']~!b['
print_info: EOS token        = 200020 '[e~['
print_info: UNK token        = 200021 ']!d~['
print_info: PAD token        = 200004 '<fim_pad>'
print_info: LF token         = 10 'Ċ'
print_info: FIM PRE token    = 200001 '<fim_prefix>'
print_info: FIM SUF token    = 200003 '<fim_suffix>'
print_info: FIM MID token    = 200002 '<fim_middle>'
print_info: FIM PAD token    = 200004 '<fim_pad>'
print_info: FIM REP token    = 200005 '<reponame>'
print_info: EOG token        = 200004 '<fim_pad>'
print_info: EOG token        = 200005 '<reponame>'
print_info: EOG token        = 200020 '[e~['
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors:          CPU model buffer size =   329.70 MiB
load_tensors:        CUDA0 model buffer size = 11054.66 MiB
load_tensors:        CUDA1 model buffer size = 12107.34 MiB
load_tensors:        CUDA2 model buffer size = 12093.41 MiB
load_tensors:        CUDA3 model buffer size = 11536.70 MiB
load_tensors:        CUDA4 model buffer size = 12552.41 MiB
load_tensors:        CUDA5 model buffer size = 12251.66 MiB
load_tensors:        CUDA6 model buffer size = 12260.34 MiB
load_tensors:        CUDA7 model buffer size = 12409.91 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 8
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 1024
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (1024) < n_ctx_train (196608) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     6.20 MiB
llama_kv_cache:      CUDA0 KV buffer size =   224.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA2 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =   224.00 MiB
llama_kv_cache:      CUDA4 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA5 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA6 KV buffer size =   256.00 MiB
llama_kv_cache:      CUDA7 KV buffer size =   256.00 MiB
llama_kv_cache: size = 1984.00 MiB (  1024 cells,  62 layers,  8/8 seqs), K (f16):  992.00 MiB, V (f16):  992.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: Flash Attention was auto, set to enabled
CUDA error: out of memory
  current device: 6, in function alloc at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:576
  cuMemAddressReserve(&pool_addr, CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-03-24T11:32:15.465+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
time=2026-03-24T11:32:17.522+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server error"
time=2026-03-24T11:32:17.615+08:00 level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 1"
time=2026-03-24T11:32:17.772+08:00 level=INFO source=sched.go:511 msg="Load failed" model=C:\Users\its\.ollama\models\blobs\sha256-d98bf8c3c536de17d554ba4a78919ea45717074fbb7184cdd1f0d3bbdca055bf error="llama runner process has terminated: CUDA error"
[GIN] 2026/03/24 - 11:32:17 | 500 |          1m2s |       127.0.0.1 | POST     "/api/generate"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.18.2

extent analysis

Fix Plan

To resolve the CUDA out-of-memory error, we'll focus on optimizing memory usage. Here are the steps:

Reduce batch size: Lower the batch size to reduce memory requirements. You can do this by setting the BatchSize parameter in the load request.
Disable pipeline parallelism: Pipeline parallelism can increase memory usage. You can disable it by setting n_copies=1 in the llama_context.
Reduce KV cache size: Decrease the KV cache size to free up memory. You can do this by setting the KvSize parameter in the load request.
Use model pruning: If possible, use a pruned version of the model to reduce memory requirements.
Update Ollama: Ensure you're running the latest version of Ollama, as updates may include memory optimization fixes.

Example code changes:

# Reduce batch size
load_request = {
    "Operation": "commit",
    "LoraPath": [],
    "Parallel": 8,
    "BatchSize": 256,  # Reduced batch size
    "FlashAttention": "Auto",
    "KvSize": 4096,
    "KvCacheType": "",
    "NumThreads": 28,
    "GPULayers": 63,
    # ...
}

# Disable pipeline parallelism
llama_context = {
    "n_seq_max": 8,
    "n_ctx": 8192,
    "n_ctx_seq": 1024,
    "n_batch": 4096,
    "n_ubatch": 512,
    "causal_attn": 1,
    "flash_attn": "auto",
    "kv_unified": False,
    "freq_base": 5000000.0,
    "freq_scale": 1,
    "n_copies": 1,  # Disabled pipeline parallelism
    # ...
}

Verification

After applying these changes, restart the Ollama server and attempt to load the model again. Monitor the logs for any errors or memory-related issues. If the problem persists, you may need to further optimize memory usage or consider using a different model.

Extra Tips

Regularly update your GPU drivers to ensure you have the latest optimizations and fixes.
Consider using a tool like nvidia-smi to monitor GPU memory usage and identify potential bottlenecks.
If you're using a large model, consider using a model pruning technique to reduce memory requirements.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix [Windows] CUDA error: out of memory (cuMemAddressReserve) on 8x GPU setup [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix [Windows] CUDA error: out of memory (cuMemAddressReserve) on 8x GPU setup [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING