ollama - 💡(How to fix) Fix GPT-OSS 120B not working with 0.30 [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

time=2026-06-02T07:17:33.302Z level=INFO source=llama_server.go:1110 msg="waiting for llama-server to become available" status="llm server error" ...............................................................................................time=2026-06-02T07:22:33.268Z level=INFO source=sched.go:641 msg="Load failed" model=/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a error="timed out waiting for llama-server to start - "

Fix Action

Fixed

Code Example

time=2026-06-02T07:17:05.634Z level=INFO source=routes.go:1919 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: LLAMA_ARG_FIT: LLAMA_ARG_FIT_TARGET: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GO_TEMPLATE:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_IGPU_ENABLE:1 OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_TRANSFER_STREAMS:4 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-06-02T07:17:05.635Z level=INFO source=routes.go:1921 msg="Ollama cloud disabled: false"
time=2026-06-02T07:17:05.658Z level=INFO source=images.go:754 msg="total blobs: 42"
time=2026-06-02T07:17:05.659Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2026-06-02T07:17:05.659Z level=INFO source=routes.go:1981 msg="Listening on [::]:11434 (version 0.30.0)"
time=2026-06-02T07:17:05.665Z level=INFO source=runner.go:55 msg="discovering available GPUs..."
time=2026-06-02T07:17:05.734Z level=INFO source=model_list_cache.go:111 msg="model list cache hydration complete" models=13 failures=0 elapsed=74.705219ms
time=2026-06-02T07:17:06.008Z level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=4h37m7.09383206s consecutive_failures=0
time=2026-06-02T07:17:06.650Z level=INFO source=types.go:32 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="Radeon 8060S Graphics" libdirs=ollama,rocm_v7_2 driver=0.0 pci_id=0000:c5:00.0 type=iGPU total="112.5 GiB" available="118.0 GiB"
time=2026-06-02T07:17:06.650Z level=INFO source=routes.go:2031 msg="vram-based default context" total_vram="112.5 GiB" default_num_ctx=262144
[GIN] 2026/06/02 - 07:17:23 | 200 |     713.621µs |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:17:23 | 200 |      67.418µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:17:23 | 200 |     105.707µs |      172.18.0.4 | GET      "/api/version"
time=2026-06-02T07:17:33.050Z level=INFO source=server.go:109 msg="using llama-server for model" model=/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-06-02T07:17:33.050Z level=INFO source=llama_server.go:400 msg="starting llama-server" cmd="/usr/lib/ollama/llama-server --model /root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a --port 32853 --host 127.0.0.1 --no-webui --offline -c 262144 -np 2 --log-verbosity 4 --no-log-prefix --no-log-timestamps --no-jinja --chat-template chatml --flash-attn on -b 1024 -ub 1024"
time=2026-06-02T07:17:33.051Z level=INFO source=sched.go:613 msg="system memory" total="124.9 GiB" free="124.8 GiB" free_swap="4.6 GiB"
time=2026-06-02T07:17:33.051Z level=INFO source=sched.go:620 msg="gpu memory" id=0 library=ROCm available="117.5 GiB" free="118.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-06-02T07:17:33.051Z level=INFO source=llama_server.go:882 msg="loading model via llama-server" model=/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-06-02T07:17:33.051Z level=INFO source=llama_server.go:1060 msg="waiting for llama-server to start responding"
time=2026-06-02T07:17:33.051Z level=INFO source=llama_server.go:1110 msg="waiting for llama-server to become available" status="llm server not responding"
common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
log_info: verbosity = 4 (adjust with the `-lv N` CLI arg)
device_info:
  - CPU     : AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (127937 MiB, 127937 MiB free)
  - ROCm0   : Radeon 8060S Graphics (115200 MiB, 120767 MiB free)
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | REPACK = 1 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | 
srv          init: using 31 threads for HTTP server
srv          init: The UI is disabled
srv          init: Use --ui/--no-ui (or deprecated --webui/--no-webui) to enable/disable
srv         start: binding port with default address family
srv  llama_server: loading model
srv    load_model: loading model '/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a'
common_init_result: fitting params to device memory ...
common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
common_params_fit_impl: getting device memory data for initial parameters:
handle_gptoss: detected Ollama-format gpt-oss GGUF; applying compatibility fixes
time=2026-06-02T07:17:33.302Z level=INFO source=llama_server.go:1110 msg="waiting for llama-server to become available" status="llm server error"
common_memory_breakdown_print: | memory breakdown [MiB]     |  total     free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (8060S Graphics) | 115200 = 120459 + (71326 = 61223 +    9306 +     796) +      -76586 |
common_memory_breakdown_print: |   - Host                   |                     1385 =  1104 +       0 +     281                |
common_params_fit_impl: projected to use 71326 MiB of device memory vs. 120459 MiB of free device memory
common_params_fit_impl: will leave 49133 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.94 seconds
handle_gptoss: detected Ollama-format gpt-oss GGUF; applying compatibility fixes
llama_model_loader: loaded meta data with 48 key-value pairs and 687 tensors from /root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                          general.file_type u32              = 4
llama_model_loader: - kv   1:               general.quantization_version u32              = 2
llama_model_loader: - kv   2:                gptoss.attention.head_count u32              = 64
llama_model_loader: - kv   3:             gptoss.attention.head_count_kv u32              = 8
llama_model_loader: - kv   4:                gptoss.attention.key_length u32              = 64
llama_model_loader: - kv   5:    gptoss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   6:            gptoss.attention.sliding_window u32              = 128
llama_model_loader: - kv   7:              gptoss.attention.value_length u32              = 64
llama_model_loader: - kv   8:                         gptoss.block_count u32              = 36
llama_model_loader: - kv   9:                      gptoss.context_length u32              = 131072
llama_model_loader: - kv  10:                    gptoss.embedding_length u32              = 2880
llama_model_loader: - kv  11:                        gptoss.expert_count u32              = 128
llama_model_loader: - kv  12:                   gptoss.expert_used_count u32              = 4
llama_model_loader: - kv  13:                 gptoss.feed_forward_length u32              = 2880
llama_model_loader: - kv  14:                      gptoss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  15:                 gptoss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  16: gptoss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  17:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  18:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  19:           tokenizer.ggml.add_padding_token bool             = false
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 199999
llama_model_loader: - kv  22:               tokenizer.ggml.eos_token_ids arr[i32,3]       = [199999, 200002, 200012]
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,201088]  = [0.000000, 1.000000, 2.000000, 3.0000...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                       general.architecture str              = gpt-oss
llama_model_loader: - kv  30:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  31:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  32:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  33:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  34:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  35:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  36:                        gpt-oss.block_count u32              = 36
llama_model_loader: - kv  37:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv  38:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  39:                       gpt-oss.expert_count u32              = 128
llama_model_loader: - kv  40:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  41:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  42:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  43:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  44: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  45:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  46:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  47:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - type  f32:  433 tensors
llama_model_loader: - type bf16:  146 tensors
llama_model_loader: - type mxfp4:  108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = unknown, may not work
print_info: file size   = 60.87 GiB (4.48 BPW) 
llama_prepare_model_devices: using device ROCm0 (Radeon 8060S Graphics) (0000:c5:00.0) - 120460 MiB free
load: 0 unused tokens
load: setting token '<|message|>' (200008) attribute to USER_DEFINED (16), old attributes: 8
load: setting token '<|start|>' (200006) attribute to USER_DEFINED (16), old attributes: 8
load: setting token '<|constrain|>' (200003) attribute to USER_DEFINED (16), old attributes: 8
load: setting token '<|channel|>' (200005) attribute to USER_DEFINED (16), old attributes: 8
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>', or '<|calls|>' and '<|flush|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 1090
load: token to piece cache size = 1.3413 MB
print_info: arch                  = gpt-oss
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 131072
print_info: n_embd                = 2880
print_info: n_embd_inp            = 2880
print_info: n_layer               = 36
print_info: n_head                = 64
print_info: n_head_kv             = 8
print_info: n_rot                 = 64
print_info: n_swa                 = 128
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 64
print_info: n_embd_head_v         = 64
print_info: n_gqa                 = 8
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: f_attn_value_scale    = 0.0000
print_info: n_ff                  = 2880
print_info: n_expert              = 128
print_info: n_expert_used         = 4
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = yarn
print_info: freq_base_train       = 150000.0
print_info: freq_scale_train      = 0.03125
print_info: freq_base_swa         = 150000.0
print_info: freq_scale_swa        = 0.03125
print_info: n_embd_head_k_swa     = 64
print_info: n_embd_head_v_swa     = 64
print_info: n_rot_swa             = 64
print_info: n_ctx_orig_yarn       = 4096
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 120B
print_info: model params          = 116.83 B
print_info: general.name          = n/a
print_info: n_ff_exp              = 2880
print_info: vocab type            = BPE
print_info: n_vocab               = 201088
print_info: n_merges              = 446189
print_info: BOS token             = 199998 '<|startoftext|>'
print_info: EOS token             = 199999 '<|endoftext|>'
print_info: EOT token             = 199999 '<|endoftext|>'
print_info: PAD token             = 199999 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: EOG token             = 199999 '<|endoftext|>'
print_info: EOG token             = 200002 '<|return|>'
print_info: EOG token             = 200012 '<|call|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1104.61 MiB
load_tensors:        ROCm0 model buffer size = 61223.74 MiB

[GIN] 2026/06/02 - 07:18:54 | 200 |   39.591351ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:18:54 | 200 |     682.732µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:18:57 | 200 |    2.556981ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:18:57 | 200 |       12.64µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:18:58 | 200 |    1.641806ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:18:58 | 200 |      78.888µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:19:01 | 200 |    1.697404ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:19:01 | 200 |     110.667µs |      172.18.0.4 | GET      "/api/ps"
...............................................................................................time=2026-06-02T07:22:33.268Z level=INFO source=sched.go:641 msg="Load failed" model=/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a error="timed out waiting for llama-server to start - "
[GIN] 2026/06/02 - 07:22:35 | 500 |          5m2s |      172.18.0.4 | POST     "/api/chat"
RAW_BUFFERClick to expand / collapse

What is the issue?

On AMD strix halo, 128Gb shared RAM-VRAM, ubuntu 24.04, openwebui 0.9.6, ollama 0.30, GPT-OSS120B not loading. After a while openwebui answer "timed out waiting for llama-server to start -"

Relevant log output

time=2026-06-02T07:17:05.634Z level=INFO source=routes.go:1919 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: LLAMA_ARG_FIT: LLAMA_ARG_FIT_TARGET: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GO_TEMPLATE:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_IGPU_ENABLE:1 OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_TRANSFER_STREAMS:4 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-06-02T07:17:05.635Z level=INFO source=routes.go:1921 msg="Ollama cloud disabled: false"
time=2026-06-02T07:17:05.658Z level=INFO source=images.go:754 msg="total blobs: 42"
time=2026-06-02T07:17:05.659Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2026-06-02T07:17:05.659Z level=INFO source=routes.go:1981 msg="Listening on [::]:11434 (version 0.30.0)"
time=2026-06-02T07:17:05.665Z level=INFO source=runner.go:55 msg="discovering available GPUs..."
time=2026-06-02T07:17:05.734Z level=INFO source=model_list_cache.go:111 msg="model list cache hydration complete" models=13 failures=0 elapsed=74.705219ms
time=2026-06-02T07:17:06.008Z level=INFO source=model_recommendations.go:177 msg="model recommendations cache sleep scheduled" wait=4h37m7.09383206s consecutive_failures=0
time=2026-06-02T07:17:06.650Z level=INFO source=types.go:32 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="Radeon 8060S Graphics" libdirs=ollama,rocm_v7_2 driver=0.0 pci_id=0000:c5:00.0 type=iGPU total="112.5 GiB" available="118.0 GiB"
time=2026-06-02T07:17:06.650Z level=INFO source=routes.go:2031 msg="vram-based default context" total_vram="112.5 GiB" default_num_ctx=262144
[GIN] 2026/06/02 - 07:17:23 | 200 |     713.621µs |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:17:23 | 200 |      67.418µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:17:23 | 200 |     105.707µs |      172.18.0.4 | GET      "/api/version"
time=2026-06-02T07:17:33.050Z level=INFO source=server.go:109 msg="using llama-server for model" model=/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-06-02T07:17:33.050Z level=INFO source=llama_server.go:400 msg="starting llama-server" cmd="/usr/lib/ollama/llama-server --model /root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a --port 32853 --host 127.0.0.1 --no-webui --offline -c 262144 -np 2 --log-verbosity 4 --no-log-prefix --no-log-timestamps --no-jinja --chat-template chatml --flash-attn on -b 1024 -ub 1024"
time=2026-06-02T07:17:33.051Z level=INFO source=sched.go:613 msg="system memory" total="124.9 GiB" free="124.8 GiB" free_swap="4.6 GiB"
time=2026-06-02T07:17:33.051Z level=INFO source=sched.go:620 msg="gpu memory" id=0 library=ROCm available="117.5 GiB" free="118.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-06-02T07:17:33.051Z level=INFO source=llama_server.go:882 msg="loading model via llama-server" model=/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-06-02T07:17:33.051Z level=INFO source=llama_server.go:1060 msg="waiting for llama-server to start responding"
time=2026-06-02T07:17:33.051Z level=INFO source=llama_server.go:1110 msg="waiting for llama-server to become available" status="llm server not responding"
common_params_print_info: build 1 (19620004f) with GNU 11.2.1 for Linux x86_64
log_info: verbosity = 4 (adjust with the `-lv N` CLI arg)
device_info:
  - CPU     : AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (127937 MiB, 127937 MiB free)
  - ROCm0   : Radeon 8060S Graphics (115200 MiB, 120767 MiB free)
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | REPACK = 1 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | 
srv          init: using 31 threads for HTTP server
srv          init: The UI is disabled
srv          init: Use --ui/--no-ui (or deprecated --webui/--no-webui) to enable/disable
srv         start: binding port with default address family
srv  llama_server: loading model
srv    load_model: loading model '/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a'
common_init_result: fitting params to device memory ...
common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
common_params_fit_impl: getting device memory data for initial parameters:
handle_gptoss: detected Ollama-format gpt-oss GGUF; applying compatibility fixes
time=2026-06-02T07:17:33.302Z level=INFO source=llama_server.go:1110 msg="waiting for llama-server to become available" status="llm server error"
common_memory_breakdown_print: | memory breakdown [MiB]     |  total     free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (8060S Graphics) | 115200 = 120459 + (71326 = 61223 +    9306 +     796) +      -76586 |
common_memory_breakdown_print: |   - Host                   |                     1385 =  1104 +       0 +     281                |
common_params_fit_impl: projected to use 71326 MiB of device memory vs. 120459 MiB of free device memory
common_params_fit_impl: will leave 49133 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.94 seconds
handle_gptoss: detected Ollama-format gpt-oss GGUF; applying compatibility fixes
llama_model_loader: loaded meta data with 48 key-value pairs and 687 tensors from /root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                          general.file_type u32              = 4
llama_model_loader: - kv   1:               general.quantization_version u32              = 2
llama_model_loader: - kv   2:                gptoss.attention.head_count u32              = 64
llama_model_loader: - kv   3:             gptoss.attention.head_count_kv u32              = 8
llama_model_loader: - kv   4:                gptoss.attention.key_length u32              = 64
llama_model_loader: - kv   5:    gptoss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   6:            gptoss.attention.sliding_window u32              = 128
llama_model_loader: - kv   7:              gptoss.attention.value_length u32              = 64
llama_model_loader: - kv   8:                         gptoss.block_count u32              = 36
llama_model_loader: - kv   9:                      gptoss.context_length u32              = 131072
llama_model_loader: - kv  10:                    gptoss.embedding_length u32              = 2880
llama_model_loader: - kv  11:                        gptoss.expert_count u32              = 128
llama_model_loader: - kv  12:                   gptoss.expert_used_count u32              = 4
llama_model_loader: - kv  13:                 gptoss.feed_forward_length u32              = 2880
llama_model_loader: - kv  14:                      gptoss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  15:                 gptoss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  16: gptoss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  17:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  18:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  19:           tokenizer.ggml.add_padding_token bool             = false
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 199999
llama_model_loader: - kv  22:               tokenizer.ggml.eos_token_ids arr[i32,3]       = [199999, 200002, 200012]
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,201088]  = [0.000000, 1.000000, 2.000000, 3.0000...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                       general.architecture str              = gpt-oss
llama_model_loader: - kv  30:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  31:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  32:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  33:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  34:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  35:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  36:                        gpt-oss.block_count u32              = 36
llama_model_loader: - kv  37:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv  38:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  39:                       gpt-oss.expert_count u32              = 128
llama_model_loader: - kv  40:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  41:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  42:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  43:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  44: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  45:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  46:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  47:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - type  f32:  433 tensors
llama_model_loader: - type bf16:  146 tensors
llama_model_loader: - type mxfp4:  108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = unknown, may not work
print_info: file size   = 60.87 GiB (4.48 BPW) 
llama_prepare_model_devices: using device ROCm0 (Radeon 8060S Graphics) (0000:c5:00.0) - 120460 MiB free
load: 0 unused tokens
load: setting token '<|message|>' (200008) attribute to USER_DEFINED (16), old attributes: 8
load: setting token '<|start|>' (200006) attribute to USER_DEFINED (16), old attributes: 8
load: setting token '<|constrain|>' (200003) attribute to USER_DEFINED (16), old attributes: 8
load: setting token '<|channel|>' (200005) attribute to USER_DEFINED (16), old attributes: 8
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>', or '<|calls|>' and '<|flush|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 1090
load: token to piece cache size = 1.3413 MB
print_info: arch                  = gpt-oss
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 131072
print_info: n_embd                = 2880
print_info: n_embd_inp            = 2880
print_info: n_layer               = 36
print_info: n_head                = 64
print_info: n_head_kv             = 8
print_info: n_rot                 = 64
print_info: n_swa                 = 128
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 64
print_info: n_embd_head_v         = 64
print_info: n_gqa                 = 8
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: f_attn_value_scale    = 0.0000
print_info: n_ff                  = 2880
print_info: n_expert              = 128
print_info: n_expert_used         = 4
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = yarn
print_info: freq_base_train       = 150000.0
print_info: freq_scale_train      = 0.03125
print_info: freq_base_swa         = 150000.0
print_info: freq_scale_swa        = 0.03125
print_info: n_embd_head_k_swa     = 64
print_info: n_embd_head_v_swa     = 64
print_info: n_rot_swa             = 64
print_info: n_ctx_orig_yarn       = 4096
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 120B
print_info: model params          = 116.83 B
print_info: general.name          = n/a
print_info: n_ff_exp              = 2880
print_info: vocab type            = BPE
print_info: n_vocab               = 201088
print_info: n_merges              = 446189
print_info: BOS token             = 199998 '<|startoftext|>'
print_info: EOS token             = 199999 '<|endoftext|>'
print_info: EOT token             = 199999 '<|endoftext|>'
print_info: PAD token             = 199999 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: EOG token             = 199999 '<|endoftext|>'
print_info: EOG token             = 200002 '<|return|>'
print_info: EOG token             = 200012 '<|call|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1104.61 MiB
load_tensors:        ROCm0 model buffer size = 61223.74 MiB

[GIN] 2026/06/02 - 07:18:54 | 200 |   39.591351ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:18:54 | 200 |     682.732µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:18:57 | 200 |    2.556981ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:18:57 | 200 |       12.64µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:18:58 | 200 |    1.641806ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:18:58 | 200 |      78.888µs |      172.18.0.4 | GET      "/api/ps"
[GIN] 2026/06/02 - 07:19:01 | 200 |    1.697404ms |      172.18.0.4 | GET      "/api/tags"
[GIN] 2026/06/02 - 07:19:01 | 200 |     110.667µs |      172.18.0.4 | GET      "/api/ps"
...............................................................................................time=2026-06-02T07:22:33.268Z level=INFO source=sched.go:641 msg="Load failed" model=/root/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a error="timed out waiting for llama-server to start - "
[GIN] 2026/06/02 - 07:22:35 | 500 |          5m2s |      172.18.0.4 | POST     "/api/chat"

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.30.0

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING