ollama - 💡(How to fix) Fix Qwen3-Next:80b : doesn't load anymore after v0.18.x [1 participants]

ollama2026-03-29 09:07:50

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15128•Fetched 2026-04-08 01:45:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Aelentel

Participants

Aelentel

Timeline (top)

closed ×1labeled ×1renamed ×1

Error Message

Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 | 52.556µs | 127.0.0.1 | GET "/api/version" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 | 28.23µs | 127.0.0.1 | HEAD "/" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 | 90.213149ms | 127.0.0.1 | POST "/api/show" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 | 85.191693ms | 127.0.0.1 | POST "/api/show" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_backend_cuda_device_get_memory device GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab utilizing NVML memory reporting free: 44256460800 total: 100485038080 Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.739Z level=INFO source=sched.go:627 msg="updated VRAM based on existing loaded models" gpu=GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7a> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.764Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen3next Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.796Z level=INFO source=server.go:247 msg="enabling flash attention" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.796Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ol> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=sched.go:484 msg="system memory" total="251.6 GiB" free="246.5 GiB" free_swap="0 B" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab library=CUDA available="40.8 GiB"> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=server.go:759 msg="loading model" "model layers"=49 requested=-1 Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.808Z level=INFO source=runner.go:1411 msg="starting ollama engine" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.809Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:35725" Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.818Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.840Z level=INFO source=ggml.go:136 msg="" architecture=qwen3next file_type=Q4_K_M name="Qwen3 Next 80B A3B Thinking" description="> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: found 1 CUDA devices: Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: Device 0: NVIDIA H100 NVL, compute capability 9.0, VMM: yes, ID: GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.907Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.B> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.914Z level=INFO source=server.go:1218 msg="llm load error: failed to initialize model: qwen3next: layer 0 missing attn_qkv/attn_ga> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.915Z level=INFO source=runner.go:1284 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:Disable> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.915Z level=INFO source=sched.go:511 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-8476acca2ca7dc4dd86ad2e0> Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 500 | 310.450484ms | 127.0.0.1 | POST "/api/generate"

Code Example

Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |      52.556µs |       127.0.0.1 | GET      "/api/version"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |       28.23µs |       127.0.0.1 | HEAD     "/"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |   90.213149ms |       127.0.0.1 | POST     "/api/show"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |   85.191693ms |       127.0.0.1 | POST     "/api/show"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_backend_cuda_device_get_memory device GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab utilizing NVML memory reporting free: 44256460800 total: 100485038080
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.739Z level=INFO source=sched.go:627 msg="updated VRAM based on existing loaded models" gpu=GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7a>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.764Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen3next
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.796Z level=INFO source=server.go:247 msg="enabling flash attention"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.796Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ol>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=sched.go:484 msg="system memory" total="251.6 GiB" free="246.5 GiB" free_swap="0 B"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab library=CUDA available="40.8 GiB">
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=server.go:759 msg="loading model" "model layers"=49 requested=-1
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.808Z level=INFO source=runner.go:1411 msg="starting ollama engine"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.809Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:35725"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.818Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.840Z level=INFO source=ggml.go:136 msg="" architecture=qwen3next file_type=Q4_K_M name="Qwen3 Next 80B A3B Thinking" description=">
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: found 1 CUDA devices:
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]:   Device 0: NVIDIA H100 NVL, compute capability 9.0, VMM: yes, ID: GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.907Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.B>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.914Z level=INFO source=server.go:1218 msg="llm load error: failed to initialize model: qwen3next: layer 0 missing attn_qkv/attn_ga>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.915Z level=INFO source=runner.go:1284 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:Disable>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.915Z level=INFO source=sched.go:511 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-8476acca2ca7dc4dd86ad2e0>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 500 |  310.450484ms |       127.0.0.1 | POST     "/api/generate"

RAW_BUFFERClick to expand / collapse

What is the issue?

the model loaded successfully before v18 and after the update i got this error :

$ ollama -v && ollama run qwen3-next:80b

ollama version is 0.18.3 Error: 500 Internal Server Error: failed to initialize model: qwen3next: layer 0 missing attn_qkv/attn_gate projections

Relevant log output

Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |      52.556µs |       127.0.0.1 | GET      "/api/version"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |       28.23µs |       127.0.0.1 | HEAD     "/"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |   90.213149ms |       127.0.0.1 | POST     "/api/show"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 200 |   85.191693ms |       127.0.0.1 | POST     "/api/show"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_backend_cuda_device_get_memory device GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab utilizing NVML memory reporting free: 44256460800 total: 100485038080
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.739Z level=INFO source=sched.go:627 msg="updated VRAM based on existing loaded models" gpu=GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7a>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.764Z level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen3next
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.796Z level=INFO source=server.go:247 msg="enabling flash attention"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.796Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ol>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=sched.go:484 msg="system memory" total="251.6 GiB" free="246.5 GiB" free_swap="0 B"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab library=CUDA available="40.8 GiB">
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.797Z level=INFO source=server.go:759 msg="loading model" "model layers"=49 requested=-1
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.808Z level=INFO source=runner.go:1411 msg="starting ollama engine"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.809Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:35725"
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.818Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.840Z level=INFO source=ggml.go:136 msg="" architecture=qwen3next file_type=Q4_K_M name="Qwen3 Next 80B A3B Thinking" description=">
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: ggml_cuda_init: found 1 CUDA devices:
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]:   Device 0: NVIDIA H100 NVL, compute capability 9.0, VMM: yes, ID: GPU-e34b0cab-38f4-fb3e-dfd1-538bd97bf7ab
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.907Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.B>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.914Z level=INFO source=server.go:1218 msg="llm load error: failed to initialize model: qwen3next: layer 0 missing attn_qkv/attn_ga>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.915Z level=INFO source=runner.go:1284 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:Disable>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: time=2026-03-29T09:07:20.915Z level=INFO source=sched.go:511 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-8476acca2ca7dc4dd86ad2e0>
Mar 29 09:07:20 gof-zur-nut-vm490-023 ollama[816865]: [GIN] 2026/03/29 - 09:07:20 | 500 |  310.450484ms |       127.0.0.1 | POST     "/api/generate"

OS

Ubuntu 24.04.4 LTS

GPU

Nvidia H100

CPU

Intel Xeon Processor

Ollama version

v0.18.3

extent analysis

Fix Plan

The issue seems to be related to a missing attn_qkv/attn_gate projection in the qwen3next model.

To fix this, you can try the following steps:

Update the model configuration to include the missing projection.
Check the model architecture and ensure it is compatible with the current version of ollama.
If the issue persists, try downgrading ollama to a previous version that worked with the model.

Here's an example of how you can update the model configuration:

import torch
import torch.nn as nn

class Qwen3NextModel(nn.Module):
    def __init__(self):
        super(Qwen3NextModel, self).__init__()
        self.attn_qkv = nn.Linear(128, 128)  # Add the missing attn_qkv projection
        self.attn_gate = nn.Linear(128, 128)  # Add the missing attn_gate projection

    def forward(self, x):
        # Your model forward pass here
        pass

Alternatively, you can try to modify the ollama configuration to use a different model architecture that is compatible with the current version.

Verification

To verify that the fix worked, you can try running the ollama command again and check the logs for any errors. If the model loads successfully, you should see a message indicating that the model has been loaded.

Extra Tips

Make sure to check the ollama documentation for any updates or changes to the model architecture or configuration.
If you are using a custom model, ensure that it is compatible with the current version of ollama.
You can also try to debug the issue by running ollama with the --verbose flag to get more detailed logs.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Qwen3-Next:80b : doesn't load anymore after v0.18.x [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Qwen3-Next:80b : doesn't load anymore after v0.18.x [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING