ollama - 💡(How to fix) Fix ggml_cuda_cpy: unsupported type combination (q4_K to q4_K) on Blackwell (compute 12.0) — RTX 5070 Ti Laptop GPU [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15939Fetched 2026-05-04 04:58:35
View on GitHub
Comments
1
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
commented ×1

Loading any Q4_K_M GGUF model crashes the Ollama runner on RTX 50-series (Blackwell, compute capability 12.0). The CUDA backend hits an unsupported type combination in ggml_cuda_cpy and the runner exits with a 500 before producing any tokens. Forcing the Vulkan backend (OLLAMA_LLM_LIBRARY=vulkan) is a working workaround but introduces its own model-swap stability issue (separate from this report).

Error Message

Returns: Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details time=... level=ERROR source=server.go:1219 msg="do load request" error="Post "http://127.0.0.1:.../load\": read tcp ...->...: wsarecv: An existing connection was forcibly closed by the remote host." Blackwell consumer cards are now the default high-VRAM laptop GPU. Anyone who sets up Ollama on an RTX 50-series machine and tries any Q4_K_M model gets a hard 500 with no obvious next step — the error message ("resource limitations or an internal error") doesn't mention the kernel mismatch, so the workaround is hard to find without reading the server log directly.

Root Cause

Blackwell consumer cards are now the default high-VRAM laptop GPU. Anyone who sets up Ollama on an RTX 50-series machine and tries any Q4_K_M model gets a hard 500 with no obvious next step — the error message ("resource limitations or an internal error") doesn't mention the kernel mismatch, so the workaround is hard to find without reading the server log directly.

A graceful fallback (CUDA → Vulkan when the runtime detects an unsupported kernel on the active arch) would resolve this without requiring users to know about the env-var workaround.

Thanks for everything you build — happy to provide additional repros/logs if useful.

Fix Action

Fix / Workaround

Loading any Q4_K_M GGUF model crashes the Ollama runner on RTX 50-series (Blackwell, compute capability 12.0). The CUDA backend hits an unsupported type combination in ggml_cuda_cpy and the runner exits with a 500 before producing any tokens. Forcing the Vulkan backend (OLLAMA_LLM_LIBRARY=vulkan) is a working workaround but introduces its own model-swap stability issue (separate from this report).

Workaround that works

Blackwell consumer cards are now the default high-VRAM laptop GPU. Anyone who sets up Ollama on an RTX 50-series machine and tries any Q4_K_M model gets a hard 500 with no obvious next step — the error message ("resource limitations or an internal error") doesn't mention the kernel mismatch, so the workaround is hard to find without reading the server log directly.

Code Example

ollama create fulcrum-test -f Modelfile

---

FROM /path/to/fulcrum-command-q4km.gguf
   PARAMETER temperature 0.7
   PARAMETER num_ctx 4096

---

ollama run fulcrum-test "hello"

---

time=... level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:49[ID:GPU-... Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=... level=INFO source=ggml.go:136 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=38
load_backend: loaded CPU backend from ...\ggml-cpu-alderlake.dll
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti Laptop GPU, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll
time=... level=INFO source=ggml.go:104 msg=system ... CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:574: ggml_cuda_cpy: unsupported type combination (q4_K to q4_K)

time=... level=ERROR source=server.go:1219 msg="do load request" error="Post \"http://127.0.0.1:.../load\": read tcp ...->...: wsarecv: An existing connection was forcibly closed by the remote host."
RAW_BUFFERClick to expand / collapse

Summary

Loading any Q4_K_M GGUF model crashes the Ollama runner on RTX 50-series (Blackwell, compute capability 12.0). The CUDA backend hits an unsupported type combination in ggml_cuda_cpy and the runner exits with a 500 before producing any tokens. Forcing the Vulkan backend (OLLAMA_LLM_LIBRARY=vulkan) is a working workaround but introduces its own model-swap stability issue (separate from this report).

Environment

  • OS: Windows 11 Home 26200
  • Ollama: 0.22.1 (latest available via winget at time of report)
  • GPU: NVIDIA GeForce RTX 5070 Ti Laptop GPU (Blackwell, SM 12.0)
  • VRAM: 12 GB
  • NVIDIA Driver: 577.13
  • CUDA Version (driver-reported): 12.9
  • Models tested: 10 separate fulcrum-*-q4km.gguf files (Gemma 3 LoRA fine-tunes quantized to Q4_K_M; ~7.25 GB each)

Reproduce

  1. Install Ollama 0.22.1 on a Blackwell GPU box.
  2. Register any Q4_K_M GGUF model:
    ollama create fulcrum-test -f Modelfile
    where Modelfile is:
    FROM /path/to/fulcrum-command-q4km.gguf
    PARAMETER temperature 0.7
    PARAMETER num_ctx 4096
  3. Try to run it:
    ollama run fulcrum-test "hello"

Expected

Model loads, inference proceeds.

Actual

Returns: Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details

Server log (relevant tail)

time=... level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:49[ID:GPU-... Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=... level=INFO source=ggml.go:136 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=38
load_backend: loaded CPU backend from ...\ggml-cpu-alderlake.dll
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti Laptop GPU, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll
time=... level=INFO source=ggml.go:104 msg=system ... CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:574: ggml_cuda_cpy: unsupported type combination (q4_K to q4_K)

time=... level=ERROR source=server.go:1219 msg="do load request" error="Post \"http://127.0.0.1:.../load\": read tcp ...->...: wsarecv: An existing connection was forcibly closed by the remote host."

The CUDA.0.ARCHS line shows SM 1200 is in the compiled-arch list — the runner DOES try the CUDA path on Blackwell rather than skipping it. The crash is specifically in cpy.cu:574 for the q4_K to q4_K type combination.

Other environment knobs tried (no effect on the CUDA path)

  • OLLAMA_FLASH_ATTENTION=0 — still crashes the same way
  • OLLAMA_NEW_ENGINE=1 — still crashes
  • OLLAMA_KV_CACHE_TYPE=q8_0 — still crashes
  • CUDA_VISIBLE_DEVICES="" — Ollama's scheduler still discovers and selects CUDA0 anyway; the CUDA path runs.

Workaround that works

Force Vulkan-only:

  • OLLAMA_LLM_LIBRARY=vulkan
  • OLLAMA_VULKAN=1

The Vulkan backend on the same GPU loads and serves the same Q4_K_M models without the kernel crash. Inference quality is good. Throughput is acceptable for our use case.

(Vulkan has its own separate model-swap stability issue under heavy load — happy to file that as a separate report if helpful.)

Why this matters

Blackwell consumer cards are now the default high-VRAM laptop GPU. Anyone who sets up Ollama on an RTX 50-series machine and tries any Q4_K_M model gets a hard 500 with no obvious next step — the error message ("resource limitations or an internal error") doesn't mention the kernel mismatch, so the workaround is hard to find without reading the server log directly.

A graceful fallback (CUDA → Vulkan when the runtime detects an unsupported kernel on the active arch) would resolve this without requiring users to know about the env-var workaround.

Thanks for everything you build — happy to provide additional repros/logs if useful.

extent analysis

TL;DR

Forcing the Vulkan backend by setting OLLAMA_LLM_LIBRARY=vulkan is a working workaround for the crash issue with Q4_K_M GGUF models on RTX 50-series GPUs.

Guidance

  • The crash is caused by an unsupported type combination in ggml_cuda_cpy when using the CUDA backend on Blackwell GPUs.
  • To verify the issue, check the server log for the error message "ggml_cuda_cpy: unsupported type combination (q4_K to q4_K)".
  • As a temporary workaround, set OLLAMA_LLM_LIBRARY=vulkan to force the Vulkan backend, which loads and serves Q4_K_M models without crashing.
  • Note that the Vulkan backend has its own model-swap stability issue under heavy load, which may need to be addressed separately.

Example

No code snippet is provided as the issue is related to a specific hardware and software configuration.

Notes

The issue is specific to RTX 50-series GPUs with Blackwell architecture and Q4_K_M GGUF models. The workaround may not be necessary for other GPU models or software configurations.

Recommendation

Apply the workaround by setting OLLAMA_LLM_LIBRARY=vulkan to force the Vulkan backend, as it is a reliable solution for the crash issue on RTX 50-series GPUs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix ggml_cuda_cpy: unsupported type combination (q4_K to q4_K) on Blackwell (compute 12.0) — RTX 5070 Ti Laptop GPU [1 comments, 1 participants]