ollama - 💡(How to fix) Fix ggml_cuda_cpy: unsupported type combination (q4_K to q4_K) on Blackwell (compute 12.0) — RTX 5070 Ti Laptop GPU [1 comments, 1 participants]

ollama2026-05-03 03:35:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15939•Fetched 2026-05-04 04:58:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

nickherahomes

Participants

nickherahomes

Timeline (top)

commented ×1

Loading any Q4_K_M GGUF model crashes the Ollama runner on RTX 50-series (Blackwell, compute capability 12.0). The CUDA backend hits an unsupported type combination in ggml_cuda_cpy and the runner exits with a 500 before producing any tokens. Forcing the Vulkan backend (OLLAMA_LLM_LIBRARY=vulkan) is a working workaround but introduces its own model-swap stability issue (separate from this report).

Error Message

Returns: Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details time=... level=ERROR source=server.go:1219 msg="do load request" error="Post "http://127.0.0.1:.../load\": read tcp ...->...: wsarecv: An existing connection was forcibly closed by the remote host." Blackwell consumer cards are now the default high-VRAM laptop GPU. Anyone who sets up Ollama on an RTX 50-series machine and tries any Q4_K_M model gets a hard 500 with no obvious next step — the error message ("resource limitations or an internal error") doesn't mention the kernel mismatch, so the workaround is hard to find without reading the server log directly.

Root Cause

Blackwell consumer cards are now the default high-VRAM laptop GPU. Anyone who sets up Ollama on an RTX 50-series machine and tries any Q4_K_M model gets a hard 500 with no obvious next step — the error message ("resource limitations or an internal error") doesn't mention the kernel mismatch, so the workaround is hard to find without reading the server log directly.

A graceful fallback (CUDA → Vulkan when the runtime detects an unsupported kernel on the active arch) would resolve this without requiring users to know about the env-var workaround.

Thanks for everything you build — happy to provide additional repros/logs if useful.

Fix Action

Fix / Workaround

Workaround that works

Code Example

ollama create fulcrum-test -f Modelfile

---

FROM /path/to/fulcrum-command-q4km.gguf
   PARAMETER temperature 0.7
   PARAMETER num_ctx 4096

---

ollama run fulcrum-test "hello"

---

time=... level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:49[ID:GPU-... Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=... level=INFO source=ggml.go:136 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=38
load_backend: loaded CPU backend from ...\ggml-cpu-alderlake.dll
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti Laptop GPU, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll
time=... level=INFO source=ggml.go:104 msg=system ... CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:574: ggml_cuda_cpy: unsupported type combination (q4_K to q4_K)

time=... level=ERROR source=server.go:1219 msg="do load request" error="Post \"http://127.0.0.1:.../load\": read tcp ...->...: wsarecv: An existing connection was forcibly closed by the remote host."

RAW_BUFFERClick to expand / collapse

Summary

Environment

OS: Windows 11 Home 26200
Ollama: 0.22.1 (latest available via winget at time of report)
GPU: NVIDIA GeForce RTX 5070 Ti Laptop GPU (Blackwell, SM 12.0)
VRAM: 12 GB
NVIDIA Driver: 577.13
CUDA Version (driver-reported): 12.9
Models tested: 10 separate fulcrum-*-q4km.gguf files (Gemma 3 LoRA fine-tunes quantized to Q4_K_M; ~7.25 GB each)

Reproduce

Install Ollama 0.22.1 on a Blackwell GPU box.

ollama create fulcrum-test -f Modelfile

where Modelfile is:

FROM /path/to/fulcrum-command-q4km.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Try to run it:
```
ollama run fulcrum-test "hello"
```

Expected

Model loads, inference proceeds.

Actual

Returns: Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details

Server log (relevant tail)

time=... level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:49[ID:GPU-... Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=... level=INFO source=ggml.go:136 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1065 num_key_values=38
load_backend: loaded CPU backend from ...\ggml-cpu-alderlake.dll
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti Laptop GPU, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll
time=... level=INFO source=ggml.go:104 msg=system ... CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:574: ggml_cuda_cpy: unsupported type combination (q4_K to q4_K)

time=... level=ERROR source=server.go:1219 msg="do load request" error="Post \"http://127.0.0.1:.../load\": read tcp ...->...: wsarecv: An existing connection was forcibly closed by the remote host."

The CUDA.0.ARCHS line shows SM 1200 is in the compiled-arch list — the runner DOES try the CUDA path on Blackwell rather than skipping it. The crash is specifically in cpy.cu:574 for the q4_K to q4_K type combination.

Other environment knobs tried (no effect on the CUDA path)

OLLAMA_FLASH_ATTENTION=0 — still crashes the same way
OLLAMA_NEW_ENGINE=1 — still crashes
OLLAMA_KV_CACHE_TYPE=q8_0 — still crashes
CUDA_VISIBLE_DEVICES="" — Ollama's scheduler still discovers and selects CUDA0 anyway; the CUDA path runs.

Workaround that works

Force Vulkan-only:

OLLAMA_LLM_LIBRARY=vulkan
OLLAMA_VULKAN=1

The Vulkan backend on the same GPU loads and serves the same Q4_K_M models without the kernel crash. Inference quality is good. Throughput is acceptable for our use case.

(Vulkan has its own separate model-swap stability issue under heavy load — happy to file that as a separate report if helpful.)

Why this matters

A graceful fallback (CUDA → Vulkan when the runtime detects an unsupported kernel on the active arch) would resolve this without requiring users to know about the env-var workaround.

Thanks for everything you build — happy to provide additional repros/logs if useful.

extent analysis

TL;DR

Forcing the Vulkan backend by setting OLLAMA_LLM_LIBRARY=vulkan is a working workaround for the crash issue with Q4_K_M GGUF models on RTX 50-series GPUs.

Guidance

The crash is caused by an unsupported type combination in ggml_cuda_cpy when using the CUDA backend on Blackwell GPUs.
To verify the issue, check the server log for the error message "ggml_cuda_cpy: unsupported type combination (q4_K to q4_K)".
As a temporary workaround, set OLLAMA_LLM_LIBRARY=vulkan to force the Vulkan backend, which loads and serves Q4_K_M models without crashing.
Note that the Vulkan backend has its own model-swap stability issue under heavy load, which may need to be addressed separately.

Example

No code snippet is provided as the issue is related to a specific hardware and software configuration.

Notes

The issue is specific to RTX 50-series GPUs with Blackwell architecture and Q4_K_M GGUF models. The workaround may not be necessary for other GPU models or software configurations.

Recommendation

Apply the workaround by setting OLLAMA_LLM_LIBRARY=vulkan to force the Vulkan backend, as it is a reliable solution for the crash issue on RTX 50-series GPUs.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ssr #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix ggml_cuda_cpy: unsupported type combination (q4_K to q4_K) on Blackwell (compute 12.0) — RTX 5070 Ti Laptop GPU [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround that works

Code Example

Summary

Environment

Reproduce

Expected

Actual

Server log (relevant tail)

Other environment knobs tried (no effect on the CUDA path)

Workaround that works

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix ggml_cuda_cpy: unsupported type combination (q4_K to q4_K) on Blackwell (compute 12.0) — RTX 5070 Ti Laptop GPU [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround that works

Code Example

Summary

Environment

Reproduce

Expected

Actual

Server log (relevant tail)

Other environment knobs tried (no effect on the CUDA path)

Workaround that works

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING