ollama - 💡(How to fix) Fix RTX 4060 CUDA illegal memory access on tool-calling requests (0.21.2) [1 participants]

cryptrix1598 · 2026-04-28T20:23:09Z

[ollama] Bug Description When using Ollama 0.21.2 with an NVIDIA RTX 4060 8GB VRAM on Windows 11, the model runner crashes with a CUDA illegal memory access er… ## Bug Description When using Ollama 0.21.2 with an NVIDIA RTX 4060 (8GB VRAM) on Windows 11, the model runner crashes with a CUDA illegal memory access error specifically when handling tool-calling requests. Simple inference requests (e.g., `ollama run qwen2.5:7b "say hello"`) work perfectly fine, but any request involving tool use causes an immediate crash. ## Error Message CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:2981 cudaStreamSynchronize(cuda_ctx->stream()) ggml-cuda.cu:94: CUDA error ## System Information - **OS:** Windows 11 - **GPU:** NVIDIA GeForce RTX 4060 8GB VRAM - **GPU Driver Version:** 591.86 (also tested with latest driver, same issue) - **CUDA Version:** 13.1 - **Ollama Version:** 0.21.2 - **RAM:** 32GB - **Model:** qwen2.5:7b ## Steps to Reproduce 1. Install Ollama 0.21.2 on Windows 11 with RTX 4060 2. Pull qwen2.5:7b 3. Run a simple inference request — works fine 4. Run any tool-calling request (e.g., via OpenClaw gateway) — crashes immediately with CUDA illegal memory access ## Expected Behavior Tool-calling requests should execute successfully using GPU acceleration without crashing the model runner. ## Actual Behavior Model runner crashes immediately on tool-calling requests with CUDA illegal memory access error. The runner terminates with exit status 1. Simple inference works fine — the crash is specifically triggered by tool-calling workloads. ## Additional Context - Downgrading to Ollama 0.19.0 does not fix the issue — that version fails to discover the GPU entirely (timeout during GPU discovery) - Setting OLLAMA_NUM_GPU=0 to force CPU mode resolves the crash but makes the tool unusable due to extremely slow inference - Setting OLLAMA_GPU_OVERHEAD to 1GB and 2GB does not resolve the issue - The crash occurs consistently and reproducibly every single time a tool-calling request is made - Tested with both Game Ready Driver 591.86 and latest available driver — same result on both - Issue appears to be specific to tool-calling inference patterns that use more complex CUDA memory access patterns than simple generation

ollama2026-04-28 20:23:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15863•Fetched 2026-04-29 06:11:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

cryptrix1598

Participants

cryptrix1598

Error Message

CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:2981 cudaStreamSynchronize(cuda_ctx->stream()) ggml-cuda.cu:94: CUDA error

RAW_BUFFERClick to expand / collapse

Bug Description

When using Ollama 0.21.2 with an NVIDIA RTX 4060 (8GB VRAM) on Windows 11, the model runner crashes with a CUDA illegal memory access error specifically when handling tool-calling requests. Simple inference requests (e.g., ollama run qwen2.5:7b "say hello") work perfectly fine, but any request involving tool use causes an immediate crash.

Error Message

System Information

OS: Windows 11
GPU: NVIDIA GeForce RTX 4060 8GB VRAM
GPU Driver Version: 591.86 (also tested with latest driver, same issue)
CUDA Version: 13.1
Ollama Version: 0.21.2
RAM: 32GB
Model: qwen2.5:7b

Steps to Reproduce

Install Ollama 0.21.2 on Windows 11 with RTX 4060
Pull qwen2.5:7b
Run a simple inference request — works fine
Run any tool-calling request (e.g., via OpenClaw gateway) — crashes immediately with CUDA illegal memory access

Expected Behavior

Tool-calling requests should execute successfully using GPU acceleration without crashing the model runner.

Actual Behavior

Model runner crashes immediately on tool-calling requests with CUDA illegal memory access error. The runner terminates with exit status 1. Simple inference works fine — the crash is specifically triggered by tool-calling workloads.

Additional Context

Downgrading to Ollama 0.19.0 does not fix the issue — that version fails to discover the GPU entirely (timeout during GPU discovery)
Setting OLLAMA_NUM_GPU=0 to force CPU mode resolves the crash but makes the tool unusable due to extremely slow inference
Setting OLLAMA_GPU_OVERHEAD to 1GB and 2GB does not resolve the issue
The crash occurs consistently and reproducibly every single time a tool-calling request is made
Tested with both Game Ready Driver 591.86 and latest available driver — same result on both
Issue appears to be specific to tool-calling inference patterns that use more complex CUDA memory access patterns than simple generation

extent analysis

TL;DR

The most likely fix for the CUDA illegal memory access error in Ollama 0.21.2 with NVIDIA RTX 4060 is to investigate and adjust the memory allocation or CUDA settings to accommodate the more complex memory access patterns of tool-calling requests.

Guidance

Verify that the issue persists across different CUDA versions and GPU driver updates to rule out version-specific bugs.
Experiment with adjusting the OLLLAMA_GPU_OVERHEAD setting to a higher value than 2GB to see if increased GPU memory allocation resolves the crash.
Consider testing the model runner with a different GPU model or a reduced workload to isolate if the issue is specific to the RTX 4060 or the tool-calling requests.
Review the CUDA documentation and Ollama source code (if available) to understand the memory access patterns and potential bottlenecks in the ggml_backend_cuda_synchronize function.

Example

No specific code example can be provided without modifying the Ollama source code or CUDA settings, which is beyond the scope of this analysis.

Notes

The issue appears to be related to the specific memory access patterns of tool-calling requests, which may exceed the default memory allocation or cause conflicts with the CUDA stream synchronization. Further investigation into the CUDA settings, GPU memory allocation, and Ollama configuration is necessary to resolve the issue.

Recommendation

Apply a workaround by adjusting the OLLLAMA_GPU_OVERHEAD setting or exploring alternative CUDA settings to accommodate the complex memory access patterns of tool-calling requests, as upgrading to a fixed version is not implied in the given issue context.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#prompt template #agent execution #callback error #memory management #API rate limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix RTX 4060 CUDA illegal memory access on tool-calling requests (0.21.2) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Bug Description

Error Message

System Information

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix RTX 4060 CUDA illegal memory access on tool-calling requests (0.21.2) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Bug Description

Error Message

System Information

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING