ollama - 💡(How to fix) Fix RTX 4060 CUDA illegal memory access on tool-calling requests (0.21.2) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15863Fetched 2026-04-29 06:11:43
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Error Message

CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:2981 cudaStreamSynchronize(cuda_ctx->stream()) ggml-cuda.cu:94: CUDA error

RAW_BUFFERClick to expand / collapse

Bug Description

When using Ollama 0.21.2 with an NVIDIA RTX 4060 (8GB VRAM) on Windows 11, the model runner crashes with a CUDA illegal memory access error specifically when handling tool-calling requests. Simple inference requests (e.g., ollama run qwen2.5:7b "say hello") work perfectly fine, but any request involving tool use causes an immediate crash.

Error Message

CUDA error: an illegal memory access was encountered current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:2981 cudaStreamSynchronize(cuda_ctx->stream()) ggml-cuda.cu:94: CUDA error

System Information

  • OS: Windows 11
  • GPU: NVIDIA GeForce RTX 4060 8GB VRAM
  • GPU Driver Version: 591.86 (also tested with latest driver, same issue)
  • CUDA Version: 13.1
  • Ollama Version: 0.21.2
  • RAM: 32GB
  • Model: qwen2.5:7b

Steps to Reproduce

  1. Install Ollama 0.21.2 on Windows 11 with RTX 4060
  2. Pull qwen2.5:7b
  3. Run a simple inference request — works fine
  4. Run any tool-calling request (e.g., via OpenClaw gateway) — crashes immediately with CUDA illegal memory access

Expected Behavior

Tool-calling requests should execute successfully using GPU acceleration without crashing the model runner.

Actual Behavior

Model runner crashes immediately on tool-calling requests with CUDA illegal memory access error. The runner terminates with exit status 1. Simple inference works fine — the crash is specifically triggered by tool-calling workloads.

Additional Context

  • Downgrading to Ollama 0.19.0 does not fix the issue — that version fails to discover the GPU entirely (timeout during GPU discovery)
  • Setting OLLAMA_NUM_GPU=0 to force CPU mode resolves the crash but makes the tool unusable due to extremely slow inference
  • Setting OLLAMA_GPU_OVERHEAD to 1GB and 2GB does not resolve the issue
  • The crash occurs consistently and reproducibly every single time a tool-calling request is made
  • Tested with both Game Ready Driver 591.86 and latest available driver — same result on both
  • Issue appears to be specific to tool-calling inference patterns that use more complex CUDA memory access patterns than simple generation

extent analysis

TL;DR

The most likely fix for the CUDA illegal memory access error in Ollama 0.21.2 with NVIDIA RTX 4060 is to investigate and adjust the memory allocation or CUDA settings to accommodate the more complex memory access patterns of tool-calling requests.

Guidance

  • Verify that the issue persists across different CUDA versions and GPU driver updates to rule out version-specific bugs.
  • Experiment with adjusting the OLLLAMA_GPU_OVERHEAD setting to a higher value than 2GB to see if increased GPU memory allocation resolves the crash.
  • Consider testing the model runner with a different GPU model or a reduced workload to isolate if the issue is specific to the RTX 4060 or the tool-calling requests.
  • Review the CUDA documentation and Ollama source code (if available) to understand the memory access patterns and potential bottlenecks in the ggml_backend_cuda_synchronize function.

Example

No specific code example can be provided without modifying the Ollama source code or CUDA settings, which is beyond the scope of this analysis.

Notes

The issue appears to be related to the specific memory access patterns of tool-calling requests, which may exceed the default memory allocation or cause conflicts with the CUDA stream synchronization. Further investigation into the CUDA settings, GPU memory allocation, and Ollama configuration is necessary to resolve the issue.

Recommendation

Apply a workaround by adjusting the OLLLAMA_GPU_OVERHEAD setting or exploring alternative CUDA settings to accommodate the complex memory access patterns of tool-calling requests, as upgrading to a fixed version is not implied in the given issue context.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING