ollama - ✅(Solved) Fix Error in ollama using open-notebook [1 pull requests, 4 comments, 3 participants]

ollama2026-03-04 10:31:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14615•Fetched 2026-04-08 00:33:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4cross-referenced ×3closed ×1labeled ×1

Error Message

print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1B print_info: model params = 6.94 B print_info: general.name = Granite 4.0 H Tiny print_info: f_embedding_scale = 12.000000 print_info: f_residual_scale = 0.220000 print_info: f_attention_scale = 0.007813 print_info: n_ff_shexp = 1024 print_info: vocab type = BPE print_info: n_vocab = 100352 print_info: n_merges = 100000 print_info: BOS token = 100257 '<|end_of_text|>' print_info: EOS token = 100257 '<|end_of_text|>' print_info: EOT token = 100257 '<|end_of_text|>' print_info: UNK token = 100269 '<|unk|>' print_info: PAD token = 100256 '<|pad|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 100258 '<|fim_prefix|>' print_info: FIM SUF token = 100260 '<|fim_suffix|>' print_info: FIM MID token = 100259 '<|fim_middle|>' print_info: FIM PAD token = 100261 '<|fim_pad|>' print_info: EOG token = 100257 '<|end_of_text|>' print_info: EOG token = 100261 '<|fim_pad|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 10 repeating layers to GPU load_tensors: offloaded 10/41 layers to GPU load_tensors: CPU model buffer size = 120.59 MiB load_tensors: CUDA0 model buffer size = 1003.36 MiB load_tensors: CUDA_Host model buffer size = 3028.19 MiB time=2026-03-04T09:37:10.728+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding" ggml_cuda_host_malloc: failed to allocate 1.00 MiB of pinned memory: out of memory CUDA error: out of memory current device: 0, in function ggml_backend_cuda_device_event_new at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:4957 cudaEventCreateWithFlags(&event, 0x02) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-03-04T09:37:23.287+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-03-04T09:37:24.385+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding" time=2026-03-04T09:37:26.735+01:00 level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 1" time=2026-03-04T09:37:26.902+01:00 level=INFO source=sched.go:518 msg="Load failed" model=C:\Users\user.ollama\models\blobs\sha256-491ba81786c46a345a5da9a60cdb9f9a3056960c8411dd857153c194b1f91313 error="llama runner process has terminated: CUDA error" [GIN] 2026/03/04 - 09:37:27 | 500 | 29.7846277s | 127.0.0.1 | POST "/api/chat"

Fix Action

Fixed

Fixed by PR: cuda: graceful OOM fallback when creating events during partial GPU offload (https://github.com/ollama/ollama/pull/14620)

PR fix notes

PR #14620: cuda: graceful OOM fallback when creating events during partial GPU offload

Repository: ollama/ollama
Author: ssam18
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/14620

Description (problem / solution / changelog)

When the model is too large to fit entirely on the GPU and some portions are transferred to the host system, then the CUDA Host Buffer for transfering data from CPU to GPU will run out of pinned memory resources on the host system. After that occurs, calling cudaEventCreateWithFlags() will fail -- like ggml_cuda_host_malloc(), it has a graceful recovery path when this happens. However, cudaEventCreateWithFlags() was wrapped in CUDA_CHECK() a fatal abort macro. Therefore, when the CUDA_HOST_BUFFER runs out of pinned memory resources, the cudaEventCreateWithFlags() call will result in a "CUDA error: out of memory" message when the llama runner terminates with a status code of 500. The error is reproducible using command-r7b:latest on a system with limited pinned memory resources as described in #14615.

This patch substitutes the fatal CUDA_CHECK() with an error checking routine that first resets the CUDA error state, writes a log warning and returns nullptr. This is identical to what ggml_cuda_host_malloc() does a couple hundred lines before, and returning nullptr is safe since the GGML Backend Scheduler performs a null check on every event usage, therefore instead of terminating the process, cross-device synchronization of events is simply skipped.

Changed files

llama/patches/0035-ggml-cuda-graceful-oom-fallback-for-event-creation.patch (added, +41/-0)

Code Example

print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 6.94 B
print_info: general.name     = Granite 4.0 H Tiny
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale  = 0.220000
print_info: f_attention_scale = 0.007813
print_info: n_ff_shexp        = 1024
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|end_of_text|>'
print_info: EOS token        = 100257 '<|end_of_text|>'
print_info: EOT token        = 100257 '<|end_of_text|>'
print_info: UNK token        = 100269 '<|unk|>'
print_info: PAD token        = 100256 '<|pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: FIM PAD token    = 100261 '<|fim_pad|>'
print_info: EOG token        = 100257 '<|end_of_text|>'
print_info: EOG token        = 100261 '<|fim_pad|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/41 layers to GPU
load_tensors:          CPU model buffer size =   120.59 MiB
load_tensors:        CUDA0 model buffer size =  1003.36 MiB
load_tensors:    CUDA_Host model buffer size =  3028.19 MiB
time=2026-03-04T09:37:10.728+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
ggml_cuda_host_malloc: failed to allocate 1.00 MiB of pinned memory: out of memory
CUDA error: out of memory
  current device: 0, in function ggml_backend_cuda_device_event_new at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:4957
  cudaEventCreateWithFlags(&event, 0x02)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-03-04T09:37:23.287+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-04T09:37:24.385+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
time=2026-03-04T09:37:26.735+01:00 level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 1"
time=2026-03-04T09:37:26.902+01:00 level=INFO source=sched.go:518 msg="Load failed" model=C:\Users\user\.ollama\models\blobs\sha256-491ba81786c46a345a5da9a60cdb9f9a3056960c8411dd857153c194b1f91313 error="llama runner process has terminated: CUDA error"
[GIN] 2026/03/04 - 09:37:27 | 500 |   29.7846277s |       127.0.0.1 | POST     "/api/chat"

RAW_BUFFERClick to expand / collapse

What is the issue?

When I use open-notebook, it allways gives error 500 in just 30 seconds on creating a insight. But if I use the "ollama run" featur, only with commandr-7b:latest gives this errors. I tried qwen3.5:4b, granite3.1-dense:2b, granite4:tiny-h and lfm2.5-thinking without errors. Look: PS C:\Users\user> ollama run command-r7b:latest "Summarize this text in 3 bullet points:

INPUT

Neural networks are computing systems inspired by biological neural networks..." Error: 500 Internal Server Error: llama runner process has terminated: CUDA error PS C:\Users\user> ollama run command-r7b:latest "Summarize this text in 3 bullet points:

INPUT

Neural networks are computing systems inspired by biological neural networks..."^C PS C:\Users\user> ollama rm command-r7b:latest deleted 'command-r7b:latest' PS C:\Users\user> ollama pull command-r7b:latest pulling manifest pulling b32d935e114c: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 5.1 GB pulling 0d8282caa612: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.2 KB pulling 945eaa8b1428: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 13 KB pulling d8455b5dce0b: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B pulling 574fdc7616e8: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 491 B verifying sha256 digest writing manifest success PS C:\Users\user> ollama run command-r7b:latest "Summarize this text in 3 bullet points:

INPUT

Neural networks are computing systems inspired by biological neural networks..." Error: 500 Internal Server Error: llama runner process has terminated: CUDA error PS C:\Users\user> ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL

Relevant log output

print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 6.94 B
print_info: general.name     = Granite 4.0 H Tiny
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale  = 0.220000
print_info: f_attention_scale = 0.007813
print_info: n_ff_shexp        = 1024
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|end_of_text|>'
print_info: EOS token        = 100257 '<|end_of_text|>'
print_info: EOT token        = 100257 '<|end_of_text|>'
print_info: UNK token        = 100269 '<|unk|>'
print_info: PAD token        = 100256 '<|pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: FIM PAD token    = 100261 '<|fim_pad|>'
print_info: EOG token        = 100257 '<|end_of_text|>'
print_info: EOG token        = 100261 '<|fim_pad|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/41 layers to GPU
load_tensors:          CPU model buffer size =   120.59 MiB
load_tensors:        CUDA0 model buffer size =  1003.36 MiB
load_tensors:    CUDA_Host model buffer size =  3028.19 MiB
time=2026-03-04T09:37:10.728+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
ggml_cuda_host_malloc: failed to allocate 1.00 MiB of pinned memory: out of memory
CUDA error: out of memory
  current device: 0, in function ggml_backend_cuda_device_event_new at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:4957
  cudaEventCreateWithFlags(&event, 0x02)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-03-04T09:37:23.287+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-04T09:37:24.385+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
time=2026-03-04T09:37:26.735+01:00 level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 1"
time=2026-03-04T09:37:26.902+01:00 level=INFO source=sched.go:518 msg="Load failed" model=C:\Users\user\.ollama\models\blobs\sha256-491ba81786c46a345a5da9a60cdb9f9a3056960c8411dd857153c194b1f91313 error="llama runner process has terminated: CUDA error"
[GIN] 2026/03/04 - 09:37:27 | 500 |   29.7846277s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.17.5

extent analysis

Fix Plan

The issue is caused by a CUDA out-of-memory error. To fix this, we need to reduce the memory usage of the model or increase the available GPU memory.

Step 1: Reduce Model Memory Usage

We can try to reduce the model size or use a more efficient model architecture. However, since the model is already loaded, we can try to reduce the batch size or sequence length to reduce memory usage.

Step 2: Increase Available GPU Memory

We can try to close other GPU-intensive applications or increase the GPU memory allocation for the Ollama process.

Step 3: Modify Ollama Configuration

We can modify the Ollama configuration to use a smaller model or reduce the GPU memory allocation. We can add the following configuration options to the ollama.yaml file:

model:
  size: small
gpu:
  memory_allocation: 2048

Alternatively, we can use the --gpu-memory flag when running the Ollama command:

ollama run --gpu-memory 2048 command-r7b:latest "Summarize this text in 3 bullet points: ..."

Step 4: Update Ollama Version

If the issue persists, we can try updating Ollama to the latest version:

ollama update

Verification

To verify that the fix worked, we can run the Ollama command again and check for any errors:

ollama run command-r7b:latest "Summarize this text in 3 bullet points: ..."

If the command runs successfully without any errors, the fix has worked.

Extra Tips

Make sure to close other GPU-intensive applications before running Ollama.
Consider using a more efficient model architecture or reducing the model size to reduce memory usage.
If the issue persists, try increasing the GPU memory allocation or using a different GPU device.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

ollama - ✅(Solved) Fix Error in ollama using open-notebook [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #14620: cuda: graceful OOM fallback when creating events during partial GPU offload

Description (problem / solution / changelog)

Changed files

Code Example

What is the issue?

INPUT

INPUT

INPUT

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

Fix Plan

Step 1: Reduce Model Memory Usage

Step 2: Increase Available GPU Memory

Step 3: Modify Ollama Configuration

Step 4: Update Ollama Version

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING