ollama - 💡(How to fix) Fix granite4.1 models ignoring Ollama default context window size on Ollama 0.22.0 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15906Fetched 2026-05-01 05:33:22
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Root Cause

I was running granite4.1:30b (17 GB) and noticed it was running slow on my hardware, given the GPUs I have. When I ran ollama ps, I saw that the model was using 97 GB . I only have roughly 48 GB of VRAM, so the model spilled over to the CPU. This is where the slowness came from. I believe this is because the context size is set to 131072 and not the ollama's default context size (8k tokens?). I tested this not only with a simple python app using the ollama library, but also by making sure no models are loaded and then run ollama run granite4.1:30b.

Code Example

$ nvidia-smi
Thu Apr 30 11:23:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:41:00.0 Off |                  Off |
|  0%   43C    P8             20W /  450W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:83:00.0  On |                  Off |
|  0%   48C    P8             18W /  450W |     104MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3741      G   /usr/bin/gnome-shell                            6MiB |
|    1   N/A  N/A      3741      G   /usr/bin/gnome-shell                           66MiB |
|    1   N/A  N/A      3829      G   /usr/bin/Xwayland                               8MiB |
+-----------------------------------------------------------------------------------------+

$ ollama ps
NAME              ID              SIZE     PROCESSOR          CONTEXT    UNTIL
granite4.1:30b    3f3e5df8a021    97 GB    52%/48% CPU/GPU    131072     4 minutes from now


$ ollama ps
NAME             ID              SIZE     PROCESSOR          CONTEXT    UNTIL
granite4.1:8b    444af1c4b2fe    55 GB    15%/85% CPU/GPU    131072     4 minutes from now
RAW_BUFFERClick to expand / collapse

What is the issue?

I was running granite4.1:30b (17 GB) and noticed it was running slow on my hardware, given the GPUs I have. When I ran ollama ps, I saw that the model was using 97 GB . I only have roughly 48 GB of VRAM, so the model spilled over to the CPU. This is where the slowness came from. I believe this is because the context size is set to 131072 and not the ollama's default context size (8k tokens?). I tested this not only with a simple python app using the ollama library, but also by making sure no models are loaded and then run ollama run granite4.1:30b.

I tested the smaller granite4.1:8b model as well and this 5.3 GB was 55 GB. This is also because it is using the whole allowable context window and not the default size.

It would be nice if the granite4.1 models would use the default context size (if not specified) so that these models can fit on my GPUs.

Relevant log output

$ nvidia-smi
Thu Apr 30 11:23:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:41:00.0 Off |                  Off |
|  0%   43C    P8             20W /  450W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:83:00.0  On |                  Off |
|  0%   48C    P8             18W /  450W |     104MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3741      G   /usr/bin/gnome-shell                            6MiB |
|    1   N/A  N/A      3741      G   /usr/bin/gnome-shell                           66MiB |
|    1   N/A  N/A      3829      G   /usr/bin/Xwayland                               8MiB |
+-----------------------------------------------------------------------------------------+

$ ollama ps
NAME              ID              SIZE     PROCESSOR          CONTEXT    UNTIL
granite4.1:30b    3f3e5df8a021    97 GB    52%/48% CPU/GPU    131072     4 minutes from now


$ ollama ps
NAME             ID              SIZE     PROCESSOR          CONTEXT    UNTIL
granite4.1:8b    444af1c4b2fe    55 GB    15%/85% CPU/GPU    131072     4 minutes from now

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.22.0

extent analysis

TL;DR

The issue can be resolved by setting the context size to the default value of 8k tokens when running the granite4.1 models.

Guidance

  • Verify the current context size used by the models by running ollama ps and checking the CONTEXT column.
  • Try setting the context size to the default value of 8k tokens when running the models using the ollama run command with the appropriate option.
  • Monitor the memory usage of the models after setting the context size to ensure it fits within the available VRAM.
  • Test the smaller granite4.1:8b model with the default context size to confirm the issue is resolved.

Notes

The provided log output and issue description suggest that the context size is the primary cause of the issue, but further testing may be necessary to confirm this.

Recommendation

Apply workaround: set the context size to the default value of 8k tokens when running the granite4.1 models, as this is likely to resolve the issue and allow the models to fit within the available VRAM.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix granite4.1 models ignoring Ollama default context window size on Ollama 0.22.0 [1 participants]