ollama - 💡(How to fix) Fix Why is Gemma4:26b performance significantly slow on Ollama? [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15353Fetched 2026-04-08 02:52:19
View on GitHub
Comments
3
Participants
2
Timeline
5
Reactions
0
Author
Timeline (top)
commented ×3closed ×1labeled ×1

Root Cause

Update: After some investigation, I noticed that llama.cpp achieves faster inference because it defaults to a "text-only" mode for multimodal models when no image is provided. I am not certain if Unsloth Studio employs a similar mechanism, but the performance difference is undeniable.

Fix Action

Fix / Workaround

Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |       20.29µs |       127.0.0.1 | HEAD     "/"
Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |  226.505134ms |       127.0.0.1 | POST     "/api/show"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.117Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38853"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:247 msg="enabling flash attention"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:484 msg="system memory" total="61.9 GiB" free="57.5 GiB" free_swap="7.1 GiB"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 library=CUDA available="30.9 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=-1
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.467Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.519Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: found 1 CUDA devices:
Apr 05 22:31:10 ubuntumainserver ollama[4010283]:   Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.671Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.675Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.688Z level=INFO source=model.go:138 msg="vision: decode" elapsed=3.098772ms bounds=(0,0)-(2048,2048)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=75.059281ms size="[768 768]"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=78.862504ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.118Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.161Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.168Z level=INFO source=model.go:138 msg="vision: decode" elapsed=308.777µs bounds=(0,0)-(2048,2048)
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.231Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=62.685692ms size="[768 768]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.234Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.965872ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.0 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="318.7 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="72.0 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:272 msg="total memory" size="18.7 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
Apr 05 22:31:12 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:12.794Z level=INFO source=server.go:1390 msg="llama runner started in 2.35 seconds"
Apr 05 22:31:15 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:15 | 200 |   5.93703499s |       127.0.0.1 | POST     "/api/generate"

Code Example

test@ubuntumainserver:~$ nvidia-smi
Sun Apr  5 22:18:01 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P8              9W /  450W |   19032MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1643537      C   /usr/local/bin/ollama                 19022MiB |
+-----------------------------------------------------------------------------------------+
test@ubuntumainserver:~$ ollama run gemma4:26b "5+8/16=?" --verbose
Thinking...
The user is asking for the result of the arithmetic expression $5 + \frac{8}{16}$.

    *   Order of operations (PEMDAS/BODMAS) states that Division/Multiplication comes before Addition/Subtraction.
    *   Expression: $5 + (8 \div 16)$

    *   $8 / 16 = \frac{8}{16}$
    *   Simplify the fraction: $\frac{8}{16} = \frac{1}{2}$
    *   Decimal form: $0.5$

    *   $5 + 0.5 = 5.5$
    *   Fraction form: $5 \frac{1}{2}$

    *   Show the simplified fraction version.
    *   Show the decimal version.
...done thinking.

To solve **5 + 8/16**, follow the order of operations (division before addition):

1.  **Simplify the fraction:**
    $8/16$ can be simplified by dividing both the numerator and denominator by 8.
    $8 \div 8 = 1$
    $16 \div 8 = 2$
    So, $8/16 = 1/2$ (or **0.5** in decimal form).

2.  **Add to 5:**
    $5 + 0.5 = 5.5$

**Final Answer:**
**5.5** (or $5 \frac{1}{2}$)

total duration:       3.438829184s
load duration:        212.989801ms
prompt eval count:    22 token(s)
prompt eval duration: 18.179174ms
prompt eval rate:     1210.18 tokens/s
eval count:           328 token(s)
eval duration:        3.10435076s
eval rate:            105.66 tokens/s

---

Sun Apr  5 22:18:10 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P3             37W /  450W |       2MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
test@xserver:~/models$ ../llama.cpp/build/bin/llama-cli -m gemma-4-26B-A4B-it-UD-Q4_K_M.gguf   -p "5+8/16=?"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32109 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8669-761797ffd
model      : gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 5+8/16=?

[Start thinking]
The user wants to solve the mathematical expression $5 + 8/16$.

    *   Addition ($+$)
    *   Division ($/$)

    *   Order of operations (PEMDAS/BODMAS) states that division should be performed before addition.

    *   Expression: $8 / 16$
    *   Calculation: $\frac{8}{16}$
    *   Simplification: Both 8 and 16 are divisible by 8.
    *   $\frac{8 \div 8}{16 \div 8} = \frac{1}{2}$
    *   Decimal form: $0.5$

    *   Expression: $5 + 0.5$
    *   Calculation: $5.5$

    *   Fraction form: $5 \frac{1}{2}$ or $\frac{11}{2}$
    *   Decimal form: $5.5$
[End thinking]

To solve this, follow the order of operations (PEMDAS/BODMAS), which dictates that you perform division before addition.

1.  **Divide 8 by 16:**
    $8 / 16 = 0.5$ (or $\frac{1}{2}$)

2.  **Add 5 to the result:**
    $5 + 0.5 = 5.5$

**Answer:**
**5.5** (or $5\frac{1}{2}$)

[ Prompt: 181.9 t/s | Generation: 212.5 t/s ]

> 

Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 9209 + (22311 = 16071 +    5420 +     819) +         587 |
llama_memory_breakdown_print: |   - Host               |                  1274 =   748 +       0 +     526                |

---

Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |       20.29µs |       127.0.0.1 | HEAD     "/"
Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |  226.505134ms |       127.0.0.1 | POST     "/api/show"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.117Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38853"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:247 msg="enabling flash attention"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:484 msg="system memory" total="61.9 GiB" free="57.5 GiB" free_swap="7.1 GiB"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 library=CUDA available="30.9 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=-1
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.467Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.519Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: found 1 CUDA devices:
Apr 05 22:31:10 ubuntumainserver ollama[4010283]:   Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.671Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.675Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.688Z level=INFO source=model.go:138 msg="vision: decode" elapsed=3.098772ms bounds=(0,0)-(2048,2048)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=75.059281ms size="[768 768]"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=78.862504ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.118Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.161Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.168Z level=INFO source=model.go:138 msg="vision: decode" elapsed=308.777µs bounds=(0,0)-(2048,2048)
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.231Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=62.685692ms size="[768 768]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.234Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.965872ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.0 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="318.7 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="72.0 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:272 msg="total memory" size="18.7 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
Apr 05 22:31:12 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:12.794Z level=INFO source=server.go:1390 msg="llama runner started in 2.35 seconds"
Apr 05 22:31:15 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:15 | 200 |   5.93703499s |       127.0.0.1 | POST     "/api/generate"
RAW_BUFFERClick to expand / collapse

What is the issue?

I am experiencing a significant performance gap when running the Gemma4:26b model on Ollama. In my tests, Ollama performs substantially worse than other inference engines on the exact same hardware.

Update: After some investigation, I noticed that llama.cpp achieves faster inference because it defaults to a "text-only" mode for multimodal models when no image is provided. I am not certain if Unsloth Studio employs a similar mechanism, but the performance difference is undeniable.

Rather than a bug, please consider this a Feature Request: It would be a significant improvement if Ollama could implement an optional "text-only" loading mode for multimodal models. If no image processing is required, matching the 2x speed boost seen in llama.cpp by bypassing the vision encoders/modules would be a massive optimization for users with high-end hardware like the RTX 5090.

Comparison

llama.cpp: 2x faster than Ollama. Unsloth Studio: 2x faster than Ollama.

Environment

I have performed benchmarks on two separate Ubuntu machines with nearly identical high-end specifications:

GPU: NVIDIA RTX 5090 CPU: AMD Ryzen 9 7950X3D RAM: 64GB OS: Ubuntu 24.04.3 LTS

Ollama

test@ubuntumainserver:~$ nvidia-smi
Sun Apr  5 22:18:01 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P8              9W /  450W |   19032MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1643537      C   /usr/local/bin/ollama                 19022MiB |
+-----------------------------------------------------------------------------------------+
test@ubuntumainserver:~$ ollama run gemma4:26b "5+8/16=?" --verbose
Thinking...
The user is asking for the result of the arithmetic expression $5 + \frac{8}{16}$.

    *   Order of operations (PEMDAS/BODMAS) states that Division/Multiplication comes before Addition/Subtraction.
    *   Expression: $5 + (8 \div 16)$

    *   $8 / 16 = \frac{8}{16}$
    *   Simplify the fraction: $\frac{8}{16} = \frac{1}{2}$
    *   Decimal form: $0.5$

    *   $5 + 0.5 = 5.5$
    *   Fraction form: $5 \frac{1}{2}$

    *   Show the simplified fraction version.
    *   Show the decimal version.
...done thinking.

To solve **5 + 8/16**, follow the order of operations (division before addition):

1.  **Simplify the fraction:**
    $8/16$ can be simplified by dividing both the numerator and denominator by 8.
    $8 \div 8 = 1$
    $16 \div 8 = 2$
    So, $8/16 = 1/2$ (or **0.5** in decimal form).

2.  **Add to 5:**
    $5 + 0.5 = 5.5$

**Final Answer:**
**5.5** (or $5 \frac{1}{2}$)

total duration:       3.438829184s
load duration:        212.989801ms
prompt eval count:    22 token(s)
prompt eval duration: 18.179174ms
prompt eval rate:     1210.18 tokens/s
eval count:           328 token(s)
eval duration:        3.10435076s
eval rate:            105.66 tokens/s

llama.cpp

Sun Apr  5 22:18:10 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P3             37W /  450W |       2MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
test@xserver:~/models$ ../llama.cpp/build/bin/llama-cli -m gemma-4-26B-A4B-it-UD-Q4_K_M.gguf   -p "5+8/16=?"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32109 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8669-761797ffd
model      : gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 5+8/16=?

[Start thinking]
The user wants to solve the mathematical expression $5 + 8/16$.

    *   Addition ($+$)
    *   Division ($/$)

    *   Order of operations (PEMDAS/BODMAS) states that division should be performed before addition.

    *   Expression: $8 / 16$
    *   Calculation: $\frac{8}{16}$
    *   Simplification: Both 8 and 16 are divisible by 8.
    *   $\frac{8 \div 8}{16 \div 8} = \frac{1}{2}$
    *   Decimal form: $0.5$

    *   Expression: $5 + 0.5$
    *   Calculation: $5.5$

    *   Fraction form: $5 \frac{1}{2}$ or $\frac{11}{2}$
    *   Decimal form: $5.5$
[End thinking]

To solve this, follow the order of operations (PEMDAS/BODMAS), which dictates that you perform division before addition.

1.  **Divide 8 by 16:**
    $8 / 16 = 0.5$ (or $\frac{1}{2}$)

2.  **Add 5 to the result:**
    $5 + 0.5 = 5.5$

**Answer:**
**5.5** (or $5\frac{1}{2}$)

[ Prompt: 181.9 t/s | Generation: 212.5 t/s ]

> 

Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 9209 + (22311 = 16071 +    5420 +     819) +         587 |
llama_memory_breakdown_print: |   - Host               |                  1274 =   748 +       0 +     526                |

Relevant log output

Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |       20.29µs |       127.0.0.1 | HEAD     "/"
Apr 05 22:31:09 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:09 | 200 |  226.505134ms |       127.0.0.1 | POST     "/api/show"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.117Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38853"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:247 msg="enabling flash attention"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-7121486771cbfe218851513210c40b35dbdee93ab1ef43fe36283c883980f0df --port 42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:484 msg="system memory" total="61.9 GiB" free="57.5 GiB" free_swap="7.1 GiB"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 library=CUDA available="30.9 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.445Z level=INFO source=server.go:759 msg="loading model" "model layers"=31 requested=-1
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1417 msg="starting ollama engine"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.457Z level=INFO source=runner.go:1452 msg="Server listening on 127.0.0.1:42019"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.467Z level=INFO source=runner.go:1290 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.519Z level=INFO source=ggml.go:136 msg="" architecture=gemma4 file_type=Q4_K_M name="" description="" num_tensors=1014 num_key_values=52
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: ggml_cuda_init: found 1 CUDA devices:
Apr 05 22:31:10 ubuntumainserver ollama[4010283]:   Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, ID: GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.671Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.675Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.688Z level=INFO source=model.go:138 msg="vision: decode" elapsed=3.098772ms bounds=(0,0)-(2048,2048)
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=75.059281ms size="[768 768]"
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:10 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:10.763Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=78.862504ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.118Z level=INFO source=runner.go:1290 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.161Z level=INFO source=model.go:97 msg="gemma4: token IDs" image=255999 image_end=258882 audio=256000 audio_end=258883
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.168Z level=INFO source=model.go:138 msg="vision: decode" elapsed=308.777µs bounds=(0,0)-(2048,2048)
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.231Z level=INFO source=model.go:145 msg="vision: preprocess" elapsed=62.685692ms size="[768 768]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:148 msg="vision: pixelValues" shape="[768 768 3]" dim0=768 dim1=768 dim2=3
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.233Z level=INFO source=model.go:152 msg="vision: patches" patchesX=48 patchesY=48 total=2304 patchSize=16
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.234Z level=INFO source=model.go:156 msg="vision: encoded" elapsed=65.965872ms shape="[2816 256]"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=runner.go:1290 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:16 GPULayers:31[ID:GPU-0aa928c0-ece6-7698-4db1-ac130bfe47b7 Layers:31(0..30)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:482 msg="offloading 30 repeating layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=ggml.go:494 msg="offloaded 31/31 layers to GPU"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="16.6 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:245 msg="model weights" device=CPU size="667.5 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.0 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="318.7 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="72.0 MiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=device.go:272 msg="total memory" size="18.7 GiB"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
Apr 05 22:31:11 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:11.290Z level=INFO source=server.go:1386 msg="waiting for server to become available" status="llm server loading model"
Apr 05 22:31:12 ubuntumainserver ollama[4010283]: time=2026-04-05T22:31:12.794Z level=INFO source=server.go:1390 msg="llama runner started in 2.35 seconds"
Apr 05 22:31:15 ubuntumainserver ollama[4010283]: [GIN] 2026/04/05 - 22:31:15 | 200 |   5.93703499s |       127.0.0.1 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.20.2

extent analysis

TL;DR

Implement an optional "text-only" loading mode for multimodal models in Ollama to match the performance of llama.cpp.

Guidance

  • Investigate the feasibility of adding a "text-only" mode to Ollama, similar to llama.cpp, to bypass vision encoders/modules when no image is provided.
  • Review the Ollama codebase to identify areas where the vision encoder can be conditionally disabled or optimized for text-only workloads.
  • Consider adding a command-line flag or configuration option to enable/disable the "text-only" mode, allowing users to choose the optimal setting for their specific use case.
  • Benchmark the performance of Ollama with the proposed "text-only" mode against llama.cpp and Unsloth Studio to ensure the optimization is effective.

Example

No code snippet is provided, as the issue requires a design and implementation change rather than a simple code fix.

Notes

The performance difference between Ollama and llama.cpp/Unsloth Studio may be due to the lack of a "text-only" mode in Ollama, which can lead to unnecessary computations and memory allocations. Adding this feature could significantly improve Ollama's performance for text-only workloads.

Recommendation

Apply a workaround by implementing the proposed "text-only" mode in Ollama, as this is likely to provide a substantial performance boost for users with high-end hardware like the RTX 5090.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING