ollama - ✅(Solved) Fix iGPU: reduce memory overhead, add RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN [1 pull requests, 2 comments, 2 participants]

RajeshKumar11 · 2026-03-19T10:27:58Z

[ollama] PR 14954: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA VULKAN - Repository: ollama/ollama - Author: RajeshK… # PR #14954: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN - Repository: ollama/ollama - Author: RajeshKumar11 - State: open | merged: False - Link: https://github.com/ollama/ollama/pull/14954 ## Description (problem / solution / changelog) ## Summary Integrated GPUs (Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. Several scheduler decisions that make sense for discrete GPUs cause memory pressure or wasted headroom on iGPU-only systems. This PR addresses four issues: - **`ml/device.go`**: `MinimumMemory()` now returns 256 MiB for integrated GPUs instead of 457 MiB. iGPUs have no separate VRAM management structures so the smaller overhead is correct. - **`server/sched.go`**: After `updateFreeSpace`, cap iGPU `FreeMemory` at 80% of current system free RAM. iGPU "VRAM" is shared physical memory — over-allocating starves the OS and causes OOM under CPU load. - **`server/sched.go`**: When all detected GPUs are integrated and no user override is set, auto-cap `maxRunners` at 1 (instead of `defaultModelsPerGPU=3`). Multiple concurrently loaded models on iGPU compete for the same RAM the CPU uses. - **`envconfig/config.go`**: Add `OLLAMA_IGPU_MAX_MODELS` env var so users can explicitly control the concurrent model cap on iGPU systems (0 = use standard logic). - **`envconfig/config.go` + `discover/runner.go`**: Clarify `OLLAMA_VULKAN` — the flag *forces* Vulkan over higher-priority backends, not enables it. Vulkan is auto-detected on iGPU. `OLLAMA_VULKAN=0` disables it entirely. - **`discover/types.go`**: Emit a `selected backend` log line after GPU discovery so users can confirm which compute path is active without having to parse the `inference compute` line. ## Test plan - [ ] Build with Vulkan on Intel Iris Xe (Windows 11, Vulkan 1.4.341) - [ ] Verify `selected backend backends=Vulkan` appears in server log on iGPU-only system - [ ] Verify `OLLAMA_IGPU_MAX_MODELS` appears in server config log - [ ] Verify iGPU free memory is capped in debug log when system RAM is under pressure - [ ] Existing scheduler tests pass (`go test ./server/...`) Closes #14953 ## Changed files - `discover/runner.go` (modified, +1/-1) - `discover/types.go` (modified, +20/-0) - `envconfig/config.go` (modified, +12/-3) - `ml/device.go` (modified, +14/-0) - `server/sched.go` (modified, +52/-4) ## Fixed - Fixed by PR: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN (https://github.com/ollama/ollama/pull/14954) ## Problem Integrated GPUs (iGPU — Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. The current Ollama scheduler treats iGPU the same as a discrete GPU in several places, causing: 1. **Over-reserved memory overhead** — \`MinimumMemory()\` reserves 457 MiB on all non-Metal backends, including iGPU. Since iGPU has no separate VRAM management structures, this wastes headroom that could be used to offload more model layers. 2. **No system RAM pressure guard** — iGPU \`FreeMemory\` is reported as available VRAM, but this is shared with the OS and CPU processes. Loading a large model can exhaust physical RAM and cause OOM crashes under CPU load. 3. **Too many concurrent models by default** — \`defaultModelsPerGPU = 3\` allows 3 models loaded simultaneously. On iGPU-only systems this multiplies RAM pressure since all loaded models share the same physical memory pool. 4. **Misleading \`OLLAMA_VULKAN\` log** — On iGPU-only systems Vulkan is auto-selected, but the server config log shows \`OLLAMA_VULKAN:false\`, making users think Vulkan is not active. There is also no log line explicitly stating which backend was chosen. ## Related issues - #13023 — Intel Iris Xe not detected (0 VRAM) - #13029 — Vulkan fails to allocate memory buffer - #12223 — OLLAMA_GPU_OVERHEAD not respected - #13212 — OLLAMA_VULKAN=0 has no effect - #11748 — No shared-memory offload when VRAM full ## Proposed fix - \`ml/device.go\`: Return 256 MiB overhead for integrated GPUs (vs 457 MiB for discrete) - \`server/sched.go\`: After \`updateFreeSpace\`, cap iGPU \`FreeMemory\` at 80% of current system free RAM - \`server/sched.go\`: When all GPUs are integrated and no user override is set, auto-cap \`maxRunners\` at 1 - \`envconfig/config.go\`: Add \`OLLAMA_IGPU_MAX_MODELS\` env var for user override of the concurrent model cap - \`envconfig/config.go\` + \`discover/runner.go\`: Clarify \`OLLAMA_VULKAN\` docs — it forces Vulkan, not enables it - \`discover/types.go\`: Emit \`selected backend\` log line after GPU discovery ## Test environment - Intel Core Ultra 7 155H, Intel Iris Xe Graphics (iGPU) - Windows 11, Vulkan 1.4.341, Ollama built from source

ollama2026-03-19 10:27:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14953•Fetched 2026-04-08 01:02:04

View on GitHub

Comments

Participants

Timeline

Reactions

Author

RajeshKumar11

Participants

RajeshKumar11

rick-github

Timeline (top)

subscribed ×3commented ×2referenced ×2cross-referenced ×1

Fix Action

Fixed

Fixed by PR: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN (https://github.com/ollama/ollama/pull/14954)

PR fix notes

PR #14954: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN

Repository: ollama/ollama
Author: RajeshKumar11
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/14954

Description (problem / solution / changelog)

Summary

Integrated GPUs (Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. Several scheduler decisions that make sense for discrete GPUs cause memory pressure or wasted headroom on iGPU-only systems. This PR addresses four issues:

ml/device.go: MinimumMemory() now returns 256 MiB for integrated GPUs instead of 457 MiB. iGPUs have no separate VRAM management structures so the smaller overhead is correct.
server/sched.go: After updateFreeSpace, cap iGPU FreeMemory at 80% of current system free RAM. iGPU "VRAM" is shared physical memory — over-allocating starves the OS and causes OOM under CPU load.
server/sched.go: When all detected GPUs are integrated and no user override is set, auto-cap maxRunners at 1 (instead of defaultModelsPerGPU=3). Multiple concurrently loaded models on iGPU compete for the same RAM the CPU uses.
envconfig/config.go: Add OLLAMA_IGPU_MAX_MODELS env var so users can explicitly control the concurrent model cap on iGPU systems (0 = use standard logic).
envconfig/config.go + discover/runner.go: Clarify OLLAMA_VULKAN — the flag forces Vulkan over higher-priority backends, not enables it. Vulkan is auto-detected on iGPU. OLLAMA_VULKAN=0 disables it entirely.
discover/types.go: Emit a selected backend log line after GPU discovery so users can confirm which compute path is active without having to parse the inference compute line.

Test plan

Build with Vulkan on Intel Iris Xe (Windows 11, Vulkan 1.4.341)
Verify selected backend backends=Vulkan appears in server log on iGPU-only system
Verify OLLAMA_IGPU_MAX_MODELS appears in server config log
Verify iGPU free memory is capped in debug log when system RAM is under pressure
Existing scheduler tests pass (go test ./server/...)

Closes #14953

Changed files

discover/runner.go (modified, +1/-1)
discover/types.go (modified, +20/-0)
envconfig/config.go (modified, +12/-3)
ml/device.go (modified, +14/-0)
server/sched.go (modified, +52/-4)

RAW_BUFFERClick to expand / collapse

Problem

Integrated GPUs (iGPU — Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. The current Ollama scheduler treats iGPU the same as a discrete GPU in several places, causing:

Over-reserved memory overhead — `MinimumMemory()` reserves 457 MiB on all non-Metal backends, including iGPU. Since iGPU has no separate VRAM management structures, this wastes headroom that could be used to offload more model layers.
No system RAM pressure guard — iGPU `FreeMemory` is reported as available VRAM, but this is shared with the OS and CPU processes. Loading a large model can exhaust physical RAM and cause OOM crashes under CPU load.
Too many concurrent models by default — `defaultModelsPerGPU = 3` allows 3 models loaded simultaneously. On iGPU-only systems this multiplies RAM pressure since all loaded models share the same physical memory pool.
Misleading `OLLAMA_VULKAN` log — On iGPU-only systems Vulkan is auto-selected, but the server config log shows `OLLAMA_VULKAN:false`, making users think Vulkan is not active. There is also no log line explicitly stating which backend was chosen.

Related issues

#13023 — Intel Iris Xe not detected (0 VRAM)
#13029 — Vulkan fails to allocate memory buffer
#12223 — OLLAMA_GPU_OVERHEAD not respected
#13212 — OLLAMA_VULKAN=0 has no effect
#11748 — No shared-memory offload when VRAM full

Proposed fix

`ml/device.go`: Return 256 MiB overhead for integrated GPUs (vs 457 MiB for discrete)
`server/sched.go`: After `updateFreeSpace`, cap iGPU `FreeMemory` at 80% of current system free RAM
`server/sched.go`: When all GPUs are integrated and no user override is set, auto-cap `maxRunners` at 1
`envconfig/config.go`: Add `OLLAMA_IGPU_MAX_MODELS` env var for user override of the concurrent model cap
`envconfig/config.go` + `discover/runner.go`: Clarify `OLLAMA_VULKAN` docs — it forces Vulkan, not enables it
`discover/types.go`: Emit `selected backend` log line after GPU discovery

Test environment

Intel Core Ultra 7 155H, Intel Iris Xe Graphics (iGPU)
Windows 11, Vulkan 1.4.341, Ollama built from source

extent analysis

Fix Plan

To address the issues with integrated GPUs (iGPU), we will implement the following changes:

Update ml/device.go to return a reduced memory overhead for iGPUs:

func MinimumMemory() uint64 {
    if isIntegratedGPU {
        return 256 * 1024 * 1024 // 256 MiB
    }
    return 457 * 1024 * 1024 // 457 MiB
}

Modify server/sched.go to cap FreeMemory at 80% of system free RAM for iGPUs:

func updateFreeSpace() {
    // ...
    if isIntegratedGPU {
        freeMemory = min(freeMemory, uint64(0.8 * float64(getSystemFreeRAM())))
    }
}

Update server/sched.go to auto-cap maxRunners at 1 when all GPUs are integrated and no user override is set:

func init() {
    // ...
    if allGPUsAreIntegrated && maxRunners == 0 {
        maxRunners = 1
    }
}

Add OLLAMA_IGPU_MAX_MODELS env var for user override of concurrent model cap in envconfig/config.go:

func init() {
    // ...
    maxModelsPerIGPU = getenvInt("OLLAMA_IGPU_MAX_MODELS", 1)
}

Clarify OLLAMA_VULKAN docs in envconfig/config.go and discover/runner.go:

// OLLAMA_VULKAN forces Vulkan, rather than enabling it

Emit selected backend log line after GPU discovery in discover/types.go:

func discoverGPUs() {
    // ...
    log.Printf("Selected backend: %s", backendName)
}

Verification

To verify the fix, run the Ollama server on an iGPU-only system and check the logs for the correct backend selection and memory allocation. Additionally, test with multiple models and verify that the system does not run out of memory.

Extra Tips

Ensure that the OLLAMA_IGPU_MAX_MODELS env var is set correctly for user overrides.
Monitor system memory usage and adjust the maxRunners cap as needed to prevent OOM crashes.
Consider adding additional logging or monitoring to track memory usage and GPU utilization.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - ✅(Solved) Fix iGPU: reduce memory overhead, add RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #14954: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Problem

Related issues

Proposed fix

Test environment

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - ✅(Solved) Fix iGPU: reduce memory overhead, add RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #14954: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Problem

Related issues

Proposed fix

Test environment

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING