ollama - ✅(Solved) Fix iGPU: reduce memory overhead, add RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14953Fetched 2026-04-08 01:02:04
View on GitHub
Comments
2
Participants
2
Timeline
9
Reactions
0
Timeline (top)
subscribed ×3commented ×2referenced ×2cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #14954: iGPU: reduce memory overhead, RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN

Description (problem / solution / changelog)

Summary

Integrated GPUs (Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. Several scheduler decisions that make sense for discrete GPUs cause memory pressure or wasted headroom on iGPU-only systems. This PR addresses four issues:

  • ml/device.go: MinimumMemory() now returns 256 MiB for integrated GPUs instead of 457 MiB. iGPUs have no separate VRAM management structures so the smaller overhead is correct.
  • server/sched.go: After updateFreeSpace, cap iGPU FreeMemory at 80% of current system free RAM. iGPU "VRAM" is shared physical memory — over-allocating starves the OS and causes OOM under CPU load.
  • server/sched.go: When all detected GPUs are integrated and no user override is set, auto-cap maxRunners at 1 (instead of defaultModelsPerGPU=3). Multiple concurrently loaded models on iGPU compete for the same RAM the CPU uses.
  • envconfig/config.go: Add OLLAMA_IGPU_MAX_MODELS env var so users can explicitly control the concurrent model cap on iGPU systems (0 = use standard logic).
  • envconfig/config.go + discover/runner.go: Clarify OLLAMA_VULKAN — the flag forces Vulkan over higher-priority backends, not enables it. Vulkan is auto-detected on iGPU. OLLAMA_VULKAN=0 disables it entirely.
  • discover/types.go: Emit a selected backend log line after GPU discovery so users can confirm which compute path is active without having to parse the inference compute line.

Test plan

  • Build with Vulkan on Intel Iris Xe (Windows 11, Vulkan 1.4.341)
  • Verify selected backend backends=Vulkan appears in server log on iGPU-only system
  • Verify OLLAMA_IGPU_MAX_MODELS appears in server config log
  • Verify iGPU free memory is capped in debug log when system RAM is under pressure
  • Existing scheduler tests pass (go test ./server/...)

Closes #14953

Changed files

  • discover/runner.go (modified, +1/-1)
  • discover/types.go (modified, +20/-0)
  • envconfig/config.go (modified, +12/-3)
  • ml/device.go (modified, +14/-0)
  • server/sched.go (modified, +52/-4)
RAW_BUFFERClick to expand / collapse

Problem

Integrated GPUs (iGPU — Intel Iris Xe, AMD APU, etc.) share physical RAM with the CPU. The current Ollama scheduler treats iGPU the same as a discrete GPU in several places, causing:

  1. Over-reserved memory overhead — `MinimumMemory()` reserves 457 MiB on all non-Metal backends, including iGPU. Since iGPU has no separate VRAM management structures, this wastes headroom that could be used to offload more model layers.

  2. No system RAM pressure guard — iGPU `FreeMemory` is reported as available VRAM, but this is shared with the OS and CPU processes. Loading a large model can exhaust physical RAM and cause OOM crashes under CPU load.

  3. Too many concurrent models by default — `defaultModelsPerGPU = 3` allows 3 models loaded simultaneously. On iGPU-only systems this multiplies RAM pressure since all loaded models share the same physical memory pool.

  4. Misleading `OLLAMA_VULKAN` log — On iGPU-only systems Vulkan is auto-selected, but the server config log shows `OLLAMA_VULKAN:false`, making users think Vulkan is not active. There is also no log line explicitly stating which backend was chosen.

Related issues

  • #13023 — Intel Iris Xe not detected (0 VRAM)
  • #13029 — Vulkan fails to allocate memory buffer
  • #12223 — OLLAMA_GPU_OVERHEAD not respected
  • #13212 — OLLAMA_VULKAN=0 has no effect
  • #11748 — No shared-memory offload when VRAM full

Proposed fix

  • `ml/device.go`: Return 256 MiB overhead for integrated GPUs (vs 457 MiB for discrete)
  • `server/sched.go`: After `updateFreeSpace`, cap iGPU `FreeMemory` at 80% of current system free RAM
  • `server/sched.go`: When all GPUs are integrated and no user override is set, auto-cap `maxRunners` at 1
  • `envconfig/config.go`: Add `OLLAMA_IGPU_MAX_MODELS` env var for user override of the concurrent model cap
  • `envconfig/config.go` + `discover/runner.go`: Clarify `OLLAMA_VULKAN` docs — it forces Vulkan, not enables it
  • `discover/types.go`: Emit `selected backend` log line after GPU discovery

Test environment

  • Intel Core Ultra 7 155H, Intel Iris Xe Graphics (iGPU)
  • Windows 11, Vulkan 1.4.341, Ollama built from source

extent analysis

Fix Plan

To address the issues with integrated GPUs (iGPU), we will implement the following changes:

  • Update ml/device.go to return a reduced memory overhead for iGPUs:
func MinimumMemory() uint64 {
    if isIntegratedGPU {
        return 256 * 1024 * 1024 // 256 MiB
    }
    return 457 * 1024 * 1024 // 457 MiB
}
  • Modify server/sched.go to cap FreeMemory at 80% of system free RAM for iGPUs:
func updateFreeSpace() {
    // ...
    if isIntegratedGPU {
        freeMemory = min(freeMemory, uint64(0.8 * float64(getSystemFreeRAM())))
    }
}
  • Update server/sched.go to auto-cap maxRunners at 1 when all GPUs are integrated and no user override is set:
func init() {
    // ...
    if allGPUsAreIntegrated && maxRunners == 0 {
        maxRunners = 1
    }
}
  • Add OLLAMA_IGPU_MAX_MODELS env var for user override of concurrent model cap in envconfig/config.go:
func init() {
    // ...
    maxModelsPerIGPU = getenvInt("OLLAMA_IGPU_MAX_MODELS", 1)
}
  • Clarify OLLAMA_VULKAN docs in envconfig/config.go and discover/runner.go:
// OLLAMA_VULKAN forces Vulkan, rather than enabling it
  • Emit selected backend log line after GPU discovery in discover/types.go:
func discoverGPUs() {
    // ...
    log.Printf("Selected backend: %s", backendName)
}

Verification

To verify the fix, run the Ollama server on an iGPU-only system and check the logs for the correct backend selection and memory allocation. Additionally, test with multiple models and verify that the system does not run out of memory.

Extra Tips

  • Ensure that the OLLAMA_IGPU_MAX_MODELS env var is set correctly for user overrides.
  • Monitor system memory usage and adjust the maxRunners cap as needed to prevent OOM crashes.
  • Consider adding additional logging or monitoring to track memory usage and GPU utilization.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - ✅(Solved) Fix iGPU: reduce memory overhead, add RAM pressure guard, cap concurrent models, clarify OLLAMA_VULKAN [1 pull requests, 2 comments, 2 participants]