ollama - 💡(How to fix) Fix Feature Request: Hot-swappable model loading without server restart [1 comments, 2 participants]

ollama2026-03-07 01:49:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14684•Fetched 2026-04-08 00:32:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

fuleinist

Participants

fuleinist

rick-github

Timeline (top)

closed ×1commented ×1subscribed ×1

Root Cause

Currently, switching between different LLM models in Ollama requires stopping and restarting the Ollama server. This is problematic because:

Code Example

# Pre-load multiple models
curl -X POST http://localhost:11434/api/model/load -d '{"model": "codellama"}'
curl -X POST http://localhost:11434/api/model/load -d '{"model": "llama3"}'

# List loaded models
curl http://localhost:11434/api/models/loaded
# [{"name": "codellama", "memory": "4GB"}, {"name": "llama3", "memory": "8GB"}]

# Chat using llama3 (auto-selected or specified)
curl http://localhost:11434/api/chat -d '{"model": "llama3", "messages": [...]}'

# Unload codellama when you need VRAM for other tasks
curl -X POST http://localhost:11434/api/model/unload -d '{"model": "codellama"}'

RAW_BUFFERClick to expand / collapse

Problem

Currently, switching between different LLM models in Ollama requires stopping and restarting the Ollama server. This is problematic because:

Time-consuming: Each model switch requires waiting for the server to fully restart
Memory inefficient: All loaded models are cleared from VRAM on restart
Disruptive: Active requests/interactions are interrupted during restart
Poor DX: Developers working with multiple models (e.g., coding assistant + chat) must constantly restart

Proposed Solution

Implement hot-swappable model loading that allows:

Multiple models loaded simultaneously: Keep several models in memory (within VRAM limits)
Dynamic model switching: Switch the active model via API without restarting
Model lifecycle management: Load/unload specific models on demand
Memory-aware scheduling: Automatically manage VRAM based on available resources

Use Cases

Multi-model workflows: Use a coding model for code assistance and a chat model for conversation without restart
Development pipelines: Quickly switch between different model versions during testing
Resource optimization: Keep a lightweight model always loaded while loading heavier models on-demand
Reduced latency: Eliminate cold-start delays when switching between frequently-used models

Implementation Suggestions

API Endpoints:

- Load a model into memory
- Unload a model from memory
- List currently loaded models
Add parameter to that auto-loads if not loaded

Configuration:

- Maximum concurrent models (default: 2)
- Max VRAM per model or total

Memory Management:

Implement LRU eviction when VRAM is exhausted
Add priority system for sticky models that shouldn't be evicted
Expose memory usage stats via

Example Workflow

# Pre-load multiple models
curl -X POST http://localhost:11434/api/model/load -d '{"model": "codellama"}'
curl -X POST http://localhost:11434/api/model/load -d '{"model": "llama3"}'

# List loaded models
curl http://localhost:11434/api/models/loaded
# [{"name": "codellama", "memory": "4GB"}, {"name": "llama3", "memory": "8GB"}]

# Chat using llama3 (auto-selected or specified)
curl http://localhost:11434/api/chat -d '{"model": "llama3", "messages": [...]}'

# Unload codellama when you need VRAM for other tasks
curl -X POST http://localhost:11434/api/model/unload -d '{"model": "codellama"}'

This feature would significantly improve developer experience and make Ollama more suitable for production multi-model deployments.

extent analysis

Fix Plan

To implement hot-swappable model loading in Ollama, follow these steps:

Step 1: API Endpoints
- Create API endpoints for loading, unloading, and listing models:

from flask import Flask, request, jsonify app = Flask(name)

Model management

models = {}

@app.route('/api/model/load', methods=['POST']) def load_model(): model_name = request.json['model'] # Load model into memory models[model_name] = load_model_into_memory(model_name) return jsonify({'message': f'Model {model_name} loaded'})

@app.route('/api/model/unload', methods=['POST']) def unload_model(): model_name = request.json['model'] # Unload model from memory del models[model_name] return jsonify({'message': f'Model {model_name} unloaded'})

@app.route('/api/models/loaded', methods=['GET']) def list_loaded_models(): return jsonify([{'name': model, 'memory': get_model_memory_usage(model)} for model in models])

* **Step 2: Configuration**
  * Add configuration options for maximum concurrent models and max VRAM per model:
    ```python
max_concurrent_models = 2
max_vram_per_model = 16  # GB

Step 3: Memory Management
- Implement LRU eviction when VRAM is exhausted:

from collections import OrderedDict

model_lru = OrderedDict()

def load_model_into_memory(model_name): # Load model into memory # ... model_lru[model_name] = time.time() if len(model_lru) > max_concurrent_models: # Evict least recently used model lru_model = next(iter(model_lru)) del model_lru[lru_model] del models[lru_model]

* **Step 4: Auto-loading**
  * Add parameter to API endpoint to auto-load model if not loaded:
    ```python
@app.route('/api/chat', methods=['POST'])
def chat():
    model_name = request.json.get('model')
    if model_name and model_name not in models:
        # Auto-load model
        load_model_into_memory(model_name)
    # ...

Verification

To verify that the fix worked, test the following scenarios:

Load multiple models into memory using the /api/model/load endpoint
List loaded models using the /api/models/loaded endpoint
Unload a model using the /api/model/unload endpoint
Test auto-loading by specifying a model in the /api/chat endpoint that is not loaded

Extra Tips

Monitor VRAM usage and adjust configuration options as needed
Implement priority system for sticky models that shouldn't be evicted
Expose memory usage stats via API endpoint for monitoring and debugging purposes

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #optimization #memory management #model loading #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Feature Request: Hot-swappable model loading without server restart [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Problem

Proposed Solution

Use Cases

Implementation Suggestions

API Endpoints:

Configuration:

Memory Management:

Example Workflow

extent analysis

Fix Plan

Model management

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Feature Request: Hot-swappable model loading without server restart [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Problem

Proposed Solution

Use Cases

Implementation Suggestions

API Endpoints:

Configuration:

Memory Management:

Example Workflow

extent analysis

Fix Plan

Model management

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING