ollama - 💡(How to fix) Fix Feature Request: Hot-swappable model loading without server restart [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#14684Fetched 2026-04-08 00:32:57
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Timeline (top)
closed ×1commented ×1subscribed ×1

Root Cause

Currently, switching between different LLM models in Ollama requires stopping and restarting the Ollama server. This is problematic because:

Code Example

# Pre-load multiple models
curl -X POST http://localhost:11434/api/model/load -d '{"model": "codellama"}'
curl -X POST http://localhost:11434/api/model/load -d '{"model": "llama3"}'

# List loaded models
curl http://localhost:11434/api/models/loaded
# [{"name": "codellama", "memory": "4GB"}, {"name": "llama3", "memory": "8GB"}]

# Chat using llama3 (auto-selected or specified)
curl http://localhost:11434/api/chat -d '{"model": "llama3", "messages": [...]}'

# Unload codellama when you need VRAM for other tasks
curl -X POST http://localhost:11434/api/model/unload -d '{"model": "codellama"}'
RAW_BUFFERClick to expand / collapse

Problem

Currently, switching between different LLM models in Ollama requires stopping and restarting the Ollama server. This is problematic because:

  1. Time-consuming: Each model switch requires waiting for the server to fully restart
  2. Memory inefficient: All loaded models are cleared from VRAM on restart
  3. Disruptive: Active requests/interactions are interrupted during restart
  4. Poor DX: Developers working with multiple models (e.g., coding assistant + chat) must constantly restart

Proposed Solution

Implement hot-swappable model loading that allows:

  1. Multiple models loaded simultaneously: Keep several models in memory (within VRAM limits)
  2. Dynamic model switching: Switch the active model via API without restarting
  3. Model lifecycle management: Load/unload specific models on demand
  4. Memory-aware scheduling: Automatically manage VRAM based on available resources

Use Cases

  • Multi-model workflows: Use a coding model for code assistance and a chat model for conversation without restart
  • Development pipelines: Quickly switch between different model versions during testing
  • Resource optimization: Keep a lightweight model always loaded while loading heavier models on-demand
  • Reduced latency: Eliminate cold-start delays when switching between frequently-used models

Implementation Suggestions

API Endpoints:

    • Load a model into memory
    • Unload a model from memory
    • List currently loaded models
  • Add parameter to that auto-loads if not loaded

Configuration:

    • Maximum concurrent models (default: 2)
    • Max VRAM per model or total

Memory Management:

  • Implement LRU eviction when VRAM is exhausted
  • Add priority system for sticky models that shouldn't be evicted
  • Expose memory usage stats via

Example Workflow

# Pre-load multiple models
curl -X POST http://localhost:11434/api/model/load -d '{"model": "codellama"}'
curl -X POST http://localhost:11434/api/model/load -d '{"model": "llama3"}'

# List loaded models
curl http://localhost:11434/api/models/loaded
# [{"name": "codellama", "memory": "4GB"}, {"name": "llama3", "memory": "8GB"}]

# Chat using llama3 (auto-selected or specified)
curl http://localhost:11434/api/chat -d '{"model": "llama3", "messages": [...]}'

# Unload codellama when you need VRAM for other tasks
curl -X POST http://localhost:11434/api/model/unload -d '{"model": "codellama"}'

This feature would significantly improve developer experience and make Ollama more suitable for production multi-model deployments.

extent analysis

Fix Plan

To implement hot-swappable model loading in Ollama, follow these steps:

  • Step 1: API Endpoints
    • Create API endpoints for loading, unloading, and listing models:

from flask import Flask, request, jsonify app = Flask(name)

Model management

models = {}

@app.route('/api/model/load', methods=['POST']) def load_model(): model_name = request.json['model'] # Load model into memory models[model_name] = load_model_into_memory(model_name) return jsonify({'message': f'Model {model_name} loaded'})

@app.route('/api/model/unload', methods=['POST']) def unload_model(): model_name = request.json['model'] # Unload model from memory del models[model_name] return jsonify({'message': f'Model {model_name} unloaded'})

@app.route('/api/models/loaded', methods=['GET']) def list_loaded_models(): return jsonify([{'name': model, 'memory': get_model_memory_usage(model)} for model in models])

* **Step 2: Configuration**
  * Add configuration options for maximum concurrent models and max VRAM per model:
    ```python
max_concurrent_models = 2
max_vram_per_model = 16  # GB
  • Step 3: Memory Management
    • Implement LRU eviction when VRAM is exhausted:

from collections import OrderedDict

model_lru = OrderedDict()

def load_model_into_memory(model_name): # Load model into memory # ... model_lru[model_name] = time.time() if len(model_lru) > max_concurrent_models: # Evict least recently used model lru_model = next(iter(model_lru)) del model_lru[lru_model] del models[lru_model]

* **Step 4: Auto-loading**
  * Add parameter to API endpoint to auto-load model if not loaded:
    ```python
@app.route('/api/chat', methods=['POST'])
def chat():
    model_name = request.json.get('model')
    if model_name and model_name not in models:
        # Auto-load model
        load_model_into_memory(model_name)
    # ...

Verification

To verify that the fix worked, test the following scenarios:

  • Load multiple models into memory using the /api/model/load endpoint
  • List loaded models using the /api/models/loaded endpoint
  • Unload a model using the /api/model/unload endpoint
  • Test auto-loading by specifying a model in the /api/chat endpoint that is not loaded

Extra Tips

  • Monitor VRAM usage and adjust configuration options as needed
  • Implement priority system for sticky models that shouldn't be evicted
  • Expose memory usage stats via API endpoint for monitoring and debugging purposes

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix Feature Request: Hot-swappable model loading without server restart [1 comments, 2 participants]