litellm - 💡(How to fix) Fix Proxy silently drops all models after extended runtime (config-only, no DB) [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#25350Fetched 2026-04-09 07:52:37
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1labeled ×1

After extended runtime (8-14+ hours), the LiteLLM proxy silently drops all models. /v1/models and /model/info return {"data": []} (0 models), while /health continues to report healthy with DB connected.

All downstream consumers that route through the proxy start failing with model-not-found errors. There is no log output indicating the models were dropped.

Error Message

Models silently disappear from llm_router.model_list after extended runtime. No error is logged. The health endpoint masks the issue by reporting healthy.

Root Cause

Suspected Root Cause

Fix Action

Workaround

We run an external health check (systemd timer, every 10 minutes) that queries /v1/models, checks model count, and auto-restarts the proxy via systemctl restart if count drops to 0.

RAW_BUFFERClick to expand / collapse

Bug Report

Environment

  • LiteLLM version: 1.83.4 (latest as of 2026-04-08)
  • OS: Ubuntu 24.04
  • Python: 3.12
  • Setup: Config-only (config.yaml), no Prisma/DB for model storage
  • Deployment: systemd user service, --num_workers 1

Description

After extended runtime (8-14+ hours), the LiteLLM proxy silently drops all models. /v1/models and /model/info return {"data": []} (0 models), while /health continues to report healthy with DB connected.

All downstream consumers that route through the proxy start failing with model-not-found errors. There is no log output indicating the models were dropped.

Reproduction Steps

  1. Start LiteLLM proxy with a config.yaml containing multiple model definitions (we have 26 models across Vertex AI and Anthropic providers)
  2. Verify models are loaded: curl /v1/models returns all 26
  3. Leave the proxy running for 8-14+ hours under normal load
  4. Check again: curl /v1/models returns {"data": []} — 0 models
  5. /health still reports healthy
  6. Restarting the proxy immediately restores all 26 models

Expected Behavior

Models loaded from config.yaml should persist for the lifetime of the proxy process.

Actual Behavior

Models silently disappear from llm_router.model_list after extended runtime. No error is logged. The health endpoint masks the issue by reporting healthy.

Suspected Root Cause

Based on code review of proxy_server.py:

  • _schedule_regular_config_update() runs a background task that periodically reloads config. If this reload silently fails or partially executes, it could wipe llm_router.model_list
  • The model_cost_map_reload scheduled task (runs every 24h by default) may also contribute — see #24349 and #24308 for related issues with this task causing unexpected proxy state

Workaround

We run an external health check (systemd timer, every 10 minutes) that queries /v1/models, checks model count, and auto-restarts the proxy via systemctl restart if count drops to 0.

Related Issues

  • #12711 — config.yaml models disappear after /model/update API call (different trigger, same symptom)
  • PR #22215 — fixes a related architectural issue where proxy_server.py only reads from llm_router.model_list; if the router drops its list, /model/info silently returns 0

Impact

This is a silent failure that affects all consumers. The health endpoint reporting "healthy" while serving 0 models makes it particularly dangerous for production use. We've hit this bug multiple times over the past week.

extent analysis

TL;DR

The most likely fix involves addressing the periodic config reload and model cost map reload tasks in proxy_server.py to prevent silent failures that wipe the llm_router.model_list.

Guidance

  • Investigate the _schedule_regular_config_update() function to ensure it properly handles config reloads without clearing the llm_router.model_list.
  • Review the model_cost_map_reload task to determine if it's contributing to the issue, considering the related issues #24349 and #24308.
  • Verify that the llm_router.model_list is being updated correctly after each config reload and model cost map reload.
  • Consider implementing additional logging to detect when the llm_router.model_list is being cleared or modified unexpectedly.

Example

No code snippet is provided due to the lack of specific code details in the issue.

Notes

The provided workaround using an external health check to auto-restart the proxy may mitigate the issue but does not address the root cause. The related issues #12711 and PR #22215 may provide additional context for resolving the problem.

Recommendation

Apply a workaround by implementing the external health check as described, while concurrently investigating and addressing the root cause related to the periodic config and model cost map reload tasks. This approach ensures immediate mitigation of the production impact while working towards a more permanent fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix Proxy silently drops all models after extended runtime (config-only, no DB) [1 comments, 1 participants]