litellm - ✅(Solved) Fix [Bug]: HealthCheckTable Unbounded Growth Causes Model Dashboard Performance Degradation [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#25623Fetched 2026-04-14 05:38:40
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×2commented ×1cross-referenced ×1referenced ×1

Error Message

httpx.ReadError: Server disconnected httpx.ConnectError: ClientNotConnectedError

Fix Action

Fixed

PR fix notes

PR #25652: fix(proxy): prevent LiteLLM_HealthCheckTable unbounded growth

Description (problem / solution / changelog)

Summary

Fixes #25623

LiteLLM_HealthCheckTable grows without bound when external monitoring tools (e.g. Uptime Kuma) ping health endpoints continuously. With 500k+ rows, the Model Dashboard query loads every row into Python memory, causing 8GB+ container memory and 400% CPU spikes.

Root cause

get_all_latest_health_checks ran a full-table find_many with no where filter and no take limit, then deduplicated in Python. On a large deployment this means hundreds of thousands of rows materialized in memory on every dashboard load.

Changes

litellm/proxy/utils.py

get_all_latest_health_checks — add a checked_at >= (now - TTL) WHERE clause so only recent rows are fetched. The DB does the filtering; Python only sees O(unique models) rows.

cleanup_old_health_checks — new method that deletes rows older than TTL via a single delete_many. Called from save_health_check_result at most once per hour (tracked via _health_check_last_cleanup_ts) to keep the table bounded without adding per-insert overhead.

TTL defaults to 7 days and is configurable via HEALTH_CHECK_TTL_DAYS env var.

Tests

Added 5 new tests in tests/test_litellm/proxy/test_health_check_functions.py:

  • test_get_all_latest_health_checks_applies_ttl_filter — verifies the WHERE filter is passed to the DB
  • test_get_all_latest_health_checks_ttl_env_override — verifies HEALTH_CHECK_TTL_DAYS controls the cutoff
  • test_cleanup_old_health_checks_calls_delete_many — verifies delete_many is called with a lt cutoff
  • test_save_health_check_result_triggers_cleanup_after_one_hour — cleanup runs when last run was > 1 hour ago
  • test_save_health_check_result_skips_cleanup_within_one_hour — cleanup is skipped when last run was < 1 hour ago

All 27 tests in the file pass.

Checklist

  • Tests added (5 new unit tests, all mocked)
  • make test-unit passes for this test file
  • Black formatting applied
  • Scope isolated to this single bug
  • Uses Prisma model methods (find_many, delete_many) — no raw SQL

Changed files

  • docs/my-website/docs/proxy/config_settings.md (modified, +1/-0)
  • litellm/proxy/utils.py (modified, +76/-34)
  • tests/test_litellm/proxy/test_health_check_functions.py (modified, +248/-81)

Code Example

httpx.ReadError: Server disconnected
  httpx.ConnectError: ClientNotConnectedError

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When external monitoring tools (e.g., Uptime Kuma) are configured to monitor LiteLLM endpoints, the LiteLLM_HealthCheckTable grows unboundedly, leading to severe performance degradation in the Model Dashboard. The health check data is not actively displayed in this dashboard but still gets queried, causing memory/CPU spikes.

Expected Behavior

  • Health check data should either be displayed in the dashboard or not queried on dashboard load
  • The Model Dashboard should remain responsive regardless of health check history size
  • Old health check entries should be automatically purged or configurable for retention

Actual Behavior

  • LiteLLM_HealthCheckTable grows to 500,000+ rows without bounds
  • Each health check ping creates a new row
  • Model Dashboard queries trigger full table scans on the bloated table
  • Container memory grows to 8GB+ with 400% CPU usage
  • Prisma query-engine enters death/reconnect loops:
    httpx.ReadError: Server disconnected
    httpx.ConnectError: ClientNotConnectedError

Possible fixes

  1. HealthCheckTable retention policy: Implement automatic cleanup of LiteLLM_HealthCheckTable entries older than N days (configurable, default 7 days)

  2. Dashboard query optimization: Exclude HealthCheckTable from Model Dashboard queries, or paginate/cursor-based queries if health data is needed

Steps to Reproduce

  1. Configure external monitoring (e.g., Uptime Kuma) to ping LiteLLM health endpoints
  2. Run for several days
  3. Observe memory/CPU spike in LiteLLM container when opening Model Dashboard → severe lag or unresponsiveness

Relevant log output

What part of LiteLLM is this about?

UI Dashboard

What LiteLLM version are you on ?

v1.82.3.dev.9

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Implement a retention policy for the LiteLLM_HealthCheckTable to automatically purge old health check entries and optimize Model Dashboard queries to exclude or paginate the HealthCheckTable.

Guidance

  • Implement a configurable retention policy (e.g., 7 days) to automatically delete old LiteLLM_HealthCheckTable entries to prevent unbounded growth.
  • Optimize Model Dashboard queries to exclude the HealthCheckTable if health data is not displayed, or use pagination/cursor-based queries if health data is needed.
  • Verify the fix by monitoring the LiteLLM_HealthCheckTable size and Model Dashboard performance after implementing the retention policy and query optimizations.
  • Consider adding logging or monitoring to track the number of purged health check entries and query performance to ensure the fix is effective.

Example

No code snippet is provided as the issue does not contain sufficient information about the underlying database or query implementation.

Notes

The provided solution assumes that the LiteLLM_HealthCheckTable growth is the primary cause of the performance degradation. Additional investigation may be necessary to rule out other contributing factors.

Recommendation

Apply a workaround by implementing a retention policy for the LiteLLM_HealthCheckTable and optimizing Model Dashboard queries, as this is a more immediate and targeted solution to address the performance degradation issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING