litellm - ✅(Solved) Fix [Bug]: HealthCheckTable Unbounded Growth Causes Model Dashboard Performance Degradation [1 pull requests, 1 comments, 2 participants]

litellm2026-04-13 08:16:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#25623•Fetched 2026-04-14 05:38:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

checkmeck

Participants

balgaly

checkmeck

Timeline (top)

labeled ×2commented ×1cross-referenced ×1referenced ×1

Error Message

httpx.ReadError: Server disconnected httpx.ConnectError: ClientNotConnectedError

Fix Action

Fixed

Fixed by PR: fix(proxy): prevent LiteLLM_HealthCheckTable unbounded growth (https://github.com/BerriAI/litellm/pull/25652)

PR fix notes

PR #25652: fix(proxy): prevent LiteLLM_HealthCheckTable unbounded growth

Repository: BerriAI/litellm
Author: balgaly
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/25652

Description (problem / solution / changelog)

Summary

Fixes #25623

LiteLLM_HealthCheckTable grows without bound when external monitoring tools (e.g. Uptime Kuma) ping health endpoints continuously. With 500k+ rows, the Model Dashboard query loads every row into Python memory, causing 8GB+ container memory and 400% CPU spikes.

Root cause

get_all_latest_health_checks ran a full-table find_many with no where filter and no take limit, then deduplicated in Python. On a large deployment this means hundreds of thousands of rows materialized in memory on every dashboard load.

Changes

`litellm/proxy/utils.py`

get_all_latest_health_checks — add a checked_at >= (now - TTL) WHERE clause so only recent rows are fetched. The DB does the filtering; Python only sees O(unique models) rows.

cleanup_old_health_checks — new method that deletes rows older than TTL via a single delete_many. Called from save_health_check_result at most once per hour (tracked via _health_check_last_cleanup_ts) to keep the table bounded without adding per-insert overhead.

TTL defaults to 7 days and is configurable via HEALTH_CHECK_TTL_DAYS env var.

Tests

Added 5 new tests in tests/test_litellm/proxy/test_health_check_functions.py:

test_get_all_latest_health_checks_applies_ttl_filter — verifies the WHERE filter is passed to the DB
test_get_all_latest_health_checks_ttl_env_override — verifies HEALTH_CHECK_TTL_DAYS controls the cutoff
test_cleanup_old_health_checks_calls_delete_many — verifies delete_many is called with a lt cutoff
test_save_health_check_result_triggers_cleanup_after_one_hour — cleanup runs when last run was > 1 hour ago
test_save_health_check_result_skips_cleanup_within_one_hour — cleanup is skipped when last run was < 1 hour ago

All 27 tests in the file pass.

Checklist

Tests added (5 new unit tests, all mocked)
make test-unit passes for this test file
Black formatting applied
Scope isolated to this single bug
Uses Prisma model methods (find_many, delete_many) — no raw SQL

Changed files

docs/my-website/docs/proxy/config_settings.md (modified, +1/-0)
litellm/proxy/utils.py (modified, +76/-34)
tests/test_litellm/proxy/test_health_check_functions.py (modified, +248/-81)

Code Example

httpx.ReadError: Server disconnected
  httpx.ConnectError: ClientNotConnectedError

---

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When external monitoring tools (e.g., Uptime Kuma) are configured to monitor LiteLLM endpoints, the LiteLLM_HealthCheckTable grows unboundedly, leading to severe performance degradation in the Model Dashboard. The health check data is not actively displayed in this dashboard but still gets queried, causing memory/CPU spikes.

Expected Behavior

Health check data should either be displayed in the dashboard or not queried on dashboard load
The Model Dashboard should remain responsive regardless of health check history size
Old health check entries should be automatically purged or configurable for retention

Actual Behavior

LiteLLM_HealthCheckTable grows to 500,000+ rows without bounds
Each health check ping creates a new row
Model Dashboard queries trigger full table scans on the bloated table
Container memory grows to 8GB+ with 400% CPU usage

Prisma query-engine enters death/reconnect loops:

httpx.ReadError: Server disconnected
httpx.ConnectError: ClientNotConnectedError

Possible fixes

HealthCheckTable retention policy: Implement automatic cleanup of LiteLLM_HealthCheckTable entries older than N days (configurable, default 7 days)
Dashboard query optimization: Exclude HealthCheckTable from Model Dashboard queries, or paginate/cursor-based queries if health data is needed

Steps to Reproduce

Configure external monitoring (e.g., Uptime Kuma) to ping LiteLLM health endpoints
Run for several days
Observe memory/CPU spike in LiteLLM container when opening Model Dashboard → severe lag or unresponsiveness

Relevant log output

What part of LiteLLM is this about?

UI Dashboard

What LiteLLM version are you on ?

v1.82.3.dev.9

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Implement a retention policy for the LiteLLM_HealthCheckTable to automatically purge old health check entries and optimize Model Dashboard queries to exclude or paginate the HealthCheckTable.

Guidance

Implement a configurable retention policy (e.g., 7 days) to automatically delete old LiteLLM_HealthCheckTable entries to prevent unbounded growth.
Optimize Model Dashboard queries to exclude the HealthCheckTable if health data is not displayed, or use pagination/cursor-based queries if health data is needed.
Verify the fix by monitoring the LiteLLM_HealthCheckTable size and Model Dashboard performance after implementing the retention policy and query optimizations.
Consider adding logging or monitoring to track the number of purged health check entries and query performance to ensure the fix is effective.

Example

No code snippet is provided as the issue does not contain sufficient information about the underlying database or query implementation.

Notes

The provided solution assumes that the LiteLLM_HealthCheckTable growth is the primary cause of the performance degradation. Additional investigation may be necessary to rule out other contributing factors.

Recommendation

Apply a workaround by implementing a retention policy for the LiteLLM_HealthCheckTable and optimizing Model Dashboard queries, as this is a more immediate and targeted solution to address the performance degradation issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #LLM response #prompt template #agent execution #callback error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.