litellm - ✅(Solved) Fix [Bug]: Pass-through endpoint registry grows unbounded causing CPU to reach 100% [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24833Fetched 2026-04-08 01:59:19
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
referenced ×3cross-referenced ×2labeled ×1

Root Cause

Root cause:

Fix Action

Fixed

PR fix notes

PR #24846: fix(pass-through): remove stale routes by key to prevent unbounded registry growth

Description (problem / solution / changelog)

Relevant issues

Fixes https://github.com/BerriAI/litellm/issues/24833

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

One-line fix in initialize_pass_through_endpoints (line ~2294):

# Before — scans by endpoint_id, never matches because UUID changes every cycle
InitPassThroughEndpointHelpers.remove_endpoint_routes(endpoint_key)

# After — pops the stale route key directly in O(1)
_registered_pass_through_routes.pop(endpoint_key, None)

Root cause: When pass-through endpoints stored in DB have no id field, each 30s reload generates a new UUID. The old cleanup (remove_endpoint_routes) compared the full route key against endpoint_id (UUID only) — they never matched, so stale entries were never deleted. The dict grew without bound, making cleanup O(n²) and eventually pinning CPU at 100%.

Fix: Replace the remove_endpoint_routes() call with a direct dict.pop() on the route key, which removes the stale entry in O(1).

Tests added

3 unit tests in TestRemoveStaleEndpointRoute:

  • test_pop_removes_stale_route_by_key — verifies O(1) removal
  • test_pop_noop_for_unknown_key — verifies no-op for nonexistent keys
  • test_registry_does_not_grow_across_reload_cycles — core regression test: simulates 50 reload cycles with changing UUIDs and asserts the registry stays at constant size

Changed files

  • litellm/proxy/pass_through_endpoints/pass_through_endpoints.py (modified, +5/-1)
  • tests/test_litellm/proxy/pass_through_endpoints/test_pass_through_endpoints.py (modified, +80/-0)

PR #24872: fix: stable deterministic IDs + correct cleanup for pass-through registry leak (fixes #24833)

Description (problem / solution / changelog)

Summary

Fixes #24833 — _registered_pass_through_routes grows unbounded causing CPU to reach 100%.

Profiling showed remove_endpoint_routes consumed ~43 s of CPU on every 30-second reload cycle. Two independent root-cause bugs both needed to be fixed:


Bug 1: UUID churn — IDs change every reload

Location: _register_pass_through_endpoint() (line ~2154)

DB-sourced endpoints have no id field. The old code generated a fresh uuid.uuid4() on every call:

# before
if endpoint_data.get('id') is None:
    endpoint_data['id'] = str(uuid.uuid4())   # new UUID every 30 s!

Each reload produced a different route key in _registered_pass_through_routes, so old entries were never reused or cleaned up. With N endpoints and a 30-second reload interval the dict grew by N entries every cycle.

Fix: derive a stable, deterministic ID from path + methods via SHA-256:

# after
if endpoint_data.get('id') is None:
    _path_for_id = endpoint_data.get('path') or ''
    _methods_for_id = sorted(endpoint_data.get('methods') or [])
    _stable_key = f'path:{_path_for_id}|methods:{",".join(_methods_for_id)}'
    endpoint_data['id'] = 'auto-' + hashlib.sha256(_stable_key.encode()).hexdigest()[:16]

An unchanged endpoint now always maps to the same registry key across reloads. Explicit IDs (from YAML config or API) are preserved unchanged.


Bug 2: Wrong argument to remove_endpoint_routes — cleanup was a no-op

Location: initialize_pass_through_endpoints() (line ~2294)

The cleanup loop collected route keys (format: {endpoint_id}:exact:{path}:{methods}) from get_all_registered_pass_through_routes(), then passed the full key to remove_endpoint_routes() which expects only the endpoint_id prefix:

# before — passes full key like '550e8400...:exact:/my/path:GET,POST'
InitPassThroughEndpointHelpers.remove_endpoint_routes(endpoint_key)

# inside remove_endpoint_routes: searches for value['endpoint_id'] == endpoint_key
# → never matches the stored short UUID → zero entries deleted → no-op

Even if IDs had been stable, this bug alone would have prevented any cleanup from ever happening.

Fix: split the route key to extract the endpoint_id before calling the helper:

# after
stale_endpoint_id = endpoint_key.split(':', 1)[0]
InitPassThroughEndpointHelpers.remove_endpoint_routes(stale_endpoint_id)

Tests

Added 5 new regression tests in tests/pass_through_unit_tests/test_passthrough_registry_leak.py:

TestWhat it verifies
test_db_endpoint_without_id_gets_stable_deterministic_idSame path produces same ID across two simulated reloads
test_different_paths_produce_different_idsDifferent paths → different IDs (no collision)
test_remove_endpoint_routes_called_with_endpoint_id_prefixExtracted prefix correctly removes both exact + subpath routes
test_registry_does_not_grow_across_reload_cycles5 reload cycles of same endpoint → registry size stays constant
test_explicit_id_is_preservedEndpoints with an explicit id keep their ID unchanged

Impact

  • All existing pass-through endpoint behavior is unchanged for config-file endpoints (these carry explicit IDs)
  • DB endpoints (no id) now get a deterministic, stable auto- prefixed ID
  • CPU usage caused by the O(n²) cleanup scan is eliminated

Changed files

  • litellm/proxy/pass_through_endpoints/pass_through_endpoints.py (modified, +37/-2)
  • tests/pass_through_unit_tests/test_passthrough_registry_leak.py (added, +291/-0)

Code Example

if endpoint_data.get("id") is None:
    endpoint_data["id"] = str(uuid.uuid4())

---

# remove_endpoint_routes (line 2008-2027)
keys_to_remove = [
    key
    for key, value in _registered_pass_through_routes.items()
    if value["endpoint_id"] == endpoint_id
]

---

py-spy shows remove_endpoint_routes consumed 42.91 seconds of CPU,
accounting for nearly the entirety of initialize_pass_through_endpoints (43 seconds total).
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When using pass-through endpoints with store_model_in_db: true, CPU usage gradually climbs to 100% and stays there, even when idle. Profiling with py-spy shows that remove_endpoint_routes in pass_through_endpoints.py consumes 42.91 seconds of CPU — nearly the entire initialize_pass_through_endpoints cycle.

Root cause:

LiteLLM reloads pass-through endpoints from the DB every ~30 seconds. The DB records have no id field, so _register_pass_through_endpoint generates a new UUID on every reload cycle (line 2154-2155):

if endpoint_data.get("id") is None:
    endpoint_data["id"] = str(uuid.uuid4())

This new UUID becomes part of the route key ({endpoint_id}:exact:{path}:{methods}), which is inserted into _registered_pass_through_routes. During cleanup, the code iterates through all entries looking for matching endpoint_id:

# remove_endpoint_routes (line 2008-2027)
keys_to_remove = [
    key
    for key, value in _registered_pass_through_routes.items()
    if value["endpoint_id"] == endpoint_id
]

Since the endpoint_id (UUID) changes every cycle, old entries never match and never get deleted. The dictionary grows by N entries every 30 seconds, and the cleanup scan degrades from O(n) to O(n²).

Impact:

  • _registered_pass_through_routes grows without bound
  • Every 30-second cleanup cycle scans the entire (ever-growing) dict
  • CPU usage increases linearly and eventually pins at 100%
  • Affects all users with DB-stored pass-through endpoints (no id field)

Steps to Reproduce

  1. Configure pass-through endpoints stored in DB (via UI/API, not YAML config — these have no id field)
  2. Enable store_model_in_db: true in general_settings
  3. Start the proxy and send some requests through pass-through endpoints
  4. Monitor CPU usage over time — it will steadily increase
  5. Profile with py-spy: remove_endpoint_routes dominates CPU

Relevant log output

py-spy shows remove_endpoint_routes consumed 42.91 seconds of CPU,
accounting for nearly the entirety of initialize_pass_through_endpoints (43 seconds total).

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.7-stable (verified bug still present on main HEAD as of 2026-04-20)

Twitter / LinkedIn details

No response


Re-filed from #24833 with a fresh PR rebased onto current main. Original issue and PR #24846 stalled for 3 weeks without review.

extent analysis

TL;DR

  • The most likely fix is to modify the remove_endpoint_routes function to efficiently handle the growing dictionary of pass-through endpoint routes.

Guidance

  • Review the database schema to consider adding an id field to pass-through endpoint records to prevent UUID regeneration on every reload cycle.
  • Optimize the remove_endpoint_routes function to improve dictionary lookup efficiency, potentially by using a more efficient data structure or by reorganizing the cleanup logic.
  • Consider implementing a mechanism to limit the growth of the _registered_pass_through_routes dictionary, such as by removing old entries after a certain threshold is reached.
  • Investigate the possibility of caching or batching database queries to reduce the frequency of reload cycles.

Example

# Example of using a set for efficient lookup
registered_endpoint_ids = set()
...
def remove_endpoint_routes(endpoint_id):
    global registered_endpoint_ids
    registered_endpoint_ids.discard(endpoint_id)
    # Remove corresponding routes from _registered_pass_through_routes

Notes

  • The provided information suggests a performance issue related to the growing dictionary, but the root cause may be more complex, and additional investigation may be necessary.
  • The proposed fix assumes that the id field can be added to the database schema or that an alternative solution can be implemented to prevent UUID regeneration.

Recommendation

  • Apply workaround: Modify the remove_endpoint_routes function to improve dictionary lookup efficiency and consider implementing a mechanism to limit dictionary growth, as the current implementation leads to significant performance degradation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - ✅(Solved) Fix [Bug]: Pass-through endpoint registry grows unbounded causing CPU to reach 100% [3 pull requests, 1 participants]