litellm - ✅(Solved) Fix [Bug]: Pass-through endpoint registry grows unbounded causing CPU to reach 100% [3 pull requests, 1 participants]

litellm2026-03-31 03:49:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#24833•Fetched 2026-04-08 01:59:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

silencedoctor

Participants

silencedoctor

Timeline (top)

referenced ×3cross-referenced ×2labeled ×1

Root Cause

Root cause:

Fix Action

Fixed

Fixed by PR: fix(pass-through): pop stale route key directly to prevent unbounded registry growth (https://github.com/BerriAI/litellm/pull/26082)
Fixed by PR: fix(pass-through): remove stale routes by key to prevent unbounded registry growth (https://github.com/BerriAI/litellm/pull/24846)

PR fix notes

PR #24846: fix(pass-through): remove stale routes by key to prevent unbounded registry growth

Repository: BerriAI/litellm
Author: silencedoctor
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/24846

Description (problem / solution / changelog)

Relevant issues

Fixes https://github.com/BerriAI/litellm/issues/24833

Pre-Submission checklist

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

One-line fix in initialize_pass_through_endpoints (line ~2294):

# Before — scans by endpoint_id, never matches because UUID changes every cycle
InitPassThroughEndpointHelpers.remove_endpoint_routes(endpoint_key)

# After — pops the stale route key directly in O(1)
_registered_pass_through_routes.pop(endpoint_key, None)

Root cause: When pass-through endpoints stored in DB have no id field, each 30s reload generates a new UUID. The old cleanup (remove_endpoint_routes) compared the full route key against endpoint_id (UUID only) — they never matched, so stale entries were never deleted. The dict grew without bound, making cleanup O(n²) and eventually pinning CPU at 100%.

Fix: Replace the remove_endpoint_routes() call with a direct dict.pop() on the route key, which removes the stale entry in O(1).

Tests added

3 unit tests in TestRemoveStaleEndpointRoute:

test_pop_removes_stale_route_by_key — verifies O(1) removal
test_pop_noop_for_unknown_key — verifies no-op for nonexistent keys
test_registry_does_not_grow_across_reload_cycles — core regression test: simulates 50 reload cycles with changing UUIDs and asserts the registry stays at constant size

Changed files

litellm/proxy/pass_through_endpoints/pass_through_endpoints.py (modified, +5/-1)
tests/test_litellm/proxy/pass_through_endpoints/test_pass_through_endpoints.py (modified, +80/-0)

PR #24872: fix: stable deterministic IDs + correct cleanup for pass-through registry leak (fixes #24833)

Repository: BerriAI/litellm
Author: voidborne-d
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/24872

Description (problem / solution / changelog)

Summary

Fixes #24833 — _registered_pass_through_routes grows unbounded causing CPU to reach 100%.

Profiling showed remove_endpoint_routes consumed ~43 s of CPU on every 30-second reload cycle. Two independent root-cause bugs both needed to be fixed:

Bug 1: UUID churn — IDs change every reload

Location: _register_pass_through_endpoint() (line ~2154)

DB-sourced endpoints have no id field. The old code generated a fresh uuid.uuid4() on every call:

# before
if endpoint_data.get('id') is None:
    endpoint_data['id'] = str(uuid.uuid4())   # new UUID every 30 s!

Each reload produced a different route key in _registered_pass_through_routes, so old entries were never reused or cleaned up. With N endpoints and a 30-second reload interval the dict grew by N entries every cycle.

Fix: derive a stable, deterministic ID from path + methods via SHA-256:

# after
if endpoint_data.get('id') is None:
    _path_for_id = endpoint_data.get('path') or ''
    _methods_for_id = sorted(endpoint_data.get('methods') or [])
    _stable_key = f'path:{_path_for_id}|methods:{",".join(_methods_for_id)}'
    endpoint_data['id'] = 'auto-' + hashlib.sha256(_stable_key.encode()).hexdigest()[:16]

An unchanged endpoint now always maps to the same registry key across reloads. Explicit IDs (from YAML config or API) are preserved unchanged.

Bug 2: Wrong argument to remove_endpoint_routes — cleanup was a no-op

Location: initialize_pass_through_endpoints() (line ~2294)

The cleanup loop collected route keys (format: {endpoint_id}:exact:{path}:{methods}) from get_all_registered_pass_through_routes(), then passed the full key to remove_endpoint_routes() which expects only the endpoint_id prefix:

# before — passes full key like '550e8400...:exact:/my/path:GET,POST'
InitPassThroughEndpointHelpers.remove_endpoint_routes(endpoint_key)

# inside remove_endpoint_routes: searches for value['endpoint_id'] == endpoint_key
# → never matches the stored short UUID → zero entries deleted → no-op

Even if IDs had been stable, this bug alone would have prevented any cleanup from ever happening.

Fix: split the route key to extract the endpoint_id before calling the helper:

# after
stale_endpoint_id = endpoint_key.split(':', 1)[0]
InitPassThroughEndpointHelpers.remove_endpoint_routes(stale_endpoint_id)

Tests

Added 5 new regression tests in tests/pass_through_unit_tests/test_passthrough_registry_leak.py:

Test	What it verifies
`test_db_endpoint_without_id_gets_stable_deterministic_id`	Same path produces same ID across two simulated reloads
`test_different_paths_produce_different_ids`	Different paths → different IDs (no collision)
`test_remove_endpoint_routes_called_with_endpoint_id_prefix`	Extracted prefix correctly removes both exact + subpath routes
`test_registry_does_not_grow_across_reload_cycles`	5 reload cycles of same endpoint → registry size stays constant
`test_explicit_id_is_preserved`	Endpoints with an explicit `id` keep their ID unchanged

Impact

All existing pass-through endpoint behavior is unchanged for config-file endpoints (these carry explicit IDs)
DB endpoints (no id) now get a deterministic, stable auto- prefixed ID
CPU usage caused by the O(n²) cleanup scan is eliminated

Changed files

litellm/proxy/pass_through_endpoints/pass_through_endpoints.py (modified, +37/-2)
tests/pass_through_unit_tests/test_passthrough_registry_leak.py (added, +291/-0)

Code Example

if endpoint_data.get("id") is None:
    endpoint_data["id"] = str(uuid.uuid4())

---

# remove_endpoint_routes (line 2008-2027)
keys_to_remove = [
    key
    for key, value in _registered_pass_through_routes.items()
    if value["endpoint_id"] == endpoint_id
]

---

py-spy shows remove_endpoint_routes consumed 42.91 seconds of CPU,
accounting for nearly the entirety of initialize_pass_through_endpoints (43 seconds total).

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When using pass-through endpoints with store_model_in_db: true, CPU usage gradually climbs to 100% and stays there, even when idle. Profiling with py-spy shows that remove_endpoint_routes in pass_through_endpoints.py consumes 42.91 seconds of CPU — nearly the entire initialize_pass_through_endpoints cycle.

Root cause:

LiteLLM reloads pass-through endpoints from the DB every ~30 seconds. The DB records have no id field, so _register_pass_through_endpoint generates a new UUID on every reload cycle (line 2154-2155):

if endpoint_data.get("id") is None:
    endpoint_data["id"] = str(uuid.uuid4())

This new UUID becomes part of the route key ({endpoint_id}:exact:{path}:{methods}), which is inserted into _registered_pass_through_routes. During cleanup, the code iterates through all entries looking for matching endpoint_id:

# remove_endpoint_routes (line 2008-2027)
keys_to_remove = [
    key
    for key, value in _registered_pass_through_routes.items()
    if value["endpoint_id"] == endpoint_id
]

Since the endpoint_id (UUID) changes every cycle, old entries never match and never get deleted. The dictionary grows by N entries every 30 seconds, and the cleanup scan degrades from O(n) to O(n²).

Impact:

_registered_pass_through_routes grows without bound
Every 30-second cleanup cycle scans the entire (ever-growing) dict
CPU usage increases linearly and eventually pins at 100%
Affects all users with DB-stored pass-through endpoints (no id field)

Steps to Reproduce

Configure pass-through endpoints stored in DB (via UI/API, not YAML config — these have no id field)
Enable store_model_in_db: true in general_settings
Start the proxy and send some requests through pass-through endpoints
Monitor CPU usage over time — it will steadily increase
Profile with py-spy: remove_endpoint_routes dominates CPU

Relevant log output

py-spy shows remove_endpoint_routes consumed 42.91 seconds of CPU,
accounting for nearly the entirety of initialize_pass_through_endpoints (43 seconds total).

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.7-stable (verified bug still present on main HEAD as of 2026-04-20)

Twitter / LinkedIn details

No response

Re-filed from #24833 with a fresh PR rebased onto current main. Original issue and PR #24846 stalled for 3 weeks without review.

extent analysis

TL;DR

The most likely fix is to modify the remove_endpoint_routes function to efficiently handle the growing dictionary of pass-through endpoint routes.

Guidance

Review the database schema to consider adding an id field to pass-through endpoint records to prevent UUID regeneration on every reload cycle.
Optimize the remove_endpoint_routes function to improve dictionary lookup efficiency, potentially by using a more efficient data structure or by reorganizing the cleanup logic.
Consider implementing a mechanism to limit the growth of the _registered_pass_through_routes dictionary, such as by removing old entries after a certain threshold is reached.
Investigate the possibility of caching or batching database queries to reduce the frequency of reload cycles.

Example

# Example of using a set for efficient lookup
registered_endpoint_ids = set()
...
def remove_endpoint_routes(endpoint_id):
    global registered_endpoint_ids
    registered_endpoint_ids.discard(endpoint_id)
    # Remove corresponding routes from _registered_pass_through_routes

Notes

The provided information suggests a performance issue related to the growing dictionary, but the root cause may be more complex, and additional investigation may be necessary.
The proposed fix assumes that the id field can be added to the database schema or that an alternative solution can be implemented to prevent UUID regeneration.

Recommendation

Apply workaround: Modify the remove_endpoint_routes function to improve dictionary lookup efficiency and consider implementing a mechanism to limit dictionary growth, as the current implementation leads to significant performance degradation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

litellm - ✅(Solved) Fix [Bug]: Pass-through endpoint registry grows unbounded causing CPU to reach 100% [3 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #24846: fix(pass-through): remove stale routes by key to prevent unbounded registry growth

Description (problem / solution / changelog)

Relevant issues

Pre-Submission checklist

Type

Changes

Tests added

Changed files

PR #24872: fix: stable deterministic IDs + correct cleanup for pass-through registry leak (fixes #24833)

Description (problem / solution / changelog)

Summary

Bug 1: UUID churn — IDs change every reload

Bug 2: Wrong argument to remove_endpoint_routes — cleanup was a no-op

Tests

Impact

Changed files

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING