litellm - 💡(How to fix) Fix [Bug]: Worker with transient Postgres/HAProxy outage causing valid models to fail with "Invalid model name" [1 participants]

litellm2026-04-22 10:52:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#26237•Fetched 2026-04-23 07:24:28

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mishaja12

Participants

mishaja12

Timeline (top)

labeled ×3

Error Message

Response: {"error":{"message":"{'error': '/chat/completions: Invalid model name passed in model=google/gemini-2.5-flash. Call /v1/models to view available models for your key.'}","type":"None","param":"None","code":"400"}}

Root Cause

LiteLLM should not serve traffic until router/model state is successfully loaded at least once, or LiteLLM should keep retrying and recover automatically once Postgres connectivity returns, without staying stuck in a bad state valid configured models should not be rejected permanently just because the first DB poll failed during startup

Code Example

Response: {"error":{"message":"{'error': '/chat/completions: Invalid model name passed in model=google/gemini-2.5-flash. Call `/v1/models` to view available models for your key.'}","type":"None","param":"None","code":"400"}} 

And the same for every other model

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

We hit an incident where LiteLLM started returning Invalid model name for a model that was valid and configured (e.g. google/gemini-2.5-flash, but actually all 140 models that we've configured).

This happened in a setup where LiteLLM uses a Postgres-backed config/state layer behind HAProxy. The leading pattern we observed is:

there is a transient Postgres / HAProxy instability a LiteLLM worker starts or refreshes during that instability the worker appears to miss its initial successful model/router load the worker still serves traffic all requests going through that worker fail with Invalid model name for otherwise valid models the issue stops after the worker is restarted or after a later successful refresh

What we expected to happen:

This looks like a startup / readiness / router initialization bug in the DB-backed config path, rather than a real invalid-model request.

Steps to Reproduce

I do believe it is quite hard to replicate, but we have this setup PostgreSQL + HA Proxy

Run LiteLLM in Kubernetes with DB-backed configuration/state stored in Postgres, with Postgres accessed through HAProxy.
Configure a valid model such as google/gemini-2.5-flash. (or any OpenAI model)
Introduce a short Postgres / HAProxy outage or connection flap.
Restart one LiteLLM worker during that outage window.
Allow the worker to come up.
Send a normal OpenAI-compatible request to /chat/completions using the valid configured model.
Observe that the restarted worker returns Invalid model name for that valid model, and continues doing so until restart or later successful refresh.

Example request:

curl -X POST "http://<litellm-host>/chat/completions"
-H "Content-Type: application/json"
-H "Authorization: Bearer <key>"
-d '{ "model": "google/gemini-2.5-flash", "messages": [ { "role": "user", "content": "hello" } ] }' 2. 3.

Relevant log output

Response: {"error":{"message":"{'error': '/chat/completions: Invalid model name passed in model=google/gemini-2.5-flash. Call `/v1/models` to view available models for your key.'}","type":"None","param":"None","code":"400"}} 

And the same for every other model

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.82.3

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Implement a retry mechanism for the LiteLLM worker to reload the model/router state when the initial load fails due to Postgres/HAProxy instability.

Guidance

Verify that the Postgres/HAProxy instability is indeed the root cause by checking the logs for any connection errors or timeouts during the worker startup.
Implement a health check for the LiteLLM worker to ensure it does not serve traffic until the model/router state is successfully loaded.
Consider adding a circuit breaker pattern to detect when the Postgres/HAProxy connection is unstable and prevent the worker from serving traffic until the connection is restored.
Review the LiteLLM configuration to ensure that the worker is properly configured to handle transient errors and retry the model/router state load as needed.

Example

A possible implementation of the retry mechanism could involve using a library like tenacity to decorate the model/router state load function with a retry policy, e.g.:

import tenacity

@tenacity.retry(wait=tenacity.wait_exponential(multiplier=1, min=4, max=10))
def load_model_state():
    # code to load model/router state
    pass

Notes

The exact implementation of the retry mechanism and health check will depend on the specific requirements and constraints of the LiteLLM application.
It may be necessary to modify the LiteLLM configuration or code to properly handle transient errors and implement the retry mechanism.

Recommendation

Apply a workaround by implementing a retry mechanism for the LiteLLM worker to reload the model/router state when the initial load fails due to Postgres/HAProxy instability, as this will allow the worker to recover from transient errors and serve traffic correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tool integration #LLM response #prompt template #agent execution #callback error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Worker with transient Postgres/HAProxy outage causing valid models to fail with "Invalid model name" [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Worker with transient Postgres/HAProxy outage causing valid models to fail with "Invalid model name" [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING