litellm - ✅(Solved) Fix [Bug]: No Retry-After header on RouterRateLimitError (all deployments in cooldown) [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#27823Fetched 2026-05-14 03:30:26
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Timeline (top)
cross-referenced ×2commented ×1labeled ×1

Error Message

The RouterRateLimitError already carries self.cooldown_time as a float attribute. This value should be exposed as a standard Retry-After HTTP header on the 429 response so clients can respect it without parsing error message strings. {"error": {"message": "No deployments available for selected model, Try again in 60 seconds. ...", "type": "None", "param": "None", "code": "429"}} No retry-after header. The timing is only available by parsing the error message body. {"error": {"message": "No deployments available for selected model, Try again in 60 seconds. ...", ...}} In _handle_llm_api_exception in litellm/proxy/common_request_processing.py, after headers are assembled and before the raise ProxyException(...) branches, check if the exception is a RouterRateLimitError and add the header:

Fix Action

Fix

In _handle_llm_api_exception in litellm/proxy/common_request_processing.py, after headers are assembled and before the raise ProxyException(...) branches, check if the exception is a RouterRateLimitError and add the header:

from litellm.types.router import RouterRateLimitError

# ...

if isinstance(e, RouterRateLimitError):
    cooldown_time = getattr(e, "cooldown_time", None)
    if cooldown_time is not None:
        headers["retry-after"] = str(int(cooldown_time))

This is distinct from:

  • #21553 / PR #21648 — forwarding upstream provider Retry-After header (gap #1)
  • #26070 — exposing retry_after attribute on litellm.RateLimitError from provider messages

This issue covers the case where LiteLLM itself is the rate limiter (router-level cooldown), not the upstream provider.

PR fix notes

PR #27825: fix(proxy): add Retry-After header on RouterRateLimitError

Description (problem / solution / changelog)

Problem

When all deployments for a model are in cooldown (e.g. after upstream 429 rate limits), LiteLLM raises RouterRateLimitError with a human-readable message like:

No deployments available for selected model, Try again in 120 seconds.

However, no Retry-After HTTP header is set on the response. This means downstream clients (OpenAI SDK, custom agents, API gateways) cannot programmatically determine when to retry — they must parse the error message string, which is fragile and non-standard.

Fix

The cooldown_time is already available as a float attribute on RouterRateLimitError. This PR promotes it to a standard Retry-After HTTP header in _handle_llm_api_exception(), so clients get:

HTTP/1.1 429 Too Many Requests
Retry-After: 60

Changes

  • litellm/proxy/common_request_processing.py: After headers.update(custom_headers), check if the exception is a RouterRateLimitError and add retry-after header from e.cooldown_time.
  • tests/test_litellm/proxy/test_common_request_processing.py: New test class TestHandleLLMApiExceptionRetryAfterHeader with 3 tests:
    • RouterRateLimitError with cooldown_time=60 → header "60"
    • RouterRateLimitError with cooldown_time=0 → header "0"
    • Non-rate-limit error → no retry-after header

Testing

pytest tests/test_litellm/proxy/test_common_request_processing.py -k "TestHandleLLMApiExceptionRetryAfterHeader" -v

Fixes #27823

Changed files

  • litellm/llms/gemini/google_genai/transformation.py (modified, +53/-0)
  • litellm/llms/vertex_ai/google_genai/transformation.py (modified, +3/-0)
  • litellm/proxy/common_request_processing.py (modified, +7/-0)
  • tests/test_litellm/google_genai/test_google_genai_transformation.py (modified, +227/-0)
  • tests/test_litellm/proxy/test_common_request_processing.py (modified, +57/-0)

PR #27826: fix(proxy): add Retry-After header on RouterRateLimitError

Description (problem / solution / changelog)

Problem

When all deployments for a model are in cooldown (e.g. after upstream 429 rate limits), LiteLLM raises RouterRateLimitError with a human-readable message like:

No deployments available for selected model, Try again in 120 seconds.

However, no Retry-After HTTP header is set on the response. This means downstream clients (OpenAI SDK, custom agents, API gateways) cannot programmatically determine when to retry — they must parse the error message string, which is fragile and non-standard.

Fix

The cooldown_time is already available as a float attribute on RouterRateLimitError. This PR promotes it to a standard Retry-After HTTP header in _handle_llm_api_exception(), so clients get:

HTTP/1.1 429 Too Many Requests
Retry-After: 60

Changes

  • litellm/proxy/common_request_processing.py: After headers.update(custom_headers), check if the exception is a RouterRateLimitError and add retry-after header from e.cooldown_time.
  • tests/test_litellm/proxy/test_common_request_processing.py: New test class TestHandleLLMApiExceptionRetryAfterHeader with 3 tests:
    • RouterRateLimitError with cooldown_time=60 → header "60"
    • RouterRateLimitError with cooldown_time=0 → header "0"
    • Non-rate-limit error → no retry-after header

Testing

pytest tests/test_litellm/proxy/test_common_request_processing.py -k "TestHandleLLMApiExceptionRetryAfterHeader" -v

Fixes #27823

Note: Previous PR #27825 was targeting main directly — closed and retargeted to litellm_oss_staging per repo contribution policy.

Changed files

  • litellm/proxy/common_request_processing.py (modified, +7/-0)
  • tests/test_litellm/proxy/test_common_request_processing.py (modified, +57/-0)

Code Example

HTTP/1.1 429 Too Many Requests
content-type: application/json
x-litellm-call-id: ...
x-litellm-version: ...

{"error": {"message": "No deployments available for selected model, Try again in 60 seconds. ...", "type": "None", "param": "None", "code": "429"}}

---

HTTP/1.1 429 Too Many Requests
retry-after: 60
content-type: application/json
...

{"error": {"message": "No deployments available for selected model, Try again in 60 seconds. ...", ...}}

---

from litellm.types.router import RouterRateLimitError

# ...

if isinstance(e, RouterRateLimitError):
    cooldown_time = getattr(e, "cooldown_time", None)
    if cooldown_time is not None:
        headers["retry-after"] = str(int(cooldown_time))
RAW_BUFFERClick to expand / collapse

What happened?

When all deployments for a model are in cooldown (e.g. after upstream 429s), LiteLLM raises a RouterRateLimitError with the message "No deployments available for selected model, Try again in X seconds". However, no Retry-After HTTP header is included in the response, so downstream clients (OpenAI SDK, custom clients, gateway agents) cannot programmatically determine when to retry.

What should happen?

The RouterRateLimitError already carries self.cooldown_time as a float attribute. This value should be exposed as a standard Retry-After HTTP header on the 429 response so clients can respect it without parsing error message strings.

Current behavior

HTTP/1.1 429 Too Many Requests
content-type: application/json
x-litellm-call-id: ...
x-litellm-version: ...

{"error": {"message": "No deployments available for selected model, Try again in 60 seconds. ...", "type": "None", "param": "None", "code": "429"}}

No retry-after header. The timing is only available by parsing the error message body.

Expected behavior

HTTP/1.1 429 Too Many Requests
retry-after: 60
content-type: application/json
...

{"error": {"message": "No deployments available for selected model, Try again in 60 seconds. ...", ...}}

Fix

In _handle_llm_api_exception in litellm/proxy/common_request_processing.py, after headers are assembled and before the raise ProxyException(...) branches, check if the exception is a RouterRateLimitError and add the header:

from litellm.types.router import RouterRateLimitError

# ...

if isinstance(e, RouterRateLimitError):
    cooldown_time = getattr(e, "cooldown_time", None)
    if cooldown_time is not None:
        headers["retry-after"] = str(int(cooldown_time))

This is distinct from:

  • #21553 / PR #21648 — forwarding upstream provider Retry-After header (gap #1)
  • #26070 — exposing retry_after attribute on litellm.RateLimitError from provider messages

This issue covers the case where LiteLLM itself is the rate limiter (router-level cooldown), not the upstream provider.

Environment

  • LiteLLM version: 1.83.0
  • Deployment: LiteLLM Proxy
  • Upstream provider: z.ai (GLM-5.1)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

HTTP/1.1 429 Too Many Requests
retry-after: 60
content-type: application/json
...

{"error": {"message": "No deployments available for selected model, Try again in 60 seconds. ...", ...}}

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING