litellm - 💡(How to fix) Fix [Feature]: Parse retryDelay from JSON response body + Provider-specific cooldown_time configuration [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#25507Fetched 2026-04-11 06:13:44
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Error Message

utils.py - _get_retry_after_from_exception_header()

retry_header = response_headers.get("retry-after") # Only reads HTTP header

Code Example

# utils.py - _get_retry_after_from_exception_header()
retry_header = response_headers.get("retry-after")  # Only reads HTTP header

---

{
  "error": {
    "message": "Please retry in 52.454646382s.",
    "retryDelay": "52s"
  }
}

---

router_settings:
  cooldown_time: 60  # Default fallback
  
  provider_cooldown_config:
    gemini:
      cooldown_time: 60  # Seconds (matches retryDelay)
    ollama:
      cooldown_time: 604800  # 7 days (weekly limits)
    openai:
      cooldown_time: 1  # Minimal - Retry-After header used

---

router_settings:
  model_group_cooldown_time:
    gemini-*: 60
    glm-*: 604800
    deepseek-*: 3600

---

RequestDeployment A (rate limit)ExceptionFallback to BSuccess
Callback NOT called here    ↑ Callback called here (success)

---

def _get_retry_after_from_exception(exception, response_headers=None):
    # 1. Try HTTP header first (current behavior)
    retry_after = _get_retry_after_from_header(response_headers)
    if retry_after > 0:
        return retry_after
    
    # 2. Try JSON body retryDelay field
    if hasattr(exception, 'response'):
        try:
            response_json = exception.response.json()
            retry_delay = response_json.get('error', {}).get('retryDelay')
            if retry_delay:
                return parse_retry_delay_string(retry_delay)  # "52s" -> 52
        except:
            pass
    
    # 3. Try message pattern matching
    message = str(exception)
    match = re.search(r'retry\s+(?:in\s+)?(\d+(?:\.\d+)?)s', message.lower())
    if match:
        return int(float(match.group(1)))
    
    return -1

---

litellm.intermediate_failure_callback = [my_parser]

def my_parser(kwargs):
    # Called on EACH failure, not just final result
    # Can modify response_headers to inject retry-after
RAW_BUFFERClick to expand / collapse

The Feature

1. Parse retryDelay from JSON response body (not just HTTP headers)

Currently, LiteLLM only parses retry-after from HTTP headers for dynamic cooldown (PR #5358). However, some providers return retry information in the JSON response body, not as HTTP headers:

  • Google Gemini: Returns "retryDelay": "52s" and "Please retry in 52.XXXs." in the error JSON body
  • Ollama Cloud: Returns "weekly usage limit" message in JSON body (no Retry-After header)

Current behavior:

# utils.py - _get_retry_after_from_exception_header()
retry_header = response_headers.get("retry-after")  # Only reads HTTP header

Proposed behavior: Parse retry information from:

  1. HTTP Retry-After header (current behavior - keep)
  2. JSON body retryDelay field (new - for Gemini)
  3. JSON body error message pattern matching (new - for providers without explicit fields)

Example Gemini error that should be parsed:

{
  "error": {
    "message": "Please retry in 52.454646382s.",
    "retryDelay": "52s"
  }
}

2. Provider-specific cooldown_time configuration

Currently, cooldown_time is a global setting in router_settings. Different providers have vastly different rate limit durations:

ProviderRate Limit DurationCurrent Cooldown
OpenAI/AzureSeconds-minutesGlobal value
GeminiSeconds (from JSON)Global value
Ollama CloudWeekly (7 days)Global value

Proposed: Allow cooldown_time per model group or provider:

router_settings:
  cooldown_time: 60  # Default fallback
  
  provider_cooldown_config:
    gemini:
      cooldown_time: 60  # Seconds (matches retryDelay)
    ollama:
      cooldown_time: 604800  # 7 days (weekly limits)
    openai:
      cooldown_time: 1  # Minimal - Retry-After header used

Or per model group:

router_settings:
  model_group_cooldown_time:
    gemini-*: 60
    glm-*: 604800
    deepseek-*: 3600

Motivation, pitch

Problem 1: Over-cooldown for short rate limits

When Gemini returns "retryDelay": "52s", LiteLLM correctly logs this but doesn't parse it for cooldown. With global cooldown_time: 3600, deployments are cooled down for 1 hour instead of 52 seconds.

This causes:

  • Unnecessary deployment lockout
  • Reduced availability during short rate limits
  • Suboptimal load balancing

Problem 2: Under-cooldown for long rate limits

Ollama Cloud has weekly usage limits. If cooldown_time: 3600, LiteLLM retries the same deployment every hour, generating repeated rate limit errors until it finds an available key.

With 18 API keys configured, this means:

  • ~18 exception logs per request during rate limit
  • Unnecessary API calls to rate-limited keys
  • Slower request resolution

Problem 3: Callbacks don't intercept intermediate failures

Custom callbacks registered via litellm.failure_callback are only called after the final request result, not for intermediate failures during retry/fallback:

Request → Deployment A (rate limit) → Exception → Fallback to B → Success
         ↑ Callback NOT called here    ↑ Callback called here (success)

This makes it impossible to implement custom retryDelay parsing via callbacks.

Proposed Implementation

Option A: Extend _get_retry_after_from_exception_header()

def _get_retry_after_from_exception(exception, response_headers=None):
    # 1. Try HTTP header first (current behavior)
    retry_after = _get_retry_after_from_header(response_headers)
    if retry_after > 0:
        return retry_after
    
    # 2. Try JSON body retryDelay field
    if hasattr(exception, 'response'):
        try:
            response_json = exception.response.json()
            retry_delay = response_json.get('error', {}).get('retryDelay')
            if retry_delay:
                return parse_retry_delay_string(retry_delay)  # "52s" -> 52
        except:
            pass
    
    # 3. Try message pattern matching
    message = str(exception)
    match = re.search(r'retry\s+(?:in\s+)?(\d+(?:\.\d+)?)s', message.lower())
    if match:
        return int(float(match.group(1)))
    
    return -1

Option B: Add provider_retry_delay_parser callback hook

Add a new callback that fires on each intermediate failure, before retry/fallback:

litellm.intermediate_failure_callback = [my_parser]

def my_parser(kwargs):
    # Called on EACH failure, not just final result
    # Can modify response_headers to inject retry-after

Option C: Provider cooldown config

Add provider_cooldown_config to router settings with provider-specific defaults.

Related Issues

  • #1339 - Rate-limit awareness across routing strategies
  • #3065 - Router TPM/RPM monitoring improvements
  • #5358 - Cooldown individual models based on retry-after header (current implementation)

Twitter / LinkedIn details

No response

extent analysis

TL;DR

To address the issue of improper cooldown times due to missing retry information parsing from JSON response bodies, extend the _get_retry_after_from_exception_header() function to parse retry delays from JSON bodies and implement provider-specific cooldown configurations.

Guidance

  • Extend the _get_retry_after_from_exception_header() function to parse retryDelay from JSON response bodies for providers like Google Gemini.
  • Implement provider-specific cooldown configurations using provider_cooldown_config in router settings to handle different rate limit durations for each provider.
  • Consider adding a callback hook like provider_retry_delay_parser to allow custom parsing of retry delays on intermediate failures.
  • Review and adjust the cooldown times for each provider based on their specific rate limit durations to prevent over-cooldown or under-cooldown.

Example

def _get_retry_after_from_exception(exception, response_headers=None):
    # Try HTTP header first (current behavior)
    retry_after = _get_retry_after_from_header(response_headers)
    if retry_after > 0:
        return retry_after
    
    # Try JSON body retryDelay field
    if hasattr(exception, 'response'):
        try:
            response_json = exception.response.json()
            retry_delay = response_json.get('error', {}).get('retryDelay')
            if retry_delay:
                return parse_retry_delay_string(retry_delay)  # "52s" -> 52
        except:
            pass
    
    # Try message pattern matching
    message = str(exception)
    match = re.search(r'retry\s+(?:in\s+)?(\d+(?:\.\d+)?)s', message.lower())
    if match:
        return int(float(match.group(1)))
    
    return -1

Notes

The proposed solution focuses on extending the existing _get_retry_after_from_exception_header() function and implementing provider-specific cooldown configurations. However, the effectiveness of this solution may depend on the specific requirements and constraints of the LiteLLM system, which are not fully detailed in the issue description.

Recommendation

Apply workaround by extending the _get_retry_after_from_exception_header() function to parse retry delays from JSON response bodies and implement provider-specific cooldown configurations. This approach allows for a more flexible and provider-specific handling of rate limits and cooldown times.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Feature]: Parse retryDelay from JSON response body + Provider-specific cooldown_time configuration [1 participants]