litellm - 💡(How to fix) Fix feat(s3): Add retry with exponential backoff for transient S3 503/500 errors in s3_v2 callback [1 participants]

litellm2026-04-09 21:45:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#25446•Fetched 2026-04-10 03:41:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jimmychen-p72

Participants

jimmychen-p72

Error Message

s3_v2.py:406-413 (current)

response = await self.async_httpx_client.put( url, data=json_string, headers=signed_headers ) response.raise_for_status() except Exception as e: verbose_logger.exception(f"Error uploading to s3: {str(e)}") self.handle_callback_failure(callback_name="S3Logger")

Root Cause

This is because s3_v2.py uses raw httpx with SigV4 signing rather than boto3's S3 client (which has built-in exponential backoff). The current code does a single PUT attempt and on failure just logs the exception:

Code Example

# s3_v2.py:406-413 (current)
response = await self.async_httpx_client.put(
    url, data=json_string, headers=signed_headers
)
response.raise_for_status()
except Exception as e:
    verbose_logger.exception(f"Error uploading to s3: {str(e)}")
    self.handle_callback_failure(callback_name="S3Logger")

---

File "/app/litellm/integrations/s3_v2.py", line 346, in async_upload_data_to_s3
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
File "/app/litellm/llms/custom_httpx/http_handler.py", line 538, in put
    response.raise_for_status()
httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://<bucket>.s3.us-east-1.amazonaws.com/2026-04-09/time-20-11-30-629789_<id>.json'

---

# In async_upload_data_to_s3 — replace the single PUT with:
max_retries = 3
for attempt in range(max_retries):
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
    if response.status_code in (500, 503) and attempt < max_retries - 1:
        wait_time = (2 ** attempt) * 0.1  # 0.1s, 0.2s
        verbose_logger.warning(
            f"S3 upload returned {response.status_code}, retrying in {wait_time}s "
            f"(attempt {attempt + 1}/{max_retries}) "
            f"key={batch_logging_element.s3_object_key}"
        )
        await asyncio.sleep(wait_time)
        continue
    response.raise_for_status()
    break

RAW_BUFFERClick to expand / collapse

Feature Description

The S3 callback logger (litellm/integrations/s3_v2.py) uploads a JSON log file to S3 after every LLM request via async_upload_data_to_s3. When S3 returns a transient 503 Slow Down or 500 Internal Server Error, the upload fails permanently with no retry — the audit record is lost.

# s3_v2.py:406-413 (current)
response = await self.async_httpx_client.put(
    url, data=json_string, headers=signed_headers
)
response.raise_for_status()
except Exception as e:
    verbose_logger.exception(f"Error uploading to s3: {str(e)}")
    self.handle_callback_failure(callback_name="S3Logger")

Why S3 Returns 503

AWS S3 returns 503 "Slow Down" as expected behavior when request rates to a partition exceed internal limits. At scale (e.g., 1M+ requests/day, each generating an S3 PUT), transient 503s are normal and expected — AWS recommends implementing exponential backoff retry.

From AWS S3 docs:

"Amazon S3 automatically scales to high request rates... If your request rate increases quickly, Amazon S3 may return 503 Slow Down errors while it scales."

Evidence

Running litellm v1.81.12 in production with ~1.37M requests/day:

S3 503 failures: ~10-18/day (0.001% of uploads)
Distribution: Even across all pods and availability zones — confirms S3-side transient throttling, not a pod/network issue
7-day total: 124 permanently lost audit records
Bucket: Standard S3 bucket in us-east-1 with date-prefix partitioning ({YYYY-MM-DD}/time-{ts}_{id}.json)

Stack trace:

File "/app/litellm/integrations/s3_v2.py", line 346, in async_upload_data_to_s3
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
File "/app/litellm/llms/custom_httpx/http_handler.py", line 538, in put
    response.raise_for_status()
httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://<bucket>.s3.us-east-1.amazonaws.com/2026-04-09/time-20-11-30-629789_<id>.json'

Proposed Fix

Add exponential backoff retry for transient S3 errors (500, 503) in both async_upload_data_to_s3 and upload_data_to_s3:

# In async_upload_data_to_s3 — replace the single PUT with:
max_retries = 3
for attempt in range(max_retries):
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
    if response.status_code in (500, 503) and attempt < max_retries - 1:
        wait_time = (2 ** attempt) * 0.1  # 0.1s, 0.2s
        verbose_logger.warning(
            f"S3 upload returned {response.status_code}, retrying in {wait_time}s "
            f"(attempt {attempt + 1}/{max_retries}) "
            f"key={batch_logging_element.s3_object_key}"
        )
        await asyncio.sleep(wait_time)
        continue
    response.raise_for_status()
    break

Same pattern for the sync upload_data_to_s3 method using time.sleep().

This would eliminate virtually all failures since S3 503s are transient — a single retry resolves >99% of cases.

Impact

Without retry: Each S3 503 = permanent data loss for that request's audit/logging record
With retry: Near-zero data loss with negligible latency impact (0.1-0.3s delay only on the rare retry path)

Are you willing to submit a PR?

Yes — happy to submit a PR with this fix if maintainers agree with the approach.

extent analysis

TL;DR

Implement exponential backoff retry for transient S3 errors (500, 503) in async_upload_data_to_s3 and upload_data_to_s3 to prevent permanent data loss.

Guidance

Identify the lines of code where the single PUT attempt is made and replace it with a retry mechanism that catches 500 and 503 status codes.
Implement a loop that retries the upload up to a specified number of times (e.g., 3) with increasing wait times between attempts.
Use a wait time calculation like `(2 ** attempt) *

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#network issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix feat(s3): Add retry with exponential backoff for transient S3 503/500 errors in s3_v2 callback [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

s3_v2.py:406-413 (current)

Root Cause

Code Example

Feature Description

Why S3 Returns 503

Evidence

Proposed Fix

Impact

Are you willing to submit a PR?

extent analysis

TL;DR

Guidance

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix feat(s3): Add retry with exponential backoff for transient S3 503/500 errors in s3_v2 callback [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

s3_v2.py:406-413 (current)

Root Cause

Code Example

Feature Description

Why S3 Returns 503

Evidence

Proposed Fix

Impact

Are you willing to submit a PR?

extent analysis

TL;DR

Guidance

Still need to ship something?

RELATED_DISCOVERY

TRENDING