litellm - 💡(How to fix) Fix feat(s3): Add retry with exponential backoff for transient S3 503/500 errors in s3_v2 callback [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#25446Fetched 2026-04-10 03:41:02
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Error Message

s3_v2.py:406-413 (current)

response = await self.async_httpx_client.put( url, data=json_string, headers=signed_headers ) response.raise_for_status() except Exception as e: verbose_logger.exception(f"Error uploading to s3: {str(e)}") self.handle_callback_failure(callback_name="S3Logger")

Root Cause

This is because s3_v2.py uses raw httpx with SigV4 signing rather than boto3's S3 client (which has built-in exponential backoff). The current code does a single PUT attempt and on failure just logs the exception:

Code Example

# s3_v2.py:406-413 (current)
response = await self.async_httpx_client.put(
    url, data=json_string, headers=signed_headers
)
response.raise_for_status()
except Exception as e:
    verbose_logger.exception(f"Error uploading to s3: {str(e)}")
    self.handle_callback_failure(callback_name="S3Logger")

---

File "/app/litellm/integrations/s3_v2.py", line 346, in async_upload_data_to_s3
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
File "/app/litellm/llms/custom_httpx/http_handler.py", line 538, in put
    response.raise_for_status()
httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://<bucket>.s3.us-east-1.amazonaws.com/2026-04-09/time-20-11-30-629789_<id>.json'

---

# In async_upload_data_to_s3 — replace the single PUT with:
max_retries = 3
for attempt in range(max_retries):
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
    if response.status_code in (500, 503) and attempt < max_retries - 1:
        wait_time = (2 ** attempt) * 0.1  # 0.1s, 0.2s
        verbose_logger.warning(
            f"S3 upload returned {response.status_code}, retrying in {wait_time}s "
            f"(attempt {attempt + 1}/{max_retries}) "
            f"key={batch_logging_element.s3_object_key}"
        )
        await asyncio.sleep(wait_time)
        continue
    response.raise_for_status()
    break
RAW_BUFFERClick to expand / collapse

Feature Description

The S3 callback logger (litellm/integrations/s3_v2.py) uploads a JSON log file to S3 after every LLM request via async_upload_data_to_s3. When S3 returns a transient 503 Slow Down or 500 Internal Server Error, the upload fails permanently with no retry — the audit record is lost.

This is because s3_v2.py uses raw httpx with SigV4 signing rather than boto3's S3 client (which has built-in exponential backoff). The current code does a single PUT attempt and on failure just logs the exception:

# s3_v2.py:406-413 (current)
response = await self.async_httpx_client.put(
    url, data=json_string, headers=signed_headers
)
response.raise_for_status()
except Exception as e:
    verbose_logger.exception(f"Error uploading to s3: {str(e)}")
    self.handle_callback_failure(callback_name="S3Logger")

Why S3 Returns 503

AWS S3 returns 503 "Slow Down" as expected behavior when request rates to a partition exceed internal limits. At scale (e.g., 1M+ requests/day, each generating an S3 PUT), transient 503s are normal and expected — AWS recommends implementing exponential backoff retry.

From AWS S3 docs:

"Amazon S3 automatically scales to high request rates... If your request rate increases quickly, Amazon S3 may return 503 Slow Down errors while it scales."

Evidence

Running litellm v1.81.12 in production with ~1.37M requests/day:

  • S3 503 failures: ~10-18/day (0.001% of uploads)
  • Distribution: Even across all pods and availability zones — confirms S3-side transient throttling, not a pod/network issue
  • 7-day total: 124 permanently lost audit records
  • Bucket: Standard S3 bucket in us-east-1 with date-prefix partitioning ({YYYY-MM-DD}/time-{ts}_{id}.json)

Stack trace:

File "/app/litellm/integrations/s3_v2.py", line 346, in async_upload_data_to_s3
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
File "/app/litellm/llms/custom_httpx/http_handler.py", line 538, in put
    response.raise_for_status()
httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://<bucket>.s3.us-east-1.amazonaws.com/2026-04-09/time-20-11-30-629789_<id>.json'

Proposed Fix

Add exponential backoff retry for transient S3 errors (500, 503) in both async_upload_data_to_s3 and upload_data_to_s3:

# In async_upload_data_to_s3 — replace the single PUT with:
max_retries = 3
for attempt in range(max_retries):
    response = await self.async_httpx_client.put(
        url, data=json_string, headers=signed_headers
    )
    if response.status_code in (500, 503) and attempt < max_retries - 1:
        wait_time = (2 ** attempt) * 0.1  # 0.1s, 0.2s
        verbose_logger.warning(
            f"S3 upload returned {response.status_code}, retrying in {wait_time}s "
            f"(attempt {attempt + 1}/{max_retries}) "
            f"key={batch_logging_element.s3_object_key}"
        )
        await asyncio.sleep(wait_time)
        continue
    response.raise_for_status()
    break

Same pattern for the sync upload_data_to_s3 method using time.sleep().

This would eliminate virtually all failures since S3 503s are transient — a single retry resolves >99% of cases.

Impact

  • Without retry: Each S3 503 = permanent data loss for that request's audit/logging record
  • With retry: Near-zero data loss with negligible latency impact (0.1-0.3s delay only on the rare retry path)

Are you willing to submit a PR?

Yes — happy to submit a PR with this fix if maintainers agree with the approach.

extent analysis

TL;DR

Implement exponential backoff retry for transient S3 errors (500, 503) in async_upload_data_to_s3 and upload_data_to_s3 to prevent permanent data loss.

Guidance

  • Identify the lines of code where the single PUT attempt is made and replace it with a retry mechanism that catches 500 and 503 status codes.
  • Implement a loop that retries the upload up to a specified number of times (e.g., 3) with increasing wait times between attempts.
  • Use a wait time calculation like `(2 ** attempt) *

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING