litellm - 💡(How to fix) Fix [Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26693Fetched 2026-04-29 06:12:48
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×2cross-referenced ×1

Error Message

  • Should this be modeled as a new optional structured error / event contract?

Code Example

{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}

---

event: queue_status
data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"}
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

The Feature

Add an optional gateway-side queue protocol for overloaded/self-hosted backends, so LiteLLM Proxy can expose structured queue metadata instead of only returning a raw 429.

Today, when a self-hosted backend such as vLLM, Ollama, or an internal inference gateway is saturated, the practical outcomes are usually:

  • the upstream returns 429 Too Many Requests
  • the client retries blindly using retry-after
  • or the request is simply rejected and the client has no visibility into whether work is waiting, how long it may take, or whether the server is just overloaded

I am proposing a provider-agnostic, opt-in mechanism for the Proxy layer to expose structured waiting state, for example:

{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}

This does not have to be tied to one exact transport. Possible designs:

  1. SSE event during streaming requests, for example:
event: queue_status
data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"}
  1. Structured JSON response body for non-streaming overload cases
  2. Standardized response headers such as retry-after plus optional queue metadata headers
  3. A pluggable hook/adapter interface for custom backends/gateways to supply queue metadata

The key point is: LiteLLM should be able to expose more than just a bare 429 when the upstream system actually has a real queue and can describe it.

Motivation, pitch

I am working on a setup where the problem is naturally split across two layers:

  1. Gateway/server side: decide whether requests should be queued, rejected, or retried, and expose structured waiting metadata
  2. Client side: render that metadata so the user sees something like "server busy, queue position: 5" instead of assuming the request is frozen

Right now LiteLLM already has strong support for:

  • rate limiting
  • max_parallel_requests
  • retry/backoff
  • fallbacks
  • retry-after style signaling

But from what I can tell, it does not currently provide a standard way to expose queue position / estimated wait information from overloaded self-hosted gateways to downstream clients.

This would be especially useful for:

  • teams running self-hosted vLLM clusters
  • internal company gateways in front of multiple model servers
  • clients that want to provide better UX than "request failed with 429" or "spinning until retry succeeds"

One concrete downstream use case is an OpenCode client integration on the consumer side:

  • LiteLLM / gateway side: expose structured queue metadata
  • OpenCode side: render it in the TUI while the request is waiting

Related client-side discussion:

Some implementation questions I would love feedback on:

  • Should this be modeled as a new optional structured error / event contract?
  • Should LiteLLM standardize SSE queue_status events for streaming clients?
  • Should queue metadata be available on both streaming and non-streaming paths?
  • Should this be limited to proxy mode, where LiteLLM is acting as the gateway?
  • Would the maintainers prefer a hook-based approach for custom providers/backends first?

I think this is a natural extension of LiteLLM's existing gateway responsibilities. It would let LiteLLM remain provider-agnostic while still supporting a much better overload experience for self-hosted deployments.

If there is interest in this direction, I would be happy to help refine the proposal further.

What part of LiteLLM is this about?

Proxy

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

No

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Implement a gateway-side queue protocol to expose structured queue metadata, allowing LiteLLM Proxy to provide more informative responses when the upstream system is overloaded.

Guidance

  • Define a new optional structured error/event contract to standardize the exposure of queue metadata, such as queue_position and estimated_wait_seconds.
  • Consider implementing SSE queue_status events for streaming clients to provide real-time updates on queue status.
  • Determine whether queue metadata should be available on both streaming and non-streaming paths, and whether this feature should be limited to proxy mode.
  • Explore a hook-based approach for custom providers/backends to supply queue metadata, ensuring LiteLLM remains provider-agnostic.

Example

{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}

This example illustrates the proposed structured queue metadata that could be exposed by LiteLLM Proxy.

Notes

The implementation details will depend on the specific requirements and constraints of the LiteLLM Proxy and its interactions with self-hosted gateways and clients. Further discussion and refinement of the proposal are necessary to determine the best approach.

Recommendation

Apply a workaround by implementing a custom solution for exposing queue metadata, as there is no standard way to do so currently in LiteLLM. This will allow for a better user experience and provide a foundation for future standardization.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends [1 participants]