litellm - 💡(How to fix) Fix [Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends [1 participants]

d4n-sec · 2026-04-28T11:43:08Z

[litellm] Check for existing issues - X I have searched the existing issues and checked that my issue is not a duplicate. The Feature Add an optional gateway-s… ### Check for existing issues - [X] I have searched the existing issues and checked that my issue is not a duplicate. ### The Feature Add an optional **gateway-side queue protocol** for overloaded/self-hosted backends, so LiteLLM Proxy can expose structured queue metadata instead of only returning a raw `429`. Today, when a self-hosted backend such as vLLM, Ollama, or an internal inference gateway is saturated, the practical outcomes are usually: - the upstream returns `429 Too Many Requests` - the client retries blindly using `retry-after` - or the request is simply rejected and the client has no visibility into whether work is waiting, how long it may take, or whether the server is just overloaded I am proposing a provider-agnostic, opt-in mechanism for the **Proxy** layer to expose structured waiting state, for example: ```json { "queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources" } ``` This does not have to be tied to one exact transport. Possible designs: 1. SSE event during streaming requests, for example: ```text event: queue_status data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"} ``` 2. Structured JSON response body for non-streaming overload cases 3. Standardized response headers such as `retry-after` plus optional queue metadata headers 4. A pluggable hook/adapter interface for custom backends/gateways to supply queue metadata The key point is: LiteLLM should be able to expose **more than just a bare 429** when the upstream system actually has a real queue and can describe it. ### Motivation, pitch I am working on a setup where the problem is naturally split across two layers: 1. **Gateway/server side**: decide whether requests should be queued, rejected, or retried, and expose structured waiting metadata 2. **Client side**: render that metadata so the user sees something like "server busy, queue position: 5" instead of assuming the request is frozen Right now LiteLLM already has strong support for: - rate limiting - `max_parallel_requests` - retry/backoff - fallbacks - `retry-after` style signaling But from what I can tell, it does **not** currently provide a standard way to expose queue position / estimated wait information from overloaded self-hosted gateways to downstream clients. This would be especially useful for: - teams running self-hosted vLLM clusters - internal company gateways in front of multiple model servers - clients that want to provide better UX than "request failed with 429" or "spinning until retry succeeds" One concrete downstream use case is an OpenCode client integration on the consumer side: - LiteLLM / gateway side: expose structured queue metadata - OpenCode side: render it in the TUI while the request is waiting Related client-side discussion: - OpenCode issue: https://github.com/anomalyco/opencode/issues/24763 Some implementation questions I would love feedback on: - Should this be modeled as a new optional structured error / event contract? - Should LiteLLM standardize SSE `queue_status` events for streaming clients? - Should queue metadata be available on both streaming and non-streaming paths? - Should this be limited to proxy mode, where LiteLLM is acting as the gateway? - Would the maintainers prefer a hook-based approach for custom providers/backends first? I think this is a natural extension of LiteLLM's existing gateway responsibilities. It would let LiteLLM remain provider-agnostic while still supporting a much better overload experience for self-hosted deployments. If there is interest in this direction, I would be happy to help refine the proposal further. ### What part of LiteLLM is this about? Proxy ### LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users? No ### Twitter / LinkedIn details No response

litellm2026-04-28 11:43:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#26693•Fetched 2026-04-29 06:12:48

View on GitHub

Comments

Participants

Timeline

Reactions

Author

d4n-sec

Participants

d4n-sec

Timeline (top)

labeled ×2cross-referenced ×1

Error Message

Should this be modeled as a new optional structured error / event contract?

Code Example

{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}

---

event: queue_status
data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"}

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

The Feature

Add an optional gateway-side queue protocol for overloaded/self-hosted backends, so LiteLLM Proxy can expose structured queue metadata instead of only returning a raw 429.

Today, when a self-hosted backend such as vLLM, Ollama, or an internal inference gateway is saturated, the practical outcomes are usually:

the upstream returns 429 Too Many Requests
the client retries blindly using retry-after
or the request is simply rejected and the client has no visibility into whether work is waiting, how long it may take, or whether the server is just overloaded

I am proposing a provider-agnostic, opt-in mechanism for the Proxy layer to expose structured waiting state, for example:

{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}

This does not have to be tied to one exact transport. Possible designs:

SSE event during streaming requests, for example:

event: queue_status
data: {"queue_position": 5, "estimated_wait_seconds": 18, "message": "Server busy, waiting for resources"}

Structured JSON response body for non-streaming overload cases
Standardized response headers such as retry-after plus optional queue metadata headers
A pluggable hook/adapter interface for custom backends/gateways to supply queue metadata

The key point is: LiteLLM should be able to expose more than just a bare 429 when the upstream system actually has a real queue and can describe it.

Motivation, pitch

I am working on a setup where the problem is naturally split across two layers:

Gateway/server side: decide whether requests should be queued, rejected, or retried, and expose structured waiting metadata
Client side: render that metadata so the user sees something like "server busy, queue position: 5" instead of assuming the request is frozen

Right now LiteLLM already has strong support for:

rate limiting
max_parallel_requests
retry/backoff
fallbacks
retry-after style signaling

But from what I can tell, it does not currently provide a standard way to expose queue position / estimated wait information from overloaded self-hosted gateways to downstream clients.

This would be especially useful for:

teams running self-hosted vLLM clusters
internal company gateways in front of multiple model servers
clients that want to provide better UX than "request failed with 429" or "spinning until retry succeeds"

One concrete downstream use case is an OpenCode client integration on the consumer side:

LiteLLM / gateway side: expose structured queue metadata
OpenCode side: render it in the TUI while the request is waiting

Related client-side discussion:

OpenCode issue: https://github.com/anomalyco/opencode/issues/24763

Some implementation questions I would love feedback on:

Should this be modeled as a new optional structured error / event contract?
Should LiteLLM standardize SSE queue_status events for streaming clients?
Should queue metadata be available on both streaming and non-streaming paths?
Should this be limited to proxy mode, where LiteLLM is acting as the gateway?
Would the maintainers prefer a hook-based approach for custom providers/backends first?

I think this is a natural extension of LiteLLM's existing gateway responsibilities. It would let LiteLLM remain provider-agnostic while still supporting a much better overload experience for self-hosted deployments.

If there is interest in this direction, I would be happy to help refine the proposal further.

What part of LiteLLM is this about?

Proxy

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Implement a gateway-side queue protocol to expose structured queue metadata, allowing LiteLLM Proxy to provide more informative responses when the upstream system is overloaded.

Guidance

Define a new optional structured error/event contract to standardize the exposure of queue metadata, such as queue_position and estimated_wait_seconds.
Consider implementing SSE queue_status events for streaming clients to provide real-time updates on queue status.
Determine whether queue metadata should be available on both streaming and non-streaming paths, and whether this feature should be limited to proxy mode.
Explore a hook-based approach for custom providers/backends to supply queue metadata, ensuring LiteLLM remains provider-agnostic.

Example

{
  "queue_position": 5,
  "estimated_wait_seconds": 18,
  "message": "Server busy, waiting for resources"
}

This example illustrates the proposed structured queue metadata that could be exposed by LiteLLM Proxy.

Notes

The implementation details will depend on the specific requirements and constraints of the LiteLLM Proxy and its interactions with self-hosted gateways and clients. Further discussion and refinement of the proposal are necessary to determine the best approach.

Recommendation

Apply a workaround by implementing a custom solution for exposing queue metadata, as there is no standard way to do so currently in LiteLLM. This will allow for a better user experience and provide a foundation for future standardization.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#SSR setup #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Check for existing issues

The Feature

Motivation, pitch

What part of LiteLLM is this about?

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Feature]: Optional gateway-side queue metadata / queue status protocol for overloaded self-hosted backends [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Check for existing issues

The Feature

Motivation, pitch

What part of LiteLLM is this about?

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING