litellm - 💡(How to fix) Fix [Bug]: x-ratelimit-* headers dropped on streaming responses and plain-dict responses (v3 parallel_request_limiter) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

No errors are logged — the headers are dropped silently. The v3 hook still runs and mutates _hidden_params, but those updates never reach the client because:

Fix Action

Fixed

Code Example

$ ./run_probes.sh
==============================================================
[chat/completions  (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5999
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99999
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[chat/completions  (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

==============================================================
[embeddings        (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5997
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99955
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[messages          (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
                  ↑ no x-ratelimit-* headers (plain-dict response)

==============================================================
[messages          (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

---

model_list:
      - model_name: test-embedding
        litellm_params:
          model: hosted_vllm/mock-embedding
          api_base: http://localhost:28100/v1
          api_key: dummy

      - model_name: test-chat
        litellm_params:
          model: openai/mock-chat
          api_base: http://localhost:28100/v1
          api_key: dummy

      - model_name: test-anthropic
        litellm_params:
          model: anthropic/mock-claude
          api_base: http://localhost:28100
          api_key: dummy

    general_settings:
      master_key: sk-1234

---

curl -sS http://localhost:4000/key/generate \
      -H 'Authorization: Bearer sk-1234' \
      -H 'Content-Type: application/json' \
      -d '{"models":["test-embedding","test-chat","test-anthropic"],
           "tpm_limit": 100000, "rpm_limit": 6000}'

---

KEY=<key from step 3>
    PROXY=http://localhost:4000

    curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/chat/completions" \
      -d '{"model":"test-chat","messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

    curl -sS -D - -o /dev/null -N -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/chat/completions" \
      -d '{"model":"test-chat","stream":true,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

    curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/messages" \
      -d '{"model":"test-anthropic","max_tokens":100,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit
RAW_BUFFERClick to expand / collapse

What happened?

The v3 parallel_request_limiter (PROXY_HOOKS["parallel_request_limiter"], default since the v3 rewrite) populates x-ratelimit-{descriptor}-{remaining,limit}-{requests,tokens} headers from async_post_call_success_hook by mutating response._hidden_params["additional_headers"]. Two response paths silently drop these headers:

  1. Streaming responses — for stream=true requests the SSE response headers are flushed to the client before async_post_call_success_hook runs. The hook still mutates _hidden_params, but the client never sees the resulting x-ratelimit-* keys. Affected: /v1/chat/completions stream=true, /v1/messages stream=true, /v1/responses stream=true.
  2. Plain-dict responses — when response is a plain dict with no _hidden_params attribute the hook short-circuits at if hasattr(response, "_hidden_params") (parallel_request_limiter_v3.py:2751-2754 on main). Affected: /v1/messages non-streaming.

Both paths still increment the rate-limit counters correctly — only the visibility via headers is broken, which makes client-side quota tracking unreliable for any caller using the affected endpoints.

The hook is the only site that emits these headers; common_request_processing.py has no fallback. A grep -n "x-ratelimit\|apply_rate_limit\|litellm_proxy_rate_limit_response" litellm/proxy/common_request_processing.py on main returns zero matches.

Reproduction (single-instance, ghcr.io/berriai/litellm:main-stable)

Create a virtual key with tpm_limit and rpm_limit so the v3 limiter activates, then fire one request per shape and dump the response headers:

$ ./run_probes.sh
==============================================================
[chat/completions  (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5999
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99999
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[chat/completions  (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

==============================================================
[embeddings        (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5997
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99955
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[messages          (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
                  ↑ no x-ratelimit-* headers (plain-dict response)

==============================================================
[messages          (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

Expected: every endpoint returns the same x-ratelimit-api_key-* set (4 headers when the key has both tpm and rpm limits). Actual: only /v1/chat/completions sync and /v1/embeddings sync emit them.

Steps to Reproduce

  1. config.yaml:

    model_list:
      - model_name: test-embedding
        litellm_params:
          model: hosted_vllm/mock-embedding
          api_base: http://localhost:28100/v1
          api_key: dummy
    
      - model_name: test-chat
        litellm_params:
          model: openai/mock-chat
          api_base: http://localhost:28100/v1
          api_key: dummy
    
      - model_name: test-anthropic
        litellm_params:
          model: anthropic/mock-claude
          api_base: http://localhost:28100
          api_key: dummy
    
    general_settings:
      master_key: sk-1234
  2. Start postgres + redis + an OpenAI/Anthropic mock backend on port 28100 (mock implements /v1/chat/completions with stream=true SSE chunks, /v1/embeddings, and Anthropic-style /v1/messages with stream=true SSE events).

  3. Boot the proxy and create a virtual key (DB-backed so v3 limiter activates):

    curl -sS http://localhost:4000/key/generate \
      -H 'Authorization: Bearer sk-1234' \
      -H 'Content-Type: application/json' \
      -d '{"models":["test-embedding","test-chat","test-anthropic"],
           "tpm_limit": 100000, "rpm_limit": 6000}'
  4. Probe each shape with curl -D - and grep for x-ratelimit-:

    KEY=<key from step 3>
    PROXY=http://localhost:4000
    
    curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/chat/completions" \
      -d '{"model":"test-chat","messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit
    
    curl -sS -D - -o /dev/null -N -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/chat/completions" \
      -d '{"model":"test-chat","stream":true,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit
    
    curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/messages" \
      -d '{"model":"test-anthropic","max_tokens":100,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

    The non-sync paths return zero x-ratelimit-* headers despite the counter being incremented (verifiable via the next request showing remaining-tokens decremented further).

Relevant log output

No errors are logged — the headers are dropped silently. The v3 hook still runs and mutates _hidden_params, but those updates never reach the client because:

  • For streaming, headers are committed when the SSE response starts.
  • For plain-dict /v1/messages responses, hasattr(response, "_hidden_params") is False so the hook short-circuits.

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.14 (also verified against main on 2026-05-12)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: x-ratelimit-* headers dropped on streaming responses and plain-dict responses (v3 parallel_request_limiter) [1 pull requests]