litellm - 💡(How to fix) Fix [Bug]: x-ratelimit-* headers dropped on streaming responses and plain-dict responses (v3 parallel_request

Code Example

$ ./run_probes.sh
==============================================================
[chat/completions  (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5999
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99999
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[chat/completions  (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

==============================================================
[embeddings        (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5997
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99955
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[messages          (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
                  ↑ no x-ratelimit-* headers (plain-dict response)

==============================================================
[messages          (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

---

model_list:
      - model_name: test-embedding
        litellm_params:
          model: hosted_vllm/mock-embedding
          api_base: http://localhost:28100/v1
          api_key: dummy

      - model_name: test-chat
        litellm_params:
          model: openai/mock-chat
          api_base: http://localhost:28100/v1
          api_key: dummy

      - model_name: test-anthropic
        litellm_params:
          model: anthropic/mock-claude
          api_base: http://localhost:28100
          api_key: dummy

    general_settings:
      master_key: sk-1234

---

curl -sS http://localhost:4000/key/generate \
      -H 'Authorization: Bearer sk-1234' \
      -H 'Content-Type: application/json' \
      -d '{"models":["test-embedding","test-chat","test-anthropic"],
           "tpm_limit": 100000, "rpm_limit": 6000}'

---

KEY=<key from step 3>
    PROXY=http://localhost:4000

    curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/chat/completions" \
      -d '{"model":"test-chat","messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

    curl -sS -D - -o /dev/null -N -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/chat/completions" \
      -d '{"model":"test-chat","stream":true,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

    curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
      "$PROXY/v1/messages" \
      -d '{"model":"test-anthropic","max_tokens":100,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

What happened?

The v3 parallel_request_limiter (PROXY_HOOKS["parallel_request_limiter"], default since the v3 rewrite) populates x-ratelimit-{descriptor}-{remaining,limit}-{requests,tokens} headers from async_post_call_success_hook by mutating response._hidden_params["additional_headers"]. Two response paths silently drop these headers:

Streaming responses — for stream=true requests the SSE response headers are flushed to the client before async_post_call_success_hook runs. The hook still mutates _hidden_params, but the client never sees the resulting x-ratelimit-* keys. Affected: /v1/chat/completions stream=true, /v1/messages stream=true, /v1/responses stream=true.
Plain-dict responses — when response is a plain dict with no _hidden_params attribute the hook short-circuits at if hasattr(response, "_hidden_params") (parallel_request_limiter_v3.py:2751-2754 on main). Affected: /v1/messages non-streaming.

Both paths still increment the rate-limit counters correctly — only the visibility via headers is broken, which makes client-side quota tracking unreliable for any caller using the affected endpoints.

The hook is the only site that emits these headers; common_request_processing.py has no fallback. A grep -n "x-ratelimit\|apply_rate_limit\|litellm_proxy_rate_limit_response" litellm/proxy/common_request_processing.py on main returns zero matches.

Reproduction (single-instance, `ghcr.io/berriai/litellm:main-stable`)

Create a virtual key with tpm_limit and rpm_limit so the v3 limiter activates, then fire one request per shape and dump the response headers:

$ ./run_probes.sh
==============================================================
[chat/completions  (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5999
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99999
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[chat/completions  (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

==============================================================
[embeddings        (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
x-ratelimit-api_key-remaining-requests: 5997
x-ratelimit-api_key-limit-requests: 6000
x-ratelimit-api_key-remaining-tokens: 99955
x-ratelimit-api_key-limit-tokens: 100000

==============================================================
[messages          (sync)]
==============================================================
HTTP/1.1 200 OK
content-type: application/json
                  ↑ no x-ratelimit-* headers (plain-dict response)

==============================================================
[messages          (stream=true)]
==============================================================
HTTP/1.1 200 OK
content-type: text/event-stream; charset=utf-8
                  ↑ no x-ratelimit-* headers

Expected: every endpoint returns the same x-ratelimit-api_key-* set (4 headers when the key has both tpm and rpm limits). Actual: only /v1/chat/completions sync and /v1/embeddings sync emit them.

Steps to Reproduce

config.yaml:

model_list:
  - model_name: test-embedding
    litellm_params:
      model: hosted_vllm/mock-embedding
      api_base: http://localhost:28100/v1
      api_key: dummy

  - model_name: test-chat
    litellm_params:
      model: openai/mock-chat
      api_base: http://localhost:28100/v1
      api_key: dummy

  - model_name: test-anthropic
    litellm_params:
      model: anthropic/mock-claude
      api_base: http://localhost:28100
      api_key: dummy

general_settings:
  master_key: sk-1234

Start postgres + redis + an OpenAI/Anthropic mock backend on port 28100 (mock implements /v1/chat/completions with stream=true SSE chunks, /v1/embeddings, and Anthropic-style /v1/messages with stream=true SSE events).

Boot the proxy and create a virtual key (DB-backed so v3 limiter activates):

curl -sS http://localhost:4000/key/generate \
  -H 'Authorization: Bearer sk-1234' \
  -H 'Content-Type: application/json' \
  -d '{"models":["test-embedding","test-chat","test-anthropic"],
       "tpm_limit": 100000, "rpm_limit": 6000}'

Probe each shape with curl -D - and grep for x-ratelimit-:

KEY=<key from step 3>
PROXY=http://localhost:4000

curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
  "$PROXY/v1/chat/completions" \
  -d '{"model":"test-chat","messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

curl -sS -D - -o /dev/null -N -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
  "$PROXY/v1/chat/completions" \
  -d '{"model":"test-chat","stream":true,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

curl -sS -D - -o /dev/null -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
  "$PROXY/v1/messages" \
  -d '{"model":"test-anthropic","max_tokens":100,"messages":[{"role":"user","content":"hello"}]}' | grep -i ratelimit

The non-sync paths return zero x-ratelimit-* headers despite the counter being incremented (verifiable via the next request showing remaining-tokens decremented further).

Relevant log output

No errors are logged — the headers are dropped silently. The v3 hook still runs and mutates _hidden_params, but those updates never reach the client because:

For streaming, headers are committed when the SSE response starts.
For plain-dict /v1/messages responses, hasattr(response, "_hidden_params") is False so the hook short-circuits.

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.14 (also verified against main on 2026-05-12)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: x-ratelimit-* headers dropped on streaming responses and plain-dict responses (v3 parallel_request_limiter) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

Code Example

What happened?

Reproduction (single-instance, `ghcr.io/berriai/litellm:main-stable`)

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: x-ratelimit-* headers dropped on streaming responses and plain-dict responses (v3 parallel_request_limiter) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

Code Example

What happened?

Reproduction (single-instance, ghcr.io/berriai/litellm:main-stable)

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Reproduction (single-instance, `ghcr.io/berriai/litellm:main-stable`)