litellm - 💡(How to fix) Fix [Bug]: Embedding / TextCompletion / Responses API responses are not counted toward key/team/org TPM limits (only `ModelResponse` is checked) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fixed

Code Example

if isinstance(response_obj, ModelResponse):
    total_tokens = response_obj.usage.total_tokens

---

model_list:
  - model_name: test-embedding
    litellm_params:
      model: openai/text-embedding-3-small
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-1234

---

litellm --config config.yaml --port 4000 &
sleep 5

# Create a virtual key with tpm_limit=100
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{"tpm_limit": 100}'
# returns {"key": "sk-..."}

---

TEST_KEY=sk-...  # the key returned above

for i in $(seq 1 10); do
  curl -s -X POST http://localhost:4000/v1/embeddings \
    -H "Authorization: Bearer $TEST_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "test-embedding", "input": "this is a long enough input to exceed the per-key tpm limit on a single request"}' \
    -o /dev/null -w "%{http_code}\n"
done

---

curl -s -H "Authorization: Bearer sk-1234" \
  "http://localhost:4000/key/info?key=$TEST_KEY"

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

Related but distinct:

  • #7486 (closed not_planned) — usage-based-routing-v2 fails to log embed event. The closer's evidence was anecdotal and not reproducible from the issue body. This bug is in parallel_request_limiter (key/team/org TPM), a separate code path from lowest_tpm_rpm_v2, and the reproduction below is concrete.
  • #27736 — Deployment-level TPM not cross-pod (different code path; this issue is about the per-key/per-team/per-org limits in parallel_request_limiter.py).

What happened?

parallel_request_limiter.py accumulates total_tokens into the per-key / per-user / per-team / per-end-user TPM counters using:

if isinstance(response_obj, ModelResponse):
    total_tokens = response_obj.usage.total_tokens

This appears in 4 places (one per limit scope) at parallel_request_limiter.py lines 511, 600, 633, 666. No branch handles EmbeddingResponse, TextCompletionResponse, or ResponsesAPIResponse, so those responses leave total_tokens = 0 and the TPM counter never increments. The tpm_limit set on a key/user/team/end-user is therefore silently ignored for those response types.

Affected endpoints

  • /v1/embeddings → returns EmbeddingResponse → not counted
  • /v1/completions → returns TextCompletionResponse → not counted
  • /v1/responses → returns ResponsesAPIResponse → not counted
  • /v1/chat/completions → returns ModelResponse → counted correctly

Steps to Reproduce

Tested on litellm[proxy]==1.83.14 (PyPI latest stable). Same behavior on main and litellm_oss_staging by code inspection — no changes to the affected isinstance checks.

Setup

config.yaml:

model_list:
  - model_name: test-embedding
    litellm_params:
      model: openai/text-embedding-3-small
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-1234

Generate a key with a small tpm_limit:

litellm --config config.yaml --port 4000 &
sleep 5

# Create a virtual key with tpm_limit=100
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{"tpm_limit": 100}'
# returns {"key": "sk-..."}

Reproduction

Send embedding requests with the new key (each request is well over 100 tokens):

TEST_KEY=sk-...  # the key returned above

for i in $(seq 1 10); do
  curl -s -X POST http://localhost:4000/v1/embeddings \
    -H "Authorization: Bearer $TEST_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "test-embedding", "input": "this is a long enough input to exceed the per-key tpm limit on a single request"}' \
    -o /dev/null -w "%{http_code}\n"
done

Expected: After ~1 request (since each input has more than 100 tokens), subsequent requests should return HTTP 429 with "Rate limit exceeded for api_key".

Observed: 200 on every request. The key's tpm_limit is silently ignored.

Direct verification

After the 10 requests, query the cached token counter:

curl -s -H "Authorization: Bearer sk-1234" \
  "http://localhost:4000/key/info?key=$TEST_KEY"

tpm_remaining / tpm_used for the key shows the counter never moved (0 used), confirming that embedding tokens were never added.

Replacing the model with a chat-completions model and rerunning produces 429 on the second request as expected.

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.14

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING