litellm - 💡(How to fix) Fix [Bug]: Embedding / TextCompletion / Responses API responses are not counted toward key/team/org TPM limits (only `ModelResponse` is checked) [1 pull requests]

litellm2026-05-12 11:51:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fixed

Fixed by PR: fix(parallel_request_limiter): count Embedding / TextCompletion / ResponsesAPI responses toward key/team/org TPM (https://github.com/BerriAI/litellm/pull/27739)

Code Example

if isinstance(response_obj, ModelResponse):
    total_tokens = response_obj.usage.total_tokens

---

model_list:
  - model_name: test-embedding
    litellm_params:
      model: openai/text-embedding-3-small
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-1234

---

litellm --config config.yaml --port 4000 &
sleep 5

# Create a virtual key with tpm_limit=100
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{"tpm_limit": 100}'
# returns {"key": "sk-..."}

---

TEST_KEY=sk-...  # the key returned above

for i in $(seq 1 10); do
  curl -s -X POST http://localhost:4000/v1/embeddings \
    -H "Authorization: Bearer $TEST_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "test-embedding", "input": "this is a long enough input to exceed the per-key tpm limit on a single request"}' \
    -o /dev/null -w "%{http_code}\n"
done

---

curl -s -H "Authorization: Bearer sk-1234" \
  "http://localhost:4000/key/info?key=$TEST_KEY"

---

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

Related but distinct:

#7486 (closed not_planned) — usage-based-routing-v2 fails to log embed event. The closer's evidence was anecdotal and not reproducible from the issue body. This bug is in parallel_request_limiter (key/team/org TPM), a separate code path from lowest_tpm_rpm_v2, and the reproduction below is concrete.
#27736 — Deployment-level TPM not cross-pod (different code path; this issue is about the per-key/per-team/per-org limits in parallel_request_limiter.py).

What happened?

parallel_request_limiter.py accumulates total_tokens into the per-key / per-user / per-team / per-end-user TPM counters using:

if isinstance(response_obj, ModelResponse):
    total_tokens = response_obj.usage.total_tokens

This appears in 4 places (one per limit scope) at parallel_request_limiter.py lines 511, 600, 633, 666. No branch handles EmbeddingResponse, TextCompletionResponse, or ResponsesAPIResponse, so those responses leave total_tokens = 0 and the TPM counter never increments. The tpm_limit set on a key/user/team/end-user is therefore silently ignored for those response types.

Affected endpoints

/v1/embeddings → returns EmbeddingResponse → not counted
/v1/completions → returns TextCompletionResponse → not counted
/v1/responses → returns ResponsesAPIResponse → not counted
/v1/chat/completions → returns ModelResponse → counted correctly

Steps to Reproduce

Tested on litellm[proxy]==1.83.14 (PyPI latest stable). Same behavior on main and litellm_oss_staging by code inspection — no changes to the affected isinstance checks.

Setup

config.yaml:

model_list:
  - model_name: test-embedding
    litellm_params:
      model: openai/text-embedding-3-small
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-1234

Generate a key with a small tpm_limit:

litellm --config config.yaml --port 4000 &
sleep 5

# Create a virtual key with tpm_limit=100
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{"tpm_limit": 100}'
# returns {"key": "sk-..."}

Reproduction

Send embedding requests with the new key (each request is well over 100 tokens):

TEST_KEY=sk-...  # the key returned above

for i in $(seq 1 10); do
  curl -s -X POST http://localhost:4000/v1/embeddings \
    -H "Authorization: Bearer $TEST_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "test-embedding", "input": "this is a long enough input to exceed the per-key tpm limit on a single request"}' \
    -o /dev/null -w "%{http_code}\n"
done

Expected: After ~1 request (since each input has more than 100 tokens), subsequent requests should return HTTP 429 with "Rate limit exceeded for api_key".

Observed: 200 on every request. The key's tpm_limit is silently ignored.

Direct verification

After the 10 requests, query the cached token counter:

curl -s -H "Authorization: Bearer sk-1234" \
  "http://localhost:4000/key/info?key=$TEST_KEY"

tpm_remaining / tpm_used for the key shows the counter never moved (0 used), confirming that embedding tokens were never added.

Replacing the model with a chat-completions model and rerunning produces 429 on the second request as expected.

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.14

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Embedding / TextCompletion / Responses API responses are not counted toward key/team/org TPM limits (only `ModelResponse` is checked) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

Check for existing issues

What happened?

Affected endpoints

Steps to Reproduce

Setup

Reproduction

Direct verification

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Embedding / TextCompletion / Responses API responses are not counted toward key/team/org TPM limits (only `ModelResponse` is checked) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

Check for existing issues

What happened?

Affected endpoints

Steps to Reproduce

Setup

Reproduction

Direct verification

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

RELATED_DISCOVERY

TRENDING