litellm - 💡(How to fix) Fix [Bug]: Deployment-level TPM enforcement is per-pod, not cross-pod — effective limit becomes `tpm_limit × N_replica` [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

import asyncio, json, urllib.request, urllib.error, time except urllib.error.HTTPError as e:

Root Cause

Root cause (code-level)

Fix Action

Fixed

Code Example

await self.router_cache.async_increment_cache(
    key=tpm_key, value=total_tokens, ttl=..., parent_otel_span=...,
)

---

Configured: tpm_limit = 1,500,000
Observed (40-minute sustained burst):
  - per-minute tokens started:  ~5.5M (~3.7x over limit)
  - per-minute tokens completed: ~3.1M  (counter visible to a single replica never reaches limit)
  - HTTP 429 count:               0  ← root cause
  - backend latency:              1.4s → 65s polynomial growth (queue saturation)
  - backend 504 timeouts:        31K+ (the only thing actually protecting the backend)

Per-replica math:  5.5M / 5 replicas = 1.1M per replica < 1.5M limit → all pass

---

# mock_backend.py
import asyncio, os, json
from fastapi import FastAPI, Request
import uvicorn

LATENCY_SEC = float(os.getenv("LATENCY_SEC", "0.5"))
app = FastAPI()

@app.post("/v1/embeddings")
async def embeddings(req: Request):
    body = await req.json()
    n_tokens = max(1, len(str(body.get("input", ""))) // 4)
    await asyncio.sleep(LATENCY_SEC)
    return {
        "object": "list",
        "data": [{"object": "embedding", "embedding": [0.1, 0.2, 0.3], "index": 0}],
        "model": body.get("model"),
        "usage": {"prompt_tokens": n_tokens, "total_tokens": n_tokens, "completion_tokens": 0},
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=28000, log_level="warning")

---

model_list:
  - model_name: test-model
    litellm_params:
      model: hosted_vllm/mock-embedding
      api_base: http://localhost:28000/v1
      api_key: dummy
      tpm: 500
      rpm: 600000

router_settings:
  routing_strategy: usage-based-routing-v2
  redis_host: localhost
  redis_port: 16379

general_settings:
  master_key: sk-1234

---

for PORT in 14000 14001 14002 14003 14004; do
  litellm --config config.yaml --port $PORT &
done

---

import asyncio, json, urllib.request, urllib.error, time
from collections import Counter

PORTS = [14000, 14001, 14002, 14003, 14004]
KEY = "sk-1234"
PAYLOAD = json.dumps({"model": "test-model", "input": "hello world " * 10}).encode()

def call(idx):
    port = PORTS[idx % 5]
    req = urllib.request.Request(f"http://localhost:{port}/v1/embeddings", method="POST", data=PAYLOAD,
        headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=30) as r:
            return "200"
    except urllib.error.HTTPError as e:
        b = e.read().decode("utf-8", errors="replace")
        return "429_LayerB" if "No deployments available" in b else f"err_{e.code}"

async def main():
    loop = asyncio.get_event_loop()
    futs = []
    t0 = time.monotonic()
    idx = 0
    while time.monotonic() - t0 < 50:
        idx += 1
        futs.append(loop.run_in_executor(None, call, idx))
        await asyncio.sleep(0.67)
    results = [await f for f in futs]
    print(Counter(results))

asyncio.run(main())

---

T+5s:   redis = 120
T+25s:  redis = 540   ← exceeds limit of 500, but still passing
T+45s:  redis = 975   ← nearly 2x limit, still passing

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

In a multi-replica LiteLLM proxy deployment with usage-based-routing-v2, the deployment-level TPM limit (litellm_params.tpm in model_list) is enforced against each replica's local in-memory counter rather than the cross-pod sum. The effective per-deployment TPM ceiling becomes tpm_limit × N_replica, and traffic up to that ceiling passes through with zero 429 responses.

Root cause (code-level)

RPM keys are batch-synced across replicas (since #9357 — "support batch writing increments to redis"), but TPM keys are not:

lowest_tpm_rpm_v2.async_log_success_event (line ~305):

await self.router_cache.async_increment_cache(
    key=tpm_key, value=total_tokens, ttl=..., parent_otel_span=...,
)

This calls DualCache.async_increment_cache, which writes to both in-memory and Redis directly, but does not register the key with the queue-based sync mechanism in base_routing_strategy.py. The periodic _sync_in_memory_spend_with_redis task only processes keys that go through _increment_value_in_current_window (which is how RPM and provider budget keys are handled).

So:

  • Redis counter = correct cross-pod sum (atomic INCRBYFLOAT)
  • Each replica's in-memory counter = only the responses it processed locally
  • The prefilter (_return_potential_deployments) reads counters via DualCache.async_batch_get_cache, which short-circuits on local in-memory hits and rarely falls back to Redis after the first read (the redis_batch_cache_expiry throttle defaults to 10 seconds)

Result: each replica compares local_in_memory_value + input_tokens > tpm_limit independently, so traffic up to tpm_limit × N_replica passes through with zero 429s.

Observed in production

5-replica deployment, embedding model, configured tpm_limit = 1,500,000:

Configured: tpm_limit = 1,500,000
Observed (40-minute sustained burst):
  - per-minute tokens started:  ~5.5M (~3.7x over limit)
  - per-minute tokens completed: ~3.1M  (counter visible to a single replica never reaches limit)
  - HTTP 429 count:               0  ← root cause
  - backend latency:              1.4s → 65s polynomial growth (queue saturation)
  - backend 504 timeouts:        31K+ (the only thing actually protecting the backend)

Per-replica math:  5.5M / 5 replicas = 1.1M per replica < 1.5M limit → all pass

The bug is silent until traffic exceeds tpm_limit × N_replica. With a single-replica deployment, the bug is invisible because each replica's local counter matches Redis. The issue only manifests at scale.

Steps to Reproduce

Setup

  1. Local Redis on port 16379
  2. Mock embedding backend with controlled latency:
# mock_backend.py
import asyncio, os, json
from fastapi import FastAPI, Request
import uvicorn

LATENCY_SEC = float(os.getenv("LATENCY_SEC", "0.5"))
app = FastAPI()

@app.post("/v1/embeddings")
async def embeddings(req: Request):
    body = await req.json()
    n_tokens = max(1, len(str(body.get("input", ""))) // 4)
    await asyncio.sleep(LATENCY_SEC)
    return {
        "object": "list",
        "data": [{"object": "embedding", "embedding": [0.1, 0.2, 0.3], "index": 0}],
        "model": body.get("model"),
        "usage": {"prompt_tokens": n_tokens, "total_tokens": n_tokens, "completion_tokens": 0},
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=28000, log_level="warning")
  1. LiteLLM proxy config.yaml:
model_list:
  - model_name: test-model
    litellm_params:
      model: hosted_vllm/mock-embedding
      api_base: http://localhost:28000/v1
      api_key: dummy
      tpm: 500
      rpm: 600000

router_settings:
  routing_strategy: usage-based-routing-v2
  redis_host: localhost
  redis_port: 16379

general_settings:
  master_key: sk-1234
  1. Start 5 instances on ports 14000-14004 (simulating 5 replicas):
for PORT in 14000 14001 14002 14003 14004; do
  litellm --config config.yaml --port $PORT &
done

Reproduction — continuous burst (RPS 1.5 for 50 seconds)

import asyncio, json, urllib.request, urllib.error, time
from collections import Counter

PORTS = [14000, 14001, 14002, 14003, 14004]
KEY = "sk-1234"
PAYLOAD = json.dumps({"model": "test-model", "input": "hello world " * 10}).encode()

def call(idx):
    port = PORTS[idx % 5]
    req = urllib.request.Request(f"http://localhost:{port}/v1/embeddings", method="POST", data=PAYLOAD,
        headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=30) as r:
            return "200"
    except urllib.error.HTTPError as e:
        b = e.read().decode("utf-8", errors="replace")
        return "429_LayerB" if "No deployments available" in b else f"err_{e.code}"

async def main():
    loop = asyncio.get_event_loop()
    futs = []
    t0 = time.monotonic()
    idx = 0
    while time.monotonic() - t0 < 50:
        idx += 1
        futs.append(loop.run_in_executor(None, call, idx))
        await asyncio.sleep(0.67)
    results = [await f for f in futs]
    print(Counter(results))

asyncio.run(main())

Expected: ~33 requests pass (until counter reaches limit 500), then 429. Observed: Counter({'200': 72})all 72 requests pass, 0 rate-limited.

Per-replica math: 72 req / 5 replicas = 14.4 req × 15 tokens = ~216 tokens per replica < 500 limit → all pass.

Redis counter timeline (proving the counter exceeds the limit but isn't enforced):

T+5s:   redis = 120
T+25s:  redis = 540   ← exceeds limit of 500, but still passing
T+45s:  redis = 975   ← nearly 2x limit, still passing

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.14

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Deployment-level TPM enforcement is per-pod, not cross-pod — effective limit becomes `tpm_limit × N_replica` [1 pull requests]