litellm - 💡(How to fix) Fix [Bug]: Deployment-level TPM enforcement is per-pod, not cross-pod — effective limit becomes `tpm_limit × N

Code Example

await self.router_cache.async_increment_cache(
    key=tpm_key, value=total_tokens, ttl=..., parent_otel_span=...,
)

---

Configured: tpm_limit = 1,500,000
Observed (40-minute sustained burst):
  - per-minute tokens started:  ~5.5M (~3.7x over limit)
  - per-minute tokens completed: ~3.1M  (counter visible to a single replica never reaches limit)
  - HTTP 429 count:               0  ← root cause
  - backend latency:              1.4s → 65s polynomial growth (queue saturation)
  - backend 504 timeouts:        31K+ (the only thing actually protecting the backend)

Per-replica math:  5.5M / 5 replicas = 1.1M per replica < 1.5M limit → all pass

---

# mock_backend.py
import asyncio, os, json
from fastapi import FastAPI, Request
import uvicorn

LATENCY_SEC = float(os.getenv("LATENCY_SEC", "0.5"))
app = FastAPI()

@app.post("/v1/embeddings")
async def embeddings(req: Request):
    body = await req.json()
    n_tokens = max(1, len(str(body.get("input", ""))) // 4)
    await asyncio.sleep(LATENCY_SEC)
    return {
        "object": "list",
        "data": [{"object": "embedding", "embedding": [0.1, 0.2, 0.3], "index": 0}],
        "model": body.get("model"),
        "usage": {"prompt_tokens": n_tokens, "total_tokens": n_tokens, "completion_tokens": 0},
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=28000, log_level="warning")

---

model_list:
  - model_name: test-model
    litellm_params:
      model: hosted_vllm/mock-embedding
      api_base: http://localhost:28000/v1
      api_key: dummy
      tpm: 500
      rpm: 600000

router_settings:
  routing_strategy: usage-based-routing-v2
  redis_host: localhost
  redis_port: 16379

general_settings:
  master_key: sk-1234

---

for PORT in 14000 14001 14002 14003 14004; do
  litellm --config config.yaml --port $PORT &
done

---

import asyncio, json, urllib.request, urllib.error, time
from collections import Counter

PORTS = [14000, 14001, 14002, 14003, 14004]
KEY = "sk-1234"
PAYLOAD = json.dumps({"model": "test-model", "input": "hello world " * 10}).encode()

def call(idx):
    port = PORTS[idx % 5]
    req = urllib.request.Request(f"http://localhost:{port}/v1/embeddings", method="POST", data=PAYLOAD,
        headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=30) as r:
            return "200"
    except urllib.error.HTTPError as e:
        b = e.read().decode("utf-8", errors="replace")
        return "429_LayerB" if "No deployments available" in b else f"err_{e.code}"

async def main():
    loop = asyncio.get_event_loop()
    futs = []
    t0 = time.monotonic()
    idx = 0
    while time.monotonic() - t0 < 50:
        idx += 1
        futs.append(loop.run_in_executor(None, call, idx))
        await asyncio.sleep(0.67)
    results = [await f for f in futs]
    print(Counter(results))

asyncio.run(main())

---

T+5s:   redis = 120
T+25s:  redis = 540   ← exceeds limit of 500, but still passing
T+45s:  redis = 975   ← nearly 2x limit, still passing

---

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

In a multi-replica LiteLLM proxy deployment with usage-based-routing-v2, the deployment-level TPM limit (litellm_params.tpm in model_list) is enforced against each replica's local in-memory counter rather than the cross-pod sum. The effective per-deployment TPM ceiling becomes tpm_limit × N_replica, and traffic up to that ceiling passes through with zero 429 responses.

Root cause (code-level)

RPM keys are batch-synced across replicas (since #9357 — "support batch writing increments to redis"), but TPM keys are not:

lowest_tpm_rpm_v2.async_log_success_event (line ~305):

await self.router_cache.async_increment_cache(
    key=tpm_key, value=total_tokens, ttl=..., parent_otel_span=...,
)

This calls DualCache.async_increment_cache, which writes to both in-memory and Redis directly, but does not register the key with the queue-based sync mechanism in base_routing_strategy.py. The periodic _sync_in_memory_spend_with_redis task only processes keys that go through _increment_value_in_current_window (which is how RPM and provider budget keys are handled).

So:

Redis counter = correct cross-pod sum (atomic INCRBYFLOAT)
Each replica's in-memory counter = only the responses it processed locally
The prefilter (_return_potential_deployments) reads counters via DualCache.async_batch_get_cache, which short-circuits on local in-memory hits and rarely falls back to Redis after the first read (the redis_batch_cache_expiry throttle defaults to 10 seconds)

Result: each replica compares local_in_memory_value + input_tokens > tpm_limit independently, so traffic up to tpm_limit × N_replica passes through with zero 429s.

Observed in production

5-replica deployment, embedding model, configured tpm_limit = 1,500,000:

Configured: tpm_limit = 1,500,000
Observed (40-minute sustained burst):
  - per-minute tokens started:  ~5.5M (~3.7x over limit)
  - per-minute tokens completed: ~3.1M  (counter visible to a single replica never reaches limit)
  - HTTP 429 count:               0  ← root cause
  - backend latency:              1.4s → 65s polynomial growth (queue saturation)
  - backend 504 timeouts:        31K+ (the only thing actually protecting the backend)

Per-replica math:  5.5M / 5 replicas = 1.1M per replica < 1.5M limit → all pass

The bug is silent until traffic exceeds tpm_limit × N_replica. With a single-replica deployment, the bug is invisible because each replica's local counter matches Redis. The issue only manifests at scale.

Steps to Reproduce

Setup

Local Redis on port 16379
Mock embedding backend with controlled latency:

# mock_backend.py
import asyncio, os, json
from fastapi import FastAPI, Request
import uvicorn

LATENCY_SEC = float(os.getenv("LATENCY_SEC", "0.5"))
app = FastAPI()

@app.post("/v1/embeddings")
async def embeddings(req: Request):
    body = await req.json()
    n_tokens = max(1, len(str(body.get("input", ""))) // 4)
    await asyncio.sleep(LATENCY_SEC)
    return {
        "object": "list",
        "data": [{"object": "embedding", "embedding": [0.1, 0.2, 0.3], "index": 0}],
        "model": body.get("model"),
        "usage": {"prompt_tokens": n_tokens, "total_tokens": n_tokens, "completion_tokens": 0},
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=28000, log_level="warning")

LiteLLM proxy config.yaml:

model_list:
  - model_name: test-model
    litellm_params:
      model: hosted_vllm/mock-embedding
      api_base: http://localhost:28000/v1
      api_key: dummy
      tpm: 500
      rpm: 600000

router_settings:
  routing_strategy: usage-based-routing-v2
  redis_host: localhost
  redis_port: 16379

general_settings:
  master_key: sk-1234

Start 5 instances on ports 14000-14004 (simulating 5 replicas):

for PORT in 14000 14001 14002 14003 14004; do
  litellm --config config.yaml --port $PORT &
done

Reproduction — continuous burst (RPS 1.5 for 50 seconds)

import asyncio, json, urllib.request, urllib.error, time
from collections import Counter

PORTS = [14000, 14001, 14002, 14003, 14004]
KEY = "sk-1234"
PAYLOAD = json.dumps({"model": "test-model", "input": "hello world " * 10}).encode()

def call(idx):
    port = PORTS[idx % 5]
    req = urllib.request.Request(f"http://localhost:{port}/v1/embeddings", method="POST", data=PAYLOAD,
        headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=30) as r:
            return "200"
    except urllib.error.HTTPError as e:
        b = e.read().decode("utf-8", errors="replace")
        return "429_LayerB" if "No deployments available" in b else f"err_{e.code}"

async def main():
    loop = asyncio.get_event_loop()
    futs = []
    t0 = time.monotonic()
    idx = 0
    while time.monotonic() - t0 < 50:
        idx += 1
        futs.append(loop.run_in_executor(None, call, idx))
        await asyncio.sleep(0.67)
    results = [await f for f in futs]
    print(Counter(results))

asyncio.run(main())

Expected: ~33 requests pass (until counter reaches limit 500), then 429. Observed: Counter({'200': 72}) — all 72 requests pass, 0 rate-limited.

Per-replica math: 72 req / 5 replicas = 14.4 req × 15 tokens = ~216 tokens per replica < 500 limit → all pass.

Redis counter timeline (proving the counter exceeds the limit but isn't enforced):

T+5s:   redis = 120
T+25s:  redis = 540   ← exceeds limit of 500, but still passing
T+45s:  redis = 975   ← nearly 2x limit, still passing

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.14

Twitter / LinkedIn details

No response

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Deployment-level TPM enforcement is per-pod, not cross-pod — effective limit becomes `tpm_limit × N_replica` [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (code-level)

Fix Action

Fixed

Code Example

Check for existing issues

What happened?

Root cause (code-level)

Observed in production

Steps to Reproduce

Setup

Reproduction — continuous burst (RPS 1.5 for 50 seconds)

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Deployment-level TPM enforcement is per-pod, not cross-pod — effective limit becomes `tpm_limit × N_replica` [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (code-level)

Fix Action

Fixed

Code Example

Check for existing issues

What happened?

Root cause (code-level)

Observed in production

Steps to Reproduce

Setup

Reproduction — continuous burst (RPS 1.5 for 50 seconds)

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

RELATED_DISCOVERY

TRENDING