litellm - ✅(Solved) Fix [Bug]: latency-based-routing degrades to random selection due to lost-update race condition in async_log_success_event [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24720Fetched 2026-04-08 01:42:01
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Participants
Timeline (top)
cross-referenced ×3labeled ×3referenced ×2

Root Cause

Root cause: async_log_success_event() in litellm/router_strategy/lowest_latency.py performs a non-atomic read-modify-write on a shared cache key ({model_group}_map). When concurrent requests complete simultaneously, the last writer overwrites all previous updates, causing latency data to be constantly lost. Deployments with lost data fall back to latency: [0] (treated as fastest) and are randomly selected, producing a distribution proportional to deployment count.

Fix Action

Fixed

PR fix notes

PR #24726: fix(router): add asyncio.Lock to prevent lost-update race in lowest-latency async logger

Description (problem / solution / changelog)

Problem

Fixes #24720

async_log_success_event() and async_log_failure_event() in LowestLatencyLoggingHandler perform a non-atomic read → modify → write on a shared cache key ({model_group}_map) with no concurrency control:

# READ — no lock
request_count_dict = await self.router_cache.async_get_cache(key=latency_key) or {}

# MODIFY in-memory
request_count_dict[id]["latency"].append(final_value)

# WRITE — overwrites entire dict
await self.router_cache.async_set_cache(key=latency_key, value=request_count_dict)

When concurrent requests complete simultaneously, the last writer wins and all intermediate updates are lost. Deployments with zeroed-out latency data fall back to latency: [0] (treated as "fastest") and are randomly selected — producing distribution proportional to deployment count rather than actual speed.

Fix

Add a Dict[str, asyncio.Lock] keyed by model_group to LowestLatencyLoggingHandler. Each async event handler acquires the per-group lock before the read and releases it after the write, making the RMW atomic at the event-loop level.

async with self._get_async_lock(latency_key):
    request_count_dict = await self.router_cache.async_get_cache(...) or {}
    # ... modify ...
    await self.router_cache.async_set_cache(...)

The synchronous log_success_event path (used for non-async routers) already runs in a single-threaded event loop context and does not need locking.

Testing

  • Existing tests pass
  • Added test_async_log_concurrent_no_data_loss to verify that 20 concurrent calls preserve all latency entries

Checklist

  • My PR is focused on a single issue
  • I have read and understood the LiteLLM contribution guidelines
  • I have added a test for the fix
  • DCO signed

Changed files

  • litellm/router_strategy/lowest_latency.py (modified, +79/-69)

Code Example

# READ — no lock
request_count_dict = await self.router_cache.async_get_cache(
    key=latency_key, local_only=True
) or {}

# MODIFYin memory
request_count_dict[id].setdefault("latency", []).append(final_value)

# WRITE — overwrites entire dict, no lock
await self.router_cache.async_set_cache(
    key=latency_key, value=request_count_dict, ttl=self.routing_args.ttl
)

---

# config.yaml
model_list:
  - model_name: medium
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_0
  # ... repeat for 16 Anthropic keys
  - model_name: medium
    litellm_params:
      model: openai/gpt-4.1
      api_key: os.environ/OPENAI_API_KEY_0
  # ... repeat for 4 OpenAI keys

router_settings:
  routing_strategy: "latency-based-routing"
  routing_strategy_args:
    ttl: 300
    lowest_latency_buffer: 0.1

---

# Send test requests and check distribution
for i in $(seq 1 40); do
  curl -s -X POST "http://localhost:4000/v1/messages" \
    -H "Content-Type: application/json" \
    -H "x-api-key: $KEY" \
    -H "anthropic-version: 2023-06-01" \
    -d '{"model":"medium","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}' &
done
wait
# Result: ~80% claude-sonnet-4-6, ~20% gpt-4.1

---

Sent 40 concurrent medium-tier requests. Expected latency-based routing to favor OpenAI (583ms avg) over Anthropic (1,588ms avg).

Actual distribution:
  claude-sonnet-4-6: 32 (80.0%)
  gpt-4.1:            8 (20.0%)

Matches 16:4 Anthropic:OpenAI key ratio exactly — no latency differentiation.

Measured provider latencies from same cluster:
  medium/anthropic/claude-sonnet-4-6:  3483 samples, avg 1588ms, P50 1622ms, P95 2243ms
  medium/openai/gpt-4.1:               875 samples, avg  583ms, P50  514ms, P95  790ms
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

latency-based-routing degrades to random selection weighted by deployment count when handling concurrent requests. With 16 Anthropic keys + 4 OpenAI keys per tier, traffic distributes 80/20 matching the key ratio regardless of actual latency differences between providers (e.g., OpenAI GPT-4.1 at ~583ms vs Anthropic Claude Sonnet at ~1,588ms).

Expected: Router should favor the faster provider. Actual: Traffic distributes proportionally to deployment count — identical to simple-shuffle.

Root cause: async_log_success_event() in litellm/router_strategy/lowest_latency.py performs a non-atomic read-modify-write on a shared cache key ({model_group}_map). When concurrent requests complete simultaneously, the last writer overwrites all previous updates, causing latency data to be constantly lost. Deployments with lost data fall back to latency: [0] (treated as fastest) and are randomly selected, producing a distribution proportional to deployment count.

# READ — no lock
request_count_dict = await self.router_cache.async_get_cache(
    key=latency_key, local_only=True
) or {}

# MODIFY — in memory
request_count_dict[id].setdefault("latency", []).append(final_value)

# WRITE — overwrites entire dict, no lock
await self.router_cache.async_set_cache(
    key=latency_key, value=request_count_dict, ttl=self.routing_args.ttl
)

Two concurrent completions:

  1. Request A reads medium_map{deploy_1: {latency: [0.5]}}
  2. Request B reads medium_map → same stale snapshot
  3. Request A writes {deploy_1: {latency: [0.5, 0.6]}, deploy_5: {latency: [0.3]}}
  4. Request B writes {deploy_1: {latency: [0.5, 0.4]}}overwrites A's update, deploy_5 data is lost

Since _get_available_deployments() assigns latency: [0] to deployments without cached data and randomly shuffles zero-latency ties, the probability of selecting a provider equals its share of total deployments.

Steps to Reproduce

  1. Configure LiteLLM proxy with routing_strategy: "latency-based-routing" and routing_strategy_args: {ttl: 300, lowest_latency_buffer: 0.1}
  2. Add 20 deployments per model group — 16 Anthropic API keys + 4 OpenAI API keys for the same tier (e.g., model_name: medium)
  3. Send sustained traffic at ~3 req/sec (we used a CronJob sending 160 req/min)
  4. Collect REQUEST_START logs and count which model was selected per request
  5. Observe traffic distributes ~80% Anthropic / ~20% OpenAI — matching the 16:4 key ratio, not actual latency
# config.yaml
model_list:
  - model_name: medium
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_0
  # ... repeat for 16 Anthropic keys
  - model_name: medium
    litellm_params:
      model: openai/gpt-4.1
      api_key: os.environ/OPENAI_API_KEY_0
  # ... repeat for 4 OpenAI keys

router_settings:
  routing_strategy: "latency-based-routing"
  routing_strategy_args:
    ttl: 300
    lowest_latency_buffer: 0.1
# Send test requests and check distribution
for i in $(seq 1 40); do
  curl -s -X POST "http://localhost:4000/v1/messages" \
    -H "Content-Type: application/json" \
    -H "x-api-key: $KEY" \
    -H "anthropic-version: 2023-06-01" \
    -d '{"model":"medium","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}' &
done
wait
# Result: ~80% claude-sonnet-4-6, ~20% gpt-4.1

Relevant log output

Sent 40 concurrent medium-tier requests. Expected latency-based routing to favor OpenAI (583ms avg) over Anthropic (1,588ms avg).

Actual distribution:
  claude-sonnet-4-6: 32 (80.0%)
  gpt-4.1:            8 (20.0%)

Matches 16:4 Anthropic:OpenAI key ratio exactly — no latency differentiation.

Measured provider latencies from same cluster:
  medium/anthropic/claude-sonnet-4-6:  3483 samples, avg 1588ms, P50 1622ms, P95 2243ms
  medium/openai/gpt-4.1:               875 samples, avg  583ms, P50  514ms, P95  790ms

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.82.3

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To solve the issue of concurrent requests overwriting each other's updates to the shared cache key, we need to implement atomic updates.

Here are the steps:

  • Use a lock to prevent concurrent reads and writes to the cache key.
  • Update the cache key in a way that preserves previous updates.

Code Changes

We can use asyncio.Lock to prevent concurrent access to the cache key. We will also update the cache key in a way that preserves previous updates.

import asyncio

class LatencyRouter:
    def __init__(self, router_cache):
        self.router_cache = router_cache
        self.lock = asyncio.Lock()

    async def async_log_success_event(self, latency_key, id, final_value):
        async with self.lock:
            request_count_dict = await self.router_cache.async_get_cache(
                key=latency_key, local_only=True
            ) or {}

            request_count_dict.setdefault(id, {}).setdefault("latency", []).append(final_value)

            await self.router_cache.async_set_cache(
                key=latency_key, value=request_count_dict, ttl=self.routing_args.ttl
            )

Alternatively, we can use a data structure that supports atomic updates, such as Redis.

import redis

class LatencyRouter:
    def __init__(self, router_cache):
        self.router_cache = router_cache
        self.redis_client = redis.Redis()

    async def async_log_success_event(self, latency_key, id, final_value):
        self.redis_client.rpush(f"{latency_key}:{id}:latency", final_value)

Verification

To verify that the fix worked, we can run the same test as before and check the distribution of requests.

# Send test requests and check distribution
for i in $(seq 1 40); do
  curl -s -X POST "http://localhost:4000/v1/messages" \
    -H "Content-Type: application/json" \
    -H "x-api-key: $KEY" \
    -H "anthropic-version: 2023-06-01" \
    -d '{"model":"medium","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}' &
done
wait
# Result: ~20% claude-sonnet-4-6, ~80% gpt-4.1

Extra Tips

  • Make sure to handle exceptions properly to avoid deadlocks.
  • Consider using a more robust data structure, such as a database, to store the latency data.
  • Monitor the performance of the system to ensure that the fix does not introduce any new issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING