litellm - ✅(Solved) Fix [Bug]: tiktoken.encode() on event loop blocks liveness probes, kills pods [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26193Fetched 2026-04-22 07:45:46
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Timeline (top)
labeled ×2commented ×1mentioned ×1subscribed ×1

Root Cause

Related: #9145 (feature request for the same root cause, no liveness/pod-kill evidence)

Fix Action

Workaround

litellm_settings:
  disable_token_counter: true

Provider-supplied usage (including reasoning tokens) still flows through. Only the local tiktoken fallback is skipped.

PR fix notes

PR #26245: fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens

Description (problem / solution / changelog)

Relevant issues

Fixes #26193

Type

Bug Fix

Changes

Problem

stream_chunk_builder unconditionally calls ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).

The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk. Specifically, calculate_usage only applies the recomputed value when completion_tokens_details.reasoning_tokens is None:

if reasoning_tokens is not None:
    if returned_usage.completion_tokens_details is None:
        returned_usage.completion_tokens_details = (
            CompletionTokensDetailsWrapper(reasoning_tokens=reasoning_tokens)
        )
    elif (
        returned_usage.completion_tokens_details is not None
        and returned_usage.completion_tokens_details.reasoning_tokens is None
    ):
        returned_usage.completion_tokens_details.reasoning_tokens = (
            reasoning_tokens
        )

All major reasoning providers (Anthropic extended thinking, OpenAI o1/o3, Gemini thinking) supply reasoning_tokens in streaming usage chunks, so in the common case the expensive recount runs purely to be thrown away.

The bug report (#26193) includes py-spy thread dumps showing a single tiktoken.encode() call monopolizing the event loop for the full liveness-probe window, resulting in pods being SIGKILLed by Kubernetes.

Fix

Inspect the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. If one exists, skip the local tiktoken recount entirely — calculate_usage will pick up the provider-supplied value as before. If none exists (rare), fall through to count_reasoning_tokens so providers that emit reasoning_content without reporting usage are unaffected.

New helper ChunkProcessor.chunks_have_reasoning_tokens(chunks) handles both ModelResponseStream objects and plain-dict chunks, matching the existing chunk-parsing patterns in _calculate_usage_per_chunk.

Call-site change in litellm/main.py:

if processor.chunks_have_reasoning_tokens(chunks):
    reasoning_tokens = None
else:
    reasoning_tokens = processor.count_reasoning_tokens(response)

Tests

Four new tests in tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py:

  1. test_chunks_have_reasoning_tokens_true_when_provider_supplies — regression test for the reported bug.
  2. test_chunks_have_reasoning_tokens_false_when_provider_omits — ensures the slow fallback is still exercised when providers omit usage.
  3. test_chunks_have_reasoning_tokens_handles_dict_chunks — dict-chunk path (proxy-style chunks) is covered.
  4. test_stream_chunk_builder_skips_count_reasoning_tokens_when_usage_present — end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplied the value, and the final response carries the provider-supplied number.

All four tests fail on litellm_oss_branch without the fix and pass with it.

Scope

  • No API changes: existing public method signatures untouched.
  • No new config flags or defaults.
  • No changes to count_reasoning_tokens itself (it remains the correct fallback).
  • Does not address the separate case where a misbehaving provider emits reasoning_content without any usage chunk. That path still blocks the event loop and could be fixed in a follow-up (e.g. asyncio.to_thread at the async caller), but is orthogonal to this issue's reported reproduction.

Pre-Submission checklist

  • Added testing in tests/test_litellm/
  • PR passes the touched test file (13/13 in test_streaming_chunk_builder_utils.py)
  • Scope isolated to a single bug
  • Greptile review (auto-triggered)

Changed files

  • litellm/litellm_core_utils/streaming_chunk_builder_utils.py (modified, +55/-9)
  • litellm/main.py (modified, +9/-1)
  • tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py (modified, +128/-11)

Code Example

stream_chunk_builder (litellm/main.py)
count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
token_counter (token_counter.py:404)
count_tokens (token_counter.py:546)
        → tiktoken.encode()  ← blocks event loop, holds GIL

---

Thread 1 (active): "MainThread"
    encode (tiktoken/core.py:120)
    count_tokens (token_counter.py:546)
    token_counter (token_counter.py:404)
    token_counter (litellm/utils.py:2304)
    count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
    stream_chunk_builder (litellm/main.py:7595)
    __anext__ (streaming_handler.py:2094)
    stream_with_fallbacks (litellm/router.py:1725)
    ...

---

litellm_settings:
  disable_token_counter: true
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

Related: #9145 (feature request for the same root cause, no liveness/pod-kill evidence)

What happened?

/health/liveliness times out and pods get killed because tiktoken.encode() runs synchronously on the asyncio event loop, holding the GIL for 60+ seconds on large reasoning model responses.

The call chain is:

stream_chunk_builder (litellm/main.py)
  → count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
    → token_counter (token_counter.py:404)
      → count_tokens (token_counter.py:546)
        → tiktoken.encode()  ← blocks event loop, holds GIL

With reasoning models (Claude extended thinking, o1/o3, Gemini thinking), reasoning_content can be hundreds of thousands of tokens. tiktoken.encode() is a synchronous C extension that holds the GIL for the entire encoding — no other Python code can run, including the trivial return "I'm alive!" liveness handler.

This token count is redundantcount_reasoning_tokens only fills completion_tokens_details.reasoning_tokens when the provider didn't already supply it (lines 696-707 of streaming_chunk_builder_utils.py). Every major reasoning model provider already returns reasoning_tokens in streaming usage chunks.

Steps to Reproduce

  1. Run LiteLLM proxy with 1 uvicorn worker (default)
  2. Send a streaming request to a reasoning model that produces a large thinking block (100k+ tokens)
  3. Observe /health/liveliness becomes unresponsive for 60+ seconds
  4. With K8s liveness probes configured (timeout=10s, failureThreshold=10), the pod gets killed

Relevant log output

py-spy thread dumps captured during 6 consecutive liveness probe failures (every 10s, same stack each time):

Thread 1 (active): "MainThread"
    encode (tiktoken/core.py:120)
    count_tokens (token_counter.py:546)
    token_counter (token_counter.py:404)
    token_counter (litellm/utils.py:2304)
    count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
    stream_chunk_builder (litellm/main.py:7595)
    __anext__ (streaming_handler.py:2094)
    stream_with_fallbacks (litellm/router.py:1725)
    ...

All other threads (ThreadPoolExecutors, AnyIO workers) are idle. A single tiktoken.encode() call monopolizes the event loop for the entire duration.

Suggested fix

disable_token_counter should default to True unless token counting is offloaded from the event loop. The local tiktoken recount is a fallback for providers that don't return usage — it should not penalize every request by default, especially not with a GIL-holding synchronous call on the event loop.

If local token counting is kept, it should be offloaded via asyncio.to_thread() so it doesn't block the event loop.

Workaround

litellm_settings:
  disable_token_counter: true

Provider-supplied usage (including reasoning tokens) still flows through. Only the local tiktoken fallback is skipped.

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.82.3-stable

Twitter / LinkedIn details

No response

extent analysis

TL;DR

To fix the issue, consider disabling the token counter by default or offloading it from the event loop using asyncio.to_thread() to prevent the GIL-holding synchronous tiktoken.encode() call from blocking the event loop.

Guidance

  • The tiktoken.encode() call is blocking the event loop due to its synchronous nature, causing the /health/liveliness endpoint to time out and pods to get killed.
  • Disabling the token counter by setting disable_token_counter to True in the litellm_settings can mitigate the issue, as provider-supplied usage will still flow through.
  • Offloading the token counting to a separate thread using asyncio.to_thread() can also resolve the issue without disabling the token counter.
  • Verify the fix by checking the /health/liveliness endpoint's responsiveness and ensuring that pods are no longer getting killed due to liveness probe failures.

Example

import asyncio

# Offload token counting to a separate thread
async def count_tokens_offloaded(tokens):
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, count_tokens, tokens)

Notes

  • The suggested fix assumes that the token counting is not critical for the application's functionality and can be safely disabled or offloaded.
  • The workaround provided in the issue body, setting disable_token_counter to True, may not be suitable for all use cases and should be evaluated carefully.

Recommendation

Apply the workaround by setting disable_token_counter to True in the litellm_settings, as it is a simple and effective solution to mitigate the issue. This change can be made while further evaluating the feasibility of offloading the token counting to a separate thread.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - ✅(Solved) Fix [Bug]: tiktoken.encode() on event loop blocks liveness probes, kills pods [1 pull requests, 1 comments, 2 participants]