litellm - ✅(Solved) Fix [Bug]: tiktoken.encode() on event loop blocks liveness probes, kills pods [1 pull requests, 1 comments, 2 participants]

litellm2026-04-21 20:46:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#26193•Fetched 2026-04-22 07:45:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

6matt

Participants

6matt

krrish-berri-2

Timeline (top)

labeled ×2commented ×1mentioned ×1subscribed ×1

Root Cause

Related: #9145 (feature request for the same root cause, no liveness/pod-kill evidence)

Fix Action

Workaround

litellm_settings:
  disable_token_counter: true

Provider-supplied usage (including reasoning tokens) still flows through. Only the local tiktoken fallback is skipped.

PR fix notes

PR #26245: fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens

Repository: BerriAI/litellm
Author: dschulmeist
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/26245

Description (problem / solution / changelog)

Relevant issues

Fixes #26193

Type

Bug Fix

Changes

Problem

stream_chunk_builder unconditionally calls ChunkProcessor.count_reasoning_tokens, which recomputes reasoning_tokens via tiktoken.encode() — a C extension that holds the GIL for tens of seconds on large reasoning responses (Claude extended thinking, OpenAI o1/o3, Gemini thinking, 100k+ tokens).

The computed value is already discarded by calculate_usage when the provider supplied reasoning_tokens in a streaming usage chunk. Specifically, calculate_usage only applies the recomputed value when completion_tokens_details.reasoning_tokens is None:

if reasoning_tokens is not None:
    if returned_usage.completion_tokens_details is None:
        returned_usage.completion_tokens_details = (
            CompletionTokensDetailsWrapper(reasoning_tokens=reasoning_tokens)
        )
    elif (
        returned_usage.completion_tokens_details is not None
        and returned_usage.completion_tokens_details.reasoning_tokens is None
    ):
        returned_usage.completion_tokens_details.reasoning_tokens = (
            reasoning_tokens
        )

All major reasoning providers (Anthropic extended thinking, OpenAI o1/o3, Gemini thinking) supply reasoning_tokens in streaming usage chunks, so in the common case the expensive recount runs purely to be thrown away.

The bug report (#26193) includes py-spy thread dumps showing a single tiktoken.encode() call monopolizing the event loop for the full liveness-probe window, resulting in pods being SIGKILLed by Kubernetes.

Fix

Inspect the already-parsed streaming chunks for any completion_tokens_details.reasoning_tokens entry. If one exists, skip the local tiktoken recount entirely — calculate_usage will pick up the provider-supplied value as before. If none exists (rare), fall through to count_reasoning_tokens so providers that emit reasoning_content without reporting usage are unaffected.

New helper ChunkProcessor.chunks_have_reasoning_tokens(chunks) handles both ModelResponseStream objects and plain-dict chunks, matching the existing chunk-parsing patterns in _calculate_usage_per_chunk.

Call-site change in litellm/main.py:

if processor.chunks_have_reasoning_tokens(chunks):
    reasoning_tokens = None
else:
    reasoning_tokens = processor.count_reasoning_tokens(response)

Tests

Four new tests in tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py:

test_chunks_have_reasoning_tokens_true_when_provider_supplies — regression test for the reported bug.
test_chunks_have_reasoning_tokens_false_when_provider_omits — ensures the slow fallback is still exercised when providers omit usage.
test_chunks_have_reasoning_tokens_handles_dict_chunks — dict-chunk path (proxy-style chunks) is covered.
test_stream_chunk_builder_skips_count_reasoning_tokens_when_usage_present — end-to-end: stream_chunk_builder does not invoke count_reasoning_tokens when the provider already supplied the value, and the final response carries the provider-supplied number.

All four tests fail on litellm_oss_branch without the fix and pass with it.

Scope

No API changes: existing public method signatures untouched.
No new config flags or defaults.
No changes to count_reasoning_tokens itself (it remains the correct fallback).
Does not address the separate case where a misbehaving provider emits reasoning_content without any usage chunk. That path still blocks the event loop and could be fixed in a follow-up (e.g. asyncio.to_thread at the async caller), but is orthogonal to this issue's reported reproduction.

Pre-Submission checklist

Added testing in tests/test_litellm/
PR passes the touched test file (13/13 in test_streaming_chunk_builder_utils.py)
Scope isolated to a single bug
Greptile review (auto-triggered)

Changed files

litellm/litellm_core_utils/streaming_chunk_builder_utils.py (modified, +55/-9)
litellm/main.py (modified, +9/-1)
tests/test_litellm/litellm_core_utils/test_streaming_chunk_builder_utils.py (modified, +128/-11)

Code Example

stream_chunk_builder (litellm/main.py)
  → count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
    → token_counter (token_counter.py:404)
      → count_tokens (token_counter.py:546)
        → tiktoken.encode()  ← blocks event loop, holds GIL

---

Thread 1 (active): "MainThread"
    encode (tiktoken/core.py:120)
    count_tokens (token_counter.py:546)
    token_counter (token_counter.py:404)
    token_counter (litellm/utils.py:2304)
    count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
    stream_chunk_builder (litellm/main.py:7595)
    __anext__ (streaming_handler.py:2094)
    stream_with_fallbacks (litellm/router.py:1725)
    ...

---

litellm_settings:
  disable_token_counter: true

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

Related: #9145 (feature request for the same root cause, no liveness/pod-kill evidence)

What happened?

/health/liveliness times out and pods get killed because tiktoken.encode() runs synchronously on the asyncio event loop, holding the GIL for 60+ seconds on large reasoning model responses.

The call chain is:

stream_chunk_builder (litellm/main.py)
  → count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
    → token_counter (token_counter.py:404)
      → count_tokens (token_counter.py:546)
        → tiktoken.encode()  ← blocks event loop, holds GIL

With reasoning models (Claude extended thinking, o1/o3, Gemini thinking), reasoning_content can be hundreds of thousands of tokens. tiktoken.encode() is a synchronous C extension that holds the GIL for the entire encoding — no other Python code can run, including the trivial return "I'm alive!" liveness handler.

This token count is redundant — count_reasoning_tokens only fills completion_tokens_details.reasoning_tokens when the provider didn't already supply it (lines 696-707 of streaming_chunk_builder_utils.py). Every major reasoning model provider already returns reasoning_tokens in streaming usage chunks.

Steps to Reproduce

Run LiteLLM proxy with 1 uvicorn worker (default)
Send a streaming request to a reasoning model that produces a large thinking block (100k+ tokens)
Observe /health/liveliness becomes unresponsive for 60+ seconds
With K8s liveness probes configured (timeout=10s, failureThreshold=10), the pod gets killed

Relevant log output

py-spy thread dumps captured during 6 consecutive liveness probe failures (every 10s, same stack each time):

Thread 1 (active): "MainThread"
    encode (tiktoken/core.py:120)
    count_tokens (token_counter.py:546)
    token_counter (token_counter.py:404)
    token_counter (litellm/utils.py:2304)
    count_reasoning_tokens (streaming_chunk_builder_utils.py:517)
    stream_chunk_builder (litellm/main.py:7595)
    __anext__ (streaming_handler.py:2094)
    stream_with_fallbacks (litellm/router.py:1725)
    ...

All other threads (ThreadPoolExecutors, AnyIO workers) are idle. A single tiktoken.encode() call monopolizes the event loop for the entire duration.

Suggested fix

disable_token_counter should default to True unless token counting is offloaded from the event loop. The local tiktoken recount is a fallback for providers that don't return usage — it should not penalize every request by default, especially not with a GIL-holding synchronous call on the event loop.

If local token counting is kept, it should be offloaded via asyncio.to_thread() so it doesn't block the event loop.

Workaround

litellm_settings:
  disable_token_counter: true

Provider-supplied usage (including reasoning tokens) still flows through. Only the local tiktoken fallback is skipped.

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.82.3-stable

Twitter / LinkedIn details

No response

extent analysis

TL;DR

To fix the issue, consider disabling the token counter by default or offloading it from the event loop using asyncio.to_thread() to prevent the GIL-holding synchronous tiktoken.encode() call from blocking the event loop.

Guidance

The tiktoken.encode() call is blocking the event loop due to its synchronous nature, causing the /health/liveliness endpoint to time out and pods to get killed.
Disabling the token counter by setting disable_token_counter to True in the litellm_settings can mitigate the issue, as provider-supplied usage will still flow through.
Offloading the token counting to a separate thread using asyncio.to_thread() can also resolve the issue without disabling the token counter.
Verify the fix by checking the /health/liveliness endpoint's responsiveness and ensuring that pods are no longer getting killed due to liveness probe failures.

Example

import asyncio

# Offload token counting to a separate thread
async def count_tokens_offloaded(tokens):
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, count_tokens, tokens)

Notes

The suggested fix assumes that the token counting is not critical for the application's functionality and can be safely disabled or offloaded.
The workaround provided in the issue body, setting disable_token_counter to True, may not be suitable for all use cases and should be evaluated carefully.

Recommendation

Apply the workaround by setting disable_token_counter to True in the litellm_settings, as it is a simple and effective solution to mitigate the issue. This change can be made while further evaluating the feasibility of offloading the token counting to a separate thread.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#indexing error #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

litellm - ✅(Solved) Fix [Bug]: tiktoken.encode() on event loop blocks liveness probes, kills pods [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

PR fix notes

PR #26245: fix(proxy): skip redundant tiktoken recount when provider supplies reasoning_tokens

Description (problem / solution / changelog)

Relevant issues

Type

Changes

Problem

Fix

Tests

Scope

Pre-Submission checklist

Changed files

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

Suggested fix

Workaround

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING