litellm - 💡(How to fix) Fix RFC: Cross-provider token-budget normalization at the Router layer (Router.budget + Router.check_fit preflight)

litellm2026-05-25 06:36:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

I'd like to gauge interest in a cross-provider token-budget normalization API at the LiteLLM Router layer — a single router.budget(model) call that returns a normalized budget object (input max, output max, effective input after tool/system overhead, per-token cost, tokenizer reference), plus a router.check_fit(messages, model, ...) preflight that returns a structured decision instead of a downstream ContextWindowExceededError.

Today model_cost.json exposes max_tokens / max_input_tokens / max_output_tokens / input_cost_per_token / output_cost_per_token as a static lookup, and token_counter() provides per-provider tokenization. But every caller has to wire these together themselves to answer the question "will this fit?" — and most don't, because the wiring is verbose and the cliff is invisible until the provider rejects the request.

Error Message

The result: callers either over-trim (waste context) or under-trim and discover the limit at request time via a provider-specific error string. LiteLLM is uniquely positioned to fix this because it's already the layer that knows every provider's tokenizer and budget metadata. 3. The error path is provider-specific — context-limit errors don't always surface as ContextWindowExceededError; the message string varies by provider. Even when callers handle the error, they discover the limit only after a round-trip. A pre-flight at the router layer eliminates the round-trip.

Discover the actual limit at request time via a provider-specific error string and then either bail or retry with manually-trimmed input

Root Cause

Code Example

from litellm import Router

router = Router(model_list=[...])

# 1) Static normalized budget for a model
b = router.budget("claude-opus-4-7")
# Budget(
#   model="claude-opus-4-7",
#   provider="anthropic",
#   input_max=200_000,
#   output_max=8_192,
#   tokenizer="anthropic",
#   input_cost_per_token=0.000015,
#   output_cost_per_token=0.000075,
#   tool_schema_overhead_fn=<callable>,
# )

# 2) Preflight — does this request fit?
decision = router.check_fit(
    model="claude-opus-4-7",
    messages=msgs,
    tools=tools,
    reserve_output=2000,
)
# FitDecision(
#   fits=False,
#   input_tokens=185_300,
#   tool_overhead_tokens=1_840,
#   effective_input_max=190_000,    # input_max - reserve_output - tool_overhead
#   over_by_tokens=-2_140,
#   suggestion=TrimSuggestion(
#       trim_oldest_n_messages=4,    # estimated to bring it in budget
#       or_switch_model=["claude-opus-4-7[1m]", "gemini-2.5-pro"],
#   ),
# )

RAW_BUFFERClick to expand / collapse

Update 2026-05-25: I've edited this issue to clarify the framing. The original wording made stronger empirical claims ("I've been running a small router-wrapper", "preflight failure rate ~0.4%", "context utilization up ~12%") than I can actually support — the design is from reading Router, token_counter, and model_cost.json, not from a measured production deployment. I've also removed a reference to issue #21558 that I couldn't independently verify. The design discussion stands; the specific numbers and personal use-case framing have been removed.

Summary

Why the Router is the right layer

Three reasons app code shouldn't be doing this itself:

Tokenizer divergence is fiddly — Claude / OpenAI / Gemini / Mistral all count tokens differently for the same string; tool-call schemas add per-provider overhead (Anthropic's tools block, OpenAI's tool_choice); system prompts are charged differently across providers. Replicating LiteLLM's existing knowledge in every caller is brittle.
The effective budget is not max_input_tokens — callers need max_input_tokens - tool_schema_overhead - system_prompt_tokens - reserved_for_output to know how many messages they can safely include. Today every framework computes this slightly differently, and most ignore tool-schema overhead.
The error path is provider-specific — context-limit errors don't always surface as ContextWindowExceededError; the message string varies by provider. Even when callers handle the error, they discover the limit only after a round-trip. A pre-flight at the router layer eliminates the round-trip.

Design sketch

from litellm import Router

router = Router(model_list=[...])

# 1) Static normalized budget for a model
b = router.budget("claude-opus-4-7")
# Budget(
#   model="claude-opus-4-7",
#   provider="anthropic",
#   input_max=200_000,
#   output_max=8_192,
#   tokenizer="anthropic",
#   input_cost_per_token=0.000015,
#   output_cost_per_token=0.000075,
#   tool_schema_overhead_fn=<callable>,
# )

# 2) Preflight — does this request fit?
decision = router.check_fit(
    model="claude-opus-4-7",
    messages=msgs,
    tools=tools,
    reserve_output=2000,
)
# FitDecision(
#   fits=False,
#   input_tokens=185_300,
#   tool_overhead_tokens=1_840,
#   effective_input_max=190_000,    # input_max - reserve_output - tool_overhead
#   over_by_tokens=-2_140,
#   suggestion=TrimSuggestion(
#       trim_oldest_n_messages=4,    # estimated to bring it in budget
#       or_switch_model=["claude-opus-4-7[1m]", "gemini-2.5-pro"],
#   ),
# )

Three new public surfaces:

Router.budget(model) -> Budget — pure metadata, no LLM call
Router.check_fit(model, messages, tools=..., reserve_output=...) -> FitDecision — counts tokens locally, returns structured decision
Router.cross_provider_normalize(messages, tools=..., from_=..., to=...) -> int — gives the destination provider's count for a request currently formatted for the source provider (useful when switching providers mid-stream)

Budget.tool_schema_overhead_fn — provider-specific function that computes the tokens added by a given tool list, so callers don't have to know about Anthropic's tools block layout vs OpenAI's.

What this is NOT

Not a model recommender — or_switch_model is a hint based on input_max, not a quality judgment; caller decides
Not a retry/fallback mechanism — Router already has fallbacks; this is the preflight that decides whether to invoke them
Not a streaming token counter — preflight is pre-request only; in-stream output token tracking stays where it is
Not changing existing APIs — purely additive

Why this matters in practice

Reading consumer-side code in the broader ecosystem: agent frameworks built on LiteLLM each implement their own version of "trim message history to fit". Most:

Ignore tool-schema overhead in the budget calculation
Differ in how they handle the system-prompt token cost across providers
Discover the actual limit at request time via a provider-specific error string and then either bail or retry with manually-trimmed input

A Router-layer preflight would centralize knowledge LiteLLM already has (provider tokenizers, schema layouts) and remove the duplicated-and-slightly-wrong "fit estimator" from each consumer. This is a code-reading argument about where the responsibility should live; would be very interested in whether maintainers see preflight as in-scope for Router or as a userland concern.

Questions before I open anything

Roadmap conflict? Is there an existing plan for unified budget metadata at the Router layer? I checked recent issues and didn't find a direct match; want to confirm.
Scope preference:
- (a) minimal demo PR — just Router.budget() returning normalized metadata, no preflight
- (b) full version with check_fit + cross_provider_normalize + TrimSuggestion
- (c) keep as a third-party package on top of LiteLLM; just expose the metadata hooks
Tool-schema overhead — the per-provider overhead function is the trickiest bit (and changes when providers change tools schema). Acceptable to ship best-effort estimates with a ±N% margin baked in?
TrimSuggestion — useful or too opinionated? Could ship just over_by_tokens and let callers decide their trim strategy.

Not opening a PR yet — gauging interest before sinking time into the wrong shape.

Thanks!

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering