litellm - 💡(How to fix) Fix RFC: Cross-provider token-budget normalization at the Router layer (Router.budget + Router.check_fit preflight)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I'd like to gauge interest in a cross-provider token-budget normalization API at the LiteLLM Router layer — a single router.budget(model) call that returns a normalized budget object (input max, output max, effective input after tool/system overhead, per-token cost, tokenizer reference), plus a router.check_fit(messages, model, ...) preflight that returns a structured decision instead of a downstream ContextWindowExceededError.

Today model_cost.json exposes max_tokens / max_input_tokens / max_output_tokens / input_cost_per_token / output_cost_per_token as a static lookup, and token_counter() provides per-provider tokenization. But every caller has to wire these together themselves to answer the question "will this fit?" — and most don't, because the wiring is verbose and the cliff is invisible until the provider rejects the request.

The result: callers either over-trim (waste context) or under-trim and discover the limit at request time via a provider-specific error string. LiteLLM is uniquely positioned to fix this because it's already the layer that knows every provider's tokenizer and budget metadata.

Error Message

The result: callers either over-trim (waste context) or under-trim and discover the limit at request time via a provider-specific error string. LiteLLM is uniquely positioned to fix this because it's already the layer that knows every provider's tokenizer and budget metadata. 3. The error path is provider-specific — context-limit errors don't always surface as ContextWindowExceededError; the message string varies by provider. Even when callers handle the error, they discover the limit only after a round-trip. A pre-flight at the router layer eliminates the round-trip.

  • Discover the actual limit at request time via a provider-specific error string and then either bail or retry with manually-trimmed input

Root Cause

Today model_cost.json exposes max_tokens / max_input_tokens / max_output_tokens / input_cost_per_token / output_cost_per_token as a static lookup, and token_counter() provides per-provider tokenization. But every caller has to wire these together themselves to answer the question "will this fit?" — and most don't, because the wiring is verbose and the cliff is invisible until the provider rejects the request.

Code Example

from litellm import Router

router = Router(model_list=[...])

# 1) Static normalized budget for a model
b = router.budget("claude-opus-4-7")
# Budget(
#   model="claude-opus-4-7",
#   provider="anthropic",
#   input_max=200_000,
#   output_max=8_192,
#   tokenizer="anthropic",
#   input_cost_per_token=0.000015,
#   output_cost_per_token=0.000075,
#   tool_schema_overhead_fn=<callable>,
# )

# 2) Preflight — does this request fit?
decision = router.check_fit(
    model="claude-opus-4-7",
    messages=msgs,
    tools=tools,
    reserve_output=2000,
)
# FitDecision(
#   fits=False,
#   input_tokens=185_300,
#   tool_overhead_tokens=1_840,
#   effective_input_max=190_000,    # input_max - reserve_output - tool_overhead
#   over_by_tokens=-2_140,
#   suggestion=TrimSuggestion(
#       trim_oldest_n_messages=4,    # estimated to bring it in budget
#       or_switch_model=["claude-opus-4-7[1m]", "gemini-2.5-pro"],
#   ),
# )
RAW_BUFFERClick to expand / collapse

Update 2026-05-25: I've edited this issue to clarify the framing. The original wording made stronger empirical claims ("I've been running a small router-wrapper", "preflight failure rate ~0.4%", "context utilization up ~12%") than I can actually support — the design is from reading Router, token_counter, and model_cost.json, not from a measured production deployment. I've also removed a reference to issue #21558 that I couldn't independently verify. The design discussion stands; the specific numbers and personal use-case framing have been removed.


Summary

I'd like to gauge interest in a cross-provider token-budget normalization API at the LiteLLM Router layer — a single router.budget(model) call that returns a normalized budget object (input max, output max, effective input after tool/system overhead, per-token cost, tokenizer reference), plus a router.check_fit(messages, model, ...) preflight that returns a structured decision instead of a downstream ContextWindowExceededError.

Today model_cost.json exposes max_tokens / max_input_tokens / max_output_tokens / input_cost_per_token / output_cost_per_token as a static lookup, and token_counter() provides per-provider tokenization. But every caller has to wire these together themselves to answer the question "will this fit?" — and most don't, because the wiring is verbose and the cliff is invisible until the provider rejects the request.

The result: callers either over-trim (waste context) or under-trim and discover the limit at request time via a provider-specific error string. LiteLLM is uniquely positioned to fix this because it's already the layer that knows every provider's tokenizer and budget metadata.

Why the Router is the right layer

Three reasons app code shouldn't be doing this itself:

  1. Tokenizer divergence is fiddly — Claude / OpenAI / Gemini / Mistral all count tokens differently for the same string; tool-call schemas add per-provider overhead (Anthropic's tools block, OpenAI's tool_choice); system prompts are charged differently across providers. Replicating LiteLLM's existing knowledge in every caller is brittle.
  2. The effective budget is not max_input_tokens — callers need max_input_tokens - tool_schema_overhead - system_prompt_tokens - reserved_for_output to know how many messages they can safely include. Today every framework computes this slightly differently, and most ignore tool-schema overhead.
  3. The error path is provider-specific — context-limit errors don't always surface as ContextWindowExceededError; the message string varies by provider. Even when callers handle the error, they discover the limit only after a round-trip. A pre-flight at the router layer eliminates the round-trip.

Design sketch

from litellm import Router

router = Router(model_list=[...])

# 1) Static normalized budget for a model
b = router.budget("claude-opus-4-7")
# Budget(
#   model="claude-opus-4-7",
#   provider="anthropic",
#   input_max=200_000,
#   output_max=8_192,
#   tokenizer="anthropic",
#   input_cost_per_token=0.000015,
#   output_cost_per_token=0.000075,
#   tool_schema_overhead_fn=<callable>,
# )

# 2) Preflight — does this request fit?
decision = router.check_fit(
    model="claude-opus-4-7",
    messages=msgs,
    tools=tools,
    reserve_output=2000,
)
# FitDecision(
#   fits=False,
#   input_tokens=185_300,
#   tool_overhead_tokens=1_840,
#   effective_input_max=190_000,    # input_max - reserve_output - tool_overhead
#   over_by_tokens=-2_140,
#   suggestion=TrimSuggestion(
#       trim_oldest_n_messages=4,    # estimated to bring it in budget
#       or_switch_model=["claude-opus-4-7[1m]", "gemini-2.5-pro"],
#   ),
# )

Three new public surfaces:

  1. Router.budget(model) -> Budget — pure metadata, no LLM call
  2. Router.check_fit(model, messages, tools=..., reserve_output=...) -> FitDecision — counts tokens locally, returns structured decision
  3. Router.cross_provider_normalize(messages, tools=..., from_=..., to=...) -> int — gives the destination provider's count for a request currently formatted for the source provider (useful when switching providers mid-stream)

Budget.tool_schema_overhead_fn — provider-specific function that computes the tokens added by a given tool list, so callers don't have to know about Anthropic's tools block layout vs OpenAI's.

What this is NOT

  • Not a model recommender — or_switch_model is a hint based on input_max, not a quality judgment; caller decides
  • Not a retry/fallback mechanism — Router already has fallbacks; this is the preflight that decides whether to invoke them
  • Not a streaming token counter — preflight is pre-request only; in-stream output token tracking stays where it is
  • Not changing existing APIs — purely additive

Why this matters in practice

Reading consumer-side code in the broader ecosystem: agent frameworks built on LiteLLM each implement their own version of "trim message history to fit". Most:

  • Ignore tool-schema overhead in the budget calculation
  • Differ in how they handle the system-prompt token cost across providers
  • Discover the actual limit at request time via a provider-specific error string and then either bail or retry with manually-trimmed input

A Router-layer preflight would centralize knowledge LiteLLM already has (provider tokenizers, schema layouts) and remove the duplicated-and-slightly-wrong "fit estimator" from each consumer. This is a code-reading argument about where the responsibility should live; would be very interested in whether maintainers see preflight as in-scope for Router or as a userland concern.

Questions before I open anything

  1. Roadmap conflict? Is there an existing plan for unified budget metadata at the Router layer? I checked recent issues and didn't find a direct match; want to confirm.
  2. Scope preference:
    • (a) minimal demo PR — just Router.budget() returning normalized metadata, no preflight
    • (b) full version with check_fit + cross_provider_normalize + TrimSuggestion
    • (c) keep as a third-party package on top of LiteLLM; just expose the metadata hooks
  3. Tool-schema overhead — the per-provider overhead function is the trickiest bit (and changes when providers change tools schema). Acceptable to ship best-effort estimates with a ±N% margin baked in?
  4. TrimSuggestion — useful or too opinionated? Could ship just over_by_tokens and let callers decide their trim strategy.

Not opening a PR yet — gauging interest before sinking time into the wrong shape.

Thanks!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING