litellm - 💡(How to fix) Fix feat: LLMLingua-2 in-place prompt compaction integration

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Solution

Add a self-contained LLMLingua-2 compaction integration that registers as a LiteLLM callback, compacts eligible messages in place before the LLM call, and exposes compaction status through metadata, spend logs, and response headers. Compaction is never on by default: a proxy master switch plus per-key/team/request prompt compaction settings gate every request. When compaction cannot run safely or within budget, the gateway pass-throughs original messages and still completes the LLM request.

Code Example

CompactResult:
  messages: list[message]
  original_tokens: int
  compressed_tokens: int
  compression_ratio: float
  status: "applied" | "pass_through" | "skipped"
  skip_reason: str | None   # not_opted_in | below_trigger | model_not_ready |
                            # dependency_missing | compaction_timeout | ...
  messages_compacted: int

---

"prompt_compaction": {
  "enabled": true,
  "compression_trigger": 200000,
  "max_compaction_ms": 2000
}
RAW_BUFFERClick to expand / collapse

PRD: LLMLingua-2 Prompt Compaction Integration

Problem Statement

Teams running long-context LLM workloads through LiteLLM hit provider context limits and pay for input tokens on every request. LiteLLM's existing drop-and-retrieve compaction (BM25 message stubbing + server-side retrieval) solves a different problem and only covers Anthropic Messages. Operators need in-place prompt compaction via LLMLingua-2 on the request hot path — opt-in per tenant, observable, fail-open, and isolated from existing compression code — for both proxy and SDK callers.

Solution

Add a self-contained LLMLingua-2 compaction integration that registers as a LiteLLM callback, compacts eligible messages in place before the LLM call, and exposes compaction status through metadata, spend logs, and response headers. Compaction is never on by default: a proxy master switch plus per-key/team/request prompt compaction settings gate every request. When compaction cannot run safely or within budget, the gateway pass-throughs original messages and still completes the LLM request.

User Stories

  1. As a platform operator, I want to enable LLMLingua-2 compaction on my LiteLLM proxy via YAML config, so that I can opt in without forking the codebase.
  2. As a platform operator, I want compaction disabled by default at the key/team level, so that I can roll out gradually without surprising existing clients.
  3. As a platform operator, I want to set prompt_compaction: { "enabled": true } on a virtual key, so that only designated keys compact prompts.
  4. As a platform operator, I want team-level prompt compaction settings inherited by keys, so that I can enable compaction for a whole team at once.
  5. As a platform operator, I want per-request prompt compaction settings to override key/team defaults, so that I can test compaction on individual calls.
  6. As a platform operator, I want compaction to run only when input tokens exceed a configurable trigger, so that short prompts are not penalized by model load or latency.
  7. As a platform operator, I want a configurable latency budget for compaction, so that slow compaction never blocks LLM requests beyond an acceptable threshold.
  8. As a platform operator, I want the proxy to pass-through unchanged when compaction exceeds the latency budget, so that user requests always succeed.
  9. As a platform operator, I want the LLMLingua-2 model warmed in the background at startup, so that the first production request does not pay the full load spike.
  10. As a platform operator, I want pass-through when the model is not yet loaded, so that deploys and restarts do not cause hard failures.
  11. As a platform operator, I want a clear startup warning if the callback is configured but the optional dependency is not installed, so that misconfiguration is caught early.
  12. As a platform operator, I want LLMLingua-2 compaction to take precedence over drop-and-retrieve compaction when both are configured, so that behavior is predictable.
  13. As a platform operator, I want a warning logged when drop-and-retrieve compaction is skipped due to precedence, so that I can audit which algorithm ran.
  14. As a platform operator, I want compaction stats in spend logs (original_tokens, compressed_tokens, status, skip_reason), so that I can measure savings and debug skips.
  15. As a platform operator, I want x-litellm-compaction-status and x-litellm-compaction-ratio response headers, so that clients can observe compaction without parsing logs.
  16. As a platform operator, I want to install compaction via litellm[llmlingua2] optional extra, so that default gateway images stay slim.
  17. As an application developer using the LiteLLM SDK directly, I want to enable compaction via request metadata, so that I can compact prompts without running the proxy.
  18. As an application developer, I want compaction to work for chat completions, so that OpenAI-format callers benefit.
  19. As an application developer, I want compaction to work for Anthropic Messages, so that /v1/messages callers benefit.
  20. As an application developer, I want system messages left unchanged, so that instructions remain intact.
  21. As an application developer, I want the latest user message left unchanged, so that the active query is never lossy-compacted.
  22. As an application developer, I want tool_use/tool_result exchanges skipped, so that tool-calling conversations stay structurally valid.
  23. As an application developer, I want messages with images or other non-text blocks skipped, so that multimodal requests are not corrupted.
  24. As an application developer, I want per-message compaction for eligible text messages, so that conversation structure is preserved.
  25. As an application developer, I want pass-through when I did not opt in, so that my requests behave exactly as before.
  26. As an SRE, I want compaction to run off the async event loop, so that concurrent requests are not blocked by CPU-bound model inference.
  27. As an SRE, I want no message content logged by the compaction integration, so that PII is not duplicated into logs.
  28. As an SRE, I want mutual exclusion with drop-and-retrieve compaction without modifying that integration's code paths, so that regressions in existing compression are avoided.
  29. As a security reviewer, I want compaction to be opt-in per tenant, so that prompt mutation is never silent.
  30. As a QA engineer, I want unit tests that mock the LLMLingua-2 compressor, so that CI does not require the heavy optional dependency.

Implementation Decisions

Architectural shape (ADR-0001)

  • New LLMLingua-2 compaction integration module — does not extend drop-and-retrieve compaction or its interception handler.
  • In-place, per-message compaction only — no stubs, content cache, or retrieval tool loop.
  • Call surface: anthropic_messages, acompletion, completion.
  • Compaction precedence: LLMLingua-2 wins over drop-and-retrieve; latter skipped with warning.

Deep modules (testable in isolation)

ModuleResponsibilityInterface (conceptual)
Compaction configParse proxy YAML + defaultsfrom_proxy_config(litellm_settings) → CompactionConfig
Opt-in resolverMerge request → key → team prompt compaction settingsresolve(settings_chain) → ResolvedCompactionSettings
Eligibility engineDecide which messages may be compactedeligible_indices(messages, call_type) → set[int]
Message normalizersCall-type-specific text extract + write-backnormalize_for_compaction(msg) → str, apply_compacted_text(msg, text) → msg
Compressor coreLazy singleton, warm, LLMLingua-2 wrapwarm(), compact_text(text, config) → str, is_ready() → bool
Compaction orchestratorTrigger check, budget timer, per-message loop, result aggregatecompact_messages(messages, call_type, config, settings) → CompactResult
Integration handlerCustomLogger hooks: deployment pre-call + response headershooks mutate kwargs metadata; headers hook returns status/ratio
Registration / dependency gateProxy callback init; fail loud if YAML + missing depinitialize_from_proxy_config(...) → handler | None

CompactResult shape (from prototype decision)

CompactResult:
  messages: list[message]
  original_tokens: int
  compressed_tokens: int
  compression_ratio: float
  status: "applied" | "pass_through" | "skipped"
  skip_reason: str | None   # not_opted_in | below_trigger | model_not_ready |
                            # dependency_missing | compaction_timeout | ...
  messages_compacted: int

Hook behavior

  • Pre-deployment hook: After opt-in + trigger checks, run orchestrator inside asyncio.to_thread. Write results to litellm_metadata["prompt_compaction_result"]. Replace kwargs["messages"] only when status == "applied".
  • Response headers hook: Return x-litellm-compaction-status, x-litellm-compaction-ratio (and optionally original/compressed token counts).
  • Mutual exclusion: At handler init or first request, if both compaction callbacks registered, log warning; drop-and-retrieve handler should no-op when LLMLingua-2 handler is active (prefer detecting via shared flag or init order — implement without editing drop-and-retrieve handler internals if possible; if a one-line guard in callback registration is required, document it).

Opt-in metadata contract

"prompt_compaction": {
  "enabled": true,
  "compression_trigger": 200000,
  "max_compaction_ms": 2000
}
  • Boolean "prompt_compaction": true accepted as { "enabled": true }.
  • Precedence: request metadata → key metadata → team metadata.
  • Global YAML llmlingua2_compaction_params.enabled required to register capability.

Pass-through conditions (fail-open)

  • Not opted in
  • Below token trigger
  • Model not ready
  • Optional dependency missing (SDK programmatic path)
  • Compaction latency budget exceeded (whole request pass-through, no partial apply)
  • Unsupported call type

Model lifecycle

  • Lazy singleton per process.
  • Optional warm_on_startup: true → fire-and-forget background task.
  • Pass-through while loading.

Dependency gate

  • Optional extra: litellm[llmlingua2].
  • YAML callback configured + import fails → startup warning, callback not registered.
  • Programmatic SDK registration + import fails → pass-through at runtime.

Existing code touchpoints (minimal)

  • Proxy callback registration branch for string "llmlingua2_compaction".
  • Entry in known custom-logger-compatible callbacks list.
  • Optional dependency declaration in package metadata.

Prototype note

Logic prototype at litellm/compression/_prototype_llmlingua2/ validated integration pattern ranking and pass-through semantics. Delete prototype TUI after implementation; retain orchestration patterns if useful.

Testing Decisions

Principle: Test external behavior through public interfaces — config parsing, opt-in resolution, eligibility rules, hook mutations, headers, pass-through paths, and registration. Mock the LLMLingua-2 compressor in all CI tests; no real model load in unit tests.

Modules to test:

ModulePriorityPrior art
Opt-in resolverHighKey metadata patterns in proxy pre-call utils tests
Eligibility engineHighPure functions — table-driven tests per call type
Message normalizersHighSimilar to compression text extraction tests
Compaction orchestratorHighMock compressor; assert CompactResult status/reasons
Integration handler (hooks)Highcompression_interception handler tests — mock compressor, assert kwargs/metadata/headers
Registration / dependency gateMediumtest_callback_utils compression_interception instantiation pattern
Mutual exclusionMediumBoth callbacks registered → assert drop-and-retrieve no-op or warning

Not tested in unit CI: Real LLMLingua-2 inference quality, GPU performance, cross-provider end-to-end latency. Optional manual/load script out of scope.

Out of Scope

  • Extending drop-and-retrieve compress() or CompressionInterceptionLogger
  • Whole-conversation flatten compaction
  • Partial apply when latency budget exceeded
  • Fail-closed (4xx) when compaction fails
  • Dashboard UI for prompt compaction settings
  • Public docs in this repo (belongs in litellm-docs)
  • Benchmarking / quality evaluation suite for compaction fidelity
  • Per-message compaction quality metrics beyond token counts
  • Gateway split-deployment-specific sidecar (standard callback on gateway process is sufficient for v1)

Further Notes

  • Domain glossary: CONTEXT.md (repo root)
  • ADR: docs/adr/0001-llmlingua2-compaction-integration.md
  • Reference YAML in ADR
  • Delete throwaway prototype directory after implementation ships

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix feat: LLMLingua-2 in-place prompt compaction integration