litellm - 💡(How to fix) Fix feat: LLMLingua-2 in-place prompt compaction integration

StepCodex · 2026-05-30T06:16:30Z

[litellm] PRD: LLMLingua-2 Prompt Compaction Integration Problem Statement Teams running long-context LLM workloads through LiteLLM hit provider context limits… ## Solution Add a self-contained **LLMLingua-2 compaction integration** that registers as a LiteLLM callback, compacts eligible messages in place before the LLM call, and exposes compaction status through metadata, spend logs, and response headers. Compaction is never on by default: a proxy master switch plus per-key/team/request **prompt compaction settings** gate every request. When compaction cannot run safely or within budget, the gateway **pass-through**s original messages and still completes the LLM request. # PRD: LLMLingua-2 Prompt Compaction Integration ## Problem Statement Teams running long-context LLM workloads through LiteLLM hit provider context limits and pay for input tokens on every request. LiteLLM's existing drop-and-retrieve compaction (BM25 message stubbing + server-side retrieval) solves a different problem and only covers Anthropic Messages. Operators need **in-place prompt compaction** via LLMLingua-2 on the request hot path — opt-in per tenant, observable, fail-open, and isolated from existing compression code — for both proxy and SDK callers. ## Solution Add a self-contained **LLMLingua-2 compaction integration** that registers as a LiteLLM callback, compacts eligible messages in place before the LLM call, and exposes compaction status through metadata, spend logs, and response headers. Compaction is never on by default: a proxy master switch plus per-key/team/request **prompt compaction settings** gate every request. When compaction cannot run safely or within budget, the gateway **pass-through**s original messages and still completes the LLM request. ## User Stories 1. As a platform operator, I want to enable LLMLingua-2 compaction on my LiteLLM proxy via YAML config, so that I can opt in without forking the codebase. 2. As a platform operator, I want compaction disabled by default at the key/team level, so that I can roll out gradually without surprising existing clients. 3. As a platform operator, I want to set `prompt_compaction: { "enabled": true }` on a virtual key, so that only designated keys compact prompts. 4. As a platform operator, I want team-level prompt compaction settings inherited by keys, so that I can enable compaction for a whole team at once. 5. As a platform operator, I want per-request prompt compaction settings to override key/team defaults, so that I can test compaction on individual calls. 6. As a platform operator, I want compaction to run only when input tokens exceed a configurable trigger, so that short prompts are not penalized by model load or latency. 7. As a platform operator, I want a configurable latency budget for compaction, so that slow compaction never blocks LLM requests beyond an acceptable threshold. 8. As a platform operator, I want the proxy to pass-through unchanged when compaction exceeds the latency budget, so that user requests always succeed. 9. As a platform operator, I want the LLMLingua-2 model warmed in the background at startup, so that the first production request does not pay the full load spike. 10. As a platform operator, I want pass-through when the model is not yet loaded, so that deploys and restarts do not cause hard failures. 11. As a platform operator, I want a clear startup warning if the callback is configured but the optional dependency is not installed, so that misconfiguration is caught early. 12. As a platform operator, I want LLMLingua-2 compaction to take precedence over drop-and-retrieve compaction when both are configured, so that behavior is predictable. 13. As a platform operator, I want a warning logged when drop-and-retrieve compaction is skipped due to precedence, so that I can audit which algorithm ran. 14. As a platform operator, I want compaction stats in spend logs (`original_tokens`, `compressed_tokens`, `status`, `skip_reason`), so that I can measure savings and debug skips. 15. As a platform operator, I want `x-litellm-compaction-status` and `x-litellm-compaction-ratio` response headers, so that clients can observe compaction without parsing logs. 16. As a platform operator, I want to install compaction via `litellm[llmlingua2]` optional extra, so that default gateway images stay slim. 17. As an application developer using the LiteLLM SDK directly, I want to enable compaction via request metadata, so that I can compact prompts without running the proxy. 18. As an application developer, I want compaction to work for chat completions, so that OpenAI-format callers benefit. 19. As an application developer, I want compaction to work for Anthropic Messages, so that `/v1/messages` callers benefit. 20. As an application developer, I want system messages left unchanged, so that instructions remain intact. 21. As an application developer, I want the latest user message left unchanged, so that the active query is never lossy-compacted. 22. As an a

litellm2026-05-30 06:16:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Solution

Add a self-contained LLMLingua-2 compaction integration that registers as a LiteLLM callback, compacts eligible messages in place before the LLM call, and exposes compaction status through metadata, spend logs, and response headers. Compaction is never on by default: a proxy master switch plus per-key/team/request prompt compaction settings gate every request. When compaction cannot run safely or within budget, the gateway pass-throughs original messages and still completes the LLM request.

Code Example

CompactResult:
  messages: list[message]
  original_tokens: int
  compressed_tokens: int
  compression_ratio: float
  status: "applied" | "pass_through" | "skipped"
  skip_reason: str | None   # not_opted_in | below_trigger | model_not_ready |
                            # dependency_missing | compaction_timeout | ...
  messages_compacted: int

---

"prompt_compaction": {
  "enabled": true,
  "compression_trigger": 200000,
  "max_compaction_ms": 2000
}

RAW_BUFFERClick to expand / collapse

PRD: LLMLingua-2 Prompt Compaction Integration

Problem Statement

Teams running long-context LLM workloads through LiteLLM hit provider context limits and pay for input tokens on every request. LiteLLM's existing drop-and-retrieve compaction (BM25 message stubbing + server-side retrieval) solves a different problem and only covers Anthropic Messages. Operators need in-place prompt compaction via LLMLingua-2 on the request hot path — opt-in per tenant, observable, fail-open, and isolated from existing compression code — for both proxy and SDK callers.

Solution

User Stories

As a platform operator, I want to enable LLMLingua-2 compaction on my LiteLLM proxy via YAML config, so that I can opt in without forking the codebase.
As a platform operator, I want compaction disabled by default at the key/team level, so that I can roll out gradually without surprising existing clients.
As a platform operator, I want to set prompt_compaction: { "enabled": true } on a virtual key, so that only designated keys compact prompts.
As a platform operator, I want team-level prompt compaction settings inherited by keys, so that I can enable compaction for a whole team at once.
As a platform operator, I want per-request prompt compaction settings to override key/team defaults, so that I can test compaction on individual calls.
As a platform operator, I want compaction to run only when input tokens exceed a configurable trigger, so that short prompts are not penalized by model load or latency.
As a platform operator, I want a configurable latency budget for compaction, so that slow compaction never blocks LLM requests beyond an acceptable threshold.
As a platform operator, I want the proxy to pass-through unchanged when compaction exceeds the latency budget, so that user requests always succeed.
As a platform operator, I want the LLMLingua-2 model warmed in the background at startup, so that the first production request does not pay the full load spike.
As a platform operator, I want pass-through when the model is not yet loaded, so that deploys and restarts do not cause hard failures.
As a platform operator, I want a clear startup warning if the callback is configured but the optional dependency is not installed, so that misconfiguration is caught early.
As a platform operator, I want LLMLingua-2 compaction to take precedence over drop-and-retrieve compaction when both are configured, so that behavior is predictable.
As a platform operator, I want a warning logged when drop-and-retrieve compaction is skipped due to precedence, so that I can audit which algorithm ran.
As a platform operator, I want compaction stats in spend logs (original_tokens, compressed_tokens, status, skip_reason), so that I can measure savings and debug skips.
As a platform operator, I want x-litellm-compaction-status and x-litellm-compaction-ratio response headers, so that clients can observe compaction without parsing logs.
As a platform operator, I want to install compaction via litellm[llmlingua2] optional extra, so that default gateway images stay slim.
As an application developer using the LiteLLM SDK directly, I want to enable compaction via request metadata, so that I can compact prompts without running the proxy.
As an application developer, I want compaction to work for chat completions, so that OpenAI-format callers benefit.
As an application developer, I want compaction to work for Anthropic Messages, so that /v1/messages callers benefit.
As an application developer, I want system messages left unchanged, so that instructions remain intact.
As an application developer, I want the latest user message left unchanged, so that the active query is never lossy-compacted.
As an application developer, I want tool_use/tool_result exchanges skipped, so that tool-calling conversations stay structurally valid.
As an application developer, I want messages with images or other non-text blocks skipped, so that multimodal requests are not corrupted.
As an application developer, I want per-message compaction for eligible text messages, so that conversation structure is preserved.
As an application developer, I want pass-through when I did not opt in, so that my requests behave exactly as before.
As an SRE, I want compaction to run off the async event loop, so that concurrent requests are not blocked by CPU-bound model inference.
As an SRE, I want no message content logged by the compaction integration, so that PII is not duplicated into logs.
As an SRE, I want mutual exclusion with drop-and-retrieve compaction without modifying that integration's code paths, so that regressions in existing compression are avoided.
As a security reviewer, I want compaction to be opt-in per tenant, so that prompt mutation is never silent.
As a QA engineer, I want unit tests that mock the LLMLingua-2 compressor, so that CI does not require the heavy optional dependency.

Implementation Decisions

Architectural shape (ADR-0001)

New LLMLingua-2 compaction integration module — does not extend drop-and-retrieve compaction or its interception handler.
In-place, per-message compaction only — no stubs, content cache, or retrieval tool loop.
Call surface: anthropic_messages, acompletion, completion.
Compaction precedence: LLMLingua-2 wins over drop-and-retrieve; latter skipped with warning.

Deep modules (testable in isolation)

Module	Responsibility	Interface (conceptual)
Compaction config	Parse proxy YAML + defaults	`from_proxy_config(litellm_settings) → CompactionConfig`
Opt-in resolver	Merge request → key → team prompt compaction settings	`resolve(settings_chain) → ResolvedCompactionSettings`
Eligibility engine	Decide which messages may be compacted	`eligible_indices(messages, call_type) → set[int]`
Message normalizers	Call-type-specific text extract + write-back	`normalize_for_compaction(msg) → str`, `apply_compacted_text(msg, text) → msg`
Compressor core	Lazy singleton, warm, LLMLingua-2 wrap	`warm()`, `compact_text(text, config) → str`, `is_ready() → bool`
Compaction orchestrator	Trigger check, budget timer, per-message loop, result aggregate	`compact_messages(messages, call_type, config, settings) → CompactResult`
Integration handler	CustomLogger hooks: deployment pre-call + response headers	hooks mutate kwargs metadata; headers hook returns status/ratio
Registration / dependency gate	Proxy callback init; fail loud if YAML + missing dep	`initialize_from_proxy_config(...) → handler \| None`

CompactResult shape (from prototype decision)

CompactResult:
  messages: list[message]
  original_tokens: int
  compressed_tokens: int
  compression_ratio: float
  status: "applied" | "pass_through" | "skipped"
  skip_reason: str | None   # not_opted_in | below_trigger | model_not_ready |
                            # dependency_missing | compaction_timeout | ...
  messages_compacted: int

Hook behavior

Pre-deployment hook: After opt-in + trigger checks, run orchestrator inside asyncio.to_thread. Write results to litellm_metadata["prompt_compaction_result"]. Replace kwargs["messages"] only when status == "applied".
Response headers hook: Return x-litellm-compaction-status, x-litellm-compaction-ratio (and optionally original/compressed token counts).
Mutual exclusion: At handler init or first request, if both compaction callbacks registered, log warning; drop-and-retrieve handler should no-op when LLMLingua-2 handler is active (prefer detecting via shared flag or init order — implement without editing drop-and-retrieve handler internals if possible; if a one-line guard in callback registration is required, document it).

Opt-in metadata contract

"prompt_compaction": {
  "enabled": true,
  "compression_trigger": 200000,
  "max_compaction_ms": 2000
}

Boolean "prompt_compaction": true accepted as { "enabled": true }.
Precedence: request metadata → key metadata → team metadata.
Global YAML llmlingua2_compaction_params.enabled required to register capability.

Pass-through conditions (fail-open)

Not opted in
Below token trigger
Model not ready
Optional dependency missing (SDK programmatic path)
Compaction latency budget exceeded (whole request pass-through, no partial apply)
Unsupported call type

Model lifecycle

Lazy singleton per process.
Optional warm_on_startup: true → fire-and-forget background task.
Pass-through while loading.

Dependency gate

Optional extra: litellm[llmlingua2].
YAML callback configured + import fails → startup warning, callback not registered.
Programmatic SDK registration + import fails → pass-through at runtime.

Existing code touchpoints (minimal)

Proxy callback registration branch for string "llmlingua2_compaction".
Entry in known custom-logger-compatible callbacks list.
Optional dependency declaration in package metadata.

Prototype note

Logic prototype at litellm/compression/_prototype_llmlingua2/ validated integration pattern ranking and pass-through semantics. Delete prototype TUI after implementation; retain orchestration patterns if useful.

Testing Decisions

Principle: Test external behavior through public interfaces — config parsing, opt-in resolution, eligibility rules, hook mutations, headers, pass-through paths, and registration. Mock the LLMLingua-2 compressor in all CI tests; no real model load in unit tests.

Modules to test:

Module	Priority	Prior art
Opt-in resolver	High	Key metadata patterns in proxy pre-call utils tests
Eligibility engine	High	Pure functions — table-driven tests per call type
Message normalizers	High	Similar to compression text extraction tests
Compaction orchestrator	High	Mock compressor; assert CompactResult status/reasons
Integration handler (hooks)	High	`compression_interception` handler tests — mock compressor, assert kwargs/metadata/headers
Registration / dependency gate	Medium	`test_callback_utils` compression_interception instantiation pattern
Mutual exclusion	Medium	Both callbacks registered → assert drop-and-retrieve no-op or warning

Not tested in unit CI: Real LLMLingua-2 inference quality, GPU performance, cross-provider end-to-end latency. Optional manual/load script out of scope.

Out of Scope

Extending drop-and-retrieve compress() or CompressionInterceptionLogger
Whole-conversation flatten compaction
Partial apply when latency budget exceeded
Fail-closed (4xx) when compaction fails
Dashboard UI for prompt compaction settings
Public docs in this repo (belongs in litellm-docs)
Benchmarking / quality evaluation suite for compaction fidelity
Per-message compaction quality metrics beyond token counts
Gateway split-deployment-specific sidecar (standard callback on gateway process is sufficient for v1)

Further Notes

Domain glossary: CONTEXT.md (repo root)
ADR: docs/adr/0001-llmlingua2-compaction-integration.md
Reference YAML in ADR
Delete throwaway prototype directory after implementation ships

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix feat: LLMLingua-2 in-place prompt compaction integration

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Solution

Code Example

PRD: LLMLingua-2 Prompt Compaction Integration

Problem Statement

Solution

User Stories

Implementation Decisions

Architectural shape (ADR-0001)

Deep modules (testable in isolation)

CompactResult shape (from prototype decision)

Hook behavior

Opt-in metadata contract

Pass-through conditions (fail-open)

Model lifecycle

Dependency gate

Existing code touchpoints (minimal)

Prototype note

Testing Decisions

Out of Scope

Further Notes

Still need to ship something?

TRENDING