litellm - ✅(Solved) Fix Bug: /v1/responses with stream=true is not persisted to cache or replayed from cache [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24579Fetched 2026-04-08 01:32:35
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
1
Author
Participants
Timeline (top)
cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

I have a fork patch and can open a PR if this approach makes sense.

PR fix notes

PR #24580: fix(cache): persist and replay streamed Responses API requests

Description (problem / solution / changelog)

Relevant issues

Fixes #24579.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have added testing for this change in tests/local_testing/test_caching_handler.py and tests/llm_responses_api_testing/test_responses_hooks.py
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.

Type

🐛 Bug Fix ✅ Test

Changes

Summary

This fixes a cache gap for the OpenAI Responses API when stream=true.

Before this change:

  • non-streaming /v1/responses requests were cacheable
  • streamed /v1/responses requests were not persisted to cache on completion
  • cache hits could return a cached ResponsesAPIResponse, but not a replayable streaming iterator

After this change:

  • completed streamed Responses requests are written to cache
  • cache hits for streamed Responses return a streaming iterator that replays the cached completed response as synthetic Responses SSE events
  • cached hit metadata reuses the previously computed cache key instead of recomputing from a narrower kwargs shape
  • cache namespace handling no longer breaks when metadata=None
  • cached replay now preserves richer Responses event shapes, including reasoning summary events

Why

LiteLLM already supports response caching for non-streaming Responses requests and has streaming cache behavior for completion-style APIs. This closes the missing path for Responses streaming so identical streamed requests can benefit from Redis cache as expected.

Tests

  • add coverage that cached streamed Responses results are converted into a replayable iterator on cache hit
  • add coverage that response.completed on a streamed Responses request persists the completed response into cache
  • add coverage that the streaming Responses lookup path keeps the normalized cache key available for the later write path
  • add coverage that cached streamed Responses hits fire success callbacks exactly once
  • add coverage that cached reasoning summary replay preserves summary_index
  • preserve existing non-streaming Responses cache behavior

Validation

I also validated this against a real local LiteLLM proxy + Redis setup:

  • first streamed /v1/responses call took normal model time
  • second identical streamed call returned quickly from cache with x-litellm-cache-key set

Changed files

  • litellm/caching/caching.py (modified, +2/-1)
  • litellm/caching/caching_handler.py (modified, +84/-22)
  • litellm/responses/streaming_iterator.py (modified, +588/-144)
  • litellm/types/llms/openai.py (modified, +3/-2)
  • tests/llm_responses_api_testing/test_responses_hooks.py (modified, +892/-4)
  • tests/local_testing/test_caching_handler.py (modified, +512/-77)
  • tests/local_testing/test_responses_stream_cache_keys.py (added, +141/-0)

Code Example

curl -sS http://localhost:4000/v1/responses \
  -H 'Authorization: Bearer sk-1234' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-4.1-mini",
    "input": "Write exactly five short bullet points about caching.",
    "stream": true
  }'
RAW_BUFFERClick to expand / collapse

What happened

LiteLLM caches non-streaming POST /v1/responses requests, but identical POST /v1/responses requests with stream=true do not get persisted to cache and are not replayed from cache on subsequent calls.

Reproduction

Use LiteLLM Proxy with Redis caching enabled for:

  • responses
  • aresponses

Then send the same request twice:

curl -sS http://localhost:4000/v1/responses \
  -H 'Authorization: Bearer sk-1234' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-4.1-mini",
    "input": "Write exactly five short bullet points about caching.",
    "stream": true
  }'

Expected:

  • first call hits provider
  • second identical call is served from cache quickly
  • completed stream output is replayed from cache

Actual:

  • first call succeeds
  • second call also hits provider
  • no cached replay occurs for streamed Responses requests

Control case

The same request with stream=false is cached correctly.

Why this looks like a LiteLLM gap

From source inspection:

  • /v1/responses routes through responses / aresponses
  • completed streamed Responses are logged, but not persisted back into cache
  • cache hit conversion exists for cached ResponsesAPIResponse, but there is no native streamed replay path for Responses API equivalent to the chat-completions streaming cache behavior

Impact

Applications using the OpenAI Responses API with streaming do not benefit from LiteLLM response caching, even when Redis caching is enabled for responses/aresponses.

Suggested fix direction

  • persist completed streamed ResponsesAPIResponse objects into cache when a stream finishes successfully
  • on cache hit for stream=true Responses requests, return a streaming iterator that replays the cached completed response as synthetic Responses SSE events

I have a fork patch and can open a PR if this approach makes sense.

extent analysis

Fix Plan

To address the caching issue for streamed POST /v1/responses requests, we need to modify the caching logic to persist completed streamed responses and implement a mechanism to replay them on subsequent cache hits.

Steps:

  1. Modify the cache persistence logic to store completed streamed ResponsesAPIResponse objects when a stream finishes successfully.
  2. Implement a cache hit handler for stream=true Responses requests that returns a streaming iterator.
  3. Create a synthetic Responses SSE event replay mechanism to replay the cached completed response.

Example Code (Python):

import redis

# Assuming 'redis_client' is an instance of Redis client
def persist_streamed_response(cache_key, response):
    """Persist completed streamed response to cache"""
    redis_client.set(cache_key, response)

def get_streamed_response(cache_key):
    """Retrieve cached streamed response"""
    return redis_client.get(cache_key)

def replay_streamed_response(cached_response):
    """Replay cached streamed response as synthetic SSE events"""
    # Iterate over the cached response and yield SSE events
    for event in cached_response:
        yield event

# Example usage:
cache_key = "responses:streamed:example"
response = [...]  # Completed streamed response

# Persist to cache
persist_streamed_response(cache_key, response)

# On cache hit, replay the response
if get_streamed_response(cache_key):
    replayed_response = replay_streamed_response(get_streamed_response(cache_key))
    # Return the replayed response as a streaming iterator
    return replayed_response

Verification

To verify the fix, send the same POST /v1/responses request with stream=true twice and check that the second request is served from the cache quickly, with the completed stream output replayed from the cache.

Extra Tips

  • Ensure proper cache key management to avoid cache collisions and optimize cache storage.
  • Consider implementing cache expiration and eviction policies to manage cache size and freshness.
  • Test the fix thoroughly to ensure correct behavior for various scenarios, including cache hits, misses, and edge cases.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING