litellm - ✅(Solved) Fix Bug: /v1/responses with stream=true is not persisted to cache or replayed from cache [1 pull requests, 1 participants]

litellm2026-03-25 18:16:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#24579•Fetched 2026-04-08 01:32:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ohnoah

Participants

ohnoah

Timeline (top)

cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

I have a fork patch and can open a PR if this approach makes sense.

PR fix notes

PR #24580: fix(cache): persist and replay streamed Responses API requests

Repository: BerriAI/litellm
Author: ohnoah
State: closed | merged: True
Link: https://github.com/BerriAI/litellm/pull/24580

Description (problem / solution / changelog)

Relevant issues

Fixes #24579.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have added testing for this change in tests/local_testing/test_caching_handler.py and tests/llm_responses_api_testing/test_responses_hooks.py
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run
Link: https://github.com/BerriAI/litellm/pull/24580/checks?sha=03417e2e23ef93fc06265f7d3e6afd675ae1ebba
CI run for the last commit
Link: https://github.com/BerriAI/litellm/pull/24580/checks?sha=343021af7d1dfc2546c7be188a1b3b83f951504e
Merge / cherry-pick CI run
Links: maintainer-owned after merge

Type

🐛 Bug Fix ✅ Test

Changes

Summary

This fixes a cache gap for the OpenAI Responses API when stream=true.

Before this change:

non-streaming /v1/responses requests were cacheable
streamed /v1/responses requests were not persisted to cache on completion
cache hits could return a cached ResponsesAPIResponse, but not a replayable streaming iterator

After this change:

completed streamed Responses requests are written to cache
cache hits for streamed Responses return a streaming iterator that replays the cached completed response as synthetic Responses SSE events
cached hit metadata reuses the previously computed cache key instead of recomputing from a narrower kwargs shape
cache namespace handling no longer breaks when metadata=None
cached replay now preserves richer Responses event shapes, including reasoning summary events

Why

LiteLLM already supports response caching for non-streaming Responses requests and has streaming cache behavior for completion-style APIs. This closes the missing path for Responses streaming so identical streamed requests can benefit from Redis cache as expected.

Tests

add coverage that cached streamed Responses results are converted into a replayable iterator on cache hit
add coverage that response.completed on a streamed Responses request persists the completed response into cache
add coverage that the streaming Responses lookup path keeps the normalized cache key available for the later write path
add coverage that cached streamed Responses hits fire success callbacks exactly once
add coverage that cached reasoning summary replay preserves summary_index
preserve existing non-streaming Responses cache behavior

Validation

I also validated this against a real local LiteLLM proxy + Redis setup:

first streamed /v1/responses call took normal model time
second identical streamed call returned quickly from cache with x-litellm-cache-key set

Changed files

litellm/caching/caching.py (modified, +2/-1)
litellm/caching/caching_handler.py (modified, +84/-22)
litellm/responses/streaming_iterator.py (modified, +588/-144)
litellm/types/llms/openai.py (modified, +3/-2)
tests/llm_responses_api_testing/test_responses_hooks.py (modified, +892/-4)
tests/local_testing/test_caching_handler.py (modified, +512/-77)
tests/local_testing/test_responses_stream_cache_keys.py (added, +141/-0)

Code Example

curl -sS http://localhost:4000/v1/responses \
  -H 'Authorization: Bearer sk-1234' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-4.1-mini",
    "input": "Write exactly five short bullet points about caching.",
    "stream": true
  }'

RAW_BUFFERClick to expand / collapse

What happened

LiteLLM caches non-streaming POST /v1/responses requests, but identical POST /v1/responses requests with stream=true do not get persisted to cache and are not replayed from cache on subsequent calls.

Reproduction

Use LiteLLM Proxy with Redis caching enabled for:

responses
aresponses

Then send the same request twice:

curl -sS http://localhost:4000/v1/responses \
  -H 'Authorization: Bearer sk-1234' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-4.1-mini",
    "input": "Write exactly five short bullet points about caching.",
    "stream": true
  }'

Expected:

first call hits provider
second identical call is served from cache quickly
completed stream output is replayed from cache

Actual:

first call succeeds
second call also hits provider
no cached replay occurs for streamed Responses requests

Control case

The same request with stream=false is cached correctly.

Why this looks like a LiteLLM gap

From source inspection:

/v1/responses routes through responses / aresponses
completed streamed Responses are logged, but not persisted back into cache
cache hit conversion exists for cached ResponsesAPIResponse, but there is no native streamed replay path for Responses API equivalent to the chat-completions streaming cache behavior

Impact

Applications using the OpenAI Responses API with streaming do not benefit from LiteLLM response caching, even when Redis caching is enabled for responses/aresponses.

Suggested fix direction

persist completed streamed ResponsesAPIResponse objects into cache when a stream finishes successfully
on cache hit for stream=true Responses requests, return a streaming iterator that replays the cached completed response as synthetic Responses SSE events

I have a fork patch and can open a PR if this approach makes sense.

extent analysis

Fix Plan

To address the caching issue for streamed POST /v1/responses requests, we need to modify the caching logic to persist completed streamed responses and implement a mechanism to replay them on subsequent cache hits.

Steps:

Modify the cache persistence logic to store completed streamed ResponsesAPIResponse objects when a stream finishes successfully.
Implement a cache hit handler for stream=true Responses requests that returns a streaming iterator.
Create a synthetic Responses SSE event replay mechanism to replay the cached completed response.

Example Code (Python):

import redis

# Assuming 'redis_client' is an instance of Redis client
def persist_streamed_response(cache_key, response):
    """Persist completed streamed response to cache"""
    redis_client.set(cache_key, response)

def get_streamed_response(cache_key):
    """Retrieve cached streamed response"""
    return redis_client.get(cache_key)

def replay_streamed_response(cached_response):
    """Replay cached streamed response as synthetic SSE events"""
    # Iterate over the cached response and yield SSE events
    for event in cached_response:
        yield event

# Example usage:
cache_key = "responses:streamed:example"
response = [...]  # Completed streamed response

# Persist to cache
persist_streamed_response(cache_key, response)

# On cache hit, replay the response
if get_streamed_response(cache_key):
    replayed_response = replay_streamed_response(get_streamed_response(cache_key))
    # Return the replayed response as a streaming iterator
    return replayed_response

Verification

To verify the fix, send the same POST /v1/responses request with stream=true twice and check that the second request is served from the cache quickly, with the completed stream output replayed from the cache.

Extra Tips

Ensure proper cache key management to avoid cache collisions and optimize cache storage.
Consider implementing cache expiration and eviction policies to manage cache size and freshness.
Test the fix thoroughly to ensure correct behavior for various scenarios, including cache hits, misses, and edge cases.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #LLM response #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.