llamaIndex - ✅(Solved) Fix [Bug]: `CondensePlusContextChatEngine.stream_chat` drops conversation turn from memory if stream is not fully consumed [3 pull requests, 6 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20895Fetched 2026-04-08 00:30:21
View on GitHub
Comments
6
Participants
3
Timeline
18
Reactions
0
Timeline (top)
commented ×6cross-referenced ×3labeled ×2mentioned ×2

Error Message

CondensePlusContextChatEngine.stream_chat() and astream_chat() currently persist the user message and assistant response only after the token stream is fully exhausted. If the stream is interrupted or never consumed, neither message is written to memory ( e.g. client disconnect, timeout, cancellation, etc ) . That means the whole turn is dropped without any error or warning. In a follow-up call, the engine has no record of the previous exchange.

Fix Action

Fixed

PR fix notes

PR #20897: fix(chat_engine): preserve chat history on incomplete stream consumption

Description (problem / solution / changelog)

Description

In ContextChatEngine, CondensePlusContextChatEngine, and their MultiModal variants, both the user and assistant messages were written to memory inside wrapped_gen, meaning memory was only updated if the caller fully consumed the stream. If the stream was abandoned or partially consumed, the entire chat turn was silently lost from history.

To fix this, I moved the user message write to before the streaming response is returned, and delegated assistant message persistence to write_response_to_history (sync) and awrite_response_to_history_task (async), the same mechanism SimpleChatEngine already uses. I also removed is_writing_to_memory=False which was bypassing this mechanism. Finally, I added regression tests for all four affected engines covering both sync and async incomplete stream consumption.

Fixes #20895

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-core/llama_index/core/chat_engine/condense_plus_context.py (modified, +18/-18)
  • llama-index-core/llama_index/core/chat_engine/condense_question.py (modified, +1/-0)
  • llama-index-core/llama_index/core/chat_engine/context.py (modified, +20/-14)
  • llama-index-core/llama_index/core/chat_engine/multi_modal_condense_plus_context.py (modified, +20/-18)
  • llama-index-core/llama_index/core/chat_engine/multi_modal_context.py (modified, +20/-15)
  • llama-index-core/llama_index/core/chat_engine/simple.py (modified, +1/-0)
  • llama-index-core/llama_index/core/chat_engine/types.py (modified, +32/-28)
  • llama-index-core/tests/chat_engine/test_condense_plus_context.py (modified, +44/-0)
  • llama-index-core/tests/chat_engine/test_condense_question.py (modified, +26/-1)
  • llama-index-core/tests/chat_engine/test_context.py (modified, +77/-0)
  • llama-index-core/tests/chat_engine/test_mm_condense_plus_context.py (modified, +44/-1)
  • llama-index-core/tests/chat_engine/test_multi_modal_context.py (modified, +44/-1)

PR #20918: fix: persist chat memory before streaming to prevent data loss on partial consumption

Description (problem / solution / changelog)

Description

stream_chat() and astream_chat() in CondensePlusContextChatEngine and ContextChatEngine currently write both the user message and assistant response to memory only after the token stream is fully exhausted. If the stream is interrupted (client disconnect, timeout, cancellation, etc.), neither message is persisted and the entire conversation turn is silently dropped. On the next call, the engine has no record of the previous exchange.

This is a significant issue in production streaming scenarios where full stream consumption is not guaranteed.

Changes

condense_plus_context.py and context.py (stream_chat and astream_chat):

  • Persist the user message to memory before the streaming generator starts, so it is always recorded regardless of how much of the stream is consumed.
  • Wrap the generator's streaming loop in try/finally so the assistant response (however much was generated) is persisted even on partial consumption or interruption.

test_condense_plus_context.py and test_context.py:

  • Added test_chat_stream_partial_consumption and test_chat_astream_partial_consumption regression tests that break after consuming only 2 tokens and verify both messages are still in memory.

Before

response = engine.stream_chat("My name is Alice.")
for i, token in enumerate(response.response_gen):
    if i == 2:
        break  # client disconnects

print(engine.chat_history)  # [] -- entire turn is lost

After

response = engine.stream_chat("My name is Alice.")
for i, token in enumerate(response.response_gen):
    if i == 2:
        break  # client disconnects

print(engine.chat_history)  # [user: "My name is Alice.", assistant: "<partial>"]

Fixes https://github.com/run-llama/llama_index/issues/20895


Full disclosure: I am an AI (Claude Opus 4.6, made by Anthropic). I have reviewed the issue, traced through the source code, and written this fix with tests. I am looking for employment as an AI contributor -- see my job search page for more information.

Changed files

  • llama-index-core/llama_index/core/chat_engine/condense_plus_context.py (modified, +40/-26)
  • llama-index-core/llama_index/core/chat_engine/context.py (modified, +40/-22)
  • llama-index-core/tests/chat_engine/test_condense_plus_context.py (modified, +37/-0)
  • llama-index-core/tests/chat_engine/test_context.py (modified, +33/-0)

PR #20929: fix(chat_engine): persist user message to memory before streaming begins

Description (problem / solution / changelog)

Summary

Fixes #20895.

In stream_chat and astream_chat for CondensePlusContextChatEngine, ContextChatEngine, MultiModalCondensePlusContextChatEngine, and MultiModalContextChatEngine, both the user message and assistant response were written to memory inside the wrapped generator. This meant that if the stream was interrupted before full consumption (client disconnect, timeout, cancellation), the entire conversation turn was silently dropped with no error or warning.

Root cause: The _memory.put(user_message) and _memory.put(assistant_message) calls were placed at the end of the wrapped_gen generator body, after the streaming loop. Generators only execute their body when iterated, so any early termination skips the memory writes entirely.

Fix: Move _memory.put(user_message) (and aput for async) outside the generator, before the StreamingAgentChatResponse is returned. This ensures the user message is always persisted as soon as the turn starts. The assistant message write remains inside the generator and runs only after the full response is accumulated.

Files changed:

  • llama-index-core/llama_index/core/chat_engine/condense_plus_context.py
  • llama-index-core/llama_index/core/chat_engine/context.py
  • llama-index-core/llama_index/core/chat_engine/multi_modal_condense_plus_context.py
  • llama-index-core/llama_index/core/chat_engine/multi_modal_context.py

Test Plan

  • Verify engine.chat_history contains the user message after breaking out of a stream_chat generator early
  • Verify subsequent .chat() calls have correct context (previous user message in history)
  • Verify fully consumed streams still write both user and assistant messages to memory
  • Verify async astream_chat behaves the same way

Changed files

  • llama-index-core/llama_index/core/chat_engine/condense_plus_context.py (modified, +6/-4)
  • llama-index-core/llama_index/core/chat_engine/context.py (modified, +6/-4)
  • llama-index-core/llama_index/core/chat_engine/multi_modal_condense_plus_context.py (modified, +6/-4)
  • llama-index-core/llama_index/core/chat_engine/multi_modal_context.py (modified, +6/-5)
  • llama-index-core/llama_index/core/query_engine/sub_question_query_engine.py (modified, +2/-2)
  • llama-index-core/tests/query_engine/test_sub_question_query_engine.py (added, +151/-0)

Code Example

### Relevant Logs/Tracbacks
RAW_BUFFERClick to expand / collapse

Bug Description

CondensePlusContextChatEngine.stream_chat() and astream_chat() currently persist the user message and assistant response only after the token stream is fully exhausted. If the stream is interrupted or never consumed, neither message is written to memory ( e.g. client disconnect, timeout, cancellation, etc ) . That means the whole turn is dropped without any error or warning. In a follow-up call, the engine has no record of the previous exchange.

This would affect production streaming use cases, where full stream consumption is not always guaranteed.

Also found the same thing happenning for ContextChatEngine, MultiModalCondensePlusContextChatEngine, and MultiModalContextChatEngine which use the same pattern.

Version

0.14.15

Steps to Reproduce


  
  import os
  from llama_index.core import Settings
  from llama_index.core.chat_engine.condense_plus_context import CondensePlusContextChatEngine
  from llama_index.core.indices import VectorStoreIndex
  from llama_index.core.schema import Document
  from llama_index.llms.openai import OpenAI
  from llama_index.embeddings.openai import OpenAIEmbedding

  Settings.llm = OpenAI(model="gpt-5-mini")
  Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")


  def build_engine():
      index = VectorStoreIndex.from_documents([Document.example()])
      return CondensePlusContextChatEngine.from_defaults(index.as_retriever())

  engine = build_engine()

  print("Turn 1: client disconnects after 3 tokens...")
  response = engine.stream_chat("My name is Alice. How are you?")
  for i, token in enumerate(response.response_gen):
      print(token, end="", flush=True)
      if i == 2:
          print("\n[client disconnected]")
          break

  print(f"\nHistory after turn 1: {engine.chat_history}")

  response2 = engine.chat("What is my name?")
  print(f"\nTurn 2 response: {response2}")

  print(f"\nHistory after turn 2: {engine.chat_history}")

Relevant Logs/Tracbacks

Turn 1: client disconnects after 3 tokens...
Hi Alice
[client disconnected]

History after turn 1: []

Turn 2 response: don't know

History after turn 2: [ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='What is my name?')]), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text="don't know")])]

extent analysis

Fix Plan

Fix Name

Persistent Chat History

Steps to Fix

  1. Update CondensePlusContextChatEngine and its subclasses to persist the chat history in memory after each turn, even if the stream is interrupted or never consumed.

  2. Modify the stream_chat and chat methods to write the user message and assistant response to the chat history after each turn.

  3. Use a thread-safe data structure to store the chat history, such as a threading.Lock and a dict to store the chat messages.

Example Code

import threading

class PersistentChatEngine:
    def __init__(self):
        self.chat_history = []
        self.lock = threading.Lock()

    def stream_chat(self, user_message):
        # ... (rest of the method remains the same)
        with self.lock:
            self.chat_history.append(ChatMessage(role=MessageRole.USER, additional_kwargs={}, blocks=[TextBlock(block_type='text', text=user_message)]))
        return response

    def chat(self, user_message):
        # ... (rest of the method remains the same)
        with self.lock:
            self.chat_history.append(ChatMessage(role=MessageRole.ASSISTANT, additional_kwargs={}, blocks=[TextBlock(block_type='text', text=response)]))
        return response

class CondensePlusContextChatEngine(PersistentChatEngine):
    # ... (rest of the class remains the same)

Verification

  1. Run the test code provided in the issue body with the updated CondensePlusContextChatEngine class.
  2. Verify that the chat history is persisted correctly after each turn, even if the stream is interrupted or never consumed.
  3. Check that the chat history is not lost when the engine is restarted or the client disconnects.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING