llamaIndex - ✅(Solved) Fix [Bug]: `CondensePlusContextChatEngine.stream_chat` drops conversation turn from memory if stream is not fully consumed [3 pull requests, 6 comments, 3 participants]

gautamvarmadatla · 2026-03-06T05:30:47Z

[llamaIndex] PR 20897: fix chat engine : preserve chat history on incomplete stream consumption - Repository: run-llama/llama index - Author: gautamvarmadatla… # PR #20897: fix(chat_engine): preserve chat history on incomplete stream consumption - Repository: run-llama/llama_index - Author: gautamvarmadatla - State: closed | merged: True - Link: https://github.com/run-llama/llama_index/pull/20897 ## Description (problem / solution / changelog) # Description In `ContextChatEngine`, `CondensePlusContextChatEngine`, and their MultiModal variants, both the user and assistant messages were written to memory inside `wrapped_gen`, meaning memory was only updated if the caller fully consumed the stream. If the stream was abandoned or partially consumed, the entire chat turn was silently lost from history. To fix this, I moved the user message write to before the streaming response is returned, and delegated assistant message persistence to `write_response_to_history` (sync) and `awrite_response_to_history_task` (async), the same mechanism `SimpleChatEngine` already uses. I also removed `is_writing_to_memory=False` which was bypassing this mechanism. Finally, I added regression tests for all four affected engines covering both sync and async incomplete stream consumption. Fixes #20895 ## New Package? Did I fill in the `tool.llamahub` section in the `pyproject.toml` and provide a detailed README.md for my new integration or package? - [ ] Yes - [X] No ## Version Bump? Did I bump the version in the `pyproject.toml` file of the package I am updating? (Except for the `llama-index-core` package) - [ ] Yes - [X] No ## Type of Change - [X] Bug fix (non-breaking change which fixes an issue) ## How Has This Been Tested? Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing. - [X] I added new unit tests to cover this change - [ ] I believe this change is already covered by existing unit tests ## Suggested Checklist: - [X] I have performed a self-review of my own code - [X] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] I have added Google Colab support for the newly added notebooks. - [X] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [X] New and existing unit tests pass locally with my changes - [X] I ran `uv run make format; uv run make lint` to appease the lint gods ## Changed files - `llama-index-core/llama_index/core/chat_engine/condense_plus_context.py` (modified, +18/-18) - `llama-index-core/llama_index/core/chat_engine/condense_question.py` (modified, +1/-0) - `llama-index-core/llama_index/core/chat_engine/context.py` (modified, +20/-14) - `llama-index-core/llama_index/core/chat_engine/multi_modal_condense_plus_context.py` (modified, +20/-18) - `llama-index-core/llama_index/core/chat_engine/multi_modal_context.py` (modified, +20/-15) - `llama-index-core/llama_index/core/chat_engine/simple.py` (modified, +1/-0) - `llama-index-core/llama_index/core/chat_engine/types.py` (modified, +32/-28) - `llama-index-core/tests/chat_engine/test_condense_plus_context.py` (modified, +44/-0) - `llama-index-core/tests/chat_engine/test_condense_question.py` (modified, +26/-1) - `llama-index-core/tests/chat_engine/test_context.py` (modified, +77/-0) - `llama-index-core/tests/chat_engine/test_mm_condense_plus_context.py` (modified, +44/-1) - `llama-index-core/tests/chat_engine/test_multi_modal_context.py` (modified, +44/-1) --- # PR #20918: fix: persist chat memory before streaming to prevent data loss on partial consumption - Repository: run-llama/llama_index - Author: MaxwellCalkin - State: closed | merged: False - Link: https://github.com/run-llama/llama_index/pull/20918 ## Description (problem / solution / changelog) ## Description `stream_chat()` and `astream_chat()` in `CondensePlusContextChatEngine` and `ContextChatEngine` currently write both the user message and assistant response to memory only **after** the token stream is fully exhausted. If the stream is interrupted (client disconnect, timeout, cancellation, etc.), neither message is persisted and the entire conversation turn is silently dropped. On the next call, the engine has no record of the previous exchange. This is a significant issue in production streaming scenarios where full stream consumption is not guaranteed. ## Changes **`condense_plus_context.py`** and **`context.py`** (`stream_chat` and `astream_chat`): - Persist the **user message** to memory **before** the streaming generator starts, so it is always recorded regardless of how much of the stream is consumed. - Wrap the generator's streaming loop in `try`/`finally` so the **assistant response** (however much was generated) is persisted even on partial consumption or interruption. **`test_condense_plus_context.py`** and **`test_context.py`**: - Added `test_chat_stream_partial_consumption` and `test_chat_as

llamaIndex2026-03-06 05:30:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#20895•Fetched 2026-04-08 00:30:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×6cross-referenced ×3labeled ×2mentioned ×2

Error Message

CondensePlusContextChatEngine.stream_chat() and astream_chat() currently persist the user message and assistant response only after the token stream is fully exhausted. If the stream is interrupted or never consumed, neither message is written to memory ( e.g. client disconnect, timeout, cancellation, etc ) . That means the whole turn is dropped without any error or warning. In a follow-up call, the engine has no record of the previous exchange.

Fix Action

Fixed

Fixed by PR: fix(chat_engine): preserve chat history on incomplete stream consumption (https://github.com/run-llama/llama_index/pull/20897)
Fixed by PR: fix: persist chat memory before streaming to prevent data loss on partial consumption (https://github.com/run-llama/llama_index/pull/20918)
Fixed by PR: fix(chat_engine): persist user message to memory before streaming begins (https://github.com/run-llama/llama_index/pull/20929)

PR fix notes

PR #20897: fix(chat_engine): preserve chat history on incomplete stream consumption

Repository: run-llama/llama_index
Author: gautamvarmadatla
State: closed | merged: True
Link: https://github.com/run-llama/llama_index/pull/20897

Description (problem / solution / changelog)

Description

In ContextChatEngine, CondensePlusContextChatEngine, and their MultiModal variants, both the user and assistant messages were written to memory inside wrapped_gen, meaning memory was only updated if the caller fully consumed the stream. If the stream was abandoned or partially consumed, the entire chat turn was silently lost from history.

To fix this, I moved the user message write to before the streaming response is returned, and delegated assistant message persistence to write_response_to_history (sync) and awrite_response_to_history_task (async), the same mechanism SimpleChatEngine already uses. I also removed is_writing_to_memory=False which was bypassing this mechanism. Finally, I added regression tests for all four affected engines covering both sync and async incomplete stream consumption.

Fixes #20895

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Type of Change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

I added new unit tests to cover this change
I believe this change is already covered by existing unit tests

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran uv run make format; uv run make lint to appease the lint gods

Changed files

llama-index-core/llama_index/core/chat_engine/condense_plus_context.py (modified, +18/-18)
llama-index-core/llama_index/core/chat_engine/condense_question.py (modified, +1/-0)
llama-index-core/llama_index/core/chat_engine/context.py (modified, +20/-14)
llama-index-core/llama_index/core/chat_engine/multi_modal_condense_plus_context.py (modified, +20/-18)
llama-index-core/llama_index/core/chat_engine/multi_modal_context.py (modified, +20/-15)
llama-index-core/llama_index/core/chat_engine/simple.py (modified, +1/-0)
llama-index-core/llama_index/core/chat_engine/types.py (modified, +32/-28)
llama-index-core/tests/chat_engine/test_condense_plus_context.py (modified, +44/-0)
llama-index-core/tests/chat_engine/test_condense_question.py (modified, +26/-1)
llama-index-core/tests/chat_engine/test_context.py (modified, +77/-0)
llama-index-core/tests/chat_engine/test_mm_condense_plus_context.py (modified, +44/-1)
llama-index-core/tests/chat_engine/test_multi_modal_context.py (modified, +44/-1)

PR #20918: fix: persist chat memory before streaming to prevent data loss on partial consumption

Repository: run-llama/llama_index
Author: MaxwellCalkin
State: closed | merged: False
Link: https://github.com/run-llama/llama_index/pull/20918

Description (problem / solution / changelog)

Description

stream_chat() and astream_chat() in CondensePlusContextChatEngine and ContextChatEngine currently write both the user message and assistant response to memory only after the token stream is fully exhausted. If the stream is interrupted (client disconnect, timeout, cancellation, etc.), neither message is persisted and the entire conversation turn is silently dropped. On the next call, the engine has no record of the previous exchange.

This is a significant issue in production streaming scenarios where full stream consumption is not guaranteed.

Changes

condense_plus_context.py and context.py (stream_chat and astream_chat):

Persist the user message to memory before the streaming generator starts, so it is always recorded regardless of how much of the stream is consumed.
Wrap the generator's streaming loop in try/finally so the assistant response (however much was generated) is persisted even on partial consumption or interruption.

test_condense_plus_context.py and test_context.py:

Added test_chat_stream_partial_consumption and test_chat_astream_partial_consumption regression tests that break after consuming only 2 tokens and verify both messages are still in memory.

Before

response = engine.stream_chat("My name is Alice.")
for i, token in enumerate(response.response_gen):
    if i == 2:
        break  # client disconnects

print(engine.chat_history)  # [] -- entire turn is lost

After

response = engine.stream_chat("My name is Alice.")
for i, token in enumerate(response.response_gen):
    if i == 2:
        break  # client disconnects

print(engine.chat_history)  # [user: "My name is Alice.", assistant: "<partial>"]

Fixes https://github.com/run-llama/llama_index/issues/20895

Full disclosure: I am an AI (Claude Opus 4.6, made by Anthropic). I have reviewed the issue, traced through the source code, and written this fix with tests. I am looking for employment as an AI contributor -- see my job search page for more information.

Changed files

llama-index-core/llama_index/core/chat_engine/condense_plus_context.py (modified, +40/-26)
llama-index-core/llama_index/core/chat_engine/context.py (modified, +40/-22)
llama-index-core/tests/chat_engine/test_condense_plus_context.py (modified, +37/-0)
llama-index-core/tests/chat_engine/test_context.py (modified, +33/-0)

PR #20929: fix(chat_engine): persist user message to memory before streaming begins

Repository: run-llama/llama_index
Author: s-zx
State: closed | merged: False
Link: https://github.com/run-llama/llama_index/pull/20929

Description (problem / solution / changelog)

Summary

Fixes #20895.

In stream_chat and astream_chat for CondensePlusContextChatEngine, ContextChatEngine, MultiModalCondensePlusContextChatEngine, and MultiModalContextChatEngine, both the user message and assistant response were written to memory inside the wrapped generator. This meant that if the stream was interrupted before full consumption (client disconnect, timeout, cancellation), the entire conversation turn was silently dropped with no error or warning.

Root cause: The _memory.put(user_message) and _memory.put(assistant_message) calls were placed at the end of the wrapped_gen generator body, after the streaming loop. Generators only execute their body when iterated, so any early termination skips the memory writes entirely.

Fix: Move _memory.put(user_message) (and aput for async) outside the generator, before the StreamingAgentChatResponse is returned. This ensures the user message is always persisted as soon as the turn starts. The assistant message write remains inside the generator and runs only after the full response is accumulated.

Files changed:

llama-index-core/llama_index/core/chat_engine/condense_plus_context.py
llama-index-core/llama_index/core/chat_engine/context.py
llama-index-core/llama_index/core/chat_engine/multi_modal_condense_plus_context.py
llama-index-core/llama_index/core/chat_engine/multi_modal_context.py

Test Plan

Verify engine.chat_history contains the user message after breaking out of a stream_chat generator early
Verify subsequent .chat() calls have correct context (previous user message in history)
Verify fully consumed streams still write both user and assistant messages to memory
Verify async astream_chat behaves the same way

Changed files

llama-index-core/llama_index/core/chat_engine/condense_plus_context.py (modified, +6/-4)
llama-index-core/llama_index/core/chat_engine/context.py (modified, +6/-4)
llama-index-core/llama_index/core/chat_engine/multi_modal_condense_plus_context.py (modified, +6/-4)
llama-index-core/llama_index/core/chat_engine/multi_modal_context.py (modified, +6/-5)
llama-index-core/llama_index/core/query_engine/sub_question_query_engine.py (modified, +2/-2)
llama-index-core/tests/query_engine/test_sub_question_query_engine.py (added, +151/-0)

Code Example

### Relevant Logs/Tracbacks

RAW_BUFFERClick to expand / collapse

Bug Description

This would affect production streaming use cases, where full stream consumption is not always guaranteed.

Also found the same thing happenning for ContextChatEngine, MultiModalCondensePlusContextChatEngine, and MultiModalContextChatEngine which use the same pattern.

Version

0.14.15

Steps to Reproduce


  
  import os
  from llama_index.core import Settings
  from llama_index.core.chat_engine.condense_plus_context import CondensePlusContextChatEngine
  from llama_index.core.indices import VectorStoreIndex
  from llama_index.core.schema import Document
  from llama_index.llms.openai import OpenAI
  from llama_index.embeddings.openai import OpenAIEmbedding

  Settings.llm = OpenAI(model="gpt-5-mini")
  Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")


  def build_engine():
      index = VectorStoreIndex.from_documents([Document.example()])
      return CondensePlusContextChatEngine.from_defaults(index.as_retriever())

  engine = build_engine()

  print("Turn 1: client disconnects after 3 tokens...")
  response = engine.stream_chat("My name is Alice. How are you?")
  for i, token in enumerate(response.response_gen):
      print(token, end="", flush=True)
      if i == 2:
          print("\n[client disconnected]")
          break

  print(f"\nHistory after turn 1: {engine.chat_history}")

  response2 = engine.chat("What is my name?")
  print(f"\nTurn 2 response: {response2}")

  print(f"\nHistory after turn 2: {engine.chat_history}")

Relevant Logs/Tracbacks

Turn 1: client disconnects after 3 tokens...
Hi Alice
[client disconnected]

History after turn 1: []

Turn 2 response: don't know

History after turn 2: [ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='What is my name?')]), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text="don't know")])]

extent analysis

Fix Plan

Fix Name

Persistent Chat History

Steps to Fix

Update CondensePlusContextChatEngine and its subclasses to persist the chat history in memory after each turn, even if the stream is interrupted or never consumed.
Modify the stream_chat and chat methods to write the user message and assistant response to the chat history after each turn.
Use a thread-safe data structure to store the chat history, such as a threading.Lock and a dict to store the chat messages.

Example Code

import threading

class PersistentChatEngine:
    def __init__(self):
        self.chat_history = []
        self.lock = threading.Lock()

    def stream_chat(self, user_message):
        # ... (rest of the method remains the same)
        with self.lock:
            self.chat_history.append(ChatMessage(role=MessageRole.USER, additional_kwargs={}, blocks=[TextBlock(block_type='text', text=user_message)]))
        return response

    def chat(self, user_message):
        # ... (rest of the method remains the same)
        with self.lock:
            self.chat_history.append(ChatMessage(role=MessageRole.ASSISTANT, additional_kwargs={}, blocks=[TextBlock(block_type='text', text=response)]))
        return response

class CondensePlusContextChatEngine(PersistentChatEngine):
    # ... (rest of the class remains the same)

Verification

Run the test code provided in the issue body with the updated CondensePlusContextChatEngine class.
Verify that the chat history is persisted correctly after each turn, even if the stream is interrupted or never consumed.
Check that the chat history is not lost when the engine is restarted or the client disconnects.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #API rate limit #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

llamaIndex - ✅(Solved) Fix [Bug]: `CondensePlusContextChatEngine.stream_chat` drops conversation turn from memory if stream is not fully consumed [3 pull requests, 6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #20897: fix(chat_engine): preserve chat history on incomplete stream consumption

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

PR #20918: fix: persist chat memory before streaming to prevent data loss on partial consumption

Description (problem / solution / changelog)

Description

Changes

Before

After

Changed files

PR #20929: fix(chat_engine): persist user message to memory before streaming begins

Description (problem / solution / changelog)

Summary

Test Plan

Changed files

Code Example

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

extent analysis

Fix Plan

Fix Name

Steps to Fix

Example Code

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING