When `streaming=True` is enabled, I expected `response.response_gen` to yield multiple smaller chunks incrementally, for example: ```text chunk 1: 监督微调 chunk 2: (SFT) chunk 3: 是现代大模型训练 ... ``` So that the frontend can display the answer progressively.

llamaIndex - 💡(How to fix) Fix [Bug]: StreamingResponse.response_gen only yields one complete chunk when using Ollama Qwen2.5 with QueryEngine streaming

llamaIndex2026-05-21 07:28:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

I am using LlamaIndex with a local Ollama model qwen2.5:7b to build a RAG query engine. I enabled streaming on the query engine:

query_engine = index.as_query_engine(
    similarity_top_k=5,
    streaming=True,
)

response = query_engine.query(question)

print("[rag]", type(response))

for chunk in response.response_gen:
    if chunk:
        yield stream_event("delta", text=str(chunk))

The printed response type is:

[rag] <class 'llama_index.core.base.response.schema.StreamingResponse'>

So the query engine does return a StreamingResponse.

However, response.response_gen only yields once, and the single chunk contains the entire final answer instead of incremental tokens/chunks.

Example output:

{"type": "delta", "text": "监督微调(SFT)是现代大模型训练的关键步骤，它通过模仿高质量的人类示范来塑造模型的行为。SFT 不增加新知识，而是着重于调整模型的回答风格、指令遵循、工具使用以及计划制定等能力。例如，在遇到用户“请用一句话回答”的指令时，经过 SFT 训练的模型会更可能给出简洁的答案而非冗长的内容。此外，SFT 还能帮助模型学会合理调用工具、执行 CoT（链式思考）以及产生结构化的输出。然而，SFT 也有其局限性，它主要通过模仿来学习，并不能有效处理未见过的新环境或复杂任务。因此，在某些情况下，强化学习(Reinforcement Learning, RL)则能发挥更大的作用，通过试错不断优化模型的表现，以应对延迟奖励和多步骤规划等问题。"}
{"type": "sources", "sources": [...]}
{"type": "done"}

Root Cause

The outer FastAPI StreamingResponse / NDJSON layer appears to work correctly, because the API can return:

Code Example

query_engine = index.as_query_engine(
    similarity_top_k=5,
    streaming=True,
)

response = query_engine.query(question)

print("[rag]", type(response))

for chunk in response.response_gen:
    if chunk:
        yield stream_event("delta", text=str(chunk))

---

[rag] <class 'llama_index.core.base.response.schema.StreamingResponse'>

---

{"type": "delta", "text": "监督微调(SFT)是现代大模型训练的关键步骤，它通过模仿高质量的人类示范来塑造模型的行为。SFT 不增加新知识，而是着重于调整模型的回答风格、指令遵循、工具使用以及计划制定等能力。例如，在遇到用户“请用一句话回答”的指令时，经过 SFT 训练的模型会更可能给出简洁的答案而非冗长的内容。此外，SFT 还能帮助模型学会合理调用工具、执行 CoT（链式思考）以及产生结构化的输出。然而，SFT 也有其局限性，它主要通过模仿来学习，并不能有效处理未见过的新环境或复杂任务。因此，在某些情况下，强化学习(Reinforcement Learning, RL)则能发挥更大的作用，通过试错不断优化模型的表现，以应对延迟奖励和多步骤规划等问题。"}
{"type": "sources", "sources": [...]}
{"type": "done"}

---

chunk 1: 监督微调
chunk 2: (SFT)
chunk 3: 是现代大模型训练
...

---

delta -> sources -> done

---

LlamaIndex version: <your llama-index version>
llama-index-llms-ollama version: <your version>
Ollama version: <your ollama version>
Model: qwen2.5:7b
Python version: <your python version>
OS: Windows

---

from fastapi.responses import StreamingResponse
import json

def stream_event(event_type: str, **kwargs):
    return json.dumps(
        {
            "type": event_type,
            **kwargs,
        },
        ensure_ascii=False,
    ) + "\n"


@app.post("/chat")
async def chat(req: ChatRequest):

    def generate():
        query_engine = index.as_query_engine(
            similarity_top_k=5,
            streaming=True,
        )

        response = query_engine.query(req.question)

        print("[rag]", type(response), flush=True)

        for chunk in response.response_gen:
            print("[chunk]", repr(str(chunk)[:50]), len(str(chunk)), flush=True)

            if chunk:
                yield stream_event("delta", text=str(chunk))

        yield stream_event("done")

    return StreamingResponse(
        generate(),
        media_type="application/x-ndjson; charset=utf-8",
    )

---

retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve(question)

# manually build prompt with retrieved context
# then call llm.stream_complete(prompt)

---

Name: llama-index-llms-ollama
Version: 0.10.1

RAW_BUFFERClick to expand / collapse

Bug Description

Title

StreamingResponse.response_gen only yields one complete chunk when using Ollama Qwen2.5 with QueryEngine streaming

Description

I am using LlamaIndex with a local Ollama model qwen2.5:7b to build a RAG query engine. I enabled streaming on the query engine:

query_engine = index.as_query_engine(
    similarity_top_k=5,
    streaming=True,
)

response = query_engine.query(question)

print("[rag]", type(response))

for chunk in response.response_gen:
    if chunk:
        yield stream_event("delta", text=str(chunk))

The printed response type is:

[rag] <class 'llama_index.core.base.response.schema.StreamingResponse'>

So the query engine does return a StreamingResponse.

However, response.response_gen only yields once, and the single chunk contains the entire final answer instead of incremental tokens/chunks.

Example output:

{"type": "delta", "text": "监督微调(SFT)是现代大模型训练的关键步骤，它通过模仿高质量的人类示范来塑造模型的行为。SFT 不增加新知识，而是着重于调整模型的回答风格、指令遵循、工具使用以及计划制定等能力。例如，在遇到用户“请用一句话回答”的指令时，经过 SFT 训练的模型会更可能给出简洁的答案而非冗长的内容。此外，SFT 还能帮助模型学会合理调用工具、执行 CoT（链式思考）以及产生结构化的输出。然而，SFT 也有其局限性，它主要通过模仿来学习，并不能有效处理未见过的新环境或复杂任务。因此，在某些情况下，强化学习(Reinforcement Learning, RL)则能发挥更大的作用，通过试错不断优化模型的表现，以应对延迟奖励和多步骤规划等问题。"}
{"type": "sources", "sources": [...]}
{"type": "done"}

Expected behavior

When streaming=True is enabled, I expected response.response_gen to yield multiple smaller chunks incrementally, for example:

chunk 1: 监督微调
chunk 2: (SFT)
chunk 3: 是现代大模型训练
...

So that the frontend can display the answer progressively.

Actual behavior

response.response_gen yields only one chunk, and that chunk contains the full generated answer.

The outer FastAPI StreamingResponse / NDJSON layer appears to work correctly, because the API can return:

delta -> sources -> done

But the LlamaIndex response.response_gen itself does not appear to produce incremental chunks.

Environment

LlamaIndex version: <your llama-index version>
llama-index-llms-ollama version: <your version>
Ollama version: <your ollama version>
Model: qwen2.5:7b
Python version: <your python version>
OS: Windows

Relevant code

from fastapi.responses import StreamingResponse
import json

def stream_event(event_type: str, **kwargs):
    return json.dumps(
        {
            "type": event_type,
            **kwargs,
        },
        ensure_ascii=False,
    ) + "\n"


@app.post("/chat")
async def chat(req: ChatRequest):

    def generate():
        query_engine = index.as_query_engine(
            similarity_top_k=5,
            streaming=True,
        )

        response = query_engine.query(req.question)

        print("[rag]", type(response), flush=True)

        for chunk in response.response_gen:
            print("[chunk]", repr(str(chunk)[:50]), len(str(chunk)), flush=True)

            if chunk:
                yield stream_event("delta", text=str(chunk))

        yield stream_event("done")

    return StreamingResponse(
        generate(),
        media_type="application/x-ndjson; charset=utf-8",
    )

Additional context

I would like to confirm whether this is expected behavior for QueryEngine streaming, or whether this may be a bug in the Ollama integration / response synthesizer layer.

If this is expected behavior, what is the recommended way to get real token-level or chunk-level streaming when using RAG with LlamaIndex and Ollama?

For now, I am considering bypassing query_engine.query() and using:

retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve(question)

# manually build prompt with retrieved context
# then call llm.stream_complete(prompt)

But I would prefer to keep using query_engine if true streaming is supported.

Version

0.14.22

Steps to Reproduce

from llama_index.core import ( StorageContext, load_index_from_storage, Settings, ) from llama_index.embeddings.ollama import OllamaEmbedding from llama_index.llms.ollama import Ollama

Settings.llm = Ollama( model="qwen2.5:7b", base_url="http://127.0.0.1:11434", request_timeout=120.0, )

Settings.embed_model = OllamaEmbedding( model_name="bge-m3", base_url="http://127.0.0.1:11434", )

storage_context = StorageContext.from_defaults( persist_dir=str("./storage/ae4f9477f0a185a7") )

index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine( similarity_top_k=5, streaming=True, )

response = query_engine.query("Please tell me what is RL?")

for idx, value in enumerate(response.response_gen): print("The index is:", idx, "\n") print(value + "\n")

Relevant Logs/Tracebacks

Name: llama-index-llms-ollama
Version: 0.10.1

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

When streaming=True is enabled, I expected response.response_gen to yield multiple smaller chunks incrementally, for example:

chunk 1: 监督微调
chunk 2: (SFT)
chunk 3: 是现代大模型训练
...

So that the frontend can display the answer progressively.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - 💡(How to fix) Fix [Bug]: StreamingResponse.response_gen only yields one complete chunk when using Ollama Qwen2.5 with QueryEngine streaming

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Bug Description

Title

Description

Expected behavior

Actual behavior

Environment

Relevant code

Additional context

Version

Steps to Reproduce

Relevant Logs/Tracebacks

FAQ

Expected behavior

Still need to ship something?

TRENDING