llamaIndex - 💡(How to fix) Fix [Bug]: StreamingResponse.response_gen only yields one complete chunk when using Ollama Qwen2.5 with QueryEngine streaming

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I am using LlamaIndex with a local Ollama model qwen2.5:7b to build a RAG query engine. I enabled streaming on the query engine:

query_engine = index.as_query_engine(
    similarity_top_k=5,
    streaming=True,
)

response = query_engine.query(question)

print("[rag]", type(response))

for chunk in response.response_gen:
    if chunk:
        yield stream_event("delta", text=str(chunk))

The printed response type is:

[rag] <class 'llama_index.core.base.response.schema.StreamingResponse'>

So the query engine does return a StreamingResponse.

However, response.response_gen only yields once, and the single chunk contains the entire final answer instead of incremental tokens/chunks.

Example output:

{"type": "delta", "text": "监督微调(SFT)是现代大模型训练的关键步骤,它通过模仿高质量的人类示范来塑造模型的行为。SFT 不增加新知识,而是着重于调整模型的回答风格、指令遵循、工具使用以及计划制定等能力。例如,在遇到用户“请用一句话回答”的指令时,经过 SFT 训练的模型会更可能给出简洁的答案而非冗长的内容。此外,SFT 还能帮助模型学会合理调用工具、执行 CoT(链式思考)以及产生结构化的输出。然而,SFT 也有其局限性,它主要通过模仿来学习,并不能有效处理未见过的新环境或复杂任务。因此,在某些情况下,强化学习(Reinforcement Learning, RL)则能发挥更大的作用,通过试错不断优化模型的表现,以应对延迟奖励和多步骤规划等问题。"}
{"type": "sources", "sources": [...]}
{"type": "done"}

Root Cause

The outer FastAPI StreamingResponse / NDJSON layer appears to work correctly, because the API can return:

Code Example

query_engine = index.as_query_engine(
    similarity_top_k=5,
    streaming=True,
)

response = query_engine.query(question)

print("[rag]", type(response))

for chunk in response.response_gen:
    if chunk:
        yield stream_event("delta", text=str(chunk))

---

[rag] <class 'llama_index.core.base.response.schema.StreamingResponse'>

---

{"type": "delta", "text": "监督微调(SFT)是现代大模型训练的关键步骤,它通过模仿高质量的人类示范来塑造模型的行为。SFT 不增加新知识,而是着重于调整模型的回答风格、指令遵循、工具使用以及计划制定等能力。例如,在遇到用户“请用一句话回答”的指令时,经过 SFT 训练的模型会更可能给出简洁的答案而非冗长的内容。此外,SFT 还能帮助模型学会合理调用工具、执行 CoT(链式思考)以及产生结构化的输出。然而,SFT 也有其局限性,它主要通过模仿来学习,并不能有效处理未见过的新环境或复杂任务。因此,在某些情况下,强化学习(Reinforcement Learning, RL)则能发挥更大的作用,通过试错不断优化模型的表现,以应对延迟奖励和多步骤规划等问题。"}
{"type": "sources", "sources": [...]}
{"type": "done"}

---

chunk 1: 监督微调
chunk 2: (SFT)
chunk 3: 是现代大模型训练
...

---

delta -> sources -> done

---

LlamaIndex version: <your llama-index version>
llama-index-llms-ollama version: <your version>
Ollama version: <your ollama version>
Model: qwen2.5:7b
Python version: <your python version>
OS: Windows

---

from fastapi.responses import StreamingResponse
import json

def stream_event(event_type: str, **kwargs):
    return json.dumps(
        {
            "type": event_type,
            **kwargs,
        },
        ensure_ascii=False,
    ) + "\n"


@app.post("/chat")
async def chat(req: ChatRequest):

    def generate():
        query_engine = index.as_query_engine(
            similarity_top_k=5,
            streaming=True,
        )

        response = query_engine.query(req.question)

        print("[rag]", type(response), flush=True)

        for chunk in response.response_gen:
            print("[chunk]", repr(str(chunk)[:50]), len(str(chunk)), flush=True)

            if chunk:
                yield stream_event("delta", text=str(chunk))

        yield stream_event("done")

    return StreamingResponse(
        generate(),
        media_type="application/x-ndjson; charset=utf-8",
    )

---

retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve(question)

# manually build prompt with retrieved context
# then call llm.stream_complete(prompt)

---

Name: llama-index-llms-ollama
Version: 0.10.1
RAW_BUFFERClick to expand / collapse

Bug Description

Title

StreamingResponse.response_gen only yields one complete chunk when using Ollama Qwen2.5 with QueryEngine streaming

Description

I am using LlamaIndex with a local Ollama model qwen2.5:7b to build a RAG query engine. I enabled streaming on the query engine:

query_engine = index.as_query_engine(
    similarity_top_k=5,
    streaming=True,
)

response = query_engine.query(question)

print("[rag]", type(response))

for chunk in response.response_gen:
    if chunk:
        yield stream_event("delta", text=str(chunk))

The printed response type is:

[rag] <class 'llama_index.core.base.response.schema.StreamingResponse'>

So the query engine does return a StreamingResponse.

However, response.response_gen only yields once, and the single chunk contains the entire final answer instead of incremental tokens/chunks.

Example output:

{"type": "delta", "text": "监督微调(SFT)是现代大模型训练的关键步骤,它通过模仿高质量的人类示范来塑造模型的行为。SFT 不增加新知识,而是着重于调整模型的回答风格、指令遵循、工具使用以及计划制定等能力。例如,在遇到用户“请用一句话回答”的指令时,经过 SFT 训练的模型会更可能给出简洁的答案而非冗长的内容。此外,SFT 还能帮助模型学会合理调用工具、执行 CoT(链式思考)以及产生结构化的输出。然而,SFT 也有其局限性,它主要通过模仿来学习,并不能有效处理未见过的新环境或复杂任务。因此,在某些情况下,强化学习(Reinforcement Learning, RL)则能发挥更大的作用,通过试错不断优化模型的表现,以应对延迟奖励和多步骤规划等问题。"}
{"type": "sources", "sources": [...]}
{"type": "done"}

Expected behavior

When streaming=True is enabled, I expected response.response_gen to yield multiple smaller chunks incrementally, for example:

chunk 1: 监督微调
chunk 2: (SFT)
chunk 3: 是现代大模型训练
...

So that the frontend can display the answer progressively.

Actual behavior

response.response_gen yields only one chunk, and that chunk contains the full generated answer.

The outer FastAPI StreamingResponse / NDJSON layer appears to work correctly, because the API can return:

delta -> sources -> done

But the LlamaIndex response.response_gen itself does not appear to produce incremental chunks.

Environment

LlamaIndex version: <your llama-index version>
llama-index-llms-ollama version: <your version>
Ollama version: <your ollama version>
Model: qwen2.5:7b
Python version: <your python version>
OS: Windows

Relevant code

from fastapi.responses import StreamingResponse
import json

def stream_event(event_type: str, **kwargs):
    return json.dumps(
        {
            "type": event_type,
            **kwargs,
        },
        ensure_ascii=False,
    ) + "\n"


@app.post("/chat")
async def chat(req: ChatRequest):

    def generate():
        query_engine = index.as_query_engine(
            similarity_top_k=5,
            streaming=True,
        )

        response = query_engine.query(req.question)

        print("[rag]", type(response), flush=True)

        for chunk in response.response_gen:
            print("[chunk]", repr(str(chunk)[:50]), len(str(chunk)), flush=True)

            if chunk:
                yield stream_event("delta", text=str(chunk))

        yield stream_event("done")

    return StreamingResponse(
        generate(),
        media_type="application/x-ndjson; charset=utf-8",
    )

Additional context

I would like to confirm whether this is expected behavior for QueryEngine streaming, or whether this may be a bug in the Ollama integration / response synthesizer layer.

If this is expected behavior, what is the recommended way to get real token-level or chunk-level streaming when using RAG with LlamaIndex and Ollama?

For now, I am considering bypassing query_engine.query() and using:

retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve(question)

# manually build prompt with retrieved context
# then call llm.stream_complete(prompt)

But I would prefer to keep using query_engine if true streaming is supported.

Version

0.14.22

Steps to Reproduce

from llama_index.core import ( StorageContext, load_index_from_storage, Settings, ) from llama_index.embeddings.ollama import OllamaEmbedding from llama_index.llms.ollama import Ollama

Settings.llm = Ollama( model="qwen2.5:7b", base_url="http://127.0.0.1:11434", request_timeout=120.0, )

Settings.embed_model = OllamaEmbedding( model_name="bge-m3", base_url="http://127.0.0.1:11434", )

storage_context = StorageContext.from_defaults( persist_dir=str("./storage/ae4f9477f0a185a7") )

index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine( similarity_top_k=5, streaming=True, )

response = query_engine.query("Please tell me what is RL?")

for idx, value in enumerate(response.response_gen): print("The index is:", idx, "\n") print(value + "\n")

Relevant Logs/Tracebacks

Name: llama-index-llms-ollama
Version: 0.10.1

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When streaming=True is enabled, I expected response.response_gen to yield multiple smaller chunks incrementally, for example:

chunk 1: 监督微调
chunk 2: (SFT)
chunk 3: 是现代大模型训练
...

So that the frontend can display the answer progressively.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING