openclaw - 💡(How to fix) Fix [Bug]: /v1/chat/completions with stream:true buffers full response when agent runs through embedded pi-ai (anthropic-messages and other non-CLI providers); TTFB == total [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77994Fetched 2026-05-06 06:18:07
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
closed ×1commented ×1

When /v1/chat/completions is called with stream: true and the resolved agent's primary provider goes through the embedded pi-ai runner (runEmbeddedPiAgent) rather than a CLI backend (claude-cli live session, generic CLI with JSONL parser) or an ACP turn, the response is emitted as well-formed SSE chunks but the first chunk is not flushed until the entire agent run is complete. Time-to-first-byte equals total time. All chat.completion.chunk payloads share the same created timestamp and are written back-to-back at the end of the run.

CLI and ACP runs stream correctly. The bug is specific to embedded pi-ai runs, which covers anthropic-messages (direct) and the openai-completions / openai-responses providers when they don't go through a CLI harness.

Error Message

  1. Lifecycle and abort behaviour are unchanged. The lifecycle "end" / "error" events that drive requestFinalize() are still emitted at the same point (after the embedded run resolves or rejects). watchClientDisconnect still aborts the run on early client disconnect. streamIncludeUsage / stream_options.include_usage still injects the trailing usage chunk after the stop chunk.

Root Cause

The HTTP endpoint at dist/openai-http-*.js is wired correctly. In the stream: true branch it subscribes via onAgentEvent(...) and forwards stream: "assistant" events as chat.completion.chunk deltas. It also has a buffered fallback that triggers when no assistant deltas arrive during the run — it writes result.text as a single content chunk after the run completes (the if (!sawAssistantDelta) { ... } block following await agentCommandFromIngress(...)).

The bug is that stream: "assistant" events are never emitted during generation when the agent is dispatched to runEmbeddedPiAgent:

PathFilePer-token deltas?
Claude CLI live sessiondist/execute.runtime-*.js (onAssistantDeltaemitAgentEvent)
Generic CLI with JSONL parserdist/execute.runtime-*.js (createCliJsonlStreamingParser)
ACP turn (acpManager.runTurn)dist/agent-command-*.js consumes text_deltaemitAcpAssistantDelta (dist/attempt-execution-*.js) → emitAgentEvent({ stream: "assistant", ... })
Embedded pi-ai (runEmbeddedPiAgent)After awaiting the stage, dist/agent-runner.runtime-*.js emits one stream: "assistant" event with the full assembled result.payloads[0].text

The provider-stream layer (dist/provider-stream-*.js) does produce text_delta, thinking_delta, and toolcall_delta events while the upstream provider's SSE is being consumed (content_block_delta / text_delta and friends are normalized there). They simply never reach the agent event bus for HTTP-spawned non-CLI runs. @mariozechner/pi-ai exposes both stream() (returns AssistantMessageEventStream) and complete() (returns Promise<AssistantMessage>) — the embedded runner consumes the result in complete()-shape and only emits one event after the message is fully assembled.

Fix Action

Fix / Workaround

The bug is that stream: "assistant" events are never emitted during generation when the agent is dispatched to runEmbeddedPiAgent:

  1. Inside the embedded runner, when consuming pi-ai stream events, fire the callback for each text_delta. The existing provider-stream-*.js already produces normalised text_delta events; route them to the new opt before they are folded into the assembled assistant message. Plumb the same way for thinking_delta (via onReasoningStream, which already exists) and toolcall_delta (via a new onToolCallDelta opt, optional, mirrors onAssistantDelta). The assembled AssistantMessage continues to be returned as today so post-run logic — tool dispatch, fallback retry, transcript persistence, usage accounting — is untouched.

Code Example

{
  "id": "fastpath",
  "model": {
    "primary": "anthropic/claude-haiku-4-5-20251001",
    "fallbacks": ["mistral/mistral-small-latest"]
  }
}

---

curl -sN -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"model":"openclaw/fastpath","messages":[{"role":"user","content":"hello"}],"stream":true,"max_tokens":50}' \
  -w '\n[ttfb=%{time_starttransfer}s total=%{time_total}s]\n' \
  http://localhost:18789/v1/chat/completions

---

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hey"},"finish_reason":null}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"! ..."},"finish_reason":null}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]

[ttfb=9.169s total=9.176s]
RAW_BUFFERClick to expand / collapse

[Bug]: /v1/chat/completions with stream:true buffers full response when agent runs through embedded pi-ai (anthropic-messages and other non-CLI providers); TTFB == total

Summary

When /v1/chat/completions is called with stream: true and the resolved agent's primary provider goes through the embedded pi-ai runner (runEmbeddedPiAgent) rather than a CLI backend (claude-cli live session, generic CLI with JSONL parser) or an ACP turn, the response is emitted as well-formed SSE chunks but the first chunk is not flushed until the entire agent run is complete. Time-to-first-byte equals total time. All chat.completion.chunk payloads share the same created timestamp and are written back-to-back at the end of the run.

CLI and ACP runs stream correctly. The bug is specific to embedded pi-ai runs, which covers anthropic-messages (direct) and the openai-completions / openai-responses providers when they don't go through a CLI harness.

Reproduction

OpenClaw 2026.5.2 (gateway image ghcr.io/openclaw/openclaw:latest as of 2026-05-05).

openclaw.json agent definition:

{
  "id": "fastpath",
  "model": {
    "primary": "anthropic/claude-haiku-4-5-20251001",
    "fallbacks": ["mistral/mistral-small-latest"]
  }
}

Request:

curl -sN -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"model":"openclaw/fastpath","messages":[{"role":"user","content":"hello"}],"stream":true,"max_tokens":50}' \
  -w '\n[ttfb=%{time_starttransfer}s total=%{time_total}s]\n' \
  http://localhost:18789/v1/chat/completions

Observed (cached prompt, ~57k input tokens, ~13 output tokens):

data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hey"},"finish_reason":null}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"! ..."},"finish_reason":null}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]

[ttfb=9.169s total=9.176s]

Expected: TTFB approximately equal to upstream model TTFB (1–2s for Haiku at this prompt size), with content chunks flushed as the underlying provider produces them.

Root cause

The HTTP endpoint at dist/openai-http-*.js is wired correctly. In the stream: true branch it subscribes via onAgentEvent(...) and forwards stream: "assistant" events as chat.completion.chunk deltas. It also has a buffered fallback that triggers when no assistant deltas arrive during the run — it writes result.text as a single content chunk after the run completes (the if (!sawAssistantDelta) { ... } block following await agentCommandFromIngress(...)).

The bug is that stream: "assistant" events are never emitted during generation when the agent is dispatched to runEmbeddedPiAgent:

PathFilePer-token deltas?
Claude CLI live sessiondist/execute.runtime-*.js (onAssistantDeltaemitAgentEvent)
Generic CLI with JSONL parserdist/execute.runtime-*.js (createCliJsonlStreamingParser)
ACP turn (acpManager.runTurn)dist/agent-command-*.js consumes text_deltaemitAcpAssistantDelta (dist/attempt-execution-*.js) → emitAgentEvent({ stream: "assistant", ... })
Embedded pi-ai (runEmbeddedPiAgent)After awaiting the stage, dist/agent-runner.runtime-*.js emits one stream: "assistant" event with the full assembled result.payloads[0].text

The provider-stream layer (dist/provider-stream-*.js) does produce text_delta, thinking_delta, and toolcall_delta events while the upstream provider's SSE is being consumed (content_block_delta / text_delta and friends are normalized there). They simply never reach the agent event bus for HTTP-spawned non-CLI runs. @mariozechner/pi-ai exposes both stream() (returns AssistantMessageEventStream) and complete() (returns Promise<AssistantMessage>) — the embedded runner consumes the result in complete()-shape and only emits one event after the message is fully assembled.

Suggested fix

The fix is symmetric to what already exists for CLI runs and is fully opt-in — runs that don't request streaming, runs already covered by CLI / ACP paths, and consumers of the embedded runner that don't pass the new callback all keep their current behaviour exactly.

  1. Add an onAssistantDelta opt to runEmbeddedPiAgent's parameter shape (dist/agent-runner.runtime-*.js, alongside the existing onPartialReply, onAssistantMessageStart, onReasoningStream, onReasoningEnd, onAgentEvent opts). Same signature as the CLI path: ({ text, delta }) => void. Ignored when not provided.

  2. Inside the embedded runner, when consuming pi-ai stream events, fire the callback for each text_delta. The existing provider-stream-*.js already produces normalised text_delta events; route them to the new opt before they are folded into the assembled assistant message. Plumb the same way for thinking_delta (via onReasoningStream, which already exists) and toolcall_delta (via a new onToolCallDelta opt, optional, mirrors onAssistantDelta). The assembled AssistantMessage continues to be returned as today so post-run logic — tool dispatch, fallback retry, transcript persistence, usage accounting — is untouched.

  3. In the HTTP endpoint (dist/openai-http-*.js, stream: true branch), pass an onAssistantDelta callback on the commandInput that calls emitAgentEvent({ runId, stream: "assistant", data: { text, delta } }). This is the exact bridge already used by the CLI onAssistantDelta hook — the existing onAgentEvent listener in the same file already handles those events and writes them as chat.completion.chunk deltas. The buffered fallback (if (!sawAssistantDelta)) remains in place as a safety net for any path that still doesn't surface deltas (e.g. providers without streaming support, fallback-retry edge cases). Also pass onToolCallDelta so streamed tool_calls chunks are emitted in the OpenAI shape rather than landing in the post-run buffer; this also closes #54174.

  4. Lifecycle and abort behaviour are unchanged. The lifecycle "end" / "error" events that drive requestFinalize() are still emitted at the same point (after the embedded run resolves or rejects). watchClientDisconnect still aborts the run on early client disconnect. streamIncludeUsage / stream_options.include_usage still injects the trailing usage chunk after the stop chunk.

  5. No change needed for: the non-streaming branch (if (!stream) continues to await the final result and sendJson it as today), CLI and ACP code paths (already streaming correctly), session persistence, transcript writing, tool execution, fallback selection, prompt caching, or any consumer of runEmbeddedPiAgent that doesn't pass the new callback (callers including pipeline.runtime-*.js's Discord/Slack draft preview keep their existing onPartialReply semantics).

The change touches three modules — agent-runner.runtime-*.js (accept and propagate the callback), the embedded pi-ai stage helper (fire the callback during stream consumption), and openai-http-*.js (supply the callback) — and is gated behind a callback that defaults to undefined, so the diff is small and the blast radius is limited to consumers that opt in.

Why it matters

/v1/chat/completions is the natural integration surface for OpenAI-compatible third-party clients. Today, perceived latency on that surface is bounded below by full agent-run time for any agent whose primary provider is anthropic-messages (and other non-CLI providers), even when both the upstream model and the openai-http SSE encoder are individually capable of true streaming. Prompt caching does not help — measurements show 99.99% cache hit (prompt_tokens: 3, total_tokens: 69357) producing the same 9s TTFB. Closing this gap brings the embedded-pi-ai HTTP path to parity with the CLI and ACP paths and makes streaming actually mean streaming for OpenAI-compat consumers.

Environment

  • Image: ghcr.io/openclaw/openclaw:latest (snapshot 2026-05-05).
  • Provider exercised: anthropic (api: anthropic-messages), model claude-haiku-4-5-20251001. Same symptom expected for any non-CLI provider routed through runEmbeddedPiAgent.
  • Deploy: single-host docker compose, Linux x86_64.
  • Confirmed with curl direct against the gateway port (no proxy, no Tailscale, no network hops): TTFB matches total.

extent analysis

TL;DR

The most likely fix is to add an onAssistantDelta callback to runEmbeddedPiAgent and pass it to the HTTP endpoint to enable streaming for non-CLI providers.

Guidance

  • Add an onAssistantDelta option to runEmbeddedPiAgent to handle text deltas from the pi-ai stream.
  • Modify the embedded pi-ai stage helper to fire the onAssistantDelta callback during stream consumption.
  • Pass the onAssistantDelta callback to the HTTP endpoint to enable streaming for non-CLI providers.
  • Verify that the TTFB is significantly reduced and content chunks are flushed as the underlying provider produces them.

Example

No code snippet is provided as the issue already includes a detailed suggested fix.

Notes

The fix is specific to the embedded pi-ai runner and does not affect other code paths. The change is gated behind a callback that defaults to undefined, limiting the blast radius to consumers that opt-in.

Recommendation

Apply the suggested fix by adding the onAssistantDelta callback to runEmbeddedPiAgent and passing it to the HTTP endpoint, as this will enable streaming for non-CLI providers and reduce perceived latency.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: /v1/chat/completions with stream:true buffers full response when agent runs through embedded pi-ai (anthropic-messages and other non-CLI providers); TTFB == total [1 comments, 2 participants]