openclaw - ✅(Solved) Fix [Bug]: Agent runner always returns payloads=0, stopReason=stop [1 pull requests, 2 comments, 2 participants]

openclaw2026-05-13 15:20:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#81458•Fetched 2026-05-14 03:32:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

dwayner79-a11y

Participants

clawsweeper[bot]

dwayner79-a11y

Timeline (top)

commented ×2closed ×1cross-referenced ×1labeled ×1

OpenClaw 2026.5.7 on macOS arm64 openclaw infer model run works Agent runner always returns payloads=0, stopReason=stop with zero tokens Affects both claude-opus-4-7 and claude-sonnet-4-6 Error hash sha256:f97d506771d3 (sonnet) and sha256:f97d506771d3 (opus)

Error Message

Error hash sha256:f97d506771d3 (sonnet) and sha256:f97d506771d3 (opus)

Root Cause

Fix Action

Fixed

Fixed by PR: fix(embedded-runner): actionable diagnostic for empty-stream config errors (https://github.com/openclaw/openclaw/pull/69346)

PR fix notes

PR #69346: fix(embedded-runner): actionable diagnostic for empty-stream config errors

Repository: openclaw/openclaw
Author: abajirao
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/69346

Description (problem / solution / changelog)

Iteration log (review-driven changes since open)

This PR has been through three rounds of bot review. Summary for human reviewers:

Commit	Source	Change
`a0426ed6c8`	Initial	Original `isLikelyConfigErrorEmptyStream` + `buildConfigErrorDiagnosticText` + 13 unit tests
`04ea0af035`	`chatgpt-codex-connector` (P2)	Route `assistant.usage` through `normalizeUsage()` before `hasNonzeroUsage()` so providers reporting via `totalTokens` (e.g. OpenAI WS conversion) aren't misclassified as zero-token config errors
`fa33fa99e6`	(rebase)	Rebased onto current `main` (1,415 upstream commits); no logic changes
`83082603d7`	`greptile-apps` (P2 ×2)	(a) Scope the "OpenRouter uses /api/v1, not /v1" hint to OpenRouter providers (and unknown - it's the most common cause); other providers no longer see misleading advice. (b) Bundle the leading blank line into the identity block so missing provider/model doesn't render as two consecutive blank lines
`d56b404b3b`	`chatgpt-codex-connector` (P2) + CI	(a) Require `normalizeUsage` to return a defined object as positive proof the assistant reported usage telemetry; missing-usage now defers to the generic "try again" path instead of misdiagnosing usage-silent providers. (b) Add `hasNonzeroUsage` to the `vi.doMock("../usage.js", ...)` in `run.overflow-compaction.harness.ts` (fixes `checks-node-agentic-agents` lane that broke after the import was added in `04ea0af035`)

All bot review threads are now resolved. Test count grew from 13 → 20 with regression coverage for every guardrail branch added during iteration.

The only currently-failing CI check is the Run the GPT-5.4 / Opus 4.6 parity gate against the qa-lab mock job (subagent-fanout-synthesis scenario) - same scenario also fails on PR #69946 and other recent PRs against main, so this is a pre-existing flake unrelated to the patch.

codex review --base origin/main was run locally on the initial submission and reported no actionable issues; subsequent iterations were driven by the bots that ran server-side.

Summary

Problem: When a streaming response terminates with stopReason=stop but produces zero content, zero thinking, zero tool calls, and zero tokens in/out, OpenClaw surfaces the generic ⚠️ Agent couldn't generate a response. Please try again. message. That's almost always a provider misconfiguration (wrong baseUrl, bad API key, unknown model id), not a legitimate empty model reply - but the generic message hides the real cause and operators retry indefinitely against a broken endpoint.
Why it matters: Real incident that cost me hours to root-cause: an agent's models.json had https://openrouter.ai/v1 instead of https://openrouter.ai/api/v1. OpenRouter's marketing site serves HTML at /v1 with HTTP 200, the SSE parser found zero data: events, and the stream "completed" cleanly. Every Telegram reply came back as the generic retry message. No log or UI hint pointed at the URL.
What changed: Added a conservative detector (isLikelyConfigErrorEmptyStream) and a diagnostic-message builder (buildConfigErrorDiagnosticText) in incomplete-turn.ts. When resolveIncompleteTurnPayloadText sees the empty-stream fingerprint, it now surfaces an actionable message naming the provider + model and listing the most common causes (wrong baseUrl, invalid key, unknown model id, network returning HTML).
What did NOT change (scope boundary): No changes to stream parsing, HTTP transport, error-stop handling (stopReason === "error" still flows through the provider's own errorMessage), reasoning-only retry logic, empty-response retry logic, or the "side effects may have executed" caveat. The new helpers are pure, the guardrails are strict, and every existing test path is preserved.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #
This PR fixes a bug or regression

(Not filing a separate issue first - CONTRIBUTING.md says "Bugs & small fixes → Open a PR!")

Root Cause (if applicable)

Root cause: resolveIncompleteTurnPayloadText treats a "normal stop with zero payloads" identically to a reasoning-only / empty-response turn and returns the generic "try again" text. That's the correct default when the model ran and legitimately returned nothing, but it's the wrong signal when the HTTP path was completely broken (zero tokens ever reached the model). The existing code has no way to distinguish the two cases because both land on the same stopReason: "stop", content: [], assistantTexts: [] shape.
Missing detection / guardrail: Usage tokens were never checked. Any legitimate model turn (including reasoning-only and empty answers) reports nonzero input tokens - so usage.input === 0 && usage.output === 0 is a reliable signal that the model endpoint was never actually reached.
Contributing context: OpenRouter is forgiving of bad paths - it serves HTML at /v1 with HTTP 200 rather than a 404, so the SSE parser sees an apparently-successful stream with no events, and nothing upstream in the stack raises an error.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/agents/pi-embedded-runner/run/incomplete-turn.config-error.test.ts (new).
Scenario the test should lock in: An attempt with payloadCount === 0, zero assistantTexts, zero content blocks, zero-token usage, and stopReason === "stop" (or undefined) yields the detailed diagnostic; any of the negative cases (nonzero usage, nonzero texts, stopReason === "error", populated content, missing assistant) falls back to the previous generic text.
Why this is the smallest reliable guardrail: The helpers are pure functions over the attempt shape, so unit tests can exercise every guardrail branch deterministically without a transport seam or a live provider.
Existing test that already covers this (if any): None - incomplete-turn.ts had no direct unit test file before this PR. The resolver was covered only indirectly via run.ts-level integration paths, which didn't have an empty-stream config-error fixture.
If no new test is added, why not: N/A - 13 new cases added.

User-visible / Behavior Changes

The error rendered when the fingerprint triggers changes from:

⚠️ Agent couldn't generate a response. Please try again.

to:

⚠️ Provider returned an empty stream (0 content, 0 tokens).
This usually indicates a configuration error, not a model issue.

  Provider: openrouter
  Model:    moonshotai/kimi-k2.5

Common causes:
  1. Wrong baseUrl - check agents/<id>/agent/models.json
     (OpenRouter uses https://openrouter.ai/api/v1, not /v1)
  2. Invalid or expired API key for the provider
  3. Model id not recognized by the selected provider
  4. Network path returning HTML (e.g., captive portal, proxy)

Run `openclaw doctor` to inspect provider configuration.

No config or default changes. No other paths alter their output - the existing generic fallback still fires for every non-config-error empty case (reasoning-only turns, empty legitimate answers, error-stop, etc.).

Diagram (if applicable)

Before:
[stream ends, 0 payloads, 0 tokens] -> [generic "try again"] -> [operator retries against broken endpoint]

After:
[stream ends, 0 payloads, 0 tokens] -> isLikelyConfigErrorEmptyStream?
                                        |-- yes -> [provider/model + fix checklist]
                                        |-- no  -> [generic "try again"] (unchanged)

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

The patch is limited to error-message composition from already-available attempt state (provider id, model id, usage). No new I/O, no secrets in the message, no behavior change to the replay-safety or side-effects accounting paths.

Repro + Verification

Environment

OS: macOS 15.4 (Darwin 25.4.0)
Runtime/container: Node 22.22.2, local openclaw gateway (launchd)
Model/provider: openrouter/moonshotai/kimi-k2.5
Integration/channel: Telegram direct chat
Relevant config (redacted): an agent's models.json baseUrl set to https://openrouter.ai/v1 (missing /api/)

Steps

Set a provider's baseUrl in ~/.openclaw/agents/<id>/agent/models.json to https://openrouter.ai/v1.
Restart the gateway.
Send a Telegram message routed to that provider.

Expected

User-facing reply names the provider/model and tells the operator to check baseUrl / key / model id / network.

Actual (before patch)

User-facing reply: generic ⚠️ Agent couldn't generate a response. Please try again. - no hint at the URL being wrong.

Actual (after patch)

User-facing reply: full diagnostic block shown in the "User-visible / Behavior Changes" section above, which immediately pointed at the /v1 vs /api/v1 issue.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Log trace from the original incident showing the silent-drop pattern (30ms total, 0 chunks, stopReason=stop for a wrong-URL OpenRouter call):

[STREAM-DISPATCH] api=openai-completions provider=openrouter modelId=anthropic/claude-sonnet-4.5 baseUrl=https://openrouter.ai/v1
[openai-completions END] chunks=0 elapsedMs=36 blocks=0 types= stopReason=stop usage={"input":0,"output":0,...} aborted=false
[agent/embedded] incomplete turn detected: runId=… sessionId=… stopReason=stop payloads=0 - surfacing error to user

(The [STREAM-DISPATCH] / [openai-completions END] lines are from temporary debug instrumentation I added during triage - not part of this PR.)

Tests: 13 new cases in incomplete-turn.config-error.test.ts covering the detector, the builder, and the resolver integration with both positive and negative fixtures.

Human Verification (required)

Verified scenarios:
- New diagnostic renders end-to-end on a live gateway after deploying this commit locally - reproduced the original incident with a deliberately-wrong baseUrl and confirmed the new message renders with correct provider/model.
- Generic "try again" still fires for a legitimate nonzero-usage empty-content turn (simulated by reasoning-only fixture in the test).
- Generic "try again" with the side-effects caveat still fires when replayMetadata.hadPotentialSideEffects === true.
- stopReason === "error" path untouched - assistant error message still flows through the provider's errorMessage.
Edge cases checked:
- lastAssistant undefined → detector returns false (no misfire).
- currentAttemptAssistant with real usage beats an empty lastAssistant → detector returns false (tested).
- Any nonzero field among input/output/cacheRead/cacheWrite/total in usage → false (covered by hasNonzeroUsage).
- Nonempty assistantTexts (partial tool-use turn already captured) → false.
- Nonempty content blocks (thinking block without visible text) → false.
What I did not verify: I did not write an end-to-end test that exercises the real openai-completions transport against a wrong URL - the helpers are pure and the integration path is covered by the existing run.ts call site plus the unit-level resolver test. I also did not inspect Codex-specific error paths because those don't enter this resolver (Codex runs through its own pipeline).

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR. (No reviews yet at time of opening.)
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
Upgrade steps: N/A - the new message is strictly more informative than the old one in exactly the narrow case that was previously a dead-end, and every other path is byte-identical.

Risks and Mitigations

Risk: False positive - a legitimate model response that somehow ends with stopReason=stop + 0 content + 0 tokens would now get the config-error message instead of the generic one.
- Mitigation: All public providers I know of report nonzero input tokens once the request reaches the model (even if output is empty). The guardrails require all of (a) payloadCount === 0, (b) empty assistantTexts, (c) empty content, (d) zero usage across every normalized field (hasNonzeroUsage checks input / output / cacheRead / cacheWrite / total), (e) assistant exists, (f) stopReason !== "error" and stopReason !== "toolUse" etc. - so the only realistic way to hit this path is when the request never actually reached the model, which is the case the message is designed for. Existing paths for reasoning-only and empty-response retries still sit upstream of this detector and fire first when their own conditions match.

AI-assisted disclosure

AI-assisted - pair-debugged the original incident and authored this patch with Claude Opus 4.7 / Claude Code.
Lightly tested - deployed to my local gateway end-to-end (reproduced the wrong-URL scenario); 13 unit cases in the new test file; full negative coverage for the detector's guardrails.
I understand the code.
pnpm check / pnpm build / pnpm tsgo / pnpm lint / pnpm format - green.
codex review --base origin/main run locally - no actionable issues.

Changed files

src/agents/pi-embedded-runner/run.overflow-compaction.harness.ts (modified, +10/-0)
src/agents/pi-embedded-runner/run/incomplete-turn.config-error.test.ts (added, +326/-0)
src/agents/pi-embedded-runner/run/incomplete-turn.ts (modified, +142/-0)

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Open the browser and ask anything, no response.

Expected behavior

get a response

Actual behavior

The agent in the browser never responds.

OpenClaw version

2026.5.7

Operating system

macOS 26.5

Install method

No response

Model

claude-opus-4-7 and claude-sonnet-4-6

Provider / routing chain

Don't know what this means.

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

get a response

#memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Agent runner always returns payloads=0, stopReason=stop [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #69346: fix(embedded-runner): actionable diagnostic for empty-stream config errors

Description (problem / solution / changelog)

Iteration log (review-driven changes since open)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (before patch)

Actual (after patch)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

AI-assisted disclosure

Changed files

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING