openclaw - ✅(Solved) Fix [Bug]: Claude CLI no-output timeout is collapsed into generic Telegram error; operators cannot distinguish timeout from gateway/provider failure [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77007Fetched 2026-05-04 04:59:31
View on GitHub
Comments
2
Participants
3
Timeline
4
Reactions
2
Timeline (top)
commented ×2cross-referenced ×1referenced ×1

When OpenClaw runs through the Claude CLI backend (agentRuntime.id = "claude-cli") and a turn produces no output for the configured CLI watchdog window, Telegram receives only the generic message:

Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

The gateway remains healthy and the CLI OAuth path is valid, but the user/operator cannot distinguish a CLI no-output timeout from provider billing, auth, gateway downtime, or a Telegram channel failure. This is especially confusing for long-running Claude CLI Opus/resume sessions where 300s can be a normal-but-slow turn rather than a broken provider.

Error Message

[agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-7 durationMs=300003 error=FailoverError This mitigates premature aborts, but it does not fix the platform UX: Telegram still collapses the root cause into a generic error and operators must inspect gateway logs to understand what happened. This appears related to, but broader than, #75264. That issue reports a specific Telegram + CLI backend stall during a file Read tool call. This report focuses on the generic platform behavior for Claude CLI no-output/turn timeouts and the lack of actionable error propagation to Telegram/status.

Root Cause

This mitigates premature aborts, but it does not fix the platform UX: Telegram still collapses the root cause into a generic error and operators must inspect gateway logs to understand what happened.

Fix Action

Fix / Workaround

  • OpenClaw: 2026.4.29 CLI installed; gateway service description pinned to v2026.4.25
  • Platform: Linux server, systemd user service
  • Channel: Telegram bot
  • Runtime: claude-cli via OAuth/subscription
  • Primary model: anthropic/claude-opus-4-7
  • agents.defaults.agentRuntime.id = "claude-cli"
  • agents.defaults.timeoutSeconds = 300 before mitigation
  • CLI backend watchdog before mitigation:

Local mitigation applied

PR fix notes

PR #77015: fix(agent-reply): surface Claude CLI timeout copy (fixes #77007)

Description (problem / solution / changelog)

Root cause

buildExternalRunFailureReply() treated most embedded-run failures as generic whenever verbose failures were off, returning GENERIC_EXTERNAL_RUN_FAILURE_TEXT. Claude CLI emits stable gateway strings (CLI exceeded timeout (\d+s)... / CLI produced no output...) that appeared in gateway logs but collapsed to generic channel copy unless verbose failure detail was enabled.

Why this fix is safe

  • Visible text uses regex-captured durations plus fixed guidance; arbitrary payloads are not echoed verbatim.
  • Matches only aggregated errors that contain internally generated CLI timeout templates; existing billing/rate-limit/OAuth/session branching stays unchanged.

Security / runtime controls (unchanged)

  • No bypass of auth, channel ACLs, SSRF defenses, sandboxing, pairing or approval semantics, or model/routing policy.
  • Verbosity still gates detailed raw forwarding; this adds a narrow carve-out comparable to OAuth / missing-API-key copy.

Tests

  • pnpm install
  • pnpm exec vitest run src/auto-reply/reply/agent-runner-execution.test.ts -t "Claude CLI"
  • pnpm check:changed
  • git diff --check

Out of scope / follow-ups

  • Non-Claude CLI backends may emit different stall strings — only matching OpenClaw CLI … literals are classified here.
  • Broader diagnostics (gateway vs delivery vs billing) from the issue is not attempted.
  • Telegram transport fallback on delivery-only failures is untouched.

fixes #77007

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/auto-reply/reply/agent-runner-execution.test.ts (modified, +49/-0)
  • src/auto-reply/reply/agent-runner-execution.ts (modified, +35/-0)

Code Example

Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

---

{
  "fresh": { "noOutputTimeoutMs": 300000 },
  "resume": { "noOutputTimeoutMs": 300000 }
}

---

2026-05-03 19:22:51 UTC
2026-05-03 22:34:27 UTC
2026-05-03 22:51:55 UTC
2026-05-03 23:58:18 UTC

---

[agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
[agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-7 durationMs=300003 error=FailoverError
[model-fallback/decision] model fallback decision: decision=candidate_failed requested=anthropic/claude-opus-4-7 candidate=anthropic/claude-opus-4-7 reason=timeout next=none detail=CLI exceeded timeout (300s) and was terminated.
Embedded agent failed before reply: CLI exceeded timeout (300s) and was terminated.
[telegram] sendMessage ok chat=<redacted> message=<id>

---

systemctl --user status openclaw-gateway.service => active/running
curl http://127.0.0.1:18788/health => {"ok":true,"status":"live"}

---

claude --model claude-opus-4-7 --print "reply ok only"
# ok

---

Claude CLI timed out after 300s with no output for claude-opus-4-7. Gateway is still healthy. Try /new, lower effort, use Sonnet, or increase agents.defaults.timeoutSeconds / cliBackends.claude-cli.reliability.watchdog.*.noOutputTimeoutMs.

---

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 900,
      "cliBackends": {
        "claude-cli": {
          "reliability": {
            "watchdog": {
              "fresh": { "noOutputTimeoutMs": 900000 },
              "resume": { "noOutputTimeoutMs": 900000 }
            }
          }
        }
      }
    }
  }
}

---

/health => {"ok":true,"status":"live"}
runtime => {"id":"claude-cli"}
model unchanged => anthropic/claude-opus-4-7
RAW_BUFFERClick to expand / collapse

Summary

When OpenClaw runs through the Claude CLI backend (agentRuntime.id = "claude-cli") and a turn produces no output for the configured CLI watchdog window, Telegram receives only the generic message:

Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

The gateway remains healthy and the CLI OAuth path is valid, but the user/operator cannot distinguish a CLI no-output timeout from provider billing, auth, gateway downtime, or a Telegram channel failure. This is especially confusing for long-running Claude CLI Opus/resume sessions where 300s can be a normal-but-slow turn rather than a broken provider.

Environment

  • OpenClaw: 2026.4.29 CLI installed; gateway service description pinned to v2026.4.25
  • Platform: Linux server, systemd user service
  • Channel: Telegram bot
  • Runtime: claude-cli via OAuth/subscription
  • Primary model: anthropic/claude-opus-4-7
  • agents.defaults.agentRuntime.id = "claude-cli"
  • agents.defaults.timeoutSeconds = 300 before mitigation
  • CLI backend watchdog before mitigation:
{
  "fresh": { "noOutputTimeoutMs": 300000 },
  "resume": { "noOutputTimeoutMs": 300000 }
}

Observed incidents

The following user-visible Telegram errors all mapped to the same gateway-side timeout pattern:

2026-05-03 19:22:51 UTC
2026-05-03 22:34:27 UTC
2026-05-03 22:51:55 UTC
2026-05-03 23:58:18 UTC

Gateway log pattern:

[agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
[agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-7 durationMs=300003 error=FailoverError
[model-fallback/decision] model fallback decision: decision=candidate_failed requested=anthropic/claude-opus-4-7 candidate=anthropic/claude-opus-4-7 reason=timeout next=none detail=CLI exceeded timeout (300s) and was terminated.
Embedded agent failed before reply: CLI exceeded timeout (300s) and was terminated.
[telegram] sendMessage ok chat=<redacted> message=<id>

At the same time:

systemctl --user status openclaw-gateway.service => active/running
curl http://127.0.0.1:18788/health => {"ok":true,"status":"live"}

Direct Claude CLI smoke test worked:

claude --model claude-opus-4-7 --print "reply ok only"
# ok

So this was not gateway downtime and not a dead OAuth/CLI installation.

Expected behavior

  1. Telegram should surface the real failure class, for example:
Claude CLI timed out after 300s with no output for claude-opus-4-7. Gateway is still healthy. Try /new, lower effort, use Sonnet, or increase agents.defaults.timeoutSeconds / cliBackends.claude-cli.reliability.watchdog.*.noOutputTimeoutMs.
  1. Status/diagnostics should distinguish:
  • gateway down
  • provider auth/billing failure
  • CLI subprocess no-output timeout
  • overall agent turn timeout
  • Telegram delivery failure
  1. For claude-cli backends, the runtime should consider a safer default for Opus/resume sessions or expose a clear recommendation when timeoutSeconds and CLI noOutputTimeoutMs are both 300s.

  2. If agentRuntime.id = "claude-cli", model-fallback/status messages should show the execution backend (claude-cli) prominently, not only the canonical model namespace (anthropic/...).

Local mitigation applied

We increased the local deployment timeouts:

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 900,
      "cliBackends": {
        "claude-cli": {
          "reliability": {
            "watchdog": {
              "fresh": { "noOutputTimeoutMs": 900000 },
              "resume": { "noOutputTimeoutMs": 900000 }
            }
          }
        }
      }
    }
  }
}

After restart:

/health => {"ok":true,"status":"live"}
runtime => {"id":"claude-cli"}
model unchanged => anthropic/claude-opus-4-7

This mitigates premature aborts, but it does not fix the platform UX: Telegram still collapses the root cause into a generic error and operators must inspect gateway logs to understand what happened.

Related issues

This appears related to, but broader than, #75264. That issue reports a specific Telegram + CLI backend stall during a file Read tool call. This report focuses on the generic platform behavior for Claude CLI no-output/turn timeouts and the lack of actionable error propagation to Telegram/status.

extent analysis

TL;DR

Increase the timeoutSeconds and noOutputTimeoutMs values to distinguish between CLI no-output timeouts and other failures.

Guidance

  • Review the current agents.defaults.timeoutSeconds and cliBackends.claude-cli.reliability.watchdog settings to ensure they are adequate for the expected response times of the Claude CLI backend.
  • Consider increasing the timeoutSeconds value to a higher value (e.g., 900) to allow for longer-running CLI sessions.
  • Update the noOutputTimeoutMs values for fresh and resume sessions to match the increased timeoutSeconds value.
  • Monitor the gateway logs to verify that the increased timeouts are effective in reducing premature aborts and to identify any other potential issues.

Example

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 900,
      "cliBackends": {
        "claude-cli": {
          "reliability": {
            "watchdog": {
              "fresh": { "noOutputTimeoutMs": 900000 },
              "resume": { "noOutputTimeoutMs": 900000 }
            }
          }
        }
      }
    }
  }
}

Notes

The current implementation of the Claude CLI backend and the gateway may not provide sufficient error propagation to distinguish between different types of failures. Increasing the timeouts may help mitigate premature aborts, but a more comprehensive solution may be required to address the underlying issue.

Recommendation

Apply workaround: Increase the timeoutSeconds and noOutputTimeoutMs values to distinguish between CLI no-output timeouts and other failures, as this provides a temporary solution to reduce premature aborts and improve the overall user experience.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Telegram should surface the real failure class, for example:
Claude CLI timed out after 300s with no output for claude-opus-4-7. Gateway is still healthy. Try /new, lower effort, use Sonnet, or increase agents.defaults.timeoutSeconds / cliBackends.claude-cli.reliability.watchdog.*.noOutputTimeoutMs.
  1. Status/diagnostics should distinguish:
  • gateway down
  • provider auth/billing failure
  • CLI subprocess no-output timeout
  • overall agent turn timeout
  • Telegram delivery failure
  1. For claude-cli backends, the runtime should consider a safer default for Opus/resume sessions or expose a clear recommendation when timeoutSeconds and CLI noOutputTimeoutMs are both 300s.

  2. If agentRuntime.id = "claude-cli", model-fallback/status messages should show the execution backend (claude-cli) prominently, not only the canonical model namespace (anthropic/...).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING