1. Telegram should surface the real failure class, for example: ```text Claude CLI timed out after 300s with no output for claude-opus-4-7. Gateway is still healthy. Try /new, lower effort, use Sonnet, or increase agents.defaults.timeoutSeconds / cliBackends.claude-cli.reliability.watchdog.*.noOutputTimeoutMs. ``` 2. Status/diagnostics should distinguish: - gateway down - provider auth/billing failure - CLI subprocess no-output timeout - overall agent turn timeout - Telegram delivery failure 3. For `claude-cli` backends, the runtime should consider a safer default for Opus/resume sessions or expose a clear recommendation when `timeoutSeconds` and CLI `noOutputTimeoutMs` are both 300s. 4. If `agentRuntime.id = "claude-cli"`, model-fallback/status messages should show the execution backend (`claude-cli`) prominently, not only the canonical model namespace (`anthropic/...`).

openclaw - ✅(Solved) Fix [Bug]: Claude CLI no-output timeout is collapsed into generic Telegram error; operators cannot distinguish timeout from gateway/provider failure [1 pull requests, 2 comments, 3 participants]

openclaw2026-05-04 00:23:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77007•Fetched 2026-05-04 04:59:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2cross-referenced ×1referenced ×1

When OpenClaw runs through the Claude CLI backend (agentRuntime.id = "claude-cli") and a turn produces no output for the configured CLI watchdog window, Telegram receives only the generic message:

Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

The gateway remains healthy and the CLI OAuth path is valid, but the user/operator cannot distinguish a CLI no-output timeout from provider billing, auth, gateway downtime, or a Telegram channel failure. This is especially confusing for long-running Claude CLI Opus/resume sessions where 300s can be a normal-but-slow turn rather than a broken provider.

Error Message

[agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-7 durationMs=300003 error=FailoverError This mitigates premature aborts, but it does not fix the platform UX: Telegram still collapses the root cause into a generic error and operators must inspect gateway logs to understand what happened. This appears related to, but broader than, #75264. That issue reports a specific Telegram + CLI backend stall during a file Read tool call. This report focuses on the generic platform behavior for Claude CLI no-output/turn timeouts and the lack of actionable error propagation to Telegram/status.

Root Cause

This mitigates premature aborts, but it does not fix the platform UX: Telegram still collapses the root cause into a generic error and operators must inspect gateway logs to understand what happened.

Fix Action

Fix / Workaround

OpenClaw: 2026.4.29 CLI installed; gateway service description pinned to v2026.4.25
Platform: Linux server, systemd user service
Channel: Telegram bot
Runtime: claude-cli via OAuth/subscription
Primary model: anthropic/claude-opus-4-7
agents.defaults.agentRuntime.id = "claude-cli"
agents.defaults.timeoutSeconds = 300 before mitigation
CLI backend watchdog before mitigation:

Local mitigation applied

PR fix notes

PR #77015: fix(agent-reply): surface Claude CLI timeout copy (fixes #77007)

Repository: openclaw/openclaw
Author: neeravmakwana
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/77015

Description (problem / solution / changelog)

Root cause

buildExternalRunFailureReply() treated most embedded-run failures as generic whenever verbose failures were off, returning GENERIC_EXTERNAL_RUN_FAILURE_TEXT. Claude CLI emits stable gateway strings (CLI exceeded timeout (\d+s)... / CLI produced no output...) that appeared in gateway logs but collapsed to generic channel copy unless verbose failure detail was enabled.

Why this fix is safe

Visible text uses regex-captured durations plus fixed guidance; arbitrary payloads are not echoed verbatim.
Matches only aggregated errors that contain internally generated CLI timeout templates; existing billing/rate-limit/OAuth/session branching stays unchanged.

Security / runtime controls (unchanged)

No bypass of auth, channel ACLs, SSRF defenses, sandboxing, pairing or approval semantics, or model/routing policy.
Verbosity still gates detailed raw forwarding; this adds a narrow carve-out comparable to OAuth / missing-API-key copy.

Tests

pnpm install
pnpm exec vitest run src/auto-reply/reply/agent-runner-execution.test.ts -t "Claude CLI"
pnpm check:changed
git diff --check

Out of scope / follow-ups

Non-Claude CLI backends may emit different stall strings — only matching OpenClaw CLI … literals are classified here.
Broader diagnostics (gateway vs delivery vs billing) from the issue is not attempted.
Telegram transport fallback on delivery-only failures is untouched.

fixes #77007

Changed files

CHANGELOG.md (modified, +1/-0)
src/auto-reply/reply/agent-runner-execution.test.ts (modified, +49/-0)
src/auto-reply/reply/agent-runner-execution.ts (modified, +35/-0)

Code Example

Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

---

{
  "fresh": { "noOutputTimeoutMs": 300000 },
  "resume": { "noOutputTimeoutMs": 300000 }
}

---

2026-05-03 19:22:51 UTC
2026-05-03 22:34:27 UTC
2026-05-03 22:51:55 UTC
2026-05-03 23:58:18 UTC

---

[agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
[agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-7 durationMs=300003 error=FailoverError
[model-fallback/decision] model fallback decision: decision=candidate_failed requested=anthropic/claude-opus-4-7 candidate=anthropic/claude-opus-4-7 reason=timeout next=none detail=CLI exceeded timeout (300s) and was terminated.
Embedded agent failed before reply: CLI exceeded timeout (300s) and was terminated.
[telegram] sendMessage ok chat=<redacted> message=<id>

---

systemctl --user status openclaw-gateway.service => active/running
curl http://127.0.0.1:18788/health => {"ok":true,"status":"live"}

---

claude --model claude-opus-4-7 --print "reply ok only"
# ok

---

Claude CLI timed out after 300s with no output for claude-opus-4-7. Gateway is still healthy. Try /new, lower effort, use Sonnet, or increase agents.defaults.timeoutSeconds / cliBackends.claude-cli.reliability.watchdog.*.noOutputTimeoutMs.

---

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 900,
      "cliBackends": {
        "claude-cli": {
          "reliability": {
            "watchdog": {
              "fresh": { "noOutputTimeoutMs": 900000 },
              "resume": { "noOutputTimeoutMs": 900000 }
            }
          }
        }
      }
    }
  }
}

---

/health => {"ok":true,"status":"live"}
runtime => {"id":"claude-cli"}
model unchanged => anthropic/claude-opus-4-7

RAW_BUFFERClick to expand / collapse

Summary

When OpenClaw runs through the Claude CLI backend (agentRuntime.id = "claude-cli") and a turn produces no output for the configured CLI watchdog window, Telegram receives only the generic message:

Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

Environment

OpenClaw: 2026.4.29 CLI installed; gateway service description pinned to v2026.4.25
Platform: Linux server, systemd user service
Channel: Telegram bot
Runtime: claude-cli via OAuth/subscription
Primary model: anthropic/claude-opus-4-7
agents.defaults.agentRuntime.id = "claude-cli"
agents.defaults.timeoutSeconds = 300 before mitigation
CLI backend watchdog before mitigation:

{
  "fresh": { "noOutputTimeoutMs": 300000 },
  "resume": { "noOutputTimeoutMs": 300000 }
}

Observed incidents

The following user-visible Telegram errors all mapped to the same gateway-side timeout pattern:

2026-05-03 19:22:51 UTC
2026-05-03 22:34:27 UTC
2026-05-03 22:51:55 UTC
2026-05-03 23:58:18 UTC

Gateway log pattern:

[agent/cli-backend] claude live session close: provider=claude-cli model=claude-opus-4-7 reason=abort
[agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-7 durationMs=300003 error=FailoverError
[model-fallback/decision] model fallback decision: decision=candidate_failed requested=anthropic/claude-opus-4-7 candidate=anthropic/claude-opus-4-7 reason=timeout next=none detail=CLI exceeded timeout (300s) and was terminated.
Embedded agent failed before reply: CLI exceeded timeout (300s) and was terminated.
[telegram] sendMessage ok chat=<redacted> message=<id>

At the same time:

systemctl --user status openclaw-gateway.service => active/running
curl http://127.0.0.1:18788/health => {"ok":true,"status":"live"}

Direct Claude CLI smoke test worked:

claude --model claude-opus-4-7 --print "reply ok only"
# ok

So this was not gateway downtime and not a dead OAuth/CLI installation.

Expected behavior

Telegram should surface the real failure class, for example:

Claude CLI timed out after 300s with no output for claude-opus-4-7. Gateway is still healthy. Try /new, lower effort, use Sonnet, or increase agents.defaults.timeoutSeconds / cliBackends.claude-cli.reliability.watchdog.*.noOutputTimeoutMs.

Status/diagnostics should distinguish:

gateway down
provider auth/billing failure
CLI subprocess no-output timeout
overall agent turn timeout
Telegram delivery failure

For claude-cli backends, the runtime should consider a safer default for Opus/resume sessions or expose a clear recommendation when timeoutSeconds and CLI noOutputTimeoutMs are both 300s.
If agentRuntime.id = "claude-cli", model-fallback/status messages should show the execution backend (claude-cli) prominently, not only the canonical model namespace (anthropic/...).

Local mitigation applied

We increased the local deployment timeouts:

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 900,
      "cliBackends": {
        "claude-cli": {
          "reliability": {
            "watchdog": {
              "fresh": { "noOutputTimeoutMs": 900000 },
              "resume": { "noOutputTimeoutMs": 900000 }
            }
          }
        }
      }
    }
  }
}

After restart:

/health => {"ok":true,"status":"live"}
runtime => {"id":"claude-cli"}
model unchanged => anthropic/claude-opus-4-7

This mitigates premature aborts, but it does not fix the platform UX: Telegram still collapses the root cause into a generic error and operators must inspect gateway logs to understand what happened.

Related issues

This appears related to, but broader than, #75264. That issue reports a specific Telegram + CLI backend stall during a file Read tool call. This report focuses on the generic platform behavior for Claude CLI no-output/turn timeouts and the lack of actionable error propagation to Telegram/status.

extent analysis

TL;DR

Increase the timeoutSeconds and noOutputTimeoutMs values to distinguish between CLI no-output timeouts and other failures.

Guidance

Review the current agents.defaults.timeoutSeconds and cliBackends.claude-cli.reliability.watchdog settings to ensure they are adequate for the expected response times of the Claude CLI backend.
Consider increasing the timeoutSeconds value to a higher value (e.g., 900) to allow for longer-running CLI sessions.
Update the noOutputTimeoutMs values for fresh and resume sessions to match the increased timeoutSeconds value.
Monitor the gateway logs to verify that the increased timeouts are effective in reducing premature aborts and to identify any other potential issues.

Example

{
  "agents": {
    "defaults": {
      "timeoutSeconds": 900,
      "cliBackends": {
        "claude-cli": {
          "reliability": {
            "watchdog": {
              "fresh": { "noOutputTimeoutMs": 900000 },
              "resume": { "noOutputTimeoutMs": 900000 }
            }
          }
        }
      }
    }
  }
}

Notes

The current implementation of the Claude CLI backend and the gateway may not provide sufficient error propagation to distinguish between different types of failures. Increasing the timeouts may help mitigate premature aborts, but a more comprehensive solution may be required to address the underlying issue.

Recommendation

Apply workaround: Increase the timeoutSeconds and noOutputTimeoutMs values to distinguish between CLI no-output timeouts and other failures, as this provides a temporary solution to reduce premature aborts and improve the overall user experience.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Telegram should surface the real failure class, for example:

Claude CLI timed out after 300s with no output for claude-opus-4-7. Gateway is still healthy. Try /new, lower effort, use Sonnet, or increase agents.defaults.timeoutSeconds / cliBackends.claude-cli.reliability.watchdog.*.noOutputTimeoutMs.

Status/diagnostics should distinguish:

gateway down
provider auth/billing failure
CLI subprocess no-output timeout
overall agent turn timeout
Telegram delivery failure

For claude-cli backends, the runtime should consider a safer default for Opus/resume sessions or expose a clear recommendation when timeoutSeconds and CLI noOutputTimeoutMs are both 300s.
If agentRuntime.id = "claude-cli", model-fallback/status messages should show the execution backend (claude-cli) prominently, not only the canonical model namespace (anthropic/...).

#installation #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Claude CLI no-output timeout is collapsed into generic Telegram error; operators cannot distinguish timeout from gateway/provider failure [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Local mitigation applied

PR fix notes

PR #77015: fix(agent-reply): surface Claude CLI timeout copy (fixes #77007)

Description (problem / solution / changelog)

Root cause

Why this fix is safe

Security / runtime controls (unchanged)

Tests

Out of scope / follow-ups

Changed files

Code Example

Summary

Environment

Observed incidents

Expected behavior

Local mitigation applied

Related issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING