openclaw - 💡(How to fix) Fix CLI watchdog kills sessions that are correctly idle while waiting on a Monitor task [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71803Fetched 2026-04-26 05:08:05
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
1
Participants

The agent/cli-backend watchdog terminates the Claude CLI process after noOutputTimeoutMs (default 180s) of no stdout, even when the agent is deliberately idle inside a Monitor tool call waiting for a long-running shell job (Whisper, ffmpeg, yt-dlp, large builds, etc.). This destroys the agent session mid-flow and leaves the user holding the bag — they have to reconstruct the in-flight task from logs.

Root Cause

Why this is a real problem

  • The Monitor tool actively tells the agent not to poll or sleep — "keep working — do not poll or sleep" — so the agent is following protocol when it produces no output.
  • Long-running shell tasks are a common, supported workflow (transcription, video clipping, Whisper, ffmpeg, yt-dlp, builds).
  • Losing the session mid-Monitor is the worst possible failure mode: the underlying task often does finish, but the agent that was supposed to act on the result is gone.
  • The user-visible symptom is "I lost my chat session." From the user's perspective it looks like a generic crash, not a watchdog timeout, because the control UI just shows the session disappearing.

Fix Action

Fix / Workaround

Related observations during the same session loss

While debugging the dropped session I also noticed a few less-severe but related logspam issues that may share a root cause or similar shape:

  • [skills] Skipping escaped skill path outside its configured root for post-bridge — a symlink outside the skills root is rejected on every cycle. (Workaround: replace symlink with a copy in ~/.openclaw/skills/.)
  • [cron] payload.model 'vamsi-local/qwen3' not allowed, falling back to agent defaults — the cron model allowlist is required to include each provider/model pair explicitly.
  • The openclaw-control-ui webchat client reconnects with code 1001 on every view change, which is normal but compounds the user-perceived chaos when a watchdog kill happens mid-session and the user starts switching views to find the lost transcript.

These are workaroundable; the watchdog-vs-Monitor interaction is the one that warrants a code fix.

Code Example

Monitor({
     command: "until [ -f /tmp/coaching-call-XXX/audio.txt ]; do sleep 10; done; echo \"transcript ready\"",
     timeout_ms: 1800000
   })

---

2026-04-25T12:01:44.014-07:00  Monitor started (task bqdfrkncm, timeout 1800000ms, persistent=false). You will be notified on each event. Keep working — do not poll or sleep.
...
2026-04-25T12:04:46.024-07:00 [agent/cli-backend] cli watchdog timeout: provider=claude-cli model=claude-opus-4-7 session=f5a1ec80-0fd8-4fa7-a823-c9c30d124ded noOutputTimeoutMs=180000 pid=26674
2026-04-25T12:04:46.033-07:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=claude-cli/claude-opus-4-7 candidate=claude-cli/claude-opus-4-7 reason=timeout next=none detail=CLI produced no output for 180s and was terminated.
2026-04-25T12:04:46.123-07:00 Embedded agent failed before reply: CLI produced no output for 180s and was terminated.
RAW_BUFFERClick to expand / collapse

CLI watchdog kills sessions that are correctly idle while waiting on a Monitor task

Summary

The agent/cli-backend watchdog terminates the Claude CLI process after noOutputTimeoutMs (default 180s) of no stdout, even when the agent is deliberately idle inside a Monitor tool call waiting for a long-running shell job (Whisper, ffmpeg, yt-dlp, large builds, etc.). This destroys the agent session mid-flow and leaves the user holding the bag — they have to reconstruct the in-flight task from logs.

Reproduction

  1. Start a session that uses Monitor to wait on a long-running shell job, e.g. transcribing a 30-minute audio file with Whisper:
    Monitor({
      command: "until [ -f /tmp/coaching-call-XXX/audio.txt ]; do sleep 10; done; echo \"transcript ready\"",
      timeout_ms: 1800000
    })
  2. Whisper runs for ~10–15 min. The agent has no other work to do until it gets the transcript ready event, so it correctly produces no output.
  3. After 180 s the watchdog fires and kills the CLI process. The session dies.

Observed log lines

2026-04-25T12:01:44.014-07:00  Monitor started (task bqdfrkncm, timeout 1800000ms, persistent=false). You will be notified on each event. Keep working — do not poll or sleep.
...
2026-04-25T12:04:46.024-07:00 [agent/cli-backend] cli watchdog timeout: provider=claude-cli model=claude-opus-4-7 session=f5a1ec80-0fd8-4fa7-a823-c9c30d124ded noOutputTimeoutMs=180000 pid=26674
2026-04-25T12:04:46.033-07:00 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=claude-cli/claude-opus-4-7 candidate=claude-cli/claude-opus-4-7 reason=timeout next=none detail=CLI produced no output for 180s and was terminated.
2026-04-25T12:04:46.123-07:00 Embedded agent failed before reply: CLI produced no output for 180s and was terminated.

The Monitor task itself is healthy and still running — only the CLI process supervising it gets killed. The agent never gets the chance to consume the transcript ready event and continue.

Why this is a real problem

  • The Monitor tool actively tells the agent not to poll or sleep — "keep working — do not poll or sleep" — so the agent is following protocol when it produces no output.
  • Long-running shell tasks are a common, supported workflow (transcription, video clipping, Whisper, ffmpeg, yt-dlp, builds).
  • Losing the session mid-Monitor is the worst possible failure mode: the underlying task often does finish, but the agent that was supposed to act on the result is gone.
  • The user-visible symptom is "I lost my chat session." From the user's perspective it looks like a generic crash, not a watchdog timeout, because the control UI just shows the session disappearing.

Suggested fixes (any of these would be sufficient)

  1. Suspend the watchdog while a Monitor task is pending. When the CLI is in a tool-call waiting state with Monitor listed, the watchdog should not count that idle time against noOutputTimeoutMs.
  2. Have the Monitor wrapper emit a periodic heartbeat token (single token, no semantic content) that resets the watchdog, but only while a Monitor task is genuinely pending.
  3. Per-session watchdog override. Allow callers to raise noOutputTimeoutMs for sessions known to do long monitoring (e.g. coaching-call workflow, clip-factory cron).
  4. Treat the pending-Monitor state as 'expected idle' in the same way a paused/awaiting-user-input state should be — exempt from no-output timeout.

(1) and (2) are the cleanest because they don't push knowledge of the timeout up to every caller.

Repro environment

  • macOS, OpenClaw gateway running locally
  • Provider: claude-cli, model: claude-opus-4-7
  • Workflow: 1-on-1 coaching call transcription skill (yt-dlp → Whisper → action plan generation)
  • noOutputTimeoutMs default: 180000

Related observations during the same session loss

While debugging the dropped session I also noticed a few less-severe but related logspam issues that may share a root cause or similar shape:

  • [skills] Skipping escaped skill path outside its configured root for post-bridge — a symlink outside the skills root is rejected on every cycle. (Workaround: replace symlink with a copy in ~/.openclaw/skills/.)
  • [cron] payload.model 'vamsi-local/qwen3' not allowed, falling back to agent defaults — the cron model allowlist is required to include each provider/model pair explicitly.
  • The openclaw-control-ui webchat client reconnects with code 1001 on every view change, which is normal but compounds the user-perceived chaos when a watchdog kill happens mid-session and the user starts switching views to find the lost transcript.

These are workaroundable; the watchdog-vs-Monitor interaction is the one that warrants a code fix.

extent analysis

TL;DR

The CLI watchdog can be fixed by suspending it while a Monitor task is pending or by having the Monitor wrapper emit a periodic heartbeat token to reset the watchdog.

Guidance

  • Identify the noOutputTimeoutMs value and consider increasing it for sessions that involve long-running Monitor tasks.
  • Implement a mechanism to suspend the watchdog when a Monitor task is pending, such as setting a flag or emitting a heartbeat token.
  • Consider adding a per-session watchdog override to allow callers to adjust the noOutputTimeoutMs value for specific workflows.
  • Review the Repro environment and Related observations to ensure that the fix does not introduce new issues.

Example

// Pseudo-code example of suspending the watchdog while a Monitor task is pending
if (monitorTaskPending) {
  // Suspend the watchdog
  watchdog.suspend();
} else {
  // Resume the watchdog
  watchdog.resume();
}

Notes

The provided suggestions (1-4) are potential solutions, but the best approach may depend on the specific requirements and constraints of the system. It is essential to test and validate any changes to ensure they do not introduce new issues.

Recommendation

Apply workaround (1) or (2) to suspend the watchdog while a Monitor task is pending, as these solutions are the cleanest and do not require pushing knowledge of the timeout up to every caller.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix CLI watchdog kills sessions that are correctly idle while waiting on a Monitor task [1 participants]