openclaw - 💡(How to fix) Fix [Bug]: GitHub Copilot anthropic-messages transport silently hangs ~365s in isolated cron sessions (2026.5.27)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In v2026.5.27, isolated agentTurn cron runs using github-copilot/claude-sonnet-4.6 (and likely any github-copilot model on the anthropic-messages transport) silently hang for ~365s with zero progress events, then abort internally with promptError: "This operation was aborted". The cron run is then misclassified as status: ok in the cron history (despite the trajectory's finalStatus: error), suppressing all failure alerts — the cron appears healthy while delivering nothing.

This is the github-copilot variant of the watchdog-hang pattern fixed for claude-cli in #86895 / PR #87546. The fix in #87546 only added stream-json progress tracking for the claude-cli backend; the anthropic-messages transport used by github-copilot's claude models has no equivalent progress signal, so the gateway watchdog cannot distinguish a working-but-silent stream from a wedged one.

Error Message

In v2026.5.27, isolated agentTurn cron runs using github-copilot/claude-sonnet-4.6 (and likely any github-copilot model on the anthropic-messages transport) silently hang for ~365s with zero progress events, then abort internally with promptError: "This operation was aborted". The cron run is then misclassified as status: ok in the cron history (despite the trajectory's finalStatus: error), suppressing all failure alerts — the cron appears healthy while delivering nothing.

  • A short transport-level idle timeout (e.g., 30–60s with no first token) triggers a transport error that the failover chain can catch, OR
  • The cron runner correctly records status: error when the trajectory's finalStatus is error, so existing failure-alert plumbing fires. "finalStatus": "error", Note the contradiction: status: ok and delivered: false on the cron-side, vs finalStatus: error and aborted: true on the trajectory-side. The reason is that pi-ai's failover classifier keys on specific patterns (the documented "An unknown error occurred" stream-wrapper text, HTTP 429, transport errors, etc.). The internal AbortError: "This operation was aborted" from the gateway's stuck-session watchdog does not match any of those patterns, so failover is silently inert for this exact failure mode.
  1. Failover-classifier gap: internal AbortError from the stuck-session watchdog is not on pi-ai's failover-worthy error list. Configured fallback chains do not fire for this failure mode. The classifier should probably treat aborted: true + assistantTexts: [] + toolMetas: [] as a failover-worthy outcome regardless of the specific error string.
  2. Cron-runner status misclassification: when the underlying trajectory ends with finalStatus: error and aborted: true, the cron runner still records status: ok in the run history. This means failureAlert blocks don't fire, consecutiveErrors stays at zero, the daily Cron Health Monitor reports the cron as healthy, and the user has no signal that anything is wrong — the cron silently dies for days while reporting green. Suggest the cron runner inspect trace.artifacts.data.finalStatus (or equivalently model.completed.data.aborted combined with empty assistantTexts) to set the correct status, and report deliveryStatus: failed when didSendViaMessagingTool: false on a payload whose prompt requires delivery.

Root Cause

The reason is that pi-ai's failover classifier keys on specific patterns (the documented "An unknown error occurred" stream-wrapper text, HTTP 429, transport errors, etc.). The internal AbortError: "This operation was aborted" from the gateway's stuck-session watchdog does not match any of those patterns, so failover is silently inert for this exact failure mode.

Fix Action

Fix / Workaround

Mitigation in use

Working mitigation: change the primary to a model on a different transport (github-copilot/gpt-5.5 via openai-responses). That bypasses the broken anthropic-messages code path entirely. Configured fallbacks back to sonnet/opus are still inert if gpt-5.5 ever hits the same class of issue.

Code Example

{
  "finalStatus": "error",
  "aborted": true,
  "externalAbort": false,
  "timedOut": false,
  "idleTimedOut": false,
  "timedOutDuringCompaction": false,
  "timedOutDuringToolExecution": false,
  "promptError": "This operation was aborted",
  "promptErrorSource": "prompt",
  "compactionCount": 0,
  "assistantTexts": [],
  "itemLifecycle": { "startedCount": 0, "completedCount": 0, "activeCount": 0 },
  "toolMetas": [],
  "didSendViaMessagingTool": false,
  "messagingToolSentTexts": [],
  "messagingToolSentTargets": []
}

---

{
  "action": "finished",
  "status": "ok",
  "runAtMs": 1779973200025,
  "durationMs": 372636,
  "provider": "github-copilot",
  "model": "claude-sonnet-4.6",
  "deliveryStatus": "not-requested",
  "delivery": { "delivered": false, "fallbackUsed": false }
}
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app hangs)

Beta release blocker

No

Summary

In v2026.5.27, isolated agentTurn cron runs using github-copilot/claude-sonnet-4.6 (and likely any github-copilot model on the anthropic-messages transport) silently hang for ~365s with zero progress events, then abort internally with promptError: "This operation was aborted". The cron run is then misclassified as status: ok in the cron history (despite the trajectory's finalStatus: error), suppressing all failure alerts — the cron appears healthy while delivering nothing.

This is the github-copilot variant of the watchdog-hang pattern fixed for claude-cli in #86895 / PR #87546. The fix in #87546 only added stream-json progress tracking for the claude-cli backend; the anthropic-messages transport used by github-copilot's claude models has no equivalent progress signal, so the gateway watchdog cannot distinguish a working-but-silent stream from a wedged one.

Steps to reproduce

  1. Configure an isolated agentTurn cron with model: "github-copilot/claude-sonnet-4.6", no fallbacks, timeoutSeconds: 600, simple bash-then-message-tool prompt.
  2. Trigger the cron (manual or scheduled).
  3. Observe: trajectory file freezes at prompt.submitted for ~365s, then model.completed fires with aborted: true. No tool calls are made. Discord delivery never happens.
  4. Cron history records status: ok, delivered: false, deliveryStatus: not-requested.

Reproduces deterministically across all my morning runs (5/5 attempts at 5:00, 5:15, 5:30, 5:45, 6:00 AM PT today, plus an additional manual run at 6:28 AM PT, all hung identically with durations 366,400–372,640 ms).

Expected behavior

Either:

  • The github-copilot anthropic-messages transport emits stream/heartbeat progress so the gateway's stuck-session watchdog can distinguish active streams from hung ones, OR
  • A short transport-level idle timeout (e.g., 30–60s with no first token) triggers a transport error that the failover chain can catch, OR
  • The cron runner correctly records status: error when the trajectory's finalStatus is error, so existing failure-alert plumbing fires.

Actual behavior

Run hangs for ~6 min with no progress; aborts internally; cron records "success"; nothing delivered; no alerts.

Trajectory artifact (trace.artifacts.data)

{
  "finalStatus": "error",
  "aborted": true,
  "externalAbort": false,
  "timedOut": false,
  "idleTimedOut": false,
  "timedOutDuringCompaction": false,
  "timedOutDuringToolExecution": false,
  "promptError": "This operation was aborted",
  "promptErrorSource": "prompt",
  "compactionCount": 0,
  "assistantTexts": [],
  "itemLifecycle": { "startedCount": 0, "completedCount": 0, "activeCount": 0 },
  "toolMetas": [],
  "didSendViaMessagingTool": false,
  "messagingToolSentTexts": [],
  "messagingToolSentTargets": []
}

Cron run history (same run, recorded by cron)

{
  "action": "finished",
  "status": "ok",
  "runAtMs": 1779973200025,
  "durationMs": 372636,
  "provider": "github-copilot",
  "model": "claude-sonnet-4.6",
  "deliveryStatus": "not-requested",
  "delivery": { "delivered": false, "fallbackUsed": false }
}

Note the contradiction: status: ok and delivered: false on the cron-side, vs finalStatus: error and aborted: true on the trajectory-side.

Environment

  • OpenClaw: 2026.5.27
  • Gateway runtime: Node v25.9.0 invoking NVM-installed openclaw at ~/.nvm/versions/node/v26.2.0/lib/node_modules/openclaw/dist/index.js
  • OS: macOS 26.5 (Darwin 25.5.0 arm64)
  • Provider: github-copilot (token exchanged via proxy.business.githubcopilot.com, enterprise SKU copilot_for_business_seat_quota, freshly refreshed mid-failure)
  • Model: github-copilot/claude-sonnet-4.6 via anthropic-messages API
  • Session target: isolated
  • Affected cron payload: simple bash + message(action=send) + message(action=upload-file)
  • Auth is healthy throughout (token refresh succeeded; gh auth status reports valid token with all required scopes)

Related issues / context

  • #86895 (closed today, 2026-05-28) — identical 365s watchdog hang pattern on claude-cli + webchat. PR #87546 (commit 4c3a0292) added stream-json progress tracking for claude-cli only; the github-copilot anthropic-messages transport was not covered. This issue is the github-copilot half of that bug.
  • #87327 (open, filed 2026-05-27) — similar isolated-cron stall but in the earlier runtime-plugins phase (78–80s, before model invocation). Distinct stall point, same overall family.
  • #65576 (closed 2026-04-25) — historical: cron runs silently disabled the LLM idle watchdog. Fixed, but with no per-attempt idle timeout for streams that open and then go silent, the same surface re-emerges in a different way.
  • #75332 (closed 2026-05-01) — HTTP 429 from github-copilot misclassified as idle timeout. Different trigger, same misclassification-as-success pattern.

Mitigation in use

The straightforward fix — adding ["github-copilot/gpt-5.5", "github-copilot/claude-opus-4.6"] as payload.fallbacksdoes not work. I confirmed this with a controlled retry: cron history shows fallbackUsed: false, delivered: false, status: ok after a fresh 374-second hang. The fallback chain is loaded but never triggers.

The reason is that pi-ai's failover classifier keys on specific patterns (the documented "An unknown error occurred" stream-wrapper text, HTTP 429, transport errors, etc.). The internal AbortError: "This operation was aborted" from the gateway's stuck-session watchdog does not match any of those patterns, so failover is silently inert for this exact failure mode.

Working mitigation: change the primary to a model on a different transport (github-copilot/gpt-5.5 via openai-responses). That bypasses the broken anthropic-messages code path entirely. Configured fallbacks back to sonnet/opus are still inert if gpt-5.5 ever hits the same class of issue.

Three distinct bugs in this one symptom

  1. Primary bug: github-copilot + anthropic-messages transport opens a stream, never emits a chunk, and is aborted by the internal watchdog ~365s later. (Sibling of #86895, fix in #87546 only covered claude-cli.)
  2. Failover-classifier gap: internal AbortError from the stuck-session watchdog is not on pi-ai's failover-worthy error list. Configured fallback chains do not fire for this failure mode. The classifier should probably treat aborted: true + assistantTexts: [] + toolMetas: [] as a failover-worthy outcome regardless of the specific error string.
  3. Cron-runner status misclassification: when the underlying trajectory ends with finalStatus: error and aborted: true, the cron runner still records status: ok in the run history. This means failureAlert blocks don't fire, consecutiveErrors stays at zero, the daily Cron Health Monitor reports the cron as healthy, and the user has no signal that anything is wrong — the cron silently dies for days while reporting green. Suggest the cron runner inspect trace.artifacts.data.finalStatus (or equivalently model.completed.data.aborted combined with empty assistantTexts) to set the correct status, and report deliveryStatus: failed when didSendViaMessagingTool: false on a payload whose prompt requires delivery.

Bugs 2 and 3 are likely worth separate issues, but I've kept them together here because they compose into the user-facing symptom: a cron that hangs, never delivers, never falls back, and never alerts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Either:

  • The github-copilot anthropic-messages transport emits stream/heartbeat progress so the gateway's stuck-session watchdog can distinguish active streams from hung ones, OR
  • A short transport-level idle timeout (e.g., 30–60s with no first token) triggers a transport error that the failover chain can catch, OR
  • The cron runner correctly records status: error when the trajectory's finalStatus is error, so existing failure-alert plumbing fires.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING