openclaw - 💡(How to fix) Fix [Bug]: Newly-created Telegram group-topic sessions wedge indefinitely on first inbound — claude-cli --resume hangs against UUID with no prior project transcript (2026.5.20 regression)

openclaw2026-05-24 15:47:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

durationMs=195366 error=FailoverError durationMs=183725 error=AbortError [warn] Subagent announce give up (retry-limit) run=<x> child=<y>

Root Cause

After upgrading to 2026.5.20, every newly-created Telegram group-topic session wedges on its first inbound and stays wedged. The gateway mints a fresh session UUID, invokes claude -p --resume <uuid>, and claude-cli hangs for 195+ seconds because no ~/.claude/projects/<workspace>/<uuid>.jsonl exists for that UUID. Watchdog aborts the embedded run after ~6 minutes. Next inbound to the same lane creates a fresh UUID with the same fate — the lane is now permanently degraded. Telegram DM lanes are unaffected.

Fix Action

Fix / Workaround

Workaround in use

Manual quarantine and store cleanup, per the unsustainable workaround documented in #44687:

Code Example

[diagnostic] stalled session: sessionId=unknown
     sessionKey=agent:main:telegram:group:<chat>:topic:<id>
     state=processing age=Ns queueDepth=1
     reason=active_work_without_progress
     classification=stalled_agent_run
     activeWorkKind=embedded_run
     lastProgress=embedded_run:started
     lastProgressAge=Ns recovery=none

---

[agent/cli-backend] claude live session turn failed:
     provider=claude-cli model=claude-opus-4-7
     durationMs=195366 error=FailoverError
   [model-fallback/decision] model fallback decision:
     decision=candidate_failed
     requested=claude-cli/claude-opus-4-7
     candidate=claude-cli/claude-opus-4-7
     reason=unknown next=claude-cli/claude-sonnet-4-6
     detail=Claude CLI failed.
   [agent/cli-backend] cli exec:
     provider=claude-cli model=sonnet promptChars=N
     trigger=user useResume=true session=present
     resumeSession=<short> reuse=reusable historyPrompt=present
   [agent/cli-backend] claude live session turn failed:
     provider=claude-cli model=claude-sonnet-4-6
     durationMs=183725 error=AbortError

---

[diagnostic] stuck session recovery:
     sessionId=<uuid> sessionKey=agent:main:telegram:group:<chat>:topic:<id>
     age=N action=abort_embedded_run aborted=true drained=true|false released=0
   [diagnostic] stuck session recovery outcome:
     status=aborted action=abort_embedded_run
     sessionId=<uuid> ... activeWorkKind=embedded_run
     lane=session:agent:main:telegram:group:<chat>:topic:<id>
     aborted=true drained=true|false forceCleared=false released=0

---

[warn] Subagent announce give up (retry-limit) run=<x> child=<y>
     requester=agent:main:telegram:group:<chat>:topic:<id>
     retries=3 endedAgo=Ns
     deliveryError="completion agent did not deliver through the message tool;
                    direct-primary: completion agent did not deliver through the message tool"

---

mv agents/<id>/sessions/<uuid>-topic-<topicId>.* /quarantine/
openclaw sessions cleanup --fix-missing --enforce --active-key "agent:<id>:telegram:direct:<user>"

RAW_BUFFERClick to expand / collapse

Environment

OpenClaw: 2026.5.20 (commit e510042)
Node.js: bundled with the published npm package
Backend: cliBackends.claude-cli pointing at Anthropic's Claude Code CLI binary
Channel: Telegram (polling mode), single Telegram supergroup with multiple forum topics
Agent: single agents.list[main] with model claude-cli/claude-opus-4-7 (legacy form) and fallbacks claude-cli/claude-sonnet-4-6, claude-cli/claude-haiku-4-5
Other plugins enabled: anthropic, browser, canvas, device-pair, file-transfer, memory-core, phone-control, slack, talk-voice, telegram (10 total)
Host: single-VPS deployment, no container

TL;DR

Symptom

For each affected lane:

User posts a message to a Telegram supergroup topic.
Gateway logs [telegram] Inbound message telegram:group:<chat>:topic:<id> -> @<bot> (group, N chars).

Embedded run starts. Diagnostic warnings begin ~2 minutes later:

[diagnostic] stalled session: sessionId=unknown
  sessionKey=agent:main:telegram:group:<chat>:topic:<id>
  state=processing age=Ns queueDepth=1
  reason=active_work_without_progress
  classification=stalled_agent_run
  activeWorkKind=embedded_run
  lastProgress=embedded_run:started
  lastProgressAge=Ns recovery=none

sessionId=unknown and lastProgress=embedded_run:started persist — the run never advances past start.

Underlying claude-cli calls fail with very long timeouts:

[agent/cli-backend] claude live session turn failed:
  provider=claude-cli model=claude-opus-4-7
  durationMs=195366 error=FailoverError
[model-fallback/decision] model fallback decision:
  decision=candidate_failed
  requested=claude-cli/claude-opus-4-7
  candidate=claude-cli/claude-opus-4-7
  reason=unknown next=claude-cli/claude-sonnet-4-6
  detail=Claude CLI failed.
[agent/cli-backend] cli exec:
  provider=claude-cli model=sonnet promptChars=N
  trigger=user useResume=true session=present
  resumeSession=<short> reuse=reusable historyPrompt=present
[agent/cli-backend] claude live session turn failed:
  provider=claude-cli model=claude-sonnet-4-6
  durationMs=183725 error=AbortError

Both Opus (~195s) and Sonnet (~183s) failover candidates time out the same way.

Watchdog escalates to recovery at ~6 minutes:
```
[diagnostic] stuck session recovery:
  sessionId=<uuid> sessionKey=agent:main:telegram:group:<chat>:topic:<id>
  age=N action=abort_embedded_run aborted=true drained=true|false released=0
[diagnostic] stuck session recovery outcome:
  status=aborted action=abort_embedded_run
  sessionId=<uuid> ... activeWorkKind=embedded_run
  lane=session:agent:main:telegram:group:<chat>:topic:<id>
  aborted=true drained=true|false forceCleared=false released=0
```
- drained=false if abort fired before any tokens emitted → in-flight content discarded silently; user sees no reply ever.
- drained=true if abort fired after streaming started → the abort handler calls Telegram's deleteMessage on the in-flight partial message → user briefly sees a partial post that then disappears. The gateway journal does NOT log the deleteMessage HTTP call separately; the cleanup is hidden inside the abort.
The same session UUID is reused across multiple aborts on the same lane. Next inbound to that lane stalls again the same way until the OpenClaw session record is manually purged.
Sub-agents spawned from a wedged parent complete their own work but can't deliver back:
```
[warn] Subagent announce give up (retry-limit) run=<x> child=<y>
  requester=agent:main:telegram:group:<chat>:topic:<id>
  retries=3 endedAgo=Ns
  deliveryError="completion agent did not deliver through the message tool;
                 direct-primary: completion agent did not deliver through the message tool"
```
When all parent-side delivery retries exhaust, the direct-primary fallback path routes the sub-agent's output to the user's DM with the originating bot instead. Result: messages composed in the context of a group topic appear in the user's DM. This is the cross-topic-to-DM "message jumping" symptom users report.

Root cause hypothesis

The gateway's OpenClaw session record and claude-cli's local project transcript at ~/.claude/projects/<workspace>/<uuid>.jsonl share the same UUID but are stored in two separate locations. When a NEW group-topic session is created post-2026.5.20:

Store	State for a newly-minted lane
OpenClaw `agents/<id>/sessions/sessions.json` + `*-topic-<id>.jsonl`	Created, contains the user's first turn
`~/.claude/projects/<workspace>/<uuid>.jsonl`	Never written

The next time the gateway invokes claude -p --resume <uuid>, claude-cli finds no matching transcript and hangs trying to resume a session that doesn't exist on its side, instead of failing fast or auto-creating fresh state. The two failover attempts (Opus then Sonnet, ~195s + ~183s) both hang the same way before the watchdog fires.

The documented safety net — "Stored session ids are verified against an existing readable project transcript before resume; phantom bindings are cleared with reason=transcript-missing instead of silently starting a fresh Claude CLI session under --resume" (per docs.openclaw.ai/gateway/cli-backends) — does not appear to be firing for group-topic lanes on 2026.5.20. Either:

The check is wired only for lane=main, not for agent:main:telegram:group:*:topic:* lanes, or
The check was bypassed by the new code path added in #19328 ("preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates", shipped in 2026.5.20).

Notably, Telegram DM lanes are unaffected because the DM session predates 2026.5.20 and has long-standing ~/.claude/projects/<workspace>/<uuid>.jsonl state. Only newly-minted sessions hit the wedge.

Reproducer

Run OpenClaw 2026.5.20 with cliBackends.claude-cli pointing at Claude Code CLI.
Configure a Telegram channel with a supergroup that has forum topics enabled; allow at least one user to post.
Have the user post a message to a topic that has no prior OpenClaw session for that lane (i.e. agents/<id>/sessions/ contains no <uuid>-topic-<topicId>.jsonl for that topic).
Observe the gateway journal: the lane enters processing state, embedded_run:started, no progress, and is aborted by the watchdog at ~6 minutes.
Inspect storage: the new session UUID exists in agents/<id>/sessions/sessions.json and as <uuid>-topic-<topicId>.jsonl, but no corresponding file exists at ~/.claude/projects/<workspace>/<uuid>.jsonl.
User posts again to the same topic. Same outcome.

Workaround in use

Manual quarantine and store cleanup, per the unsustainable workaround documented in #44687:

mv agents/<id>/sessions/<uuid>-topic-<topicId>.* /quarantine/
openclaw sessions cleanup --fix-missing --enforce --active-key "agent:<id>:telegram:direct:<user>"

Confirmed: this unblocks the affected topic, the next inbound creates a fresh session UUID, and replies start flowing again. But within hours, new sessions wedge the same way — first run on the new UUID hits the same hang, watchdog aborts, lane re-enters the wedge state. A 24-hour cycle requires repeated manual cleanup.

Gateway restart alone does NOT fix this — verified. The OpenClaw session record is rehydrated from disk with the orphan UUID still present, and the resume hang reproduces on the first post-restart inbound.

Related upstream issues

#44687 (closed, fixed in 2026.3.x): "Stale session resume at gateway startup blocks lane=main indefinitely". Same symptom family, but only for lane=main, only for sessions inherited from a prior gateway lifetime. Our case is for group-topic lanes and for newly-minted sessions.
#71127 (closed, fixed in 2026.4.x): "Stuck processing sessions are detected but never aborted". Our case has detection + abort working; the underlying resume hang re-establishes immediately after each abort.
#19328 (shipped in 2026.5.20): "preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates". Suspected source of this regression.
#82964 (shipped in 2026.5.22-beta.1): "skip stale embedded-run wake probes for dormant completion requesters". May address the sub-agent delivery cascade (the cross-topic-to-DM routing symptom).
#84949 (shipped in 2026.5.22-beta.1): "bound embedded auto-compaction session write-lock watchdogs to the compaction timeout". Related lane-state cleanup work.
#81191 (closed): event-loop starvation from startAccount Telegram polling. Different root cause, but symptom magnitude (400+ second event-loop delays) overlaps with our 195s+183s timeouts.

Suggested investigation paths

Audit the new-session creation path for the agent:main:telegram:group:*:topic:* lane shape. Does it go through the same write-claude-cli-state step as agent:main:main and agent:main:telegram:direct:*? If not, that's the gap.
Re-verify the "verify-transcript-before-resume" code path still fires for all lane shapes in 2026.5.20. If it was a lane=main only check, it needs to be generalized.
Check whether #19328's fix introduced an early return / state-reuse path that bypasses the safety net for new sessions.
Confirm 2026.5.22-beta.1 fixes this — we have not upgraded.

What we did NOT include

Sanitized for privacy:

Specific user IDs, Telegram chat IDs, topic names, bot username
Workspace contents, third-party PII referenced in any wedged session
Per-skill names that reveal the deployment's business use

If maintainers need additional unredacted log excerpts or sessions.json snippets to reproduce, those can be shared privately on request.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Newly-created Telegram group-topic sessions wedge indefinitely on first inbound — claude-cli --resume hangs against UUID with no prior project transcript (2026.5.20 regression)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround in use

Code Example

Environment

TL;DR

Symptom

Root cause hypothesis

Reproducer

Workaround in use

Related upstream issues

Suggested investigation paths

What we did NOT include

Still need to ship something?

TRENDING