openclaw - ✅(Solved) Fix Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73510Fetched 2026-04-29 06:19:00
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Timeline (top)
cross-referenced ×2commented ×1

OpenClaw v2026.4.26 with Feishu WebSocket channel becomes permanently unresponsive when a session enters stuck state. The diagnostic system detects the problem but takes no recovery action, resulting in complete channel outage requiring manual intervention.

Error Message

AxiosError: timeout of 10000ms exceeded url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'

Root Cause

The failure chain is:

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
2. Large transcript (2.7MB) loaded on every dispatch
3. Model API request takes too long → timeout
4. Node.js event loop blocked (Bug 5 amplifies this)
5. Session enters stuck state (queueDepth=1, state=processing)
6. Stuck session detected but NO action taken (Bug 1)
7. All subsequent messages dropped (replies=0)
8. Complete channel outage until manual restart

Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).


Fix Action

Fix / Workaround

Time (UTC+8)Event
16:25First stuck session detected (stuck session age=150s) on Feishu DM session
16:25–16:32Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s
16:33–16:38Feishu messages received but dispatch returns replies=0 (responses silently dropped)
16:42–18:36Gateway restarted 6+ times; each time briefly recovered then stuck again
18:36Gateway running but Feishu channel still non-responsive

Log Evidence:

{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}
1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
2. Large transcript (2.7MB) loaded on every dispatch
3. Model API request takes too long → timeout
4. Node.js event loop blocked (Bug 5 amplifies this)
5. Session enters stuck state (queueDepth=1, state=processing)
6. Stuck session detected but NO action taken (Bug 1)
7. All subsequent messages dropped (replies=0)
8. Complete channel outage until manual restart

PR fix notes

PR #73243: fix(diagnostics): abort stuck sessions

Description (problem / solution / changelog)

Summary

  • Problem: Stuck processing sessions only emitted session.stuck diagnostics and were never recovered.
  • Why it matters: A wedged run could keep a session lane blocked indefinitely, preventing later messages from progressing.
  • What changed: Added diagnostics.stuckSessionAbortMs; the heartbeat now aborts stuck reply/embedded runs, clears queued work, emits session.aborted, and marks diagnostic state idle after recovery.
  • What did NOT change (scope boundary): This does not change normal run timeout behavior or message queue policy for healthy sessions.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #71127
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: The diagnostics heartbeat only warned when a session exceeded stuckSessionWarnMs; it had no recovery threshold or abort path.
  • Missing detection / guardrail: Existing coverage asserted session.stuck emission but did not assert that stuck sessions are eventually released.
  • Contributing context (if known): Session recovery already existed elsewhere (/stop, reply run aborts, embedded PI aborts), but diagnostics did not invoke it.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/logging/diagnostic.test.ts
  • Scenario the test should lock in: A session in processing past diagnostics.stuckSessionAbortMs triggers recovery once, emits session.aborted, and returns diagnostic state to idle.
  • Why this is the smallest reliable guardrail: The heartbeat threshold and recovery decision live in diagnostics, so a focused fake-timer unit test directly covers the bug.
  • Existing test that already covers this (if any): Existing stuck-session warning test covered only session.stuck.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Adds optional config diagnostics.stuckSessionAbortMs. By default, stuck sessions are warned after 2 minutes and recovered after 15 minutes if still processing.

Changed files

  • src/agents/pi-embedded-runner/active-run-abort.ts (added, +11/-0)
  • src/config/schema.base.generated.ts (modified, +13/-0)
  • src/config/schema.help.ts (modified, +2/-0)
  • src/config/schema.labels.ts (modified, +1/-0)
  • src/config/types.base.ts (modified, +2/-0)
  • src/config/zod-schema.ts (modified, +1/-0)
  • src/gateway/config-reload-plan.ts (modified, +1/-0)
  • src/gateway/config-reload.test.ts (modified, +6/-2)
  • src/infra/diagnostic-events.ts (modified, +12/-0)
  • src/logging/diagnostic-session-state.ts (modified, +2/-0)
  • src/logging/diagnostic.test.ts (modified, +124/-0)
  • src/logging/diagnostic.ts (modified, +111/-1)

Code Example

{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}

---

-rw------- 1 openclaw 14M agents/main/sessions/3e9dd919.trajectory.jsonl.deleted.2026-04-28T10-05-50.972Z

---

AxiosError: timeout of 10000ms exceeded
url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'

---

Gateway service PATH missing required dirs: /home/openclaw/.local/share/fnm/aliases/default/bin
Recommendation: run "openclaw doctor --repair"

---

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
2. Large transcript (2.7MB) loaded on every dispatch
3. Model API request takes too long → timeout
4. Node.js event loop blocked (Bug 5 amplifies this)
5. Session enters stuck state (queueDepth=1, state=processing)
6. Stuck session detected but NO action taken (Bug 1)
7. All subsequent messages dropped (replies=0)
8. Complete channel outage until manual restart
RAW_BUFFERClick to expand / collapse

name: Bug Report about: Stuck sessions cause gateway to become permanently unresponsive labels: bug, stability, feishu

Summary

OpenClaw v2026.4.26 with Feishu WebSocket channel becomes permanently unresponsive when a session enters stuck state. The diagnostic system detects the problem but takes no recovery action, resulting in complete channel outage requiring manual intervention.

Environment

  • OpenClaw: v2026.4.26 (be8c246)
  • Channel: Feishu (WebSocket mode)
  • Model: bailian/qwen3.6-plus (Aliyun DashScope Coding Plan)
  • OS: Linux x64, 16GB RAM (Ubuntu)
  • Node.js: v22.22.2

Timeline of Events

Time (UTC+8)Event
16:25First stuck session detected (stuck session age=150s) on Feishu DM session
16:25–16:32Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s
16:33–16:38Feishu messages received but dispatch returns replies=0 (responses silently dropped)
16:42–18:36Gateway restarted 6+ times; each time briefly recovered then stuck again
18:36Gateway running but Feishu channel still non-responsive

Total outage duration: >3 hours of repeated failures across multiple restarts.


Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)

Severity: 🔴 Critical — complete service outage

Description: The diagnostic subsystem correctly detects stuck sessions and logs warnings, but takes zero recovery action. Sessions remain permanently in state=processing with queueDepth=1, blocking all subsequent messages to that session.

Log Evidence:

{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}

The diagnostic fires every 30 seconds with increasing age (150s → 510s), but no kill, reset, or restart is triggered. The process becomes permanently unresponsive.

Impact: Gateway is functionally dead. No messages can be processed. Only manual restart helps, and even that is temporary if the root cause persists.

Expected behavior:

  • Stuck session timeout → kill the hung request
  • Or: auto-reset the affected session
  • Or: trigger gateway restart after N consecutive stuck detections

Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)

Severity: 🔴 Critical — prevents self-healing after restart

Description: When compaction.mode: safeguard and compaction.maxActiveTranscriptBytes: "20mb" are configured, compaction only runs as a preflight check before a new run. At gateway startup, existing large transcript files are loaded without compaction.

Our Feishu session transcript was 2.7MB / 1008 lines (with a trajectory file that grew to 14MB before being deleted by OpenClaw). On every restart, this 2.7MB file is fully loaded into memory, contributing to the event loop overload that causes the stuck session in Bug 1.

Since 2.7MB < 20MB threshold, the preflight compaction check never triggers either. The transcript accumulates indefinitely.

Impact: Large transcripts survive restarts and immediately re-create the conditions that caused the original crash. The compaction config is effectively useless for existing sessions.

Expected behavior:

  • Gateway startup should check transcript size and run compaction if needed
  • Or: add a row-based compaction threshold (e.g., >500 lines) in addition to byte-based
  • Or: compact all sessions on startup regardless of size

Bug 3: Trajectory Files Grow Without Bound

Severity: 🟡 High — memory bomb

Description: The Feishu session trajectory file grew to 14MB before OpenClaw eventually deleted it (leaving a *.trajectory.jsonl.deleted.* artifact). There is no configured size limit for trajectory files.

Evidence:

-rw------- 1 openclaw 14M agents/main/sessions/3e9dd919.trajectory.jsonl.deleted.2026-04-28T10-05-50.972Z

Impact: Unbounded trajectory growth contributes to memory pressure and eventual crash.

Expected behavior: Trajectory files should have independent size limits with automatic truncation.


Bug 4: No Memory Cap on Gateway Process

Severity: 🟡 Medium — can affect other services

Description: Gateway RSS memory grew continuously: 718MB (post-upgrade) → 865MB → 1.0GB peak. No --max-old-space-size parameter is set on the Node.js process, and no systemd MemoryMax= limit exists.

Memory Timeline:

TimeRSS MemoryNotes
Upgrade (04/28)718MBFresh start
16:42 restart~818MBBefore manual cleanup
17:31 restart728MBAfter config changes
18:10 restart671MB → 706MB7 min growth
18:36 restart865MB → peak 1.0GBCurrent

Expected behavior: Gateway should have configurable memory limits with graceful degradation or restart.


Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop

Severity: 🟡 Medium — cascading failures

Description: The Feishu channel sends a periodic ping to https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping. When this request times out (10 seconds), it blocks the Node.js event loop, causing all other HTTP requests to queue — including model API calls and other channel operations.

Log Evidence:

AxiosError: timeout of 10000ms exceeded
url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'

Impact: A single slow outbound request degrades the entire gateway. In our case, this created a cascading failure where the ping timeout contributed to model API timeouts, which in turn caused the stuck session.

Expected behavior: Heartbeat pings should run in a non-blocking manner with independent timeout handling.


Bug 6: Gateway Service PATH Configuration Incomplete

Severity: 🔵 Low — cosmetic but confusing

Description: openclaw gateway status reports:

Gateway service PATH missing required dirs: /home/openclaw/.local/share/fnm/aliases/default/bin
Recommendation: run "openclaw doctor --repair"

The PATH is not auto-fixed despite openclaw doctor --repair being available.


Root Cause Analysis

The failure chain is:

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
2. Large transcript (2.7MB) loaded on every dispatch
3. Model API request takes too long → timeout
4. Node.js event loop blocked (Bug 5 amplifies this)
5. Session enters stuck state (queueDepth=1, state=processing)
6. Stuck session detected but NO action taken (Bug 1)
7. All subsequent messages dropped (replies=0)
8. Complete channel outage until manual restart

Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).


Requested Actions

  1. Implement auto-recovery for stuck sessions (kill/reset/restart after configurable timeout)
  2. Trigger compaction at gateway startup for sessions exceeding configurable thresholds
  3. Add trajectory file size limits with automatic truncation
  4. Document recommended memory limits for Node.js gateway process
  5. Make Feishu ping non-blocking or add circuit breaker for failing outbound requests

extent analysis

TL;DR

Implement auto-recovery for stuck sessions and trigger compaction at gateway startup to prevent complete channel outage.

Guidance

  • Identify and implement a suitable auto-recovery mechanism for stuck sessions, such as killing or resetting the session after a configurable timeout.
  • Modify the gateway to trigger compaction at startup for sessions exceeding configurable thresholds, preventing large transcripts from causing issues.
  • Consider adding trajectory file size limits with automatic truncation to prevent memory pressure.
  • Review and document recommended memory limits for the Node.js gateway process to prevent excessive memory growth.
  • Investigate making the Feishu ping non-blocking or adding a circuit breaker to handle failing outbound requests.

Example

No specific code example is provided, as the issue requires a comprehensive solution involving multiple components and configurations.

Notes

The provided issue description is detailed, and the root cause analysis is clear. However, implementing the requested actions may require additional development, testing, and validation to ensure the stability and reliability of the gateway.

Recommendation

Apply workaround: Implement auto-recovery for stuck sessions and trigger compaction at gateway startup, as these are critical fixes to prevent complete channel outage and ensure the gateway's stability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26) [1 pull requests, 1 comments, 2 participants]