openclaw - ✅(Solved) Fix Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26) [1 pull requests, 1 comments, 2 participants]

openclaw2026-04-28 11:02:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73510•Fetched 2026-04-29 06:19:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

WS-Q0758

Participants

clawsweeper[bot]

WS-Q0758

Timeline (top)

cross-referenced ×2commented ×1

OpenClaw v2026.4.26 with Feishu WebSocket channel becomes permanently unresponsive when a session enters stuck state. The diagnostic system detects the problem but takes no recovery action, resulting in complete channel outage requiring manual intervention.

Error Message

AxiosError: timeout of 10000ms exceeded url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'

Root Cause

The failure chain is:

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
     ↓
2. Large transcript (2.7MB) loaded on every dispatch
     ↓
3. Model API request takes too long → timeout
     ↓
4. Node.js event loop blocked (Bug 5 amplifies this)
     ↓
5. Session enters stuck state (queueDepth=1, state=processing)
     ↓
6. Stuck session detected but NO action taken (Bug 1)
     ↓
7. All subsequent messages dropped (replies=0)
     ↓
8. Complete channel outage until manual restart

Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).

Fix Action

Fix / Workaround

Time (UTC+8)	Event
16:25	First stuck session detected (`stuck session age=150s`) on Feishu DM session
16:25–16:32	Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s
16:33–16:38	Feishu messages received but dispatch returns `replies=0` (responses silently dropped)
16:42–18:36	Gateway restarted 6+ times; each time briefly recovered then stuck again
18:36	Gateway running but Feishu channel still non-responsive

Log Evidence:

{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
     ↓
2. Large transcript (2.7MB) loaded on every dispatch
     ↓
3. Model API request takes too long → timeout
     ↓
4. Node.js event loop blocked (Bug 5 amplifies this)
     ↓
5. Session enters stuck state (queueDepth=1, state=processing)
     ↓
6. Stuck session detected but NO action taken (Bug 1)
     ↓
7. All subsequent messages dropped (replies=0)
     ↓
8. Complete channel outage until manual restart

Code Example

{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}

---

-rw------- 1 openclaw 14M agents/main/sessions/3e9dd919.trajectory.jsonl.deleted.2026-04-28T10-05-50.972Z

---

AxiosError: timeout of 10000ms exceeded
url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'

---

Gateway service PATH missing required dirs: /home/openclaw/.local/share/fnm/aliases/default/bin
Recommendation: run "openclaw doctor --repair"

---

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
     ↓
2. Large transcript (2.7MB) loaded on every dispatch
     ↓
3. Model API request takes too long → timeout
     ↓
4. Node.js event loop blocked (Bug 5 amplifies this)
     ↓
5. Session enters stuck state (queueDepth=1, state=processing)
     ↓
6. Stuck session detected but NO action taken (Bug 1)
     ↓
7. All subsequent messages dropped (replies=0)
     ↓
8. Complete channel outage until manual restart

RAW_BUFFERClick to expand / collapse

name: Bug Report about: Stuck sessions cause gateway to become permanently unresponsive labels: bug, stability, feishu

Summary

Environment

OpenClaw: v2026.4.26 (be8c246)
Channel: Feishu (WebSocket mode)
Model: bailian/qwen3.6-plus (Aliyun DashScope Coding Plan)
OS: Linux x64, 16GB RAM (Ubuntu)
Node.js: v22.22.2

Timeline of Events

Time (UTC+8)	Event
16:25	First stuck session detected (`stuck session age=150s`) on Feishu DM session
16:25–16:32	Stuck session alarm repeats every 30s: 150s → 180s → 210s → 240s → 270s → 300s → 330s → 360s → 390s → 420s → 450s → 480s → 510s
16:33–16:38	Feishu messages received but dispatch returns `replies=0` (responses silently dropped)
16:42–18:36	Gateway restarted 6+ times; each time briefly recovered then stuck again
18:36	Gateway running but Feishu channel still non-responsive

Total outage duration: >3 hours of repeated failures across multiple restarts.

Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)

Severity: 🔴 Critical — complete service outage

Description: The diagnostic subsystem correctly detects stuck sessions and logs warnings, but takes zero recovery action. Sessions remain permanently in state=processing with queueDepth=1, blocking all subsequent messages to that session.

Log Evidence:

{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=150s queueDepth=1"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=main sessionKey=agent:main:feishu:direct:ou_xxx state=processing age=510s queueDepth=1"}
{"subsystem":"gateway/channels/feishu","message":"feishu[default]: dispatch complete (queuedFinal=false, replies=0)"}

The diagnostic fires every 30 seconds with increasing age (150s → 510s), but no kill, reset, or restart is triggered. The process becomes permanently unresponsive.

Impact: Gateway is functionally dead. No messages can be processed. Only manual restart helps, and even that is temporary if the root cause persists.

Expected behavior:

Stuck session timeout → kill the hung request
Or: auto-reset the affected session
Or: trigger gateway restart after N consecutive stuck detections

Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)

Severity: 🔴 Critical — prevents self-healing after restart

Description: When compaction.mode: safeguard and compaction.maxActiveTranscriptBytes: "20mb" are configured, compaction only runs as a preflight check before a new run. At gateway startup, existing large transcript files are loaded without compaction.

Our Feishu session transcript was 2.7MB / 1008 lines (with a trajectory file that grew to 14MB before being deleted by OpenClaw). On every restart, this 2.7MB file is fully loaded into memory, contributing to the event loop overload that causes the stuck session in Bug 1.

Since 2.7MB < 20MB threshold, the preflight compaction check never triggers either. The transcript accumulates indefinitely.

Impact: Large transcripts survive restarts and immediately re-create the conditions that caused the original crash. The compaction config is effectively useless for existing sessions.

Expected behavior:

Gateway startup should check transcript size and run compaction if needed
Or: add a row-based compaction threshold (e.g., >500 lines) in addition to byte-based
Or: compact all sessions on startup regardless of size

Bug 3: Trajectory Files Grow Without Bound

Severity: 🟡 High — memory bomb

Description: The Feishu session trajectory file grew to 14MB before OpenClaw eventually deleted it (leaving a *.trajectory.jsonl.deleted.* artifact). There is no configured size limit for trajectory files.

Evidence:

-rw------- 1 openclaw 14M agents/main/sessions/3e9dd919.trajectory.jsonl.deleted.2026-04-28T10-05-50.972Z

Impact: Unbounded trajectory growth contributes to memory pressure and eventual crash.

Expected behavior: Trajectory files should have independent size limits with automatic truncation.

Bug 4: No Memory Cap on Gateway Process

Severity: 🟡 Medium — can affect other services

Description: Gateway RSS memory grew continuously: 718MB (post-upgrade) → 865MB → 1.0GB peak. No --max-old-space-size parameter is set on the Node.js process, and no systemd MemoryMax= limit exists.

Memory Timeline:

Time	RSS Memory	Notes
Upgrade (04/28)	718MB	Fresh start
16:42 restart	~818MB	Before manual cleanup
17:31 restart	728MB	After config changes
18:10 restart	671MB → 706MB	7 min growth
18:36 restart	865MB → peak 1.0GB	Current

Expected behavior: Gateway should have configurable memory limits with graceful degradation or restart.

Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop

Severity: 🟡 Medium — cascading failures

Description: The Feishu channel sends a periodic ping to https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping. When this request times out (10 seconds), it blocks the Node.js event loop, causing all other HTTP requests to queue — including model API calls and other channel operations.

Log Evidence:

AxiosError: timeout of 10000ms exceeded
url: 'https://open.feishu.cn/open-apis/bot/v1/openclaw_bot/ping'

Impact: A single slow outbound request degrades the entire gateway. In our case, this created a cascading failure where the ping timeout contributed to model API timeouts, which in turn caused the stuck session.

Expected behavior: Heartbeat pings should run in a non-blocking manner with independent timeout handling.

Bug 6: Gateway Service PATH Configuration Incomplete

Severity: 🔵 Low — cosmetic but confusing

Description: openclaw gateway status reports:

Gateway service PATH missing required dirs: /home/openclaw/.local/share/fnm/aliases/default/bin
Recommendation: run "openclaw doctor --repair"

The PATH is not auto-fixed despite openclaw doctor --repair being available.

Root Cause Analysis

The failure chain is:

1. Feishu session transcript grows unbounded (Bug 2, Bug 3)
     ↓
2. Large transcript (2.7MB) loaded on every dispatch
     ↓
3. Model API request takes too long → timeout
     ↓
4. Node.js event loop blocked (Bug 5 amplifies this)
     ↓
5. Session enters stuck state (queueDepth=1, state=processing)
     ↓
6. Stuck session detected but NO action taken (Bug 1)
     ↓
7. All subsequent messages dropped (replies=0)
     ↓
8. Complete channel outage until manual restart

Restarting does not fix the problem because the large transcript is reloaded without compaction (Bug 2).

Requested Actions

Implement auto-recovery for stuck sessions (kill/reset/restart after configurable timeout)
Trigger compaction at gateway startup for sessions exceeding configurable thresholds
Add trajectory file size limits with automatic truncation
Document recommended memory limits for Node.js gateway process
Make Feishu ping non-blocking or add circuit breaker for failing outbound requests

extent analysis

TL;DR

Implement auto-recovery for stuck sessions and trigger compaction at gateway startup to prevent complete channel outage.

Guidance

Identify and implement a suitable auto-recovery mechanism for stuck sessions, such as killing or resetting the session after a configurable timeout.
Modify the gateway to trigger compaction at startup for sessions exceeding configurable thresholds, preventing large transcripts from causing issues.
Consider adding trajectory file size limits with automatic truncation to prevent memory pressure.
Review and document recommended memory limits for the Node.js gateway process to prevent excessive memory growth.
Investigate making the Feishu ping non-blocking or adding a circuit breaker to handle failing outbound requests.

Example

No specific code example is provided, as the issue requires a comprehensive solution involving multiple components and configurations.

Notes

The provided issue description is detailed, and the root cause analysis is clear. However, implementing the requested actions may require additional development, testing, and validation to ensure the stability and reliability of the gateway.

Recommendation

Apply workaround: Implement auto-recovery for stuck sessions and trigger compaction at gateway startup, as these are critical fixes to prevent complete channel outage and ensure the gateway's stability.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #file not found #serialization error #model compatibility #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Bug Report: Stuck sessions cause permanent gateway hang with no auto-recovery (v2026.4.26) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #73243: fix(diagnostics): abort stuck sessions

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Changed files

Code Example

name: Bug Report about: Stuck sessions cause gateway to become permanently unresponsive labels: bug, stability, feishu

Summary

Environment

Timeline of Events

Bug 1: Stuck Session Detection Has No Auto-Recovery (Critical)

Bug 2: Compaction Does Not Trigger at Gateway Startup (Critical)

Bug 3: Trajectory Files Grow Without Bound

Bug 4: No Memory Cap on Gateway Process

Bug 5: Feishu Channel Heartbeat Ping Blocks Event Loop

Bug 6: Gateway Service PATH Configuration Incomplete

Root Cause Analysis

Requested Actions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING