Code Example

Gateway starts
  ├── HTTP server listens
  ├── Gateway startup log
  ├── Post-ready maintenance (lock cleanup + markRestartAbortedMainSessionsFromLocks)
  ├── 5s delay ──→ scheduleRestartAbortedMainSessionRecovery
  └── Feishu WebSocket connects

---

14:26:34 — Gateway HTTP listening
14:26:35 — Channels + sidecars starting  
14:26:40 — Recovery sidecar fires (5s after startup)
14:37:04 — Feishu WebSocket client finally connected (~11 min later)

---

{
  "agent:main:feishu:direct:ou_63c1a75f0ed65b60facae9fa9db7c73a": {
    "status": "running",
    "abortedLastRun": false,
    "updatedAt": 1779000083850
  }
}

---

/agents/main/sessions/b8061c94-7586-4002-ab01-467dcd17e433.jsonl.lock

Issue: Feishu DM sessions not recovering after gateway restart

Summary

When the OpenClaw gateway restarts while an agent is actively processing a task (e.g., during a model call or tool execution), the interrupted session is not recovered on the Feishu channel. The user must send a new message to reactivate the session.

However, this works correctly on the Telegram channel — interrupted sessions are automatically recovered after restart.

Environment

OpenClaw version: 2026.5.12
Platform: Linux x64, Node.js v22.22.2
Channels affected: Feishu (WebSocket mode)
Channels not affected: Telegram (polling mode)
Model: primary ollama-lan/qwen3.5:9b, agents use bailian/deepseek-v4-flash etc.

Expected Behavior

After a gateway restart, any main session that was in status: "running" at the time of restart should be automatically recovered — the agent should read the session transcript, receive a system message like "Your previous turn was interrupted by a gateway restart...", and continue processing the interrupted task.

Actual Behavior

The interrupted Feishu session is never recovered. The agent does not continue, and no response is ever delivered to the user. The session remains in status: "running" in sessions.json with a stale .jsonl.lock file on disk, but the recovery mechanism either does not trigger or cannot deliver the response.

Root Cause Analysis

After investigating the behavior and the source code, I identified two potential issues:

Issue 1: Post-restart lock cleanup and session marking timing

The gateway startup sequence (simplified):

Gateway starts
  ├── HTTP server listens
  ├── Gateway startup log
  ├── Post-ready maintenance (lock cleanup + markRestartAbortedMainSessionsFromLocks)
  ├── 5s delay ──→ scheduleRestartAbortedMainSessionRecovery
  └── Feishu WebSocket connects

The markRestartAbortedMainSessionsFromLocks function scans for stale .jsonl.lock files and marks corresponding sessions as abortedLastRun: true in the in-memory session store. But it appears the session store (sessions.json) might not be loaded into memory yet when the cleanup runs, so the stale locks are cleaned up but no sessions actually get marked as abortedLastRun. The subsequent scheduleRestartAbortedMainSessionRecovery finds no sessions to recover.

For comparison, Telegram's polling mechanism makes recovery less critical — after restart, getUpdates with the stored offset re-fetches any unacknowledged updates, so the agent simply starts processing the same message again. Feishu's WebSocket push has no equivalent re-delivery mechanism — events are delivered exactly once.

Issue 2: Feishu WebSocket not ready when recovery fires

Even if the recovery mechanism correctly marks and resumes the session, the recovered response needs to be delivered through the Feishu channel. The startup timing shows:

14:26:34 — Gateway HTTP listening
14:26:35 — Channels + sidecars starting  
14:26:40 — Recovery sidecar fires (5s after startup)
14:37:04 — Feishu WebSocket client finally connected (~11 min later)

The recovery completes before the Feishu WebSocket is ready, so any response generated by the recovery cannot be delivered. On subsequent startup attempts, Feishu WebSocket connected within 0.1s, suggesting this delay was due to channel restart/config reload during investigation, but the core timing mismatch remains: the 5-second hardcoded recovery delay (DEFAULT_RECOVERY_DELAY_MS = 5e3) may be insufficient for Feishu channel readiness in general.

Session State Evidence

After a clean restart, the stale session data persists:

{
  "agent:main:feishu:direct:ou_63c1a75f0ed65b60facae9fa9db7c73a": {
    "status": "running",
    "abortedLastRun": false,
    "updatedAt": 1779000083850
  }
}

And the lock file remains on disk:

/agents/main/sessions/b8061c94-7586-4002-ab01-467dcd17e433.jsonl.lock

The session transcript (b8061c94-7586-4002-ab01-467dcd17e433.jsonl, 1.3MB) is intact and contains the full conversation context. The recovery mechanism should be able to read it and continue.

Key Code References

Main session restart recovery: dist/main-session-restart-recovery-D1yxkDUR.js
- DEFAULT_RECOVERY_DELAY_MS = 5e3 (hardcoded 5s delay)
- markRestartAbortedMainSessionsFromLocks() — marks sessions from stale locks
- scheduleRestartAbortedMainSessionRecovery() — resumes sessions after delay
Gateway startup post-attach: dist/server-startup-post-attach-Cd490zZC.js
- Post-ready maintenance → lock cleanup → mark sessions
- Sidecars: sidecars.main-session-recovery
Active-memory: dist/extensions/active-memory/index.js
- Uses runEmbeddedPiAgent with bootstrapContextMode: "lightweight" for sub-agents
Session store: dist/store-3qAZ3Zl6.js
- Persists to /agents/*/sessions/sessions.json

Suggested Fix

Ensure session store is loaded from disk before lock cleanup: The markRestartAbortedMainSessionsFromLocks function needs the in-memory session store to be populated from sessions.json before it can mark sessions as abortedLastRun. If loading is async, await it.
Make recovery delay configurable or dependent on channel readiness: The 5-second hardcoded delay is fragile for channels with longer startup times (Feishu WebSocket). A configurable recoveryDelayMs or a channel-readiness check before attempting delivery would be more robust.
Consider adding Feishu event persistence: Unlike Telegram's polling (which naturally re-delivers events via getUpdates offset), Feishu's WebSocket push has no retry mechanism. Persisting inbound Feishu events to disk before processing would allow re-delivery after restart.

Workaround

Currently, users must send a new message to Feishu after a gateway restart to reactivate the session. The previous message's context is preserved in the session transcript, so the agent can continue the conversation naturally.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feishu DM sessions not recovering after gateway restart (timing issue - session store loaded after lock cleanup)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Issue: Feishu DM sessions not recovering after gateway restart

Summary

Environment

Expected Behavior

Actual Behavior

Root Cause Analysis

Issue 1: Post-restart lock cleanup and session marking timing

Issue 2: Feishu WebSocket not ready when recovery fires

Session State Evidence

Key Code References

Suggested Fix

Workaround

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feishu DM sessions not recovering after gateway restart (timing issue - session store loaded after lock cleanup)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Issue: Feishu DM sessions not recovering after gateway restart

Summary

Environment

Expected Behavior

Actual Behavior

Root Cause Analysis

Issue 1: Post-restart lock cleanup and session marking timing

Issue 2: Feishu WebSocket not ready when recovery fires

Session State Evidence

Key Code References

Suggested Fix

Workaround

Still need to ship something?

RELATED_DISCOVERY

TRENDING