openclaw - 💡(How to fix) Fix Session enters zombie state after embedded agent init failure (deactivated_workspace) — no auto-recovery [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#54964Fetched 2026-04-08 01:34:03
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2

Error Message

When an embedded agent fails to initialize (e.g., `deactivated_workspace` error), the corresponding session enters a zombie state: 2. The initialization fails with a `deactivated_workspace` error (e.g., due to workspace config issues)

  • The `deactivated_workspace` error does not trigger session cleanup
  • The session state machine does not enter a proper error handling path when agent init fails
  1. Graceful degradation: If the session cannot be initialized after N attempts, surface an explicit error to the user instead of silent `replies=0`

Root Cause

  • The `deactivated_workspace` error does not trigger session cleanup
  • The session state machine does not enter a proper error handling path when agent init fails
  • The gateway dispatch layer considers the message "delivered" (since it reached the agent), but the agent never actually processes it
  • No `abortedLastRun` flag is set, and no automatic recovery mechanism fires

Fix Action

Fix / Workaround

When an embedded agent fails to initialize (e.g., `deactivated_workspace` error), the corresponding session enters a zombie state:

  • Subsequent messages are dispatched normally by the gateway
  • Gateway logs: `dispatch complete (queuedFinal=false, replies=0)`
  • The agent produces zero replies — users see complete silence
  • Session is never cleaned up; only manual deletion from `sessions.json` restores service
  1. A group-chat session triggers embedded agent initialization
  2. The initialization fails with a `deactivated_workspace` error (e.g., due to workspace config issues)
  3. The session is left in a broken state: marked "initialized" but not actually running
  4. All subsequent messages dispatched to this session queue silently — `replies=0` forever
  5. Only manually deleting the session key from `sessions.json` and triggering a new session creation restores functionality

Current Workaround

RAW_BUFFERClick to expand / collapse

Problem Summary

When an embedded agent fails to initialize (e.g., `deactivated_workspace` error), the corresponding session enters a zombie state:

  • Subsequent messages are dispatched normally by the gateway
  • Gateway logs: `dispatch complete (queuedFinal=false, replies=0)`
  • The agent produces zero replies — users see complete silence
  • Session is never cleaned up; only manual deletion from `sessions.json` restores service

Steps to Reproduce

  1. A group-chat session triggers embedded agent initialization
  2. The initialization fails with a `deactivated_workspace` error (e.g., due to workspace config issues)
  3. The session is left in a broken state: marked "initialized" but not actually running
  4. All subsequent messages dispatched to this session queue silently — `replies=0` forever
  5. Only manually deleting the session key from `sessions.json` and triggering a new session creation restores functionality

Current Workaround

Manually delete the session key from: `/root/.openclaw/agents/<agent>/sessions/sessions.json`

This forces a fresh session to be created on the next inbound message.


Root Cause Analysis

  • The `deactivated_workspace` error does not trigger session cleanup
  • The session state machine does not enter a proper error handling path when agent init fails
  • The gateway dispatch layer considers the message "delivered" (since it reached the agent), but the agent never actually processes it
  • No `abortedLastRun` flag is set, and no automatic recovery mechanism fires

Suggested Fixes

The session lifecycle should handle embedded agent init failures gracefully:

  1. Auto-mark on failure: When embedded agent init fails, set `abortedLastRun=true` on the session so the next dispatch can detect it and create a fresh session
  2. Session health check on dispatch: Before dispatching to an existing session, check if the previous run was aborted/failed and auto-recover rather than silently reusing a dead session
  3. Graceful degradation: If the session cannot be initialized after N attempts, surface an explicit error to the user instead of silent `replies=0`
  4. Auto-cleanup of zombie sessions: A background cleanup task that detects sessions with repeated `abortedLastRun=true` and removes them proactively

Environment

  • OpenClaw version: latest (main branch)
  • Channel: Feishu group chat
  • Session type: `group` (embedded agent)

Labels: bug, session, recovery, embedded-agent

extent analysis

Fix Plan

To address the issue, we will implement the following steps:

  • Auto-mark on failure: Set abortedLastRun=true on the session when embedded agent init fails
  • Session health check on dispatch: Check if the previous run was aborted/failed before dispatching to an existing session
  • Graceful degradation: Surface an explicit error to the user after N attempts
  • Auto-cleanup of zombie sessions: Implement a background cleanup task

Example Code

# Auto-mark on failure
def initialize_agent(session):
    try:
        # Initialize agent
        pass
    except Exception as e:
        session['abortedLastRun'] = True
        # Log error

# Session health check on dispatch
def dispatch_message(session, message):
    if session.get('abortedLastRun', False):
        # Create a fresh session
        session = create_new_session()
    # Dispatch message

# Graceful degradation
def dispatch_message(session, message):
    attempts = session.get('attempts', 0)
    if attempts >= N:
        # Surface explicit error to user
        return error_message
    # Dispatch message
    session['attempts'] = attempts + 1

# Auto-cleanup of zombie sessions
def cleanup_zombie_sessions():
    sessions = get_all_sessions()
    for session in sessions:
        if session.get('abortedLastRun', False) and session.get('attempts', 0) >= N:
            # Remove session
            remove_session(session)

Verification

To verify the fix, test the following scenarios:

  • Embedded agent init failure: Verify that the session is marked as abortedLastRun=true and a fresh session is created on the next dispatch
  • Session health check: Verify that the session is recreated when the previous run was aborted/failed
  • Graceful degradation: Verify that an explicit error is surfaced to the user after N attempts
  • Auto-cleanup of zombie sessions: Verify that zombie sessions are removed after a certain period of time

Extra Tips

  • Make sure to handle edge cases, such as concurrent modifications to the session
  • Implement logging and monitoring to detect and respond to issues
  • Consider adding a retry mechanism for agent initialization to handle temporary failures

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING