openclaw - 💡(How to fix) Fix Session enters zombie state after embedded agent init failure (deactivated_workspace)

Error Message

When an embedded agent fails to initialize (e.g., `deactivated_workspace` error), the corresponding session enters a zombie state: 2. The initialization fails with a `deactivated_workspace` error (e.g., due to workspace config issues)

The `deactivated_workspace` error does not trigger session cleanup
The session state machine does not enter a proper error handling path when agent init fails

Graceful degradation: If the session cannot be initialized after N attempts, surface an explicit error to the user instead of silent `replies=0`

Root Cause

The `deactivated_workspace` error does not trigger session cleanup
The session state machine does not enter a proper error handling path when agent init fails
The gateway dispatch layer considers the message "delivered" (since it reached the agent), but the agent never actually processes it
No `abortedLastRun` flag is set, and no automatic recovery mechanism fires

Fix Action

Fix / Workaround

When an embedded agent fails to initialize (e.g., `deactivated_workspace` error), the corresponding session enters a zombie state:

Subsequent messages are dispatched normally by the gateway
Gateway logs: `dispatch complete (queuedFinal=false, replies=0)`
The agent produces zero replies — users see complete silence
Session is never cleaned up; only manual deletion from `sessions.json` restores service

A group-chat session triggers embedded agent initialization
The initialization fails with a `deactivated_workspace` error (e.g., due to workspace config issues)
The session is left in a broken state: marked "initialized" but not actually running
All subsequent messages dispatched to this session queue silently — `replies=0` forever
Only manually deleting the session key from `sessions.json` and triggering a new session creation restores functionality

Current Workaround

Problem Summary

When an embedded agent fails to initialize (e.g., `deactivated_workspace` error), the corresponding session enters a zombie state:

Subsequent messages are dispatched normally by the gateway
Gateway logs: `dispatch complete (queuedFinal=false, replies=0)`
The agent produces zero replies — users see complete silence
Session is never cleaned up; only manual deletion from `sessions.json` restores service

Steps to Reproduce

A group-chat session triggers embedded agent initialization
The initialization fails with a `deactivated_workspace` error (e.g., due to workspace config issues)
The session is left in a broken state: marked "initialized" but not actually running
All subsequent messages dispatched to this session queue silently — `replies=0` forever
Only manually deleting the session key from `sessions.json` and triggering a new session creation restores functionality

Current Workaround

Manually delete the session key from: `/root/.openclaw/agents/<agent>/sessions/sessions.json`

This forces a fresh session to be created on the next inbound message.

Root Cause Analysis

The `deactivated_workspace` error does not trigger session cleanup
The session state machine does not enter a proper error handling path when agent init fails
The gateway dispatch layer considers the message "delivered" (since it reached the agent), but the agent never actually processes it
No `abortedLastRun` flag is set, and no automatic recovery mechanism fires

Suggested Fixes

The session lifecycle should handle embedded agent init failures gracefully:

Auto-mark on failure: When embedded agent init fails, set `abortedLastRun=true` on the session so the next dispatch can detect it and create a fresh session
Session health check on dispatch: Before dispatching to an existing session, check if the previous run was aborted/failed and auto-recover rather than silently reusing a dead session
Graceful degradation: If the session cannot be initialized after N attempts, surface an explicit error to the user instead of silent `replies=0`
Auto-cleanup of zombie sessions: A background cleanup task that detects sessions with repeated `abortedLastRun=true` and removes them proactively

Environment

OpenClaw version: latest (main branch)
Channel: Feishu group chat
Session type: `group` (embedded agent)

Labels: bug, session, recovery, embedded-agent

extent analysis

Fix Plan

To address the issue, we will implement the following steps:

Auto-mark on failure: Set abortedLastRun=true on the session when embedded agent init fails
Session health check on dispatch: Check if the previous run was aborted/failed before dispatching to an existing session
Graceful degradation: Surface an explicit error to the user after N attempts
Auto-cleanup of zombie sessions: Implement a background cleanup task

Example Code

# Auto-mark on failure
def initialize_agent(session):
    try:
        # Initialize agent
        pass
    except Exception as e:
        session['abortedLastRun'] = True
        # Log error

# Session health check on dispatch
def dispatch_message(session, message):
    if session.get('abortedLastRun', False):
        # Create a fresh session
        session = create_new_session()
    # Dispatch message

# Graceful degradation
def dispatch_message(session, message):
    attempts = session.get('attempts', 0)
    if attempts >= N:
        # Surface explicit error to user
        return error_message
    # Dispatch message
    session['attempts'] = attempts + 1

# Auto-cleanup of zombie sessions
def cleanup_zombie_sessions():
    sessions = get_all_sessions()
    for session in sessions:
        if session.get('abortedLastRun', False) and session.get('attempts', 0) >= N:
            # Remove session
            remove_session(session)

Verification

To verify the fix, test the following scenarios:

Embedded agent init failure: Verify that the session is marked as abortedLastRun=true and a fresh session is created on the next dispatch
Session health check: Verify that the session is recreated when the previous run was aborted/failed
Graceful degradation: Verify that an explicit error is surfaced to the user after N attempts
Auto-cleanup of zombie sessions: Verify that zombie sessions are removed after a certain period of time

Extra Tips

Make sure to handle edge cases, such as concurrent modifications to the session
Implement logging and monitoring to detect and respond to issues
Consider adding a retry mechanism for agent initialization to handle temporary failures

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Session enters zombie state after embedded agent init failure (deactivated_workspace) — no auto-recovery [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current Workaround

Problem Summary

Steps to Reproduce

Current Workaround

Root Cause Analysis

Suggested Fixes

Environment

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Session enters zombie state after embedded agent init failure (deactivated_workspace) — no auto-recovery [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current Workaround

Problem Summary

Steps to Reproduce

Current Workaround

Root Cause Analysis

Suggested Fixes

Environment

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING