openclaw - 💡(How to fix) Fix [Feature]: Native stuck-session detection and health check per session [2 comments, 3 participants]

raul-santamaria · 2026-03-17T22:54:52Z

[openclaw] Request native support for detecting and recovering from stuck agent sessions — specifically sessions caught in infinite or very long tool-call loop… Request native support for detecting and recovering from stuck agent sessions — specifically sessions caught in infinite or very long tool-call loops, or sessions that become unresponsive after being killed. ## Fix / Workaround ## Current Workaround ## Summary Request native support for detecting and recovering from stuck agent sessions — specifically sessions caught in infinite or very long tool-call loops, or sessions that become unresponsive after being killed. ## Problem Currently there is no platform-level mechanism to detect when a session is stuck. This creates several compounding issues: 1. **No kill signal to main agent sessions.** `sessions_spawn` returns a `childSessionKey` but there is no `sessions_kill` tool available to agents. If a spawned agent loops, the parent cannot terminate it. 2. **Stale session state after kill.** When a session is killed externally (e.g. `openclaw agent --kill`), the session store may retain dirty state (pending tool calls, partial writes), causing the session to misbehave on the next wakeup. 3. **False positives on long tasks.** Any threshold-based detection (e.g. "more than N consecutive tool calls") produces false positives on legitimate long-running tasks (large swarms, bulk file processing). Without platform-level context about what a session is doing, agents cannot distinguish stuck from busy. 4. **No supervisor for the main agent.** In a multi-agent setup, sub-agents can be monitored by their parent. But the main agent has no supervisor — if it enters a loop, nothing detects it. ## Proposed Solution ### 1. `sessions_kill` tool Expose a `sessions_kill(sessionKey)` tool that allows a parent agent to terminate a spawned child session cleanly. Should: - Send a graceful interrupt before force-killing - Clean up session store state - Return a status indicating whether the session was running, idle, or already stopped ### 2. Session health check API Add a `sessions_health(sessionKey)` tool (or extend `sessions_list`) that returns: - `status`: `idle | running | stuck | dead` - `lastActivityAt`: timestamp of last tool call or message - `currentToolCall`: name of the tool currently executing (if any) - `consecutiveToolCalls`: count of tool calls since last user/assistant turn boundary ### 3. Stuck-session detection heuristic (platform-level) The gateway is better positioned than agents to detect stuck sessions. Suggested heuristic: - Session has been in `running` state for > configurable threshold (e.g. 10 min) - AND no new output tokens have been produced in > N seconds - → emit a `session.stuck` event that the parent agent or OOA can subscribe to ### 4. Clean state on kill When a session is killed (by agent or externally), guarantee the session store is left in a consistent state: no pending tool calls, no partial writes, model override reset. ## Motivation / Use Cases - **Multi-agent swarms:** Tech Lead agents spawn 5–10 subagents in parallel. If one gets stuck, the Tech Lead currently has no way to detect it or terminate it — it just waits forever. - **Main agent supervision:** In production deployments with a single main agent, there is no watchdog. A stuck main agent is invisible until a human notices. - **Token waste:** A stuck session in a tool-call loop can consume significant tokens before anyone notices. Platform detection would allow early termination. - **Session recovery:** After an external kill, agents should be able to resume without inheriting corrupt state. ## Current Workaround We currently implement a heartbeat-based polling loop that queries `sessions_history` and counts consecutive tool calls. This approach has inherent limitations: - Threshold tuning is imprecise (legitimate tasks vs. stuck) - Cannot kill a session from agent-side - Does not detect main agent loops - Adds token overhead to every heartbeat ## Environment - OpenClaw v2026.3.13 - Multi-agent setup with several specialized agents - Host: Raspberry Pi 5 ARM64, Linux 6.12

Summary

Request native support for detecting and recovering from stuck agent sessions — specifically sessions caught in infinite or very long tool-call loops, or sessions that become unresponsive after being killed.

Problem

Currently there is no platform-level mechanism to detect when a session is stuck. This creates several compounding issues:

No kill signal to main agent sessions. sessions_spawn returns a childSessionKey but there is no sessions_kill tool available to agents. If a spawned agent loops, the parent cannot terminate it.
Stale session state after kill. When a session is killed externally (e.g. openclaw agent --kill), the session store may retain dirty state (pending tool calls, partial writes), causing the session to misbehave on the next wakeup.
False positives on long tasks. Any threshold-based detection (e.g. "more than N consecutive tool calls") produces false positives on legitimate long-running tasks (large swarms, bulk file processing). Without platform-level context about what a session is doing, agents cannot distinguish stuck from busy.
No supervisor for the main agent. In a multi-agent setup, sub-agents can be monitored by their parent. But the main agent has no supervisor — if it enters a loop, nothing detects it.

Proposed Solution

1. `sessions_kill` tool

Expose a sessions_kill(sessionKey) tool that allows a parent agent to terminate a spawned child session cleanly. Should:

Send a graceful interrupt before force-killing
Clean up session store state
Return a status indicating whether the session was running, idle, or already stopped

2. Session health check API

Add a sessions_health(sessionKey) tool (or extend sessions_list) that returns:

status: idle | running | stuck | dead
lastActivityAt: timestamp of last tool call or message
currentToolCall: name of the tool currently executing (if any)
consecutiveToolCalls: count of tool calls since last user/assistant turn boundary

3. Stuck-session detection heuristic (platform-level)

The gateway is better positioned than agents to detect stuck sessions. Suggested heuristic:

Session has been in running state for > configurable threshold (e.g. 10 min)
AND no new output tokens have been produced in > N seconds
→ emit a session.stuck event that the parent agent or OOA can subscribe to

4. Clean state on kill

When a session is killed (by agent or externally), guarantee the session store is left in a consistent state: no pending tool calls, no partial writes, model override reset.

Motivation / Use Cases

Multi-agent swarms: Tech Lead agents spawn 5–10 subagents in parallel. If one gets stuck, the Tech Lead currently has no way to detect it or terminate it — it just waits forever.
Main agent supervision: In production deployments with a single main agent, there is no watchdog. A stuck main agent is invisible until a human notices.
Token waste: A stuck session in a tool-call loop can consume significant tokens before anyone notices. Platform detection would allow early termination.
Session recovery: After an external kill, agents should be able to resume without inheriting corrupt state.

Current Workaround

We currently implement a heartbeat-based polling loop that queries sessions_history and counts consecutive tool calls. This approach has inherent limitations:

Threshold tuning is imprecise (legitimate tasks vs. stuck)
Cannot kill a session from agent-side
Does not detect main agent loops
Adds token overhead to every heartbeat

Environment

OpenClaw v2026.3.13
Multi-agent setup with several specialized agents
Host: Raspberry Pi 5 ARM64, Linux 6.12

extent analysis

Fix Plan

To address the issue of stuck agent sessions, we will implement the following:

sessions_kill tool: Create a sessions_kill function that sends a graceful interrupt to the session before force-killing it, and cleans up the session store state.
Session health check API: Develop a sessions_health function that returns the session status, last activity timestamp, current tool call, and consecutive tool calls.
Stuck-session detection heuristic: Implement a heuristic that detects stuck sessions based on a configurable threshold and emits a session.stuck event.
Clean state on kill: Ensure the session store is left in a consistent state after a session is killed.

Example Code

import time
from datetime import datetime, timedelta

# sessions_kill tool
def sessions_kill(session_key):
    # Send a graceful interrupt
    try:
        # Attempt to interrupt the session
        interrupt_session(session_key)
    except Exception as e:
        # Force-kill the session if interrupt fails
        force_kill_session(session_key)
    # Clean up session store state
    cleanup_session_store(session_key)
    return get_session_status(session_key)

# Session health check API
def sessions_health(session_key):
    session_status = get_session_status(session_key)
    last_activity_at = get_last_activity_timestamp(session_key)
    current_tool_call = get_current_tool_call(session_key)
    consecutive_tool_calls = get_consecutive_tool_calls(session_key)
    return {
        'status': session_status,
        'lastActivityAt': last_activity_at,
        'currentToolCall': current_tool_call,
        'consecutiveToolCalls': consecutive_tool_calls
    }

# Stuck-session detection heuristic
def detect_stuck_sessions(threshold):
    stuck_sessions = []
    for session_key in get_all_session_keys():
        session_status = get_session_status(session_key)
        if session_status == 'running':
            last_activity_at = get_last_activity_timestamp(session_key)
            time_since_last_activity = datetime.now() - last_activity_at
            if time_since_last_activity > threshold:
                stuck_sessions.append(session_key)
    return stuck_sessions

# Clean state on kill
def cleanup_session_store(session_key):
    # Remove pending tool calls
    remove_pending_tool_calls(session_key)
    # Reset model override
    reset_model_override(session_key)

Verification

To verify the fix, test the following scenarios:

Kill a session using the sessions_kill tool and verify that the session store state is cleaned up.
Check the session health using the sessions_health function and verify that the status, last activity timestamp, current tool call, and consecutive tool calls are accurate.
Configure the stuck-session detection heuristic and verify that it correctly detects stuck sessions and emits a session.stuck event.
Kill a session and verify that the session store is left in a consistent state.

Extra Tips

Implement logging and monitoring to track

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Feature]: Native stuck-session detection and health check per session [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Current Workaround

Summary

Problem

Proposed Solution

1. `sessions_kill` tool

2. Session health check API

3. Stuck-session detection heuristic (platform-level)

4. Clean state on kill

Motivation / Use Cases

Current Workaround

Environment

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Feature]: Native stuck-session detection and health check per session [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Current Workaround

Summary

Problem

Proposed Solution

1. sessions_kill tool

2. Session health check API

3. Stuck-session detection heuristic (platform-level)

4. Clean state on kill

Motivation / Use Cases

Current Workaround

Environment

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `sessions_kill` tool