openclaw - 💡(How to fix) Fix [Feature]: Native stuck-session detection and health check per session [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#49290Fetched 2026-04-08 00:56:55
View on GitHub
Comments
2
Participants
3
Timeline
5
Reactions
0
Timeline (top)
commented ×2closed ×1cross-referenced ×1locked ×1

Request native support for detecting and recovering from stuck agent sessions — specifically sessions caught in infinite or very long tool-call loops, or sessions that become unresponsive after being killed.

Root Cause

Request native support for detecting and recovering from stuck agent sessions — specifically sessions caught in infinite or very long tool-call loops, or sessions that become unresponsive after being killed.

Fix Action

Fix / Workaround

Current Workaround

RAW_BUFFERClick to expand / collapse

Summary

Request native support for detecting and recovering from stuck agent sessions — specifically sessions caught in infinite or very long tool-call loops, or sessions that become unresponsive after being killed.

Problem

Currently there is no platform-level mechanism to detect when a session is stuck. This creates several compounding issues:

  1. No kill signal to main agent sessions. sessions_spawn returns a childSessionKey but there is no sessions_kill tool available to agents. If a spawned agent loops, the parent cannot terminate it.

  2. Stale session state after kill. When a session is killed externally (e.g. openclaw agent --kill), the session store may retain dirty state (pending tool calls, partial writes), causing the session to misbehave on the next wakeup.

  3. False positives on long tasks. Any threshold-based detection (e.g. "more than N consecutive tool calls") produces false positives on legitimate long-running tasks (large swarms, bulk file processing). Without platform-level context about what a session is doing, agents cannot distinguish stuck from busy.

  4. No supervisor for the main agent. In a multi-agent setup, sub-agents can be monitored by their parent. But the main agent has no supervisor — if it enters a loop, nothing detects it.

Proposed Solution

1. sessions_kill tool

Expose a sessions_kill(sessionKey) tool that allows a parent agent to terminate a spawned child session cleanly. Should:

  • Send a graceful interrupt before force-killing
  • Clean up session store state
  • Return a status indicating whether the session was running, idle, or already stopped

2. Session health check API

Add a sessions_health(sessionKey) tool (or extend sessions_list) that returns:

  • status: idle | running | stuck | dead
  • lastActivityAt: timestamp of last tool call or message
  • currentToolCall: name of the tool currently executing (if any)
  • consecutiveToolCalls: count of tool calls since last user/assistant turn boundary

3. Stuck-session detection heuristic (platform-level)

The gateway is better positioned than agents to detect stuck sessions. Suggested heuristic:

  • Session has been in running state for > configurable threshold (e.g. 10 min)
  • AND no new output tokens have been produced in > N seconds
  • → emit a session.stuck event that the parent agent or OOA can subscribe to

4. Clean state on kill

When a session is killed (by agent or externally), guarantee the session store is left in a consistent state: no pending tool calls, no partial writes, model override reset.

Motivation / Use Cases

  • Multi-agent swarms: Tech Lead agents spawn 5–10 subagents in parallel. If one gets stuck, the Tech Lead currently has no way to detect it or terminate it — it just waits forever.
  • Main agent supervision: In production deployments with a single main agent, there is no watchdog. A stuck main agent is invisible until a human notices.
  • Token waste: A stuck session in a tool-call loop can consume significant tokens before anyone notices. Platform detection would allow early termination.
  • Session recovery: After an external kill, agents should be able to resume without inheriting corrupt state.

Current Workaround

We currently implement a heartbeat-based polling loop that queries sessions_history and counts consecutive tool calls. This approach has inherent limitations:

  • Threshold tuning is imprecise (legitimate tasks vs. stuck)
  • Cannot kill a session from agent-side
  • Does not detect main agent loops
  • Adds token overhead to every heartbeat

Environment

  • OpenClaw v2026.3.13
  • Multi-agent setup with several specialized agents
  • Host: Raspberry Pi 5 ARM64, Linux 6.12

extent analysis

Fix Plan

To address the issue of stuck agent sessions, we will implement the following:

  • sessions_kill tool: Create a sessions_kill function that sends a graceful interrupt to the session before force-killing it, and cleans up the session store state.
  • Session health check API: Develop a sessions_health function that returns the session status, last activity timestamp, current tool call, and consecutive tool calls.
  • Stuck-session detection heuristic: Implement a heuristic that detects stuck sessions based on a configurable threshold and emits a session.stuck event.
  • Clean state on kill: Ensure the session store is left in a consistent state after a session is killed.

Example Code

import time
from datetime import datetime, timedelta

# sessions_kill tool
def sessions_kill(session_key):
    # Send a graceful interrupt
    try:
        # Attempt to interrupt the session
        interrupt_session(session_key)
    except Exception as e:
        # Force-kill the session if interrupt fails
        force_kill_session(session_key)
    # Clean up session store state
    cleanup_session_store(session_key)
    return get_session_status(session_key)

# Session health check API
def sessions_health(session_key):
    session_status = get_session_status(session_key)
    last_activity_at = get_last_activity_timestamp(session_key)
    current_tool_call = get_current_tool_call(session_key)
    consecutive_tool_calls = get_consecutive_tool_calls(session_key)
    return {
        'status': session_status,
        'lastActivityAt': last_activity_at,
        'currentToolCall': current_tool_call,
        'consecutiveToolCalls': consecutive_tool_calls
    }

# Stuck-session detection heuristic
def detect_stuck_sessions(threshold):
    stuck_sessions = []
    for session_key in get_all_session_keys():
        session_status = get_session_status(session_key)
        if session_status == 'running':
            last_activity_at = get_last_activity_timestamp(session_key)
            time_since_last_activity = datetime.now() - last_activity_at
            if time_since_last_activity > threshold:
                stuck_sessions.append(session_key)
    return stuck_sessions

# Clean state on kill
def cleanup_session_store(session_key):
    # Remove pending tool calls
    remove_pending_tool_calls(session_key)
    # Reset model override
    reset_model_override(session_key)

Verification

To verify the fix, test the following scenarios:

  • Kill a session using the sessions_kill tool and verify that the session store state is cleaned up.
  • Check the session health using the sessions_health function and verify that the status, last activity timestamp, current tool call, and consecutive tool calls are accurate.
  • Configure the stuck-session detection heuristic and verify that it correctly detects stuck sessions and emits a session.stuck event.
  • Kill a session and verify that the session store is left in a consistent state.

Extra Tips

  • Implement logging and monitoring to track

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING