openclaw - 💡(How to fix) Fix feat: Gateway response timeout watchdog for stale sessions [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#43851Fetched 2026-04-08 00:18:37
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

This was identified while building a healing system (checkpoint + heartbeat + auto-resume). Sub-agent monitoring works well via sessions_list + subagents list, but main session hangs remain a blind spot because there is no way to know if a user message is waiting for a response.

Root Cause

This was identified while building a healing system (checkpoint + heartbeat + auto-resume). Sub-agent monitoring works well via sessions_list + subagents list, but main session hangs remain a blind spot because there is no way to know if a user message is waiting for a response.

Fix Action

Fix / Workaround

When a session receives a user message but fails to respond (due to hanging, context overflow, or silent failure), there is no mechanism to detect this. The gateway currently uses fire-and-forget delivery — once a message is dispatched to a session, there is no tracking of whether a response was produced.

  1. When gateway dispatches a user message to a session, start a timer
  2. If no response is produced within N minutes (configurable, suggest default 5 min), emit a warning event
  3. The warning could be:
    • A system event that heartbeat can pick up
    • A direct notification to a configured channel
    • A webhook/callback

Code Example

gateway:
  responseTimeout:
    enabled: true
    warnAfter: 300    # seconds (5 min)
    action: "event"   # or "notify" or "callback"
    notifyChannel: "discord:channel_id"  # optional
RAW_BUFFERClick to expand / collapse

Problem

When a session receives a user message but fails to respond (due to hanging, context overflow, or silent failure), there is no mechanism to detect this. The gateway currently uses fire-and-forget delivery — once a message is dispatched to a session, there is no tracking of whether a response was produced.

Current timeouts:

  • Discord listener: 120s (event handler)
  • Discord worker: 30 min (message processing) — too long for user-facing responsiveness
  • No "pending response" tracking exists

Impact

Users send messages and wait indefinitely with no feedback. The only current detection is via heartbeat scanning sessions_list for last_active, but this cannot distinguish between "idle (no messages)" and "stuck (message received, no reply)".

Proposed Solution

Add a response timeout watchdog at the gateway level:

  1. When gateway dispatches a user message to a session, start a timer
  2. If no response is produced within N minutes (configurable, suggest default 5 min), emit a warning event
  3. The warning could be:
    • A system event that heartbeat can pick up
    • A direct notification to a configured channel
    • A webhook/callback

Suggested Config

gateway:
  responseTimeout:
    enabled: true
    warnAfter: 300    # seconds (5 min)
    action: "event"   # or "notify" or "callback"
    notifyChannel: "discord:channel_id"  # optional

Alternative: Expose pending message queue

Even without auto-notification, exposing a sessions_list field like pendingMessages or lastUserMessageAt would let heartbeat agents detect stale responses without gateway-level changes.

Context

This was identified while building a healing system (checkpoint + heartbeat + auto-resume). Sub-agent monitoring works well via sessions_list + subagents list, but main session hangs remain a blind spot because there is no way to know if a user message is waiting for a response.

extent analysis

Fix Overview

Add a pending‑response tracker in the gateway and a watchdog that fires after a configurable timeout (default 5 min).
When a user message is handed to a session we:

  1. Record the message ID + timestamp in a fast store (in‑memory dict or Redis).
  2. Start a background timer (or a periodic sweep) that checks for entries older than warnAfter.
  3. If the timeout expires and the entry is still present, emit the configured action (event, Discord notification, or webhook).
  4. When the session finally replies, remove the entry so the watchdog does nothing.

The implementation can be done with a few lines of code and a small config change – no architectural overhaul is required.


Step‑by‑Step Implementation (Python example)

Adjust the language/structures to match your stack (Node, Go, etc.). The core ideas stay the same.

1. Add config

# config.yaml
gateway:
  responseTimeout:
    enabled: true
    warnAfter: 300          # seconds (5 min)
    action: "notify"        # "event" | "notify" | "callback"
    notifyChannel: "discord:123456789012345678"
    callbackUrl: "https://my.service/timeout"

2. Initialise a store for pending messages

# gateway/pending.py
import time
from collections import defaultdict

# In‑memory store – replace with Redis if you need persistence across workers
_pending = {}          # key: message_id, value: (session_id, timestamp)

def add(message_id: str, session_id: str):
    _pending[message_id] = (session_id, time.time())

def remove(message_id: str):
    _pending.pop(message_id, None)

def all_items():
    return _pending.items()

3. Hook into the dispatch path

# gateway/dispatch.py
from .pending import add, remove

async def dispatch_user_message(message):
    """
    Called when the gateway receives a user message and forwards it to a session.
    """
    session_id = await route_to_session(message)          # existing logic
    add(message.id, session_id)                           # <‑‑ NEW

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING