openclaw - 💡(How to fix) Fix Stuck-session detector only logs, no abort path; wedges cascade into event-loop blocking [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77935Fetched 2026-05-06 06:19:06
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
2
Timeline (top)
commented ×1

The gateway's stuck-session detector at dist/diagnostic-DitKp9ni.js only logs a warning when a session has been in processing state past diagnostics.stuckSessionWarnMs. There is no abort path. A wedged session stays wedged indefinitely until the operator hard-restarts the gateway, and during that window the rest of the gateway can pile up behind it (event-loop pressure, websocket pong timeouts, session-write-lock contention, restart-queue coalescing).

Error Message

logSessionStuck (line ~277) only writes a warn log + emits a diagnostic event: diagnosticLogger.warn(stuck session: sessionId=... ageMs=${params.ageMs}); Two-stage timeout: warn, then abort. 2. After a session crosses the abort threshold, call into the run controller to forcibly resolve the in-flight tool/LLM call with a stuck-session abort error:

  • Resolve any pending LLM call promise with an abort error (today there's a per-LLM-call 300s timeout; this should be a per-session ceiling that fires across tool transitions). The current state of the gateway is: stuck-session detection exists, has a configurable threshold, emits OTel metrics, and writes warning logs — but does not actually do anything to recover. From an operator's perspective, the warn log is misleading: it implies the system is monitoring and handling the condition, when it's only monitoring.

Root Cause

The current state of the gateway is: stuck-session detection exists, has a configurable threshold, emits OTel metrics, and writes warning logs — but does not actually do anything to recover. From an operator's perspective, the warn log is misleading: it implies the system is monitoring and handling the condition, when it's only monitoring.

A stuckSessionAbortMs feature with a sensible default would make multi-session pile-up self-healing within ~10 minutes of onset, instead of indefinite.

Code Example

heartbeatInterval = setInterval(() => {
  const stuckSessionWarnMs = resolveStuckSessionWarnMs(heartbeatConfig);
  const now = Date.now();
  // ... pruning, memory sample, etc. ...
  for (const [, state] of diagnosticSessionStates) {
    const ageMs = now - state.lastActivity;
    if (state.state === "processing" && ageMs > stuckSessionWarnMs) logSessionStuck({
      sessionId: state.sessionId,
      sessionKey: state.sessionKey,
      state: state.state,
      ageMs
    });
  }
}, 3e4);

---

function logSessionStuck(params) {
  if (!areDiagnosticsEnabledForProcess()) return;
  const state = getDiagnosticSessionState(params);
  diagnosticLogger.warn(`stuck session: sessionId=... ageMs=${params.ageMs}`);
  emitDiagnosticEvent({ type: "session.stuck", ... });
  markDiagnosticActivity();
}

---

for (const [key, state] of diagnosticSessionStates) {
     const ageMs = now - state.lastActivity;
     if (state.state !== "processing") continue;
     if (ageMs > stuckSessionAbortMs) {
       abortStuckSession(state, { reason: "stuck-session-watchdog", ageMs });
     } else if (ageMs > stuckSessionWarnMs) {
       logSessionStuck({ ...state, ageMs });
     }
   }
RAW_BUFFERClick to expand / collapse

Summary

The gateway's stuck-session detector at dist/diagnostic-DitKp9ni.js only logs a warning when a session has been in processing state past diagnostics.stuckSessionWarnMs. There is no abort path. A wedged session stays wedged indefinitely until the operator hard-restarts the gateway, and during that window the rest of the gateway can pile up behind it (event-loop pressure, websocket pong timeouts, session-write-lock contention, restart-queue coalescing).

Reproduction

dist/diagnostic-DitKp9ni.js:340-368:

heartbeatInterval = setInterval(() => {
  const stuckSessionWarnMs = resolveStuckSessionWarnMs(heartbeatConfig);
  const now = Date.now();
  // ... pruning, memory sample, etc. ...
  for (const [, state] of diagnosticSessionStates) {
    const ageMs = now - state.lastActivity;
    if (state.state === "processing" && ageMs > stuckSessionWarnMs) logSessionStuck({
      sessionId: state.sessionId,
      sessionKey: state.sessionKey,
      state: state.state,
      ageMs
    });
  }
}, 3e4);

logSessionStuck (line ~277) only writes a warn log + emits a diagnostic event:

function logSessionStuck(params) {
  if (!areDiagnosticsEnabledForProcess()) return;
  const state = getDiagnosticSessionState(params);
  diagnosticLogger.warn(`stuck session: sessionId=... ageMs=${params.ageMs}`);
  emitDiagnosticEvent({ type: "session.stuck", ... });
  markDiagnosticActivity();
}

There is no call into the run controller, no abort, no signal — only telemetry.

Observed impact

On 2026-05-04 ~21:11–21:46 EDT this gateway wedged for ~2 hours with:

  • 134% CPU sustained for 2+ hrs
  • 5 Discord/Slack sessions stuck in processing state
  • Oldest stuck session age: 883s
  • [session-write-lock] releasing lock held for 18082ms (max=15000ms)
  • Slack websockets disconnecting every ~30–45s (event-loop blocked, pong timeouts)
  • A /restart issued at 21:45 was coalesced and never executed (the restart queue silently dropped it because a prior restart was still draining the same stuck sessions)

The user had no recovery path other than a hard process kill.

Proposed fix

Two-stage timeout: warn, then abort.

  1. Add diagnostics.stuckSessionAbortMs config (default e.g. 600s; should be ≥ 2× stuckSessionWarnMs).

  2. After a session crosses the abort threshold, call into the run controller to forcibly resolve the in-flight tool/LLM call with a stuck-session abort error:

    for (const [key, state] of diagnosticSessionStates) {
      const ageMs = now - state.lastActivity;
      if (state.state !== "processing") continue;
      if (ageMs > stuckSessionAbortMs) {
        abortStuckSession(state, { reason: "stuck-session-watchdog", ageMs });
      } else if (ageMs > stuckSessionWarnMs) {
        logSessionStuck({ ...state, ageMs });
      }
    }
  3. abortStuckSession needs to:

    • Resolve any pending LLM call promise with an abort error (today there's a per-LLM-call 300s timeout; this should be a per-session ceiling that fires across tool transitions).
    • Mark the session idle so the queue can advance.
    • Emit a session.aborted diagnostic event with reason and age.
    • Be safe to call concurrently with normal completion (idempotent).
  4. Restart queue: when a queued restart's first attempt fails to drain stuck sessions within N seconds, escalate to hard restart instead of silently coalescing subsequent /restart requests. Today, the second /restart during a wedge is dropped on the floor; this is the worst possible UX (operator thinks they triggered recovery and walks away).

Why this matters

The current state of the gateway is: stuck-session detection exists, has a configurable threshold, emits OTel metrics, and writes warning logs — but does not actually do anything to recover. From an operator's perspective, the warn log is misleading: it implies the system is monitoring and handling the condition, when it's only monitoring.

A stuckSessionAbortMs feature with a sensible default would make multi-session pile-up self-healing within ~10 minutes of onset, instead of indefinite.

Environment

  • OpenClaw installed from npm at /opt/homebrew/lib/node_modules/openclaw/ (macOS, M-series, node v22)
  • Diagnostic timestamp: 2026-05-05

extent analysis

TL;DR

Implement a two-stage timeout with a warn and abort threshold to forcibly resolve stuck sessions and prevent indefinite wedging of the gateway.

Guidance

  • Introduce a new configuration option diagnostics.stuckSessionAbortMs with a default value (e.g., 600s) to define the abort threshold.
  • Modify the stuck session detection logic to call abortStuckSession when a session exceeds the abort threshold, which should resolve any pending LLM call promises and mark the session as idle.
  • Update the restart queue logic to escalate to a hard restart if a queued restart fails to drain stuck sessions within a specified time frame (e.g., N seconds).
  • Ensure the abortStuckSession function is idempotent and safe to call concurrently with normal completion.

Example

for (const [key, state] of diagnosticSessionStates) {
  const ageMs = now - state.lastActivity;
  if (state.state !== "processing") continue;
  if (ageMs > stuckSessionAbortMs) {
    abortStuckSession(state, { reason: "stuck-session-watchdog", ageMs });
  } else if (ageMs > stuckSessionWarnMs) {
    logSessionStuck({ ...state, ageMs });
  }
}

Notes

The proposed fix assumes that the abortStuckSession function will be implemented to correctly resolve pending LLM call promises and mark the session as idle. Additionally, the value of stuckSessionAbortMs should be carefully chosen to balance between preventing indefinite wedging and avoiding premature abortion of legitimate sessions.

Recommendation

Apply the proposed two-stage timeout workaround to introduce a stuck session abort mechanism, which will help prevent indefinite wedging of the gateway and provide a more robust recovery path for operators.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Stuck-session detector only logs, no abort path; wedges cascade into event-loop blocking [1 comments, 2 participants]