openclaw - 💡(How to fix) Fix No circuit breaker when context overflow coincides with lane timeout — death spiral unrecoverable without restart [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77742Fetched 2026-05-06 06:22:05
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
closed ×1commented ×1

When context overflow and lane timeout conditions coincide, the gateway has no circuit breaker to prevent a death spiral. The system continues attempting the same failing operations (compaction, model retries) without escalation or session rotation, eventually starving the event loop and blocking all channels.

Error Message

[diagnostic] stuck session: sessionId=main state=processing age=41514s queueDepth=0 [diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998 compactionAttempts=0 — auto-compaction silently fails to start lane task error: timeout (630s hardcoded)

Root Cause

The gateway lacks a circuit breaker that would:

  • Detect repeated compaction failures + lane timeouts on same session
  • Escalate to aggressive recovery (force-truncate, session rotation with handoff summary)
  • Mark session as unhealthy and bypass normal lane queuing
  • Alert operators before event loop starvation becomes critical

Current behavior treats each failure as independent, retrying the same approach indefinitely.

Code Example

[diagnostic] stuck session: sessionId=main state=processing age=41514s queueDepth=0
[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
compactionAttempts=0 — auto-compaction silently fails to start
lane task error: timeout (630s hardcoded)

---

interface CircuitBreakerState {
  consecutiveCompactionFailures: number;
  consecutiveLaneTimeouts: number;
  lastFailureTime: number;
  tripThreshold: number; // e.g., 3
  resetAfterMs: number; // e.g., 300000 (5 min)
}

// When threshold exceeded:
// 1. Trip circuit breaker
// 2. Escalate to aggressive recovery
// 3. Alert operators
// 4. Bypass normal lane for recovery ops
RAW_BUFFERClick to expand / collapse

Summary

When context overflow and lane timeout conditions coincide, the gateway has no circuit breaker to prevent a death spiral. The system continues attempting the same failing operations (compaction, model retries) without escalation or session rotation, eventually starving the event loop and blocking all channels.

Observed Failure Cascade (2026-05-05 Incident)

  1. Session hits context overflow → auto-compaction triggers
  2. Compaction fails to reduce context (compactionAttempts=0, or model produces inadequate summary)
  3. Lane task times out (hardcoded 630s, see #77741) but lane state remains stuck
  4. Gateway retries same operation with same model
  5. Event loop delay spikes (ELD P99 > 10s), CPU sustained at 100%
  6. All inbound messages queue behind stuck lane → complete unresponsiveness
  7. Only recovery: manual gateway restart

Root Cause

The gateway lacks a circuit breaker that would:

  • Detect repeated compaction failures + lane timeouts on same session
  • Escalate to aggressive recovery (force-truncate, session rotation with handoff summary)
  • Mark session as unhealthy and bypass normal lane queuing
  • Alert operators before event loop starvation becomes critical

Current behavior treats each failure as independent, retrying the same approach indefinitely.

Expected Behavior

After N consecutive failures (e.g., 2-3 compaction attempts + 1 lane timeout):

  1. Circuit breaker trips — session marked unhealthy
  2. Escalation chain fires:
    • Try compaction with fallback model (if available)
    • Escalate to compact_then_truncate mode
    • Force session rotation with handoff summary from last assistant message
  3. Alert operators — Discord/Slack notification with failure context
  4. Bypass lane queue — critical recovery operations use priority lane

Impact

  • Single session blocks entire gateway — all channels affected
  • No automatic recovery — requires manual intervention
  • Prolonged downtime — 10+ minutes minimum (lane timeout) to hours (if undetected)
  • Resource exhaustion — sustained 100% CPU, event loop delay > 10s

Evidence

From 2026-05-05 incident logs:

[diagnostic] stuck session: sessionId=main state=processing age=41514s queueDepth=0
[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
compactionAttempts=0 — auto-compaction silently fails to start
lane task error: timeout (630s hardcoded)

Related

  • #77738 (compactionAttempts=0 prevents auto-compaction from starting)
  • #77741 (Lane timeout hardcoded at 630s)
  • #48488 (Lane queue has no task-level timeout)
  • #70334 (Session lock stuck after compaction succeeds)
  • #76467 (Gateway unresponsive after compaction triggers)
  • #45686 (Compaction circuit breaker: fallback model + force-truncate)
  • #58580 (Session health check: auto-clear stuck sessions)

Proposed Fix

Add circuit breaker logic in gateway session lane handler:

interface CircuitBreakerState {
  consecutiveCompactionFailures: number;
  consecutiveLaneTimeouts: number;
  lastFailureTime: number;
  tripThreshold: number; // e.g., 3
  resetAfterMs: number; // e.g., 300000 (5 min)
}

// When threshold exceeded:
// 1. Trip circuit breaker
// 2. Escalate to aggressive recovery
// 3. Alert operators
// 4. Bypass normal lane for recovery ops

Environment

  • OpenClaw 2026.5.3
  • Observed in production ghosting incident 2026-05-05
  • Affects all channels (Discord, Telegram, Webchat, cron)

extent analysis

TL;DR

Implement a circuit breaker in the gateway session lane handler to detect and respond to repeated compaction failures and lane timeouts.

Guidance

  • Introduce a CircuitBreakerState interface to track consecutive compaction failures, lane timeouts, and last failure time.
  • Set a trip threshold (e.g., 3) and reset time (e.g., 5 minutes) for the circuit breaker.
  • When the threshold is exceeded, trip the circuit breaker and escalate to aggressive recovery, alert operators, and bypass normal lane queuing.
  • Consider implementing a fallback model for compaction and force-truncate mode as part of the escalation chain.

Example

const circuitBreakerState: CircuitBreakerState = {
  consecutiveCompactionFailures: 0,
  consecutiveLaneTimeouts: 0,
  lastFailureTime: 0,
  tripThreshold: 3,
  resetAfterMs: 300000,
};

// Example circuit breaker logic
if (circuitBreakerState.consecutiveCompactionFailures >= circuitBreakerState.tripThreshold) {
  // Trip circuit breaker and escalate to aggressive recovery
  tripCircuitBreaker();
  escalateToAggressiveRecovery();
  alertOperators();
  bypassNormalLaneQueuing();
}

Notes

The proposed fix requires careful tuning of the trip threshold and reset time to balance between preventing death spirals and allowing for legitimate retries. Additionally, the escalation chain and alerting mechanisms should be designed to minimize false positives and ensure timely operator intervention.

Recommendation

Apply the proposed circuit breaker workaround to prevent death spirals and ensure timely recovery from compaction failures and lane timeouts. This will help prevent single sessions from blocking the entire gateway and reduce the need for manual intervention.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix No circuit breaker when context overflow coincides with lane timeout — death spiral unrecoverable without restart [1 comments, 2 participants]