openclaw - 💡(How to fix) Fix 2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63279Fetched 2026-04-09 07:55:58
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

After upgrading to OpenClaw 2026.4.8, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.

This does not look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain:

  1. huge-session overflow + compaction timeout
  2. gateway enters draining/restart state and rejects new tasks
  3. subagent completion announce retries fail/give up
  4. fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

Root Cause

After upgrading to OpenClaw 2026.4.8, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.

This does not look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain:

  1. huge-session overflow + compaction timeout
  2. gateway enters draining/restart state and rejects new tasks
  3. subagent completion announce retries fail/give up
  4. fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

RAW_BUFFERClick to expand / collapse

Summary

After upgrading to OpenClaw 2026.4.8, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.

This does not look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain:

  1. huge-session overflow + compaction timeout
  2. gateway enters draining/restart state and rejects new tasks
  3. subagent completion announce retries fail/give up
  4. fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

Environment

  • OpenClaw: 2026.4.8
  • OS: macOS arm64
  • Channel: Telegram (multiple accounts)
  • Large long-lived sessions (hundreds to >1300 messages)

Evidence (local)

1) Massive overflow + compaction timeout

From artifacts/openclaw-2026-04-08-incident-extract.txt:

  • estimatedPromptTokens=1014988 / overflowTokens=759372
  • messages=1331 / messages=1351+ on affected Telegram sessions
  • multiple compaction failures at ~900s:
    • outcome=failed reason=timeout durationMs=900342
    • outcome=failed reason=timeout durationMs=900533

2) During failure chain, gateway rejects tasks as draining

From incident extract and ~/.openclaw/logs/gateway.err.log:

  • GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted
  • repeated drain timeout reached; proceeding with restart

3) Subagent completion announce failures/retries during this state

  • Subagent completion direct announce failed ... GatewayDrainingError
  • Subagent announce completion ... transient failure, retrying
  • Subagent announce give up (retry-limit)

4) Anthropic fallback still appeared in live fallback decisions during drain

Even after removing Anthropic fallback from config on disk, log lines during draining still showed:

  • next=anthropic/claude-haiku-4-5 detail=Gateway is draining for restart; new tasks are not accepted
  • and auth failure attempts:
    • candidate=anthropic/claude-haiku-4-5 reason=auth ... HTTP 401 authentication_error: invalid x-api-key

5) On-disk config had Anthropic removed, but runtime lagged until restart

  • Current ~/.openclaw/openclaw.json fallback list is only:
    • openai-codex/gpt-5.4 -> openai-codex/gpt-5.3-codex
  • Local commit removing Anthropic fallback:
    • 19172db Remove Anthropic model fallback config
  • openclaw.json metadata shows it was touched at 2026-04-08T17:10:23.793Z
  • But runtime logs still had next=anthropic/claude-haiku-4-5 at 2026-04-08T17:10:27.547+01:00

This suggests live runtime config/fallback chain can remain stale until gateway restart/reload.

Doctor note

openclaw doctor --fix was run locally, but this alone did not reload the running gateway process.

Expected behavior

  1. Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops.
  2. Subagent completion announce should not be lost/give-up during gateway drain windows.
  3. Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk.
  4. If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.

Actual behavior

  • Overflow + compaction timeout chain coincided with gateway draining errors and task rejection.
  • Subagent announce retries frequently failed/gave up.
  • Fallback routing still referenced Anthropic during draining, causing 401 auth errors, despite Anthropic fallback being removed on disk.

Potentially related (but not exact duplicate)

  • #44031 (compaction timeout hangs)
  • #40295 (compaction deadlock/session recovery)
  • #55412 (GatewayDrainingError retry behavior)
  • #54276 (subagent announce give-up)
  • #62095 (doctor --fix expectations)

Request

Please investigate this as a possible 2026.4.8 regression/failure-chain interaction:

  • large-session overflow/compaction timeout
  • gateway drain/restart task rejection behavior
  • subagent announce resilience during drain
  • runtime config/fallback-chain reload semantics (especially after removing providers)

If useful, I can provide the extracted incident artifacts/log snippets listed above.

extent analysis

TL;DR

The most likely fix for the issue is to implement a mechanism to reload the runtime configuration after changes are made to the on-disk configuration, ensuring that the fallback chain is updated without requiring a gateway restart.

Guidance

  • Investigate the openclaw doctor --fix command to determine why it does not reload the running gateway process, and consider modifying it to trigger a reload when necessary.
  • Review the gateway draining and restart logic to prevent task rejection loops and subagent announce failures during this state.
  • Implement a mechanism to detect and handle compaction timeouts and overflow errors more gracefully, preventing them from cascading into prolonged drain and task rejection.
  • Consider adding clear messaging to the CLI/doctor output when a restart/reload is required for fallback chain changes.

Example

No specific code example is provided due to the complexity of the issue and the need for further investigation.

Notes

The issue appears to be related to a regression in OpenClaw 2026.4.8, and resolving it may require changes to the gateway draining and restart logic, as well as the runtime configuration reload semantics.

Recommendation

Apply a workaround to implement a reload mechanism for the runtime configuration after changes are made to the on-disk configuration, as this is likely to address the issue with the stale fallback chain. This can be done by modifying the openclaw doctor --fix command or adding a new command to trigger a reload when necessary.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops.
  2. Subagent completion announce should not be lost/give-up during gateway drain windows.
  3. Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk.
  4. If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING