1. Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops. 2. Subagent completion announce should not be lost/give-up during gateway drain windows. 3. Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk. 4. If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.

openclaw - 💡(How to fix) Fix 2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart [1 participants]

EthanSK · 2026-04-08T17:30:18Z

[openclaw] After upgrading to OpenClaw 2026.4.8 , very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that a… After upgrading to **OpenClaw 2026.4.8**, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior. This does **not** look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain: 1. huge-session overflow + compaction timeout 2. gateway enters draining/restart state and rejects new tasks 3. subagent completion announce retries fail/give up 4. fallback decisions during drain can still route into stale runtime fallback candidates The user reports this behavior did not happen before this version. ## Summary After upgrading to **OpenClaw 2026.4.8**, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior. This does **not** look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain: 1. huge-session overflow + compaction timeout 2. gateway enters draining/restart state and rejects new tasks 3. subagent completion announce retries fail/give up 4. fallback decisions during drain can still route into stale runtime fallback candidates The user reports this behavior did not happen before this version. ## Environment - OpenClaw: `2026.4.8` - OS: macOS arm64 - Channel: Telegram (multiple accounts) - Large long-lived sessions (hundreds to >1300 messages) ## Evidence (local) ### 1) Massive overflow + compaction timeout From `artifacts/openclaw-2026-04-08-incident-extract.txt`: - `estimatedPromptTokens=1014988` / `overflowTokens=759372` - `messages=1331` / `messages=1351+` on affected Telegram sessions - multiple compaction failures at ~900s: - `outcome=failed reason=timeout durationMs=900342` - `outcome=failed reason=timeout durationMs=900533` ### 2) During failure chain, gateway rejects tasks as draining From incident extract and `~/.openclaw/logs/gateway.err.log`: - `GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted` - repeated `drain timeout reached; proceeding with restart` ### 3) Subagent completion announce failures/retries during this state - `Subagent completion direct announce failed ... GatewayDrainingError` - `Subagent announce completion ... transient failure, retrying` - `Subagent announce give up (retry-limit)` ### 4) Anthropic fallback still appeared in live fallback decisions during drain Even after removing Anthropic fallback from config on disk, log lines during draining still showed: - `next=anthropic/claude-haiku-4-5 detail=Gateway is draining for restart; new tasks are not accepted` - and auth failure attempts: - `candidate=anthropic/claude-haiku-4-5 reason=auth ... HTTP 401 authentication_error: invalid x-api-key` ### 5) On-disk config had Anthropic removed, but runtime lagged until restart - Current `~/.openclaw/openclaw.json` fallback list is only: - `openai-codex/gpt-5.4 -> openai-codex/gpt-5.3-codex` - Local commit removing Anthropic fallback: - `19172db Remove Anthropic model fallback config` - `openclaw.json` metadata shows it was touched at `2026-04-08T17:10:23.793Z` - But runtime logs still had `next=anthropic/claude-haiku-4-5` at `2026-04-08T17:10:27.547+01:00` This suggests live runtime config/fallback chain can remain stale until gateway restart/reload. ## Doctor note `openclaw doctor --fix` was run locally, but this alone did not reload the running gateway process. ## Expected behavior 1. Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops. 2. Subagent completion announce should not be lost/give-up during gateway drain windows. 3. Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk. 4. If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output. ## Actual behavior - Overflow + compaction timeout chain coincided with gateway draining errors and task rejection. - Subagent announce retries frequently failed/gave up. - Fallback routing still referenced Anthropic during draining, causing 401 auth errors, despite Anthropic fallback being removed on disk. ## Potentially related (but not exact duplicate) - #44031 (compaction timeout hangs) - #40295 (compaction deadlock/session recovery) - #55412 (GatewayDrainingError retry behavior) - #54276 (subagent announce give-up) - #62095 (`doctor --fix` expectations) ## Request Please investigate this as a possible 2026.4.8 regression/failure-chain interaction: - large-session overflow/compaction timeout - gateway drain/restart task rejection behavior - su

After upgrading to OpenClaw 2026.4.8, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.

This does not look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain:

huge-session overflow + compaction timeout
gateway enters draining/restart state and rejects new tasks
subagent completion announce retries fail/give up
fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

Root Cause

This does not look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain:

huge-session overflow + compaction timeout
gateway enters draining/restart state and rejects new tasks
subagent completion announce retries fail/give up
fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

Summary

This does not look like just “Anthropic fallback is broken.” The stronger bug shape is a failure-chain:

huge-session overflow + compaction timeout
gateway enters draining/restart state and rejects new tasks
subagent completion announce retries fail/give up
fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

Environment

OpenClaw: 2026.4.8
OS: macOS arm64
Channel: Telegram (multiple accounts)
Large long-lived sessions (hundreds to >1300 messages)

Evidence (local)

1) Massive overflow + compaction timeout

From artifacts/openclaw-2026-04-08-incident-extract.txt:

estimatedPromptTokens=1014988 / overflowTokens=759372
messages=1331 / messages=1351+ on affected Telegram sessions
multiple compaction failures at ~900s:
- outcome=failed reason=timeout durationMs=900342
- outcome=failed reason=timeout durationMs=900533

2) During failure chain, gateway rejects tasks as draining

From incident extract and ~/.openclaw/logs/gateway.err.log:

GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted
repeated drain timeout reached; proceeding with restart

3) Subagent completion announce failures/retries during this state

Subagent completion direct announce failed ... GatewayDrainingError
Subagent announce completion ... transient failure, retrying
Subagent announce give up (retry-limit)

4) Anthropic fallback still appeared in live fallback decisions during drain

Even after removing Anthropic fallback from config on disk, log lines during draining still showed:

next=anthropic/claude-haiku-4-5 detail=Gateway is draining for restart; new tasks are not accepted
and auth failure attempts:
- candidate=anthropic/claude-haiku-4-5 reason=auth ... HTTP 401 authentication_error: invalid x-api-key

5) On-disk config had Anthropic removed, but runtime lagged until restart

Current ~/.openclaw/openclaw.json fallback list is only:
- openai-codex/gpt-5.4 -> openai-codex/gpt-5.3-codex
Local commit removing Anthropic fallback:
- 19172db Remove Anthropic model fallback config
openclaw.json metadata shows it was touched at 2026-04-08T17:10:23.793Z
But runtime logs still had next=anthropic/claude-haiku-4-5 at 2026-04-08T17:10:27.547+01:00

This suggests live runtime config/fallback chain can remain stale until gateway restart/reload.

Doctor note

openclaw doctor --fix was run locally, but this alone did not reload the running gateway process.

Expected behavior

Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops.
Subagent completion announce should not be lost/give-up during gateway drain windows.
Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk.
If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.

Actual behavior

Overflow + compaction timeout chain coincided with gateway draining errors and task rejection.
Subagent announce retries frequently failed/gave up.
Fallback routing still referenced Anthropic during draining, causing 401 auth errors, despite Anthropic fallback being removed on disk.

Potentially related (but not exact duplicate)

#44031 (compaction timeout hangs)
#40295 (compaction deadlock/session recovery)
#55412 (GatewayDrainingError retry behavior)
#54276 (subagent announce give-up)
#62095 (doctor --fix expectations)

Request

Please investigate this as a possible 2026.4.8 regression/failure-chain interaction:

large-session overflow/compaction timeout
gateway drain/restart task rejection behavior
subagent announce resilience during drain
runtime config/fallback-chain reload semantics (especially after removing providers)

If useful, I can provide the extracted incident artifacts/log snippets listed above.

extent analysis

TL;DR

The most likely fix for the issue is to implement a mechanism to reload the runtime configuration after changes are made to the on-disk configuration, ensuring that the fallback chain is updated without requiring a gateway restart.

Guidance

Investigate the openclaw doctor --fix command to determine why it does not reload the running gateway process, and consider modifying it to trigger a reload when necessary.
Review the gateway draining and restart logic to prevent task rejection loops and subagent announce failures during this state.
Implement a mechanism to detect and handle compaction timeouts and overflow errors more gracefully, preventing them from cascading into prolonged drain and task rejection.
Consider adding clear messaging to the CLI/doctor output when a restart/reload is required for fallback chain changes.

Example

No specific code example is provided due to the complexity of the issue and the need for further investigation.

Notes

The issue appears to be related to a regression in OpenClaw 2026.4.8, and resolving it may require changes to the gateway draining and restart logic, as well as the runtime configuration reload semantics.

Recommendation

Apply a workaround to implement a reload mechanism for the runtime configuration after changes are made to the on-disk configuration, as this is likely to address the issue with the stale fallback chain. This can be done by modifying the openclaw doctor --fix command or adding a new command to trigger a reload when necessary.

FAQ

Expected behavior

Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops.
Subagent completion announce should not be lost/give-up during gateway drain windows.
Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk.
If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix 2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Environment

Evidence (local)

1) Massive overflow + compaction timeout

2) During failure chain, gateway rejects tasks as draining

3) Subagent completion announce failures/retries during this state

4) Anthropic fallback still appeared in live fallback decisions during drain

5) On-disk config had Anthropic removed, but runtime lagged until restart

Doctor note

Expected behavior

Actual behavior

Potentially related (but not exact duplicate)

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix 2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Environment

Evidence (local)

1) Massive overflow + compaction timeout

2) During failure chain, gateway rejects tasks as draining

3) Subagent completion announce failures/retries during this state

4) Anthropic fallback still appeared in live fallback decisions during drain

5) On-disk config had Anthropic removed, but runtime lagged until restart

Doctor note

Expected behavior

Actual behavior

Potentially related (but not exact duplicate)

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING