openclaw - ✅(Solved) Fix GatewayDrainingError should auto-retry, not surface to user [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#55412Fetched 2026-04-08 01:39:47
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×1referenced ×1

Error Message

When the gateway restarts (e.g., after config.patch), any in-flight agent run that triggers a new command during the drain window gets GatewayDrainingError. This falls through to the generic error handler in agent-runner.runtime and surfaces to the user as: This is a transient error — the gateway comes back up seconds later. But the user sees an error and thinks something is broken. GatewayDrainingError should be treated like isTransientHttp errors — auto-retry after a short delay (e.g., wait for the restart to complete, then retry). The error should never surface to the user since it always resolves on its own. In agent-runner.runtime, the error handling chain checks for billing, context overflow, role ordering, session corruption, and transient HTTP — but GatewayDrainingError is not checked and falls to the generic Agent failed before reply message. Add a check before the generic error handler: if (message.includes('Gateway is draining') || error?.name === 'GatewayDrainingError') {

Fix Action

Fix / Workaround

When the gateway restarts (e.g., after config.patch), any in-flight agent run that triggers a new command during the drain window gets GatewayDrainingError. This falls through to the generic error handler in agent-runner.runtime and surfaces to the user as:

Environment

  • OpenClaw 2026.3.24
  • macOS, local gateway, config.patch triggered restart
  • Happens every time a restart occurs while agents are active

PR fix notes

PR #55470: fix: auto-retry GatewayDrainingError instead of surfacing to user (#55412)

Description (problem / solution / changelog)

Summary

When the gateway restarts (e.g., after config.patch), in-flight agent runs that trigger new commands during the drain window get GatewayDrainingError. This falls through to the generic error handler and surfaces to users as:

⚠️ Agent failed before reply: Gateway is draining for restart; new tasks are not accepted.
Logs: openclaw logs --follow

This is a transient error — the gateway comes back up seconds later. But the user sees an error and thinks something is broken.

Fix

Adds a check for GatewayDrainingError before the generic error handler in the agent-runner error handling chain. When detected:

  1. Wait 15 seconds for the gateway restart to complete
  2. Retry the run (same pattern as existing transient HTTP error handling)

The check matches both the error message string (Gateway is draining) and the error class name (GatewayDrainingError), using a didRetryGatewayDrainingError flag to prevent infinite retry loops.

Why 15 seconds?

Gateway restarts typically complete within 5-10 seconds, but the delay accounts for slower systems. The existing transient HTTP retry uses 2.5 seconds since those are usually immediate provider hiccups. Gateway restarts need more time for the process to fully stop and restart.

Fixes #55412

Changed files

  • src/auto-reply/reply/agent-runner-execution.ts (modified, +19/-0)

Code Example

⚠️ Agent failed before reply: Gateway is draining for restart; new tasks are not accepted.
Logs: openclaw logs --follow

---

if (message.includes('Gateway is draining') || error?.name === 'GatewayDrainingError') {
  // Wait for restart to complete (poll gateway health or fixed delay)
  await new Promise(r => setTimeout(r, 15000));
  continue; // retry the run
}
RAW_BUFFERClick to expand / collapse

Problem

When the gateway restarts (e.g., after config.patch), any in-flight agent run that triggers a new command during the drain window gets GatewayDrainingError. This falls through to the generic error handler in agent-runner.runtime and surfaces to the user as:

⚠️ Agent failed before reply: Gateway is draining for restart; new tasks are not accepted.
Logs: openclaw logs --follow

This is a transient error — the gateway comes back up seconds later. But the user sees an error and thinks something is broken.

Expected behavior

GatewayDrainingError should be treated like isTransientHttp errors — auto-retry after a short delay (e.g., wait for the restart to complete, then retry). The error should never surface to the user since it always resolves on its own.

Current behavior

In agent-runner.runtime, the error handling chain checks for billing, context overflow, role ordering, session corruption, and transient HTTP — but GatewayDrainingError is not checked and falls to the generic Agent failed before reply message.

Suggested fix

Add a check before the generic error handler:

if (message.includes('Gateway is draining') || error?.name === 'GatewayDrainingError') {
  // Wait for restart to complete (poll gateway health or fixed delay)
  await new Promise(r => setTimeout(r, 15000));
  continue; // retry the run
}

Environment

  • OpenClaw 2026.3.24
  • macOS, local gateway, config.patch triggered restart
  • Happens every time a restart occurs while agents are active

extent analysis

Fix Plan

To resolve the GatewayDrainingError issue, we need to modify the error handling chain in agent-runner.runtime to auto-retry after a short delay when this error occurs. Here are the steps:

  • Modify the error handling chain to check for GatewayDrainingError:
if (message.includes('Gateway is draining') || error?.name === 'GatewayDrainingError') {
  // Wait for restart to complete (poll gateway health or fixed delay)
  await new Promise(r => setTimeout(r, 15000)); // 15-second delay
  continue; // retry the run
}
  • Alternatively, poll the gateway health instead of using a fixed delay:
if (message.includes('Gateway is draining') || error?.name === 'GatewayDrainingError') {
  while (true) {
    const gatewayHealth = await getGatewayHealth(); // implement getGatewayHealth function
    if (gatewayHealth === 'healthy') {
      break;
    }
    await new Promise(r => setTimeout(r, 1000)); // 1-second poll interval
  }
  continue; // retry the run
}

Verification

To verify that the fix worked, restart the gateway while an agent run is in progress and check that the GatewayDrainingError does not surface to the user. The agent run should auto-retry after the gateway restart is complete.

Extra Tips

  • Make sure to implement the getGatewayHealth function to poll the gateway health.
  • Adjust the delay or poll interval as needed to ensure that the agent run retries after the gateway restart is complete.
  • Consider adding logging to track the number of retries and the time it takes for the gateway to become healthy again.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

GatewayDrainingError should be treated like isTransientHttp errors — auto-retry after a short delay (e.g., wait for the restart to complete, then retry). The error should never surface to the user since it always resolves on its own.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING