openclaw - 💡(How to fix) Fix [Feature] Error-recovery plugin hook (on_run_error / on_failed_reply) for silent run death [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63140Fetched 2026-04-09 07:57:56
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

// Plugin hook — fires when an embedded run errors before producing a reply on_run_error(ctx: { sessionKey: string, // 'agent:main:discord:channel:1234' channelId?: string, // extracted from sessionKey when possible runId: string, error: Error, errorCategory: 'auth' | 'timeout' | 'overload' | 'unknown', attempts: number, // how many failover attempts were made elapsedMs: number, }): Promise<{ reply?: string } | void>

Fix Action

Fix / Workaround

Checked docs.openclaw.ai/automation/hooks and plugins/architecture on 2026.4.8: the 28 provider runtime hooks are all about model/transport/auth lifecycle (refreshOAuth, classifyFailoverReason, buildMissingAuthMessage, etc.) — none fire on a failed reply path. before_agent_reply (#20067) is success-path only. reply_dispatch is interception, not error recovery.

As a workaround we built a log-tailing bash watchdog that greps for Embedded agent failed before reply, extracts the channel ID from the preceding lane=session:... line, and posts a recovery message via openclaw message send --channel discord --target <id>. Works, but it's fragile (depends on log format) and should not be user-space code.

Code Example

// Plugin hook — fires when an embedded run errors before producing a reply
on_run_error(ctx: {
  sessionKey: string,        // 'agent:main:discord:channel:1234'
  channelId?: string,        // extracted from sessionKey when possible
  runId: string,
  error: Error,
  errorCategory: 'auth' | 'timeout' | 'overload' | 'unknown',
  attempts: number,          // how many failover attempts were made
  elapsedMs: number,
}): Promise<{ reply?: string } | void>
RAW_BUFFERClick to expand / collapse

Problem

When an embedded agent run fails before producing a reply (logged as Embedded agent failed before reply: <reason>), the error is surfaced only to the gateway log. On Discord specifically, users see the typing indicator disappear and assume the agent is still working — no recovery message is ever posted to the channel.

Checked docs.openclaw.ai/automation/hooks and plugins/architecture on 2026.4.8: the 28 provider runtime hooks are all about model/transport/auth lifecycle (refreshOAuth, classifyFailoverReason, buildMissingAuthMessage, etc.) — none fire on a failed reply path. before_agent_reply (#20067) is success-path only. reply_dispatch is interception, not error recovery.

Observed in production

Today on a 2026.4.8 Discord deployment:

  • 4 silent deaths in 5 hours (users had to reprompt)
  • 1 confirmed case where [diagnostic] lane task error: lane=session:agent:main:discord:channel:<id> was logged but the channel received no message — user waited 25+ minutes

As a workaround we built a log-tailing bash watchdog that greps for Embedded agent failed before reply, extracts the channel ID from the preceding lane=session:... line, and posts a recovery message via openclaw message send --channel discord --target <id>. Works, but it's fragile (depends on log format) and should not be user-space code.

Proposed hook shape

// Plugin hook — fires when an embedded run errors before producing a reply
on_run_error(ctx: {
  sessionKey: string,        // 'agent:main:discord:channel:1234'
  channelId?: string,        // extracted from sessionKey when possible
  runId: string,
  error: Error,
  errorCategory: 'auth' | 'timeout' | 'overload' | 'unknown',
  attempts: number,          // how many failover attempts were made
  elapsedMs: number,
}): Promise<{ reply?: string } | void>

Return a { reply } object to post a recovery message to the originating channel; return void to leave the silence in place (current behavior).

Related

  • #54964 (session zombie state after init failure — similar silent-death class)
  • #43661 (session hang during compaction timeout)
  • before_agent_reply hook (#20067) — success-path counterpart

extent analysis

TL;DR

Implement the proposed on_run_error plugin hook to handle embedded agent run failures and provide a recovery message to the user.

Guidance

  • Review the proposed on_run_error hook shape and implement it in the plugin to catch embedded agent run errors before reply.
  • Use the provided ctx object to extract relevant information, such as channelId and errorCategory, to determine the appropriate recovery message.
  • Return a { reply } object from the on_run_error hook to post a recovery message to the originating channel.
  • Consider logging the error and recovery message for auditing and debugging purposes.

Example

// Example implementation of the on_run_error hook
on_run_error(ctx: {
  sessionKey: string,
  channelId?: string,
  runId: string,
  error: Error,
  errorCategory: 'auth' | 'timeout' | 'overload' | 'unknown',
  attempts: number,
  elapsedMs: number,
}): Promise<{ reply?: string } | void> {
  const recoveryMessage = `Error occurred during agent run: ${ctx.errorCategory}`;
  return { reply: recoveryMessage };
}

Notes

The proposed on_run_error hook provides a way to handle embedded agent run failures and provide a recovery message to the user. However, the implementation details may vary depending on the specific requirements and constraints of the system.

Recommendation

Apply the proposed on_run_error hook workaround to handle embedded agent run failures and provide a recovery message to the user, as it provides a more robust and maintainable solution compared to the current log-tailing bash watchdog.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING