openclaw - ✅(Solved) Fix Gateway 'agents failed' alerts need actionable failure context [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80156Fetched 2026-05-11 03:18:15
View on GitHub
Comments
2
Participants
3
Timeline
4
Reactions
2
Timeline (top)
commented ×2closed ×1cross-referenced ×1

Error Message

The alert should include at least the failing subsystem, reason/error code, session/run id, and whether the gateway is still live.

  • tool failure (ENOENT, validation error, protected config path)
  • Tool-level failures include tool name, error code/message, session/run id, and fatal/nonfatal classification when possible.

Root Cause

The current alert makes benign/transient issues look like a gateway outage. It also forces operators to dig through journald to distinguish:

  • stale subagent/session queue message (no_active_run)
  • tool failure (ENOENT, validation error, protected config path)
  • actual agent runtime failure
  • actual gateway outage

This causes unnecessary restarts and weak incident response.

Fix Action

Fix / Workaround

queue message failed: sessionId=<uuid> reason=no_active_run
[tools] read failed: ENOENT .../agents/forge/agent/AGENTS.md
[tools] gateway failed: config.patch raw must be an object
[tools] gateway failed: gateway config.patch cannot change protected config paths: ...

PR fix notes

PR #80160: fix: add actionable queue failure diagnostics

Description (problem / solution / changelog)

Summary

Fixes openclaw/openclaw#80156 by making failed embedded-run message queue attempts actionable in logs/diagnostics.

Instead of the low-context debug line:

queue message failed: sessionId=<id> reason=no_active_run

this now emits a warning with:

  • session id
  • reason (no_active_run, not_streaming, compacting)
  • active embedded run count
  • known session key when available
  • snapshot status/run id when available
  • a hint clarifying that stale/completed runs are usually not gateway outages

Why

Operators can currently receive generic alerts like:

⚠️ 🔌 Gateway: agents failed

while the gateway itself is healthy. The useful cause is buried in journald. This patch makes the underlying queue failure diagnostic self-explanatory so notification layers and operators can distinguish stale-run follow-ups from real gateway/agent failure.

Tests

Added a regression test for queueEmbeddedPiMessage when queueing into a missing run, asserting the log includes session id, reason, active count, and non-outage hint.

Local validation

I could not run the full vitest target in the sparse checkout because dependencies were not installed and /tmp is space constrained. I did run:

git diff --check

and this PR should be validated by upstream CI.

Changed files

  • src/agents/pi-embedded-runner/runs.test.ts (modified, +16/-0)
  • src/agents/pi-embedded-runner/runs.ts (modified, +19/-3)

Code Example

⚠️ 🔌 Gateway: agents failed

---

queue message failed: sessionId=<uuid> reason=no_active_run
[tools] read failed: ENOENT .../agents/forge/agent/AGENTS.md
[tools] gateway failed: config.patch raw must be an object
[tools] gateway failed: gateway config.patch cannot change protected config paths: ...

---

⚠️ 🔌 Gateway agent warning: queue_message_failed
sessionId=...
reason=no_active_run
channel=telegram
runId=...
gatewayHealth=live
recentCause="attempted to queue follow-up into completed run"

---

⚠️ 🔌 Gateway tool failure: read ENOENT
sessionId=...
tool=read
path=/.../AGENTS.md
fatal=false
gatewayHealth=live
RAW_BUFFERClick to expand / collapse

Problem

Gateway/operator notifications can emit a generic alert like:

⚠️ 🔌 Gateway: agents failed

without enough actionable context to diagnose the failure. In a recent incident, the gateway itself was healthy, but the operator only saw the generic alert. The useful details existed only in gateway logs:

queue message failed: sessionId=<uuid> reason=no_active_run
[tools] read failed: ENOENT .../agents/forge/agent/AGENTS.md
[tools] gateway failed: config.patch raw must be an object
[tools] gateway failed: gateway config.patch cannot change protected config paths: ...

The alert should include at least the failing subsystem, reason/error code, session/run id, and whether the gateway is still live.

Why this matters

The current alert makes benign/transient issues look like a gateway outage. It also forces operators to dig through journald to distinguish:

  • stale subagent/session queue message (no_active_run)
  • tool failure (ENOENT, validation error, protected config path)
  • actual agent runtime failure
  • actual gateway outage

This causes unnecessary restarts and weak incident response.

Proposed improvement

Make gateway/agent failure notifications structured and actionable. For example:

⚠️ 🔌 Gateway agent warning: queue_message_failed
sessionId=...
reason=no_active_run
channel=telegram
runId=...
gatewayHealth=live
recentCause="attempted to queue follow-up into completed run"

For tool failures:

⚠️ 🔌 Gateway tool failure: read ENOENT
sessionId=...
tool=read
path=/.../AGENTS.md
fatal=false
gatewayHealth=live

Acceptance criteria

  • Generic Gateway: agents failed is replaced or supplemented with reason/session/run details.
  • queueEmbeddedPiMessage(... reason=no_active_run|not_streaming|compacting) exposes enough context for notification/reporting layers.
  • Tool-level failures include tool name, error code/message, session/run id, and fatal/nonfatal classification when possible.
  • Gateway health/live state is not conflated with agent/tool failure.
  • Add regression tests for no-active-run notification payload/summary.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway 'agents failed' alerts need actionable failure context [1 pull requests, 2 comments, 3 participants]