openclaw - ✅(Solved) Fix Gateway 'agents failed' alerts need actionable failure context [1 pull requests, 2 comments, 3 participants]

openclaw2026-05-10 07:27:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#80156•Fetched 2026-05-11 03:18:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2closed ×1cross-referenced ×1

Error Message

The alert should include at least the failing subsystem, reason/error code, session/run id, and whether the gateway is still live.

tool failure (ENOENT, validation error, protected config path)
Tool-level failures include tool name, error code/message, session/run id, and fatal/nonfatal classification when possible.

Root Cause

The current alert makes benign/transient issues look like a gateway outage. It also forces operators to dig through journald to distinguish:

stale subagent/session queue message (no_active_run)
tool failure (ENOENT, validation error, protected config path)
actual agent runtime failure
actual gateway outage

This causes unnecessary restarts and weak incident response.

Fix Action

Fix / Workaround

queue message failed: sessionId=<uuid> reason=no_active_run
[tools] read failed: ENOENT .../agents/forge/agent/AGENTS.md
[tools] gateway failed: config.patch raw must be an object
[tools] gateway failed: gateway config.patch cannot change protected config paths: ...

PR fix notes

PR #80160: fix: add actionable queue failure diagnostics

Repository: openclaw/openclaw
Author: markus-lassfolk
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/80160

Description (problem / solution / changelog)

Summary

Fixes openclaw/openclaw#80156 by making failed embedded-run message queue attempts actionable in logs/diagnostics.

Instead of the low-context debug line:

queue message failed: sessionId=<id> reason=no_active_run

this now emits a warning with:

session id
reason (no_active_run, not_streaming, compacting)
active embedded run count
known session key when available
snapshot status/run id when available
a hint clarifying that stale/completed runs are usually not gateway outages

Why

Operators can currently receive generic alerts like:

⚠️ 🔌 Gateway: agents failed

while the gateway itself is healthy. The useful cause is buried in journald. This patch makes the underlying queue failure diagnostic self-explanatory so notification layers and operators can distinguish stale-run follow-ups from real gateway/agent failure.

Tests

Added a regression test for queueEmbeddedPiMessage when queueing into a missing run, asserting the log includes session id, reason, active count, and non-outage hint.

Local validation

I could not run the full vitest target in the sparse checkout because dependencies were not installed and /tmp is space constrained. I did run:

git diff --check

and this PR should be validated by upstream CI.

Changed files

src/agents/pi-embedded-runner/runs.test.ts (modified, +16/-0)
src/agents/pi-embedded-runner/runs.ts (modified, +19/-3)

Code Example

⚠️ 🔌 Gateway: agents failed

---

queue message failed: sessionId=<uuid> reason=no_active_run
[tools] read failed: ENOENT .../agents/forge/agent/AGENTS.md
[tools] gateway failed: config.patch raw must be an object
[tools] gateway failed: gateway config.patch cannot change protected config paths: ...

---

⚠️ 🔌 Gateway agent warning: queue_message_failed
sessionId=...
reason=no_active_run
channel=telegram
runId=...
gatewayHealth=live
recentCause="attempted to queue follow-up into completed run"

---

⚠️ 🔌 Gateway tool failure: read ENOENT
sessionId=...
tool=read
path=/.../AGENTS.md
fatal=false
gatewayHealth=live

RAW_BUFFERClick to expand / collapse

Problem

Gateway/operator notifications can emit a generic alert like:

⚠️ 🔌 Gateway: agents failed

without enough actionable context to diagnose the failure. In a recent incident, the gateway itself was healthy, but the operator only saw the generic alert. The useful details existed only in gateway logs:

queue message failed: sessionId=<uuid> reason=no_active_run
[tools] read failed: ENOENT .../agents/forge/agent/AGENTS.md
[tools] gateway failed: config.patch raw must be an object
[tools] gateway failed: gateway config.patch cannot change protected config paths: ...

The alert should include at least the failing subsystem, reason/error code, session/run id, and whether the gateway is still live.

Why this matters

The current alert makes benign/transient issues look like a gateway outage. It also forces operators to dig through journald to distinguish:

stale subagent/session queue message (no_active_run)
tool failure (ENOENT, validation error, protected config path)
actual agent runtime failure
actual gateway outage

This causes unnecessary restarts and weak incident response.

Proposed improvement

Make gateway/agent failure notifications structured and actionable. For example:

⚠️ 🔌 Gateway agent warning: queue_message_failed
sessionId=...
reason=no_active_run
channel=telegram
runId=...
gatewayHealth=live
recentCause="attempted to queue follow-up into completed run"

For tool failures:

⚠️ 🔌 Gateway tool failure: read ENOENT
sessionId=...
tool=read
path=/.../AGENTS.md
fatal=false
gatewayHealth=live

Acceptance criteria

Generic Gateway: agents failed is replaced or supplemented with reason/session/run details.
queueEmbeddedPiMessage(... reason=no_active_run|not_streaming|compacting) exposes enough context for notification/reporting layers.
Tool-level failures include tool name, error code/message, session/run id, and fatal/nonfatal classification when possible.
Gateway health/live state is not conflated with agent/tool failure.
Add regression tests for no-active-run notification payload/summary.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#GPU compatibility #latency issue #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - ✅(Solved) Fix Gateway 'agents failed' alerts need actionable failure context [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #80160: fix: add actionable queue failure diagnostics

Description (problem / solution / changelog)

Summary

Why

Tests

Local validation

Changed files

Code Example

Problem

Why this matters

Proposed improvement

Acceptance criteria

Still need to ship something?

TRENDING

openclaw - ✅(Solved) Fix Gateway 'agents failed' alerts need actionable failure context [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #80160: fix: add actionable queue failure diagnostics

Description (problem / solution / changelog)

Summary

Why

Tests

Local validation

Changed files

Code Example

Problem

Why this matters

Proposed improvement

Acceptance criteria

Still need to ship something?

RELATED_DISCOVERY

TRENDING