openclaw - ✅(Solved) Fix [Feature]: structured abort logging for embedded run aborts (OPENCLAW_LOG_ABORT_SOURCES env var) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74165Fetched 2026-04-30 06:27:49
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Timeline (top)
cross-referenced ×2commented ×1

Add an opt-in structured diagnostic for embedded run aborts, gated by OPENCLAW_LOG_ABORT_SOURCES=1. When enabled, every abortRun() entry in src/agents/pi-embedded-runner/run/attempt.ts emits one [abort-source] line classifying the trigger and capturing five caller stack frames.

Error Message

There is no log line distinguishing them. Operators triaging unexpected mid-stream cutoffs end up grepping bundled dist code, hand-instrumenting console.warn(new Error().stack) into selection-*.js:abortRun, redeploying, and waiting for another reproduction. We just walked through that loop on a real ~13 s cutoff in our self-hosted deployment.

Root Cause

Add an opt-in structured diagnostic for embedded run aborts, gated by OPENCLAW_LOG_ABORT_SOURCES=1. When enabled, every abortRun() entry in src/agents/pi-embedded-runner/run/attempt.ts emits one [abort-source] line classifying the trigger and capturing five caller stack frames.

Fix Action

Fix / Workaround

  • Affected: anyone running self-hosted OpenClaw who needs to triage mid-stream cutoffs (Happy operators, Control UI users, anyone seeing surface_error reason=timeout without context).

  • Severity: medium for triage time (hours saved per incident), low for everyday users.

  • Frequency: rare per user, but recurring across the user base — abort-cause questions surface in Discord.

  • Consequence: faster root-cause identification; no need to dist-patch.

  • Triggered the manual dist-patch loop ourselves on 2026-04-26 → 2026-04-27 chasing a 13 s cutoff in agents.defaults.timeoutSeconds=999999999 with default LLM idle 120 s. None of the known timers explained the timing; we needed a stack trace.

  • Convention precedent: OPENCLAW_RAW_STREAM is the same shape — operator-only diagnostic, off by default, blocked from workspace .env by the OPENCLAW_* prefix policy.

PR fix notes

PR #74110: feat(agents): add OPENCLAW_LOG_ABORT_SOURCES diagnostic for embedded run aborts

Description (problem / solution / changelog)

Summary

Add an opt-in structured diagnostic for embedded run aborts. Gated by env var OPENCLAW_LOG_ABORT_SOURCES=1; default off (no behavior change). When enabled, every abortRun() entry in src/agents/pi-embedded-runner/run/attempt.ts emits a [abort-source] line classifying the trigger and capturing five caller stack frames.

No linked upstream issue — this PR adds a missing operator-facing observability seam. Filing it because we hit the gap during real triage and want a first-class hook so the next operator doesn't have to dist-patch.

Problem

When an embedded agent run aborts mid-stream, the user-visible signal is a generic AbortError("This operation was aborted"). Five different paths can trip the abort:

  • External params.abortSignal from the caller
  • scheduleAbortTimer (run-level timer firing on params.timeoutMs)
  • idleTimeoutTrigger (LLM idle wrapper)
  • Compaction guard (shouldFlagCompactionTimeout path)
  • Public cancel() API

There is no log line distinguishing them. Operators triaging unexpected mid-stream cutoffs end up grepping bundled dist code, hand-instrumenting console.warn(new Error().stack) into selection-*.js:abortRun, redeploying, and waiting for another reproduction. We just walked through that loop on a real ~13s cutoff in our deployment.

Change

src/agents/pi-embedded-runner/run/abort-source-log.ts (new, 76 LOC):

  • isAbortSourceLoggingEnabled() reads the env var via existing isTruthyEnvValue helper.
  • classifyAbortSource(ctx) returns a discriminated union over five categories: external-signal | llm-idle-timeout | compaction-timeout | run-timer | explicit-cancel. Priority order matches actual abort semantics (external signal beats internal flags).
  • logAbortSource(ctx) formats the line via the existing agent/embedded subsystem logger when gated.

src/agents/pi-embedded-runner/run/attempt.ts: 1-line import + 9-line call at the head of the abortRun closure. Captures runId, sessionId, the four state flags (externalAbort, idleTimedOut, timedOutDuringCompaction, isTimeout), and the abort reason.

No public API change. No new dependency. No changelog entry (internal operator-facing diagnostic).

Why this shape

  • Default-off: zero log volume / behavior impact for users who don't enable it.
  • Single emission point inside abortRun covers every existing call site and any future ones automatically.
  • Existing logger keeps formatting consistent with surrounding agent/embedded lines.
  • Discriminated union for source prevents freeform-string drift in downstream consumers.
  • Stack capture via new Error().stack is gated, so cost is paid only when explicitly enabled.

Alternatives considered: always-on (rejected: log volume), per-call-site instrumentation (rejected: easy to miss new sources), structured AbortError subclass (rejected: API-surface ripple beyond the value of a diagnostic).

Tests

src/agents/pi-embedded-runner/run/abort-source-log.test.ts (new, 10 cases):

  • Classification × 5 — one per category, including priority ordering (external beats idle beats compaction beats timeout).
  • Gating × 3 — unset / "1" / empty string env var.
  • Emission × 2 — silent when gated off; emits expected fields when on, including source=, runId=, reason=, and a non-empty stack= segment.
$ pnpm test src/agents/pi-embedded-runner/run/abort-source-log.test.ts
Test Files  1 passed (1)
     Tests  10 passed (10)

pnpm build clean; the resulting selection-*.js bundle contains the [abort-source] log statement and classifyAbortSource symbol.

Notes

  • Env var name follows the existing OPENCLAW_RAW_STREAM precedent (operator-facing diagnostic, off by default, blocked from workspace .env by the OPENCLAW_* prefix policy in src/infra/dotenv.ts).
  • Could grow into richer abort tracing later (correlation IDs, downstream propagation) but kept minimal here — single diagnostic line, opt-in, easy to revert.

Changed files

  • src/agents/pi-embedded-runner/run/abort-source-log.test.ts (added, +121/-0)
  • src/agents/pi-embedded-runner/run/abort-source-log.ts (added, +76/-0)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +10/-0)
RAW_BUFFERClick to expand / collapse

Summary

Add an opt-in structured diagnostic for embedded run aborts, gated by OPENCLAW_LOG_ABORT_SOURCES=1. When enabled, every abortRun() entry in src/agents/pi-embedded-runner/run/attempt.ts emits one [abort-source] line classifying the trigger and capturing five caller stack frames.

Problem to solve

When an embedded agent run aborts mid-stream, the user-visible signal is a generic AbortError("This operation was aborted"). Five different paths can trip the abort:

  • External params.abortSignal from the caller
  • scheduleAbortTimer (run-level timer)
  • idleTimeoutTrigger (LLM idle wrapper)
  • Compaction guard
  • Public cancel() API

There is no log line distinguishing them. Operators triaging unexpected mid-stream cutoffs end up grepping bundled dist code, hand-instrumenting console.warn(new Error().stack) into selection-*.js:abortRun, redeploying, and waiting for another reproduction. We just walked through that loop on a real ~13 s cutoff in our self-hosted deployment.

Proposed solution

New helper src/agents/pi-embedded-runner/run/abort-source-log.ts:

  • isAbortSourceLoggingEnabled() reads the env var via the existing isTruthyEnvValue helper.
  • classifyAbortSource(ctx) returns a discriminated union over five categories: external-signal | llm-idle-timeout | compaction-timeout | run-timer | explicit-cancel. Priority order matches actual abort semantics.
  • logAbortSource(ctx) emits via the existing agent/embedded subsystem logger when gated.

In attempt.ts: 1-line import + 9-line call at the top of the abortRun closure. No behavior change.

Default off — production logs unchanged.

Implementation ready as PR if accepted (closed with this issue): #74110 (feat/structured-abort-logging branch on chphch/openclaw-pr). 10 unit tests pass; live-deployed on personal infra; no public API change.

Alternatives considered

  • Always-on: high-frequency abort runs would flood logs.
  • Per-call-site instrumentation: easy to miss new abort sources added later; current proposal has one entry point covering all five paths.
  • Structured AbortError subclass: would require touching the AbortController contract and possibly downstream consumers — out of scope for a diagnostic.

Impact

  • Affected: anyone running self-hosted OpenClaw who needs to triage mid-stream cutoffs (Happy operators, Control UI users, anyone seeing surface_error reason=timeout without context).
  • Severity: medium for triage time (hours saved per incident), low for everyday users.
  • Frequency: rare per user, but recurring across the user base — abort-cause questions surface in Discord.
  • Consequence: faster root-cause identification; no need to dist-patch.

Evidence/examples

  • Triggered the manual dist-patch loop ourselves on 2026-04-26 → 2026-04-27 chasing a 13 s cutoff in agents.defaults.timeoutSeconds=999999999 with default LLM idle 120 s. None of the known timers explained the timing; we needed a stack trace.
  • Convention precedent: OPENCLAW_RAW_STREAM is the same shape — operator-only diagnostic, off by default, blocked from workspace .env by the OPENCLAW_* prefix policy.

Additional information

🤖 AI-assisted (Claude Code / Sonnet 4.6). Implementation written and fully tested locally. Marked per CONTRIBUTING.md.

Filing this issue first per CONTRIBUTING.md guidance for new features. PR #74110 was filed prematurely and is being closed; will reopen if maintainers agree this surface is welcome.

extent analysis

TL;DR

Enable structured diagnostic logging for embedded run aborts by setting OPENCLAW_LOG_ABORT_SOURCES=1 to help operators triage mid-stream cutoffs.

Guidance

  • Review the proposed solution in src/agents/pi-embedded-runner/run/abort-source-log.ts to understand how the abort source classification and logging work.
  • Consider the trade-offs of the proposed solution, such as the potential log flood with high-frequency abort runs, and the benefits of having a single entry point for all five abort paths.
  • Evaluate the impact of this feature on your specific use case, including the severity and frequency of mid-stream cutoffs, and the potential time savings for operators.
  • Test the implementation locally and verify that the logging works as expected before deploying it to production.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a review and potential deployment of the proposed solution.

Notes

The proposed solution is currently available as a PR (#74110) and has been tested locally with 10 unit tests passing. However, it is essential to review and test the implementation thoroughly before deploying it to production.

Recommendation

Apply the workaround by setting OPENCLAW_LOG_ABORT_SOURCES=1 to enable structured diagnostic logging for embedded run aborts, as this will provide valuable information for operators to triage mid-stream cutoffs without modifying the existing AbortController contract.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Feature]: structured abort logging for embedded run aborts (OPENCLAW_LOG_ABORT_SOURCES env var) [1 pull requests, 1 comments, 2 participants]