openclaw - ✅(Solved) Fix [Bug]: Auto-compaction leaves session JSONL write lock held after timeout, blocking all later Discord turns [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#84193Fetched 2026-05-20 03:42:55
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
1
Author
Timeline (top)
labeled ×6cross-referenced ×2commented ×1referenced ×1

OpenClaw 2026.5.18 can finish an Anthropic/Opus Discord run, enter post-run auto-compaction, and then leave the session JSONL write lock held after the compaction path times out.

After that, every new request in the same Discord channel session waits 60000ms for the session file lock and fails before the agent can reply:

SessionWriteLockTimeoutError: session file locked (timeout 60000ms)

The only observed recovery was a Gateway restart, which removed the live lock state and allowed the channel to accept requests again.

This appears related to existing session-lock/event-loop/compaction reliability reports, but this reproduction is narrower: a successful Opus run is followed by auto-compaction that holds the same session JSONL lock long enough to make all subsequent channel turns fail with no useful in-channel recovery.

Error Message

SessionWriteLockTimeoutError: session file locked (timeout 60000ms)

Root Cause

OpenClaw 2026.5.18 can finish an Anthropic/Opus Discord run, enter post-run auto-compaction, and then leave the session JSONL write lock held after the compaction path times out.

After that, every new request in the same Discord channel session waits 60000ms for the session file lock and fails before the agent can reply:

SessionWriteLockTimeoutError: session file locked (timeout 60000ms)

The only observed recovery was a Gateway restart, which removed the live lock state and allowed the channel to accept requests again.

This appears related to existing session-lock/event-loop/compaction reliability reports, but this reproduction is narrower: a successful Opus run is followed by auto-compaction that holds the same session JSONL lock long enough to make all subsequent channel turns fail with no useful in-channel recovery.

Fix Action

Fix / Workaround

Immediate workaround:

Operational workaround until fixed:

  • ensure session write locks are released in finally blocks around compaction
  • add timeout/cancellation cleanup for compaction-held session locks
  • make lock diagnostics identify the owning operation, not only the owning PID
  • surface a user-visible recovery event when compaction blocks a later interactive turn
  • optionally isolate compaction writes from normal interactive turn acquisition so a failed compaction cannot starve new user turns indefinitely

PR fix notes

PR #84220: fix(agents): abandon hung in-flight write lock on attempt cleanup (#84193)

Description (problem / solution / changelog)

Summary

Fixes #84193 — post-run Pi auto-compaction can hang while holding the session JSONL write lock. After the run's outer abort/timeout fires the wrapped await run() body never settles, so the lock-controller's finally never runs, so the lock keeps living in-process for the entire maxHoldMs window (defaults to runTimeout + 900s compaction grace = many minutes). Every later turn on the same Discord channel then bounces off SessionWriteLockTimeoutError: session file locked (timeout 60000ms) until a Gateway restart.

This change tracks every lock reacquired through withSessionWriteLock and abandons the still-held entries from acquireForCleanup, so the next turn can acquire the session file immediately. The existing transcript fence still rejects later writes that would have raced a partial-compaction mutation.

What changed

  • src/agents/pi-embedded-runner/run/attempt.session-lock.ts
    • Wrap each reacquired write lock in a small InFlightLockEntry and add it to a per-controller Set.
    • New abandonInFlightWriteLocks() flips the controller into takeover state and awaits force-release of every still-held entry (idempotent via the entry's released flag).
    • acquireForCleanup() calls it before re-acquiring, so a hung compact() can never keep the JSONL lock alive past attempt teardown. Also releases the lingering heldLock (initial coarse lock) if it was still held when takeover was detected.
    • Emit a stderr diagnostic line on abandon so maintainers can correlate journalctl -u openclaw traces with the cleanup path: [session-write-lock] abandoned N in-flight lock(s) on attempt cleanup: sessionFile=... owner=pid=... reason=stuck-compaction-or-hook.
  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts (new, unit)
    • Reproduces the leak with a stuck withSessionWriteLock(() => never-resolve) and proves cleanup releases it.
    • Asserts further writes after the abandon are rejected via the existing EmbeddedAttemptSessionTakeoverError fence path.
  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts (new, real fs)
    • Uses the real acquireSessionWriteLock against a real tmp session.jsonl (no mocks). Holds the lock through a stuck run, verifies a competing acquireSessionWriteLock from a separate caller fails with SessionWriteLockTimeoutError BEFORE cleanup, then runs cleanup and verifies the same competing acquire succeeds AFTER cleanup. Captures the diagnostic line and confirms the on-disk .jsonl bytes are not torn.

Real behavior proof

Behavior addressed: post-run Pi auto-compaction holds the session JSONL write lock past the run's outer timeout, blocking every later turn for the same session with SessionWriteLockTimeoutError until Gateway restart. (#84193)

Real environment tested: macOS arm64, Node v25.9.0, vitest 4.1.6 from node scripts/run-vitest.mjs. Real node:fs lock file + real acquireSessionWriteLock + real createEmbeddedAttemptSessionLockController — no mocks for the lock subsystem.

Exact steps or command run after this patch:

node scripts/run-vitest.mjs \
  src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts \
  src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts \
  src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts \
  src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.test.ts \
  src/agents/session-write-lock.test.ts

Evidence after fix (vitest summary):

 Test Files  10 passed (10)
      Tests  114 passed (114)
   Duration  1.11s

Real-fs trace from a standalone repro script (real acquireSessionWriteLock, no vitest, no mocks — captured locally with tmp paths redacted to $TMP):

[trace] sessionFile=$TMP/repro-84193-pVdd3y/session.jsonl
[trace] process.pid=72484
[trace] BEFORE cleanup: lock file owner=pid=72484
[trace] BEFORE cleanup: competing acquire failed after 1503ms with SessionWriteLockTimeoutError: session file locked (timeout 1500ms): pid=72484 $TMP/repro-84193-pVdd3y/session.jsonl.lock
[session-write-lock] abandoned 1 in-flight lock(s) on attempt cleanup: sessionFile=$TMP/repro-84193-pVdd3y/session.jsonl owner=pid=72484 reason=stuck-compaction-or-hook
[trace] AFTER cleanup:  competing acquire succeeded in 0ms
[trace] session.jsonl bytes intact: "{\"type\":\"session\"}\n"

This is the same shape of evidence the issue reporter would see: lock file on disk with the live Gateway PID before cleanup, the SessionWriteLockTimeoutError matching the user-reported error name and pid format (pid=N), the stderr diagnostic line on abandon, then immediate acquire success on the same session file and a verified non-torn session.jsonl. Same regression suite on upstream/main without this patch fails at:

embedded attempt session lock — stuck auto-compaction (#84193)
  ✗ abandons in-flight write lock when cleanup runs while a hung compact() still owns it
    expected "vi.fn()" to be called 1 times, but got 0 times
  ✗ rejects further withSessionWriteLock calls after abandoning a hung in-flight lock
    expected [Function] to throw error matching /session file changed/ but got
    'Cannot read properties of undefined (reading 'release')'

Observed result after fix: cleanup releases the stuck reacquired lock immediately. The 60s acquire timeout no longer blocks the next turn. Transcript fencing still rejects further writes after abandon via the existing EmbeddedAttemptSessionTakeoverError. Diagnostic stderr line is observable so a maintainer running journalctl -u openclaw after the failure will see which session file/pid was abandoned.

What was not tested: live Discord + Anthropic Opus session against a running Gateway. The integration test covers the file-lock subsystem semantics that the bug actually leaks (on-disk .jsonl.lock, the controller's reacquire path, the cleanup hook called from attempt.ts:4832); a Discord/Opus run would only add transport plumbing on top of the same locking code, so the lock-side behavior is fully covered here.

Risk addressed

ClawSweeper flagged two session-state-sensitive risks. Addressed:

  • Force-releasing in-flight compaction lock. Abandon only fires from acquireForCleanup(), which the embedded attempt runner only calls in its outer finally block (attempt.ts:4832). At that point the run's outer abort/timeout has already fired, so any late compaction write is already on an error path. The transcript fence (assertSessionFileFence) on the next attempt's controller catches any post-abandon byte-level divergence on its first withSessionWriteLock call, so a late compaction write cannot silently corrupt a future turn.
  • Race with concurrent in-process acquire. The abandon now awaits each lock.release() before returning, so the .jsonl.lock file is gone on disk before cleanup returns. The integration test exercises exactly this race (competing acquire from the same process fired immediately after cleanup) and confirms it succeeds without bouncing off the file-lock-manager's stale-lock detection.

Test plan

  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts — real-fs proof, 2/2 pass
  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts — unit regression, 4/4 pass
  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts — pre-existing suite, 28/28 pass (unchanged)
  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.test.ts — pass
  • node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts — pass
  • node scripts/run-oxlint.mjs on touched files — 0 errors / 0 warnings
  • pnpm exec oxfmt --check on touched files — formatted
  • Live Discord/Opus auto-compaction reproduction against a running Gateway (left for maintainer / Crabbox / Mantis)

Changed files

  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts (added, +116/-0)
  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts (added, +107/-0)
  • src/agents/pi-embedded-runner/run/attempt.session-lock.ts (modified, +79/-5)

PR #84353: fix(agents): abandon hung in-flight write lock on attempt cleanup (#84193)

Description (problem / solution / changelog)

Makes https://github.com/openclaw/openclaw/pull/84220 merge-ready for the ClawSweeper automerge loop. The edit pass should inspect the live PR diff, review comments, and failing checks; rebase if needed; keep the contributor branch credited; and stop only when validation is green or an external blocker is proven.

ClawSweeper 🐠 replacement reef notes:

<!-- clawsweeper-automerge-requested-by login="Takhoffman" id="781889" -->
  • Repair fallback: GitHub rejected the repair branch push because it updates workflow files and the ClawSweeper app token does not have workflows permission

Inherited issue-closing references from the source PR: Fixes #84193

Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against c1b31439982f.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts (added, +116/-0)
  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts (added, +109/-0)
  • src/agents/pi-embedded-runner/run/attempt.session-lock.ts (modified, +79/-5)

Code Example

SessionWriteLockTimeoutError: session file locked (timeout 60000ms)

---

{
  "pid": 963591,
  "createdAt": "2026-05-19T13:33:13.808Z",
  "starttime": 335126537
}

---

Original compaction timeout signal:


May 19 13:43:45 casper node[963591]:
2026-05-19T13:43:45.077+00:00 [agent/embedded]
embedded run timeout reached during compaction; extending deadline:
runId=7d2170b5-3733-4439-8451-cad42efa577b
sessionId=49e71c56-dcbc-40ab-be04-4a92fd2230be
extraMs=900000



May 19 13:44:14 casper node[963591]:
CommandLaneTaskTimeoutError: Command lane "main" task timed out after 930000ms


Subsequent requests failed waiting for the same session JSONL lock:


May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.707+00:00 [diagnostic]
lane task error: lane=main durationMs=61155
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.712+00:00 [diagnostic]
lane task error: lane=session:agent:main:discord:channel:1506258704541159484 durationMs=61165
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
Embedded agent failed before reply:
session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


The same pattern repeated:


May 19 13:46:27 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
May 19 13:54:04 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


Lock file observed before Gateway restart:


/home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
mtime: 2026-05-19 13:33:13.807249173 +0000
pid: 963591
createdAt: 2026-05-19T13:33:13.808Z


After a Gateway restart, the lock file was gone and the channel could accept new work again:


LOCK_GONE


Related public issues found:

- https://github.com/openclaw/openclaw/issues/43367 mentions session lock timeouts and detached background work in multi-agent orchestration.
- https://github.com/openclaw/openclaw/issues/75882 mentions gateway stalls, lane waits, file lock timeouts, and missed replies.

Neither is an exact match for this post-run auto-compaction lock leak in a single Discord channel session.
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

OpenClaw 2026.5.18 can finish an Anthropic/Opus Discord run, enter post-run auto-compaction, and then leave the session JSONL write lock held after the compaction path times out.

After that, every new request in the same Discord channel session waits 60000ms for the session file lock and fails before the agent can reply:

SessionWriteLockTimeoutError: session file locked (timeout 60000ms)

The only observed recovery was a Gateway restart, which removed the live lock state and allowed the channel to accept requests again.

This appears related to existing session-lock/event-loop/compaction reliability reports, but this reproduction is narrower: a successful Opus run is followed by auto-compaction that holds the same session JSONL lock long enough to make all subsequent channel turns fail with no useful in-channel recovery.

Steps to reproduce

  1. Run OpenClaw Gateway as a user systemd service with Discord enabled.
  2. Use a Discord channel session with Anthropic Opus as the active model.
  3. Start a larger file-producing task so the session crosses the auto-compaction threshold.
  4. Let the assistant finish the requested work.
  5. Observe post-run auto-compaction start for the same session.
  6. Send another user request in the same Discord channel while the compaction path is stuck.
  7. Observe the new request wait for the existing JSONL lock and fail after 60000ms.

Observed reproduction:

  • Discord channel: #mws
  • Session key: agent:main:discord:channel:1506258704541159484
  • Session file: /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl
  • Lock file: /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
  • Run id: 7d2170b5-3733-4439-8451-cad42efa577b

The final assistant answer for the original Opus run was written to the JSONL around 2026-05-19T13:33:13.804Z. The session lock was then created immediately after for the same gateway process:

{
  "pid": 963591,
  "createdAt": "2026-05-19T13:33:13.808Z",
  "starttime": 335126537
}

Later user requests in the same channel failed at 13:45, 13:46, and 13:54 UTC while waiting for the same lock.

Expected behavior

Post-run auto-compaction should not leave a live session write lock behind after timeout or abort.

Expected behavior:

  • compaction releases the session JSONL lock on success, failure, timeout, or cancellation
  • subsequent user turns in the same Discord channel are not blocked by stale in-process compaction state
  • if compaction cannot complete, OpenClaw surfaces a recoverable channel/session error
  • a Gateway restart should not be required to make the channel usable again
  • the stale-lock check should consider both PID and process start time, and should have a cleanup path for locks left by failed compaction

Actual behavior

The original Opus run completed useful work and wrote its final assistant output to the session JSONL.

Immediately afterward, auto-compaction held the session write lock. The compaction path timed out, but the lock remained held by the live Gateway process. New Discord requests in the same channel then failed before an embedded agent could start or reply.

User-visible result:

  • Discord typing/traffic stops
  • no final or error answer reaches the channel for later requests
  • each new request waits about 60 seconds and fails
  • the channel remains unusable until Gateway restart

OpenClaw version

OpenClaw 2026.5.18

Operating system

Ubuntu

Install method

npm global

Model

claude-opus-4-7

Provider / routing chain

anthropic/claude-opus-4-7 -> OpenClaw embedded run -> Discord channel session -> post-run auto-compaction

Additional provider/model setup details

Anthropic was used through the normal OpenClaw embedded runner path.

The incident happened after changing Discord group visible replies to automatic delivery to work around a separate message tool argument issue. The write-lock failure is independent of that delivery setting: the failing path is session persistence/auto-compaction before any later assistant reply can be generated.

The same environment also has separate reports for:

  • Codex app-server turns stalling after item/completed
  • model-generated SendMessage arguments being rejected instead of normalized to message

Those are distinct symptoms. This report is specifically about the session JSONL lock left behind by post-run auto-compaction.

Logs, screenshots, and evidence

Original compaction timeout signal:


May 19 13:43:45 casper node[963591]:
2026-05-19T13:43:45.077+00:00 [agent/embedded]
embedded run timeout reached during compaction; extending deadline:
runId=7d2170b5-3733-4439-8451-cad42efa577b
sessionId=49e71c56-dcbc-40ab-be04-4a92fd2230be
extraMs=900000



May 19 13:44:14 casper node[963591]:
CommandLaneTaskTimeoutError: Command lane "main" task timed out after 930000ms


Subsequent requests failed waiting for the same session JSONL lock:


May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.707+00:00 [diagnostic]
lane task error: lane=main durationMs=61155
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.712+00:00 [diagnostic]
lane task error: lane=session:agent:main:discord:channel:1506258704541159484 durationMs=61165
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
Embedded agent failed before reply:
session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


The same pattern repeated:


May 19 13:46:27 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
May 19 13:54:04 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


Lock file observed before Gateway restart:


/home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
mtime: 2026-05-19 13:33:13.807249173 +0000
pid: 963591
createdAt: 2026-05-19T13:33:13.808Z


After a Gateway restart, the lock file was gone and the channel could accept new work again:


LOCK_GONE


Related public issues found:

- https://github.com/openclaw/openclaw/issues/43367 mentions session lock timeouts and detached background work in multi-agent orchestration.
- https://github.com/openclaw/openclaw/issues/75882 mentions gateway stalls, lane waits, file lock timeouts, and missed replies.

Neither is an exact match for this post-run auto-compaction lock leak in a single Discord channel session.

Impact and severity

Severity: High / work-blocking.

Impact:

  • the affected Discord channel session becomes unusable
  • every new request waits about 60 seconds and fails before reply
  • users see no actionable recovery message in the channel
  • completed work may exist on disk, but the user receives no reliable completion signal
  • the only practical recovery observed is a Gateway restart

Additional information

Immediate workaround:

  1. Restart the Gateway cleanly.
  2. Verify the affected lock file is gone.
  3. Retry work in the channel only after the lock is cleared.

Operational workaround until fixed:

  • keep high-context Discord sessions short
  • use fresh channel/session context for large site/build tasks before auto-compaction is likely
  • split large tasks into smaller turns
  • avoid continuing work in a session that is close to compaction/context limits
  • monitor for old *.jsonl.lock files in active session directories
  • do not manually delete a lock while its owning Gateway PID is still alive unless there is strong evidence the lock is stale and the process is no longer using it
  • if the lock owner is the live Gateway process and the channel is blocked, prefer a clean Gateway restart over deleting the lock file

Suggested upstream fix areas:

  • ensure session write locks are released in finally blocks around compaction
  • add timeout/cancellation cleanup for compaction-held session locks
  • make lock diagnostics identify the owning operation, not only the owning PID
  • surface a user-visible recovery event when compaction blocks a later interactive turn
  • optionally isolate compaction writes from normal interactive turn acquisition so a failed compaction cannot starve new user turns indefinitely

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Post-run auto-compaction should not leave a live session write lock behind after timeout or abort.

Expected behavior:

  • compaction releases the session JSONL lock on success, failure, timeout, or cancellation
  • subsequent user turns in the same Discord channel are not blocked by stale in-process compaction state
  • if compaction cannot complete, OpenClaw surfaces a recoverable channel/session error
  • a Gateway restart should not be required to make the channel usable again
  • the stale-lock check should consider both PID and process start time, and should have a cleanup path for locks left by failed compaction

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Auto-compaction leaves session JSONL write lock held after timeout, blocking all later Discord turns [2 pull requests, 1 comments, 2 participants]