openclaw - 💡(How to fix) Fix [Bug]: Session-file lock cascade — single stale .lock blocks every subsequent agent invocation in container [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73684Fetched 2026-04-29 06:16:24
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1

When OpenClaw's GRAEAE agent crashes or is killed mid-invocation, it leaves a stale .lock file under ~/.openclaw/agents/main/sessions/. Every subsequent openclaw agent invocation in the same container then fails with the same lock-timeout error, falls back to the embedded model chain (which itself fails because the fallback model has too small a context window), and ultimately produces an empty/error response. The lock persists until container restart.

In our 30-prompt cross-runtime acceptance harness this single failure mode cascaded into 9 consecutive prompt failures in one barrage run (p07 through p15), tanking the score from ~26/30 to 19/30.

Error Message

Gateway agent failed; falling back to embedded: GatewayClientRequestError: FallbackSummaryError: All models failed (2): google/gemini-flash-latest: session file locked (timeout 10000ms): pid=4687 /home/node/.openclaw/agents/main/sessions/postfix-1777122625.jsonl.lock (timeout) | google/gemma-4-31b-it: Model context window too small (8192 tokens; source=modelsConfig). Minimum is 16000. (unknown)

Root Cause

When OpenClaw's GRAEAE agent crashes or is killed mid-invocation, it leaves a stale .lock file under ~/.openclaw/agents/main/sessions/. Every subsequent openclaw agent invocation in the same container then fails with the same lock-timeout error, falls back to the embedded model chain (which itself fails because the fallback model has too small a context window), and ultimately produces an empty/error response. The lock persists until container restart.

Fix Action

Workaround

Before each barrage prompt:

docker exec <container> sh -c \\
  'rm -f /home/node/.openclaw/agents/main/sessions/*.lock 2>/dev/null'

This eliminated the cascade in our harness (harness/cobol/cobol_barrage_cross_runtime.py:pre_prompt_cleanup).

Code Example

docker exec openclaw-demo-typhon openclaw agent --to \"+17777777710\" \
  --message \"<prompt>\" --timeout 300

---

Gateway agent failed; falling back to embedded:
  GatewayClientRequestError: FallbackSummaryError: All models failed (2):
  google/gemini-flash-latest: session file locked (timeout 10000ms):
    pid=4687 /home/node/.openclaw/agents/main/sessions/postfix-1777122625.jsonl.lock
    (timeout)
  | google/gemma-4-31b-it: Model context window too small
    (8192 tokens; source=modelsConfig). Minimum is 16000. (unknown)

---

docker exec <container> sh -c \\
  'rm -f /home/node/.openclaw/agents/main/sessions/*.lock 2>/dev/null'
RAW_BUFFERClick to expand / collapse

Summary

When OpenClaw's GRAEAE agent crashes or is killed mid-invocation, it leaves a stale .lock file under ~/.openclaw/agents/main/sessions/. Every subsequent openclaw agent invocation in the same container then fails with the same lock-timeout error, falls back to the embedded model chain (which itself fails because the fallback model has too small a context window), and ultimately produces an empty/error response. The lock persists until container restart.

In our 30-prompt cross-runtime acceptance harness this single failure mode cascaded into 9 consecutive prompt failures in one barrage run (p07 through p15), tanking the score from ~26/30 to 19/30.

Repro

openclaw-demo:latest container, healthy on startup. We were running 30 prompts back-to-back via docker exec:

docker exec openclaw-demo-typhon openclaw agent --to \"+17777777710\" \
  --message \"<prompt>\" --timeout 300

Prompt 6 succeeded, prompt 7+ all failed with this exact stack:

Gateway agent failed; falling back to embedded:
  GatewayClientRequestError: FallbackSummaryError: All models failed (2):
  google/gemini-flash-latest: session file locked (timeout 10000ms):
    pid=4687 /home/node/.openclaw/agents/main/sessions/postfix-1777122625.jsonl.lock
    (timeout)
  | google/gemma-4-31b-it: Model context window too small
    (8192 tokens; source=modelsConfig). Minimum is 16000. (unknown)

Notable: pid=4687 is the same in every failure — the lock from prompt 6's interrupted/timeout-killed call never gets released.

Root cause hypothesis

postfix-1777122625.jsonl.lock is created when the agent enters a session-write critical section. If the agent process is killed (timeout, SIGTERM, OOM) without releasing it, no liveness check / staleness sweep runs at the next agent invocation. The 10-second acquisition timeout fires unconditionally, then fallback kicks in.

Suggested fix

At session-acquire time, check the pid= recorded in the .lock file. If the pid is dead (!kill(pid, 0)), treat the lock as stale and reclaim it. This is the standard flock/fcntl-style idiom; today the lock is a flat .lock file with no liveness check.

Alternatively: a openclaw sessions cleanup --enforce invocation as a startup step / periodic sweep would also work.

Workaround

Before each barrage prompt:

docker exec <container> sh -c \\
  'rm -f /home/node/.openclaw/agents/main/sessions/*.lock 2>/dev/null'

This eliminated the cascade in our harness (harness/cobol/cobol_barrage_cross_runtime.py:pre_prompt_cleanup).

Adjacent issue

The fallback chain selects google/gemma-4-31b-it which has contextWindow: 8192 per modelsConfig, but the agent's hard-coded minimum is 16000 — so the fallback is never usable. Either tighten the fallback list to only models with sufficient context window, or relax the minimum to match the smallest model in the chain. Today the cascade is silent because the immediate error ("context window too small") prevents any user-visible output.

Versions

  • openclaw-demo:latest container, OpenClaw 2026.4.24
  • Models config: gemini-flash-latest primary; gemma-4-31b-it fallback

extent analysis

TL;DR

Implement a liveness check for the .lock file by verifying the pid recorded in the file, and treat the lock as stale if the pid is dead.

Guidance

  • Check the pid recorded in the .lock file at session-acquire time and treat the lock as stale if the pid is dead using !kill(pid, 0).
  • Consider implementing a periodic openclaw sessions cleanup --enforce invocation as a startup step or sweep to remove stale locks.
  • As a temporary workaround, remove all .lock files before each barrage prompt using rm -f /home/node/.openclaw/agents/main/sessions/*.lock.
  • Review the fallback model chain to ensure that the selected models have a sufficient context window to prevent silent failures.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a design change to the locking mechanism.

Notes

The provided workaround may not be suitable for production environments, as it removes all .lock files without checking their validity. A more robust solution would be to implement a liveness check for the .lock file.

Recommendation

Apply the workaround of removing all .lock files before each barrage prompt, as it has been shown to eliminate the cascade in the harness. However, this should be considered a temporary solution until a more robust locking mechanism is implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Session-file lock cascade — single stale .lock blocks every subsequent agent invocation in container [1 comments, 1 participants]