openclaw - 💡(How to fix) Fix [Bug]: Session-file lock cascade — single stale .lock blocks every subsequent agent invocation in container [1 comments, 1 participants]

openclaw2026-04-28 16:43:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73684•Fetched 2026-04-29 06:16:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

perlowja

Participants

perlowja

Timeline (top)

closed ×1commented ×1cross-referenced ×1

When OpenClaw's GRAEAE agent crashes or is killed mid-invocation, it leaves a stale .lock file under ~/.openclaw/agents/main/sessions/. Every subsequent openclaw agent invocation in the same container then fails with the same lock-timeout error, falls back to the embedded model chain (which itself fails because the fallback model has too small a context window), and ultimately produces an empty/error response. The lock persists until container restart.

In our 30-prompt cross-runtime acceptance harness this single failure mode cascaded into 9 consecutive prompt failures in one barrage run (p07 through p15), tanking the score from ~26/30 to 19/30.

Error Message

Gateway agent failed; falling back to embedded: GatewayClientRequestError: FallbackSummaryError: All models failed (2): google/gemini-flash-latest: session file locked (timeout 10000ms): pid=4687 /home/node/.openclaw/agents/main/sessions/postfix-1777122625.jsonl.lock (timeout) | google/gemma-4-31b-it: Model context window too small (8192 tokens; source=modelsConfig). Minimum is 16000. (unknown)

Root Cause

Fix Action

Workaround

Before each barrage prompt:

docker exec <container> sh -c \\
  'rm -f /home/node/.openclaw/agents/main/sessions/*.lock 2>/dev/null'

This eliminated the cascade in our harness (harness/cobol/cobol_barrage_cross_runtime.py:pre_prompt_cleanup).

Code Example

docker exec openclaw-demo-typhon openclaw agent --to \"+17777777710\" \
  --message \"<prompt>\" --timeout 300

---

Gateway agent failed; falling back to embedded:
  GatewayClientRequestError: FallbackSummaryError: All models failed (2):
  google/gemini-flash-latest: session file locked (timeout 10000ms):
    pid=4687 /home/node/.openclaw/agents/main/sessions/postfix-1777122625.jsonl.lock
    (timeout)
  | google/gemma-4-31b-it: Model context window too small
    (8192 tokens; source=modelsConfig). Minimum is 16000. (unknown)

---

docker exec <container> sh -c \\
  'rm -f /home/node/.openclaw/agents/main/sessions/*.lock 2>/dev/null'

RAW_BUFFERClick to expand / collapse

Summary

Repro

openclaw-demo:latest container, healthy on startup. We were running 30 prompts back-to-back via docker exec:

docker exec openclaw-demo-typhon openclaw agent --to \"+17777777710\" \
  --message \"<prompt>\" --timeout 300

Prompt 6 succeeded, prompt 7+ all failed with this exact stack:

Gateway agent failed; falling back to embedded:
  GatewayClientRequestError: FallbackSummaryError: All models failed (2):
  google/gemini-flash-latest: session file locked (timeout 10000ms):
    pid=4687 /home/node/.openclaw/agents/main/sessions/postfix-1777122625.jsonl.lock
    (timeout)
  | google/gemma-4-31b-it: Model context window too small
    (8192 tokens; source=modelsConfig). Minimum is 16000. (unknown)

Notable: pid=4687 is the same in every failure — the lock from prompt 6's interrupted/timeout-killed call never gets released.

Root cause hypothesis

postfix-1777122625.jsonl.lock is created when the agent enters a session-write critical section. If the agent process is killed (timeout, SIGTERM, OOM) without releasing it, no liveness check / staleness sweep runs at the next agent invocation. The 10-second acquisition timeout fires unconditionally, then fallback kicks in.

Suggested fix

At session-acquire time, check the pid= recorded in the .lock file. If the pid is dead (!kill(pid, 0)), treat the lock as stale and reclaim it. This is the standard flock/fcntl-style idiom; today the lock is a flat .lock file with no liveness check.

Alternatively: a openclaw sessions cleanup --enforce invocation as a startup step / periodic sweep would also work.

Workaround

Before each barrage prompt:

docker exec <container> sh -c \\
  'rm -f /home/node/.openclaw/agents/main/sessions/*.lock 2>/dev/null'

This eliminated the cascade in our harness (harness/cobol/cobol_barrage_cross_runtime.py:pre_prompt_cleanup).

Adjacent issue

The fallback chain selects google/gemma-4-31b-it which has contextWindow: 8192 per modelsConfig, but the agent's hard-coded minimum is 16000 — so the fallback is never usable. Either tighten the fallback list to only models with sufficient context window, or relax the minimum to match the smallest model in the chain. Today the cascade is silent because the immediate error ("context window too small") prevents any user-visible output.

Versions

openclaw-demo:latest container, OpenClaw 2026.4.24
Models config: gemini-flash-latest primary; gemma-4-31b-it fallback

extent analysis

TL;DR

Implement a liveness check for the .lock file by verifying the pid recorded in the file, and treat the lock as stale if the pid is dead.

Guidance

Check the pid recorded in the .lock file at session-acquire time and treat the lock as stale if the pid is dead using !kill(pid, 0).
Consider implementing a periodic openclaw sessions cleanup --enforce invocation as a startup step or sweep to remove stale locks.
As a temporary workaround, remove all .lock files before each barrage prompt using rm -f /home/node/.openclaw/agents/main/sessions/*.lock.
Review the fallback model chain to ensure that the selected models have a sufficient context window to prevent silent failures.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a design change to the locking mechanism.

Notes

The provided workaround may not be suitable for production environments, as it removes all .lock files without checking their validity. A more robust solution would be to implement a liveness check for the .lock file.

Recommendation

Apply the workaround of removing all .lock files before each barrage prompt, as it has been shown to eliminate the cascade in the harness. However, this should be considered a temporary solution until a more robust locking mechanism is implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#dependency error #configuration error #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Session-file lock cascade — single stale .lock blocks every subsequent agent invocation in container [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Repro

Root cause hypothesis

Suggested fix

Workaround

Adjacent issue

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Session-file lock cascade — single stale .lock blocks every subsequent agent invocation in container [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Repro

Root cause hypothesis

Suggested fix

Workaround

Adjacent issue

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING