openclaw - 💡(How to fix) Fix 🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction [3 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When context overflow triggers auto-compaction, the gateway enters a crash restart loop after compaction succeeds ("auto-compaction succeeded; retrying prompt"). The old gateway process becomes effectively zombie — it doesn't exit, but doesn't respond to new connections either. Systemd sees "gateway startup failed" and repeatedly restarts, each new process fails to acquire the lock within 5000ms (EEXIST + pid alive check), exits with code 78, and the cycle repeats. No connection survives more than a few seconds during this loop.

This is reproducible — has occurred multiple times on this system (May 6, May 7, May 8).


Root Cause

Root Cause Analysis (Code Level)

Fix Action

Fixed

Code Example

11:23:29 - context overflow detected (attempt 1/3); attempting auto-compaction
11:23:29 - [context-overflow-diag] sessionKey=agent:main:main provider=minimax/MiniMax-M2.7-highspeed
11:24:09 - [compaction] rotated active transcript after compaction
11:24:09 - auto-compaction succeeded for minimax/MiniMax-M2.7-highspeed; retrying prompt
11:24:09 - post-compaction guard armed for 3 attempts
11:24:09 - memory-lancedb-pro@1.1.0-beta.10: plugin registered
11:24:09 - memory-lancedb-pro: diagnostic build tag loaded
11:24:10 - memory-lancedb-pro: injecting 2 memories into context for agent main

---

11:24:57 - loading configuration…
11:25:04 - memory-lancedb-pro: llm-client [extract-candidates] request failed: Request timed out.
11:25:04 - memory-pro: smart-extractor: no memories extracted
11:25:04 - memory-lancedb-pro: regex fallback found 22 capturable text(s)
11:25:08 - loading configuration…
11:25:19 - webchat disconnected code=1001
11:25:19 - webchat connected (reconnect attempt)
11:26:02 - loading configuration…
11:26:18 - webchat disconnected code=1001
11:27:07 - webchat disconnected code=1001
11:27:37 - SIGTERM received; restarting  ← user manually restarted

---

Gateway failed to start: gateway already running under systemd; existing gateway is healthy, exiting with code 78 to prevent a systemd Restart=always loop | gateway already running (pid 56615); lock timeout after 5000ms
Port 18789 is already in use.
- pid 56615 root: /usr/bin/node /usr/lib/node_modules/openclaw/dist/index.js gateway --port 18789

---

compaction succeeds → adoptCompactionTranscript() 
runOwnsCompactionAfterHook() 
runPostCompactionSideEffects() 
emitSessionTranscriptUpdate() 
syncPostCompactionSessionMemory() 
continue (retry prompt)
RAW_BUFFERClick to expand / collapse

🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction

Summary

When context overflow triggers auto-compaction, the gateway enters a crash restart loop after compaction succeeds ("auto-compaction succeeded; retrying prompt"). The old gateway process becomes effectively zombie — it doesn't exit, but doesn't respond to new connections either. Systemd sees "gateway startup failed" and repeatedly restarts, each new process fails to acquire the lock within 5000ms (EEXIST + pid alive check), exits with code 78, and the cycle repeats. No connection survives more than a few seconds during this loop.

This is reproducible — has occurred multiple times on this system (May 6, May 7, May 8).


Environment

  • OpenClaw version: 2026.5.7 (also reproduced on earlier 5.5/5.6 builds)
  • Node.js: v22.22.2
  • OS: Debian with systemd
  • Launcher: systemd (openclaw-gateway.service)
  • Start timestamp: 11:23:29 — compaction triggered
  • Compaction succeeded: 11:24:09 (~40 seconds later)
  • Gateway became unresponsive: within seconds after 11:24:09
  • Systemd restart counter: reached 2972+ before manual intervention

Symptoms (from logs)

Phase 1: Normal compaction (OK)

11:23:29 - context overflow detected (attempt 1/3); attempting auto-compaction
11:23:29 - [context-overflow-diag] sessionKey=agent:main:main provider=minimax/MiniMax-M2.7-highspeed
11:24:09 - [compaction] rotated active transcript after compaction
11:24:09 - auto-compaction succeeded for minimax/MiniMax-M2.7-highspeed; retrying prompt
11:24:09 - post-compaction guard armed for 3 attempts
11:24:09 - [email protected]: plugin registered
11:24:09 - memory-lancedb-pro: diagnostic build tag loaded
11:24:10 - memory-lancedb-pro: injecting 2 memories into context for agent main

Phase 2: Gateway enters crash loop (BUG)

11:24:57 - loading configuration…
11:25:04 - memory-lancedb-pro: llm-client [extract-candidates] request failed: Request timed out.
11:25:04 - memory-pro: smart-extractor: no memories extracted
11:25:04 - memory-lancedb-pro: regex fallback found 22 capturable text(s)
11:25:08 - loading configuration…
11:25:19 - webchat disconnected code=1001
11:25:19 - webchat connected (reconnect attempt)
11:26:02 - loading configuration…
11:26:18 - webchat disconnected code=1001
11:27:07 - webchat disconnected code=1001
11:27:37 - SIGTERM received; restarting  ← user manually restarted

Then the systemd crash loop begins:

Gateway failed to start: gateway already running under systemd; existing gateway is healthy, exiting with code 78 to prevent a systemd Restart=always loop | gateway already running (pid 56615); lock timeout after 5000ms
Port 18789 is already in use.
- pid 56615 root: /usr/bin/node /usr/lib/node_modules/openclaw/dist/index.js gateway --port 18789

This repeats every ~9-10 seconds with systemd restart counter incrementing.


Root Cause Analysis (Code Level)

Lock Mechanism

Gateway uses a file-based lock (gateway-lock-ARBtYsKu.js) to prevent multiple instances:

  1. Attempts to acquire lock via open(path, 'wx')
  2. On EEXIST, reads existing lockfile, checks if pid is alive via isPidAlive()
  3. If alive and same pid: waits up to 5000ms for old process to release
  4. If timeout: throws GatewayLockError with code 78 → systemd sees exit code 78

The Bug

After compaction succeeds, the old gateway process (pid 56615) remains alive but becomes unresponsive — it holds the lock but stops processing requests. The health monitor (server-runtime-services-D6xEJ-h2.js) sees the gateway as "startup failed" and systemd repeatedly tries to start new instances, each of which:

  1. Acquires lock attempt fails with EEXIST
  2. Checks pid 56615 — still alive (it is, but zombie-like)
  3. Waits up to 5000ms
  4. Times out → GatewayLockError → exits with code 78
  5. Systemd restarts immediately → repeat

The old process never voluntarily releases the lock because it doesn't detect it needs to restart — it just stops responding during the retry-prompt phase.

Key Observation: load_config Loop

Between 11:24:09 (compaction succeeded) and 11:25:19 (webchat disconnected), the gateway is repeatedly calling "loading configuration…" — this is the inner restart loop inside the same process, not systemd-initiated restarts. This suggests the gateway runner itself is reinitializing in a tight loop.

Source Files Implicated

  • gateway-lock-ARBtYsKu.js — lock acquisition with 5000ms timeout
  • pi-embedded-Bcz04p2i.js — compaction retry loop with continue (line 415, 437, 540, 554, 560, 564, 568)
  • model-context-tokens-UxSPMMtB.jsrunPostCompactionSideEffects() triggers emitSessionTranscriptUpdate + syncPostCompactionSessionMemory
  • compact-BqITSh1q.js — compaction execution
  • server-runtime-services-D6xEJ-h2.js — health monitor / startup-grace logic

Key Code Path (pi-embedded-Bcz04p2i.js)

 compaction succeeds → adoptCompactionTranscript() 
   → runOwnsCompactionAfterHook() 
   → runPostCompactionSideEffects() 
   → emitSessionTranscriptUpdate() 
   → syncPostCompactionSessionMemory() 
   → continue (retry prompt)

The continue re-runs the agent turn loop. If the model API call during retry hangs or the post-compaction hooks take too long, the process becomes unresponsive but doesn't crash.

Health Monitor Startup Grace

The health monitor has a startup-grace: 60s window during which a gateway is allowed to be slow to respond. After compaction (~40s), the retrying-prompt phase might exceed this grace period, causing the health monitor to mark the gateway as "failed to start."


Minimal Reproduction Steps

  1. Run OpenClaw with active session (context accumulates over time)
  2. Do enough work to trigger context overflow (~500+ messages in session)
  3. Wait for auto-compaction to trigger
  4. Observe: gateway becomes unreachable within ~1 minute after compaction succeeds
  5. systemctl --user status openclaw-gateway.service shows high restart count

Expected Behavior

After auto-compaction succeeds, the gateway should:

  • Continue processing the original user request (retrying prompt)
  • Remain responsive to new connections
  • NOT enter a restart loop

Suggested Investigation Areas

  1. Post-compaction inner restart loop: The "loading configuration…" messages between 11:24:09 and 11:25:19 suggest the agent runner is reinitializing in a tight loop. Is the continue statement in pi-embedded running without proper guard conditions?

  2. Lock release race condition: When the gateway process becomes unresponsive but doesn't exit, the lock is never released. Consider adding a lock heartbeat or forcing process exit when health monitor declares failure.

  3. isPidAlive false positive: If the old process is alive but not responding to health checks, it's still considered "running" by the lock mechanism. The lock should detect a truly dead/unresponsive process.

  4. Health monitor startup-grace timing: The 60-second startup grace may not account for compaction+retry duration (~40s compaction + model API call for retry). Consider extending grace or pausing the health monitor during compaction.

  5. Systemd RestartSec: Currently ~9-10 seconds between restarts. If the old process is confirmed zombie, a KillMode=process with RestartForceExitStatus=78 or similar may be needed.


Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix 🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction [3 pull requests]