openclaw - 💡(How to fix) Fix 🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction [3 pull requests]

openclaw2026-05-08 03:40:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When context overflow triggers auto-compaction, the gateway enters a crash restart loop after compaction succeeds ("auto-compaction succeeded; retrying prompt"). The old gateway process becomes effectively zombie — it doesn't exit, but doesn't respond to new connections either. Systemd sees "gateway startup failed" and repeatedly restarts, each new process fails to acquire the lock within 5000ms (EEXIST + pid alive check), exits with code 78, and the cycle repeats. No connection survives more than a few seconds during this loop.

This is reproducible — has occurred multiple times on this system (May 6, May 7, May 8).

Root Cause

Root Cause Analysis (Code Level)

Fix Action

Fixed

Fixed by PR: fix(gateway): hard-timeout post-compaction retry to prevent indefinite hang (https://github.com/njuboy11/openclaw-fix-compaction/pull/1)
Fixed by PR: fix(gateway): hard-timeout post-compaction retry to prevent indefinite hang (https://github.com/openclaw/openclaw/pull/79237)
Fixed by PR: fix(gateway): treat EADDRINUSE as lock-recovery to break crash loop (https://github.com/openclaw/openclaw/pull/79265)

Code Example

11:23:29 - context overflow detected (attempt 1/3); attempting auto-compaction
11:23:29 - [context-overflow-diag] sessionKey=agent:main:main provider=minimax/MiniMax-M2.7-highspeed
11:24:09 - [compaction] rotated active transcript after compaction
11:24:09 - auto-compaction succeeded for minimax/MiniMax-M2.7-highspeed; retrying prompt
11:24:09 - post-compaction guard armed for 3 attempts
11:24:09 - memory-lancedb-pro@1.1.0-beta.10: plugin registered
11:24:09 - memory-lancedb-pro: diagnostic build tag loaded
11:24:10 - memory-lancedb-pro: injecting 2 memories into context for agent main

---

11:24:57 - loading configuration…
11:25:04 - memory-lancedb-pro: llm-client [extract-candidates] request failed: Request timed out.
11:25:04 - memory-pro: smart-extractor: no memories extracted
11:25:04 - memory-lancedb-pro: regex fallback found 22 capturable text(s)
11:25:08 - loading configuration…
11:25:19 - webchat disconnected code=1001
11:25:19 - webchat connected (reconnect attempt)
11:26:02 - loading configuration…
11:26:18 - webchat disconnected code=1001
11:27:07 - webchat disconnected code=1001
11:27:37 - SIGTERM received; restarting  ← user manually restarted

---

Gateway failed to start: gateway already running under systemd; existing gateway is healthy, exiting with code 78 to prevent a systemd Restart=always loop | gateway already running (pid 56615); lock timeout after 5000ms
Port 18789 is already in use.
- pid 56615 root: /usr/bin/node /usr/lib/node_modules/openclaw/dist/index.js gateway --port 18789

---

compaction succeeds → adoptCompactionTranscript() 
   → runOwnsCompactionAfterHook() 
   → runPostCompactionSideEffects() 
   → emitSessionTranscriptUpdate() 
   → syncPostCompactionSessionMemory() 
   → continue (retry prompt)

RAW_BUFFERClick to expand / collapse

🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction

Summary

This is reproducible — has occurred multiple times on this system (May 6, May 7, May 8).

Environment

OpenClaw version: 2026.5.7 (also reproduced on earlier 5.5/5.6 builds)
Node.js: v22.22.2
OS: Debian with systemd
Launcher: systemd (openclaw-gateway.service)
Start timestamp: 11:23:29 — compaction triggered
Compaction succeeded: 11:24:09 (~40 seconds later)
Gateway became unresponsive: within seconds after 11:24:09
Systemd restart counter: reached 2972+ before manual intervention

Symptoms (from logs)

Phase 1: Normal compaction (OK)

11:23:29 - context overflow detected (attempt 1/3); attempting auto-compaction
11:23:29 - [context-overflow-diag] sessionKey=agent:main:main provider=minimax/MiniMax-M2.7-highspeed
11:24:09 - [compaction] rotated active transcript after compaction
11:24:09 - auto-compaction succeeded for minimax/MiniMax-M2.7-highspeed; retrying prompt
11:24:09 - post-compaction guard armed for 3 attempts
11:24:09 - [email protected]: plugin registered
11:24:09 - memory-lancedb-pro: diagnostic build tag loaded
11:24:10 - memory-lancedb-pro: injecting 2 memories into context for agent main

Phase 2: Gateway enters crash loop (BUG)

11:24:57 - loading configuration…
11:25:04 - memory-lancedb-pro: llm-client [extract-candidates] request failed: Request timed out.
11:25:04 - memory-pro: smart-extractor: no memories extracted
11:25:04 - memory-lancedb-pro: regex fallback found 22 capturable text(s)
11:25:08 - loading configuration…
11:25:19 - webchat disconnected code=1001
11:25:19 - webchat connected (reconnect attempt)
11:26:02 - loading configuration…
11:26:18 - webchat disconnected code=1001
11:27:07 - webchat disconnected code=1001
11:27:37 - SIGTERM received; restarting  ← user manually restarted

Then the systemd crash loop begins:

Gateway failed to start: gateway already running under systemd; existing gateway is healthy, exiting with code 78 to prevent a systemd Restart=always loop | gateway already running (pid 56615); lock timeout after 5000ms
Port 18789 is already in use.
- pid 56615 root: /usr/bin/node /usr/lib/node_modules/openclaw/dist/index.js gateway --port 18789

This repeats every ~9-10 seconds with systemd restart counter incrementing.

Root Cause Analysis (Code Level)

Lock Mechanism

Gateway uses a file-based lock (gateway-lock-ARBtYsKu.js) to prevent multiple instances:

Attempts to acquire lock via open(path, 'wx')
On EEXIST, reads existing lockfile, checks if pid is alive via isPidAlive()
If alive and same pid: waits up to 5000ms for old process to release
If timeout: throws GatewayLockError with code 78 → systemd sees exit code 78

The Bug

After compaction succeeds, the old gateway process (pid 56615) remains alive but becomes unresponsive — it holds the lock but stops processing requests. The health monitor (server-runtime-services-D6xEJ-h2.js) sees the gateway as "startup failed" and systemd repeatedly tries to start new instances, each of which:

Acquires lock attempt fails with EEXIST
Checks pid 56615 — still alive (it is, but zombie-like)
Waits up to 5000ms
Times out → GatewayLockError → exits with code 78
Systemd restarts immediately → repeat

The old process never voluntarily releases the lock because it doesn't detect it needs to restart — it just stops responding during the retry-prompt phase.

Key Observation: `load_config` Loop

Between 11:24:09 (compaction succeeded) and 11:25:19 (webchat disconnected), the gateway is repeatedly calling "loading configuration…" — this is the inner restart loop inside the same process, not systemd-initiated restarts. This suggests the gateway runner itself is reinitializing in a tight loop.

Source Files Implicated

gateway-lock-ARBtYsKu.js — lock acquisition with 5000ms timeout
pi-embedded-Bcz04p2i.js — compaction retry loop with continue (line 415, 437, 540, 554, 560, 564, 568)
model-context-tokens-UxSPMMtB.js — runPostCompactionSideEffects() triggers emitSessionTranscriptUpdate + syncPostCompactionSessionMemory
compact-BqITSh1q.js — compaction execution
server-runtime-services-D6xEJ-h2.js — health monitor / startup-grace logic

Key Code Path (pi-embedded-Bcz04p2i.js)

 compaction succeeds → adoptCompactionTranscript() 
   → runOwnsCompactionAfterHook() 
   → runPostCompactionSideEffects() 
   → emitSessionTranscriptUpdate() 
   → syncPostCompactionSessionMemory() 
   → continue (retry prompt)

The continue re-runs the agent turn loop. If the model API call during retry hangs or the post-compaction hooks take too long, the process becomes unresponsive but doesn't crash.

Health Monitor Startup Grace

The health monitor has a startup-grace: 60s window during which a gateway is allowed to be slow to respond. After compaction (~40s), the retrying-prompt phase might exceed this grace period, causing the health monitor to mark the gateway as "failed to start."

Minimal Reproduction Steps

Run OpenClaw with active session (context accumulates over time)
Do enough work to trigger context overflow (~500+ messages in session)
Wait for auto-compaction to trigger
Observe: gateway becomes unreachable within ~1 minute after compaction succeeds
systemctl --user status openclaw-gateway.service shows high restart count

Expected Behavior

After auto-compaction succeeds, the gateway should:

Continue processing the original user request (retrying prompt)
Remain responsive to new connections
NOT enter a restart loop

Suggested Investigation Areas

Post-compaction inner restart loop: The "loading configuration…" messages between 11:24:09 and 11:25:19 suggest the agent runner is reinitializing in a tight loop. Is the continue statement in pi-embedded running without proper guard conditions?
Lock release race condition: When the gateway process becomes unresponsive but doesn't exit, the lock is never released. Consider adding a lock heartbeat or forcing process exit when health monitor declares failure.
isPidAlive false positive: If the old process is alive but not responding to health checks, it's still considered "running" by the lock mechanism. The lock should detect a truly dead/unresponsive process.
Health monitor startup-grace timing: The 60-second startup grace may not account for compaction+retry duration (~40s compaction + model API call for retry). Consider extending grace or pausing the health monitor during compaction.
Systemd RestartSec: Currently ~9-10 seconds between restarts. If the old process is confirmed zombie, a KillMode=process with RestartForceExitStatus=78 or similar may be needed.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix 🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction [3 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Analysis (Code Level)

Fix Action

Fixed

Code Example

🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction

Summary

Environment

Symptoms (from logs)

Phase 1: Normal compaction (OK)

Phase 2: Gateway enters crash loop (BUG)

Root Cause Analysis (Code Level)

Lock Mechanism

The Bug

Key Observation: `load_config` Loop

Source Files Implicated

Key Code Path (pi-embedded-Bcz04p2i.js)

Health Monitor Startup Grace

Minimal Reproduction Steps

Expected Behavior

Suggested Investigation Areas

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix 🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction [3 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Analysis (Code Level)

Fix Action

Fixed

Code Example

🐛 Bug: Auto-compaction triggers gateway crash loop — old process zombie-like after successful compaction

Summary

Environment

Symptoms (from logs)

Phase 1: Normal compaction (OK)

Phase 2: Gateway enters crash loop (BUG)

Root Cause Analysis (Code Level)

Lock Mechanism

The Bug

Key Observation: load_config Loop

Source Files Implicated

Key Code Path (pi-embedded-Bcz04p2i.js)

Health Monitor Startup Grace

Minimal Reproduction Steps

Expected Behavior

Suggested Investigation Areas

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Key Observation: `load_config` Loop