Task registry maintenance should not be able to keep the gateway in a sustained hot loop because of stale/corrupt/large historical run data. Possible mitigations: - avoid reloading or `structuredClone`ing the full session store for each task being checked - cache session-store lookups during a maintenance pass - batch/limit maintenance work per tick - detect pathological or corrupt task registries and quarantine/rebuild them automatically - log a clear warning when maintenance is skipping or quarantining stale task registry data I did not attach the old `runs.sqlite` because it may contain private run/session metadata, but I can provide schema/counts or more sanitized profile details if useful.

openclaw - ✅(Solved) Fix Gateway task registry maintenance can hot-loop on stale runs.sqlite [2 pull requests, 1 participants]

openclaw2026-04-28 11:15:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73517•Fetched 2026-04-29 06:18:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Lightningxxl

Participants

Lightningxxl

Timeline (top)

cross-referenced ×1

On a long-running OpenClaw installation, openclaw gateway can enter a sustained high-CPU state during task registry maintenance when ~/.openclaw/tasks/runs.sqlite* contains stale task/session entries.

In my case the gateway was technically live, but it consumed most of one CPU core for many minutes and made the machine feel very sluggish. Rebuilding only the task registry database fixed the issue.

Root Cause

I tested these independently and they were not the root cause:

Fix Action

Fix / Workaround

The relevant task DB files before the workaround were:

Workaround that fixed it

Possible mitigations:

PR fix notes

PR #415: fix(openclaw): roll back to 2026.4.22 fat — escape v4.25 NFS+SQLite hang

Repository: Isol8AI/isol8
Author: prez2307
State: closed | merged: True
Link: https://github.com/Isol8AI/isol8/pull/415

Description (problem / solution / changelog)

Summary

Pin openclaw-version.json to alpine/openclaw:2026.4.22 (fat variant). Drops -slim.
Add three schema-required fields to agents.defaults (embeddedHarness, contextLimits, heartbeat) — required by v2026.4.22 zod schema.
Re-enable channel plugins (telegram, discord, slack → enabled: true); the disable was a v4.25-specific defensive patch.

Why

v2026.4.25-slim deterministically wedges every container start. From a live ECS-exec probe on the running gateway:

PID 52 (openclaw-gateway):
  State: D (uninterruptible disk sleep)
  wchan: rpc_wait_bit_killable
  fds:   /home/node/.openclaw/tasks/runs.sqlite{,-wal,-shm}
  mount: 127.0.0.1:/  /home/node/.openclaw  nfs4  hard,port=21005

Matches upstream issue #73517 — task-registry hot-loop on stale runs.sqlite (reported against the same commit aa36ee6). Loopback NFS server inside the openclaw container deadlocks the gateway's main JS thread.

Forward path is blocked too: v2026.4.26 has an unfixed acpx-EPERM regression on remote filesystems (#73333, fix PR #73341 closed but not merged).

So the only safe move is back. We previously ran on 2026.4.22 fat in #406 with no hang. Fat variant has all bundled plugin runtime deps prebaked → no 90s install penalty on first boot. Has CODEX_HOME (added 4.7) so ChatGPT OAuth works.

Test plan

CI builds extended image off 2026.4.22 upstream
Dev redeploys cleanly
Provision a fresh container, watch CloudWatch logs reach [gateway] ready and stay healthy past starting channels and sidecars... (the wedge point)
Backend gateway connection pool establishes WS handshake (no more [ws] closed before connect code=1006)
ChatGPT-OAuth signup path completes end-to-end

Risk

New tag 2026.4.22-bootstrap won't resolve until the extended-image CI workflow runs once and pushes a per-commit tag. First deploy after merge will fail; subsequent deploys after the image build are fine.

🤖 Generated with Claude Code

Changed files

apps/backend/core/containers/config.py (modified, +17/-25)
openclaw-version.json (modified, +5/-5)

Code Example

openclaw-gateway.service active (running)
MainPID=openclaw-gateway
CPU: ~90-105% of one core for several minutes
RSS: commonly ~700-1000 MB during the bad state
/health: sometimes live, but CLI/gateway calls could time out while the box was saturated

---

pid_cpu_pct_30s=100.37 rss_kb=768xxx
pid_cpu_pct_25s=100.40 rss_kb=758xxx
pid_cpu_pct_25s=64.72  rss_kb=710xxx

---

[gateway] ready (1 plugin: memory-core; ...)
[gateway] starting channels and sidecars...
[agents/auth-profiles] kept local oauth over external cli bootstrap-only provider
[codex/catalog] codex model discovery failed; using fallback catalog
[model-pricing] ... timeout
[heartbeat] started
[plugins] [hooks] running gateway_start (1 handlers)

---

runTaskRegistryMaintenance
  -> shouldMarkLost
  -> hasBackingSession
  -> loadSessionStore
  -> readSessionStoreCache
  -> structuredClone

---

shouldMarkLost
  -> hasBackingSession
  -> deriveSessionChatType
  -> iterateBootstrapChannelPlugins
  -> getBootstrapChannelPlugin
  -> resolveActiveBootstrapPlugins
  -> resolveBundledChannelRootScope
  -> resolveBundledPluginsDir
  -> resolveOpenClawPackageRootSync
  -> findPackageRootSync
  -> readPackageNameSync
  -> parsePackageName

---

16.2% structuredClone node:internal/worker/js_transferable
14.1% parsePackageName .../openclaw-root-*.js
 2.5% readFileSync node:fs
 1.5% readFileUtf8

---

~/.openclaw/tasks/runs.sqlite      1740800 bytes
~/.openclaw/tasks/runs.sqlite-wal  4152992 bytes
~/.openclaw/tasks/runs.sqlite-shm    32768 bytes

---

systemctl --user stop openclaw-gateway.service

mkdir -p ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
mv ~/.openclaw/tasks/runs.sqlite     ~/.openclaw/tasks/runs.sqlite.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-shm ~/.openclaw/tasks/runs.sqlite-shm.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-wal ~/.openclaw/tasks/runs.sqlite-wal.disabled-YYYYMMDD-HHMMSS

systemctl --user restart openclaw-gateway.service

---

/health: {"ok":true,"status":"live"}
gateway inference with openai-codex/gpt-5.5: returned OK
idle CPU sample after warmup: 0.77% over 30s

---

pid_cpu_pct_25s=0.00 rss_kb=524804

RAW_BUFFERClick to expand / collapse

Summary

In my case the gateway was technically live, but it consumed most of one CPU core for many minutes and made the machine feel very sluggish. Rebuilding only the task registry database fixed the issue.

Environment

OpenClaw: 2026.4.25 (aa36ee6)
Install method: npm global install
Runtime: Node v25.9.0, npm 11.12.1
OS: Arch Linux systemd user service
Gateway service: openclaw-gateway.service
Model configured: openai-codex/gpt-5.5

Actual behavior

After starting the gateway, the process became live but continued burning CPU:

openclaw-gateway.service active (running)
MainPID=openclaw-gateway
CPU: ~90-105% of one core for several minutes
RSS: commonly ~700-1000 MB during the bad state
/health: sometimes live, but CLI/gateway calls could time out while the box was saturated

Representative samples before the fix:

pid_cpu_pct_30s=100.37 rss_kb=768xxx
pid_cpu_pct_25s=100.40 rss_kb=758xxx
pid_cpu_pct_25s=64.72  rss_kb=710xxx

The process log around startup looked normal enough, so this was not obvious from logs alone:

[gateway] ready (1 plugin: memory-core; ...)
[gateway] starting channels and sidecars...
[agents/auth-profiles] kept local oauth over external cli bootstrap-only provider
[codex/catalog] codex model discovery failed; using fallback catalog
[model-pricing] ... timeout
[heartbeat] started
[plugins] [hooks] running gateway_start (1 handlers)

CPU profile evidence

A Node CPU profile pointed to task registry maintenance repeatedly loading/cloning session state while checking stale tasks:

runTaskRegistryMaintenance
  -> shouldMarkLost
  -> hasBackingSession
  -> loadSessionStore
  -> readSessionStoreCache
  -> structuredClone

Another hot stack from the same profile:

shouldMarkLost
  -> hasBackingSession
  -> deriveSessionChatType
  -> iterateBootstrapChannelPlugins
  -> getBootstrapChannelPlugin
  -> resolveActiveBootstrapPlugins
  -> resolveBundledChannelRootScope
  -> resolveBundledPluginsDir
  -> resolveOpenClawPackageRootSync
  -> findPackageRootSync
  -> readPackageNameSync
  -> parsePackageName

Top self samples included:

16.2% structuredClone node:internal/worker/js_transferable
14.1% parsePackageName .../openclaw-root-*.js
 2.5% readFileSync node:fs
 1.5% readFileUtf8

The relevant task DB files before the workaround were:

~/.openclaw/tasks/runs.sqlite      1740800 bytes
~/.openclaw/tasks/runs.sqlite-wal  4152992 bytes
~/.openclaw/tasks/runs.sqlite-shm    32768 bytes

Things I ruled out

I tested these independently and they were not the root cause:

models.json cache regeneration
Feishu / Lark channel startup
custom model provider definitions and fallbacks
multiple openai-codex auth profiles
the production workspace directory
extension discovery warnings from a separate claworld plugin manifest
~/.openclaw/plugins/installs.json
internal hooks config

For example, a temp HOME using the same OpenClaw version, same Codex credentials, same OpenClaw auth profiles, and even the same production workspace idled normally after startup. The high CPU reproduced only with the production task registry DB in place.

Workaround that fixed it

Stopping the gateway, moving the task registry DB aside, and letting OpenClaw recreate it fixed the high CPU:

systemctl --user stop openclaw-gateway.service

mkdir -p ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
mv ~/.openclaw/tasks/runs.sqlite     ~/.openclaw/tasks/runs.sqlite.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-shm ~/.openclaw/tasks/runs.sqlite-shm.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-wal ~/.openclaw/tasks/runs.sqlite-wal.disabled-YYYYMMDD-HHMMSS

systemctl --user restart openclaw-gateway.service

After OpenClaw rebuilt the DB:

/health: {"ok":true,"status":"live"}
gateway inference with openai-codex/gpt-5.5: returned OK
idle CPU sample after warmup: 0.77% over 30s

With the full original config restored, keeping only the regenerated task registry DB, the gateway also idled correctly:

pid_cpu_pct_25s=0.00 rss_kb=524804

Expected behavior

Task registry maintenance should not be able to keep the gateway in a sustained hot loop because of stale/corrupt/large historical run data.

Possible mitigations:

avoid reloading or structuredCloneing the full session store for each task being checked
cache session-store lookups during a maintenance pass
batch/limit maintenance work per tick
detect pathological or corrupt task registries and quarantine/rebuild them automatically
log a clear warning when maintenance is skipping or quarantining stale task registry data

I did not attach the old runs.sqlite because it may contain private run/session metadata, but I can provide schema/counts or more sanitized profile details if useful.

extent analysis

TL;DR

Rebuilding the task registry database by stopping the gateway, moving the existing database files aside, and letting OpenClaw recreate it can resolve the high CPU issue caused by stale task/session entries.

Guidance

Identify and isolate the issue: Confirm that the high CPU usage is due to the task registry maintenance by analyzing CPU profiles and logs.
Rebuild the task registry database: Stop the gateway, move the existing runs.sqlite, runs.sqlite-shm, and runs.sqlite-wal files to a safe location, and restart the gateway to allow OpenClaw to recreate the database.
Monitor and verify: After rebuilding the database, check the gateway's CPU usage and ensure it returns to normal levels, and verify the gateway's health and functionality.
Consider long-term mitigations: Explore implementing caching for session-store lookups, batching maintenance work, or detecting and quarantining corrupt task registries to prevent similar issues in the future.

Example

The provided workaround script can be used as a starting point:

systemctl --user stop openclaw-gateway.service
mkdir -p ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
mv ~/.openclaw/tasks/runs.sqlite* ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
systemctl --user restart openclaw-gateway.service

Notes

The root cause of the issue appears to be related to the task registry maintenance mechanism and the presence of stale task/session entries in the database. The provided workaround resolves the issue by rebuilding the database, but further investigation and implementation of long-term mitigations may be necessary to prevent similar issues.

Recommendation

Apply the workaround by rebuilding the task registry database, as it has been proven to resolve the high CPU issue in this specific case. This approach allows for immediate relief while further analysis and potential code changes

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Task registry maintenance should not be able to keep the gateway in a sustained hot loop because of stale/corrupt/large historical run data.

Possible mitigations:

avoid reloading or structuredCloneing the full session store for each task being checked
cache session-store lookups during a maintenance pass
batch/limit maintenance work per tick
detect pathological or corrupt task registries and quarantine/rebuild them automatically
log a clear warning when maintenance is skipping or quarantining stale task registry data

I did not attach the old runs.sqlite because it may contain private run/session metadata, but I can provide schema/counts or more sanitized profile details if useful.

#installation #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Gateway task registry maintenance can hot-loop on stale runs.sqlite [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround that fixed it

PR fix notes

PR #415: fix(openclaw): roll back to 2026.4.22 fat — escape v4.25 NFS+SQLite hang

Description (problem / solution / changelog)

Summary

Why

Test plan

Risk

Changed files

Code Example

Summary

Environment

Actual behavior

CPU profile evidence

Things I ruled out

Workaround that fixed it

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING