openclaw - ✅(Solved) Fix Gateway task registry maintenance can hot-loop on stale runs.sqlite [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73517Fetched 2026-04-29 06:18:55
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
1
Participants
Timeline (top)
cross-referenced ×1

On a long-running OpenClaw installation, openclaw gateway can enter a sustained high-CPU state during task registry maintenance when ~/.openclaw/tasks/runs.sqlite* contains stale task/session entries.

In my case the gateway was technically live, but it consumed most of one CPU core for many minutes and made the machine feel very sluggish. Rebuilding only the task registry database fixed the issue.

Root Cause

I tested these independently and they were not the root cause:

Fix Action

Fix / Workaround

The relevant task DB files before the workaround were:

Workaround that fixed it

Possible mitigations:

PR fix notes

PR #415: fix(openclaw): roll back to 2026.4.22 fat — escape v4.25 NFS+SQLite hang

Description (problem / solution / changelog)

Summary

  • Pin openclaw-version.json to alpine/openclaw:2026.4.22 (fat variant). Drops -slim.
  • Add three schema-required fields to agents.defaults (embeddedHarness, contextLimits, heartbeat) — required by v2026.4.22 zod schema.
  • Re-enable channel plugins (telegram, discord, slackenabled: true); the disable was a v4.25-specific defensive patch.

Why

v2026.4.25-slim deterministically wedges every container start. From a live ECS-exec probe on the running gateway:

PID 52 (openclaw-gateway):
  State: D (uninterruptible disk sleep)
  wchan: rpc_wait_bit_killable
  fds:   /home/node/.openclaw/tasks/runs.sqlite{,-wal,-shm}
  mount: 127.0.0.1:/  /home/node/.openclaw  nfs4  hard,port=21005

Matches upstream issue #73517 — task-registry hot-loop on stale runs.sqlite (reported against the same commit aa36ee6). Loopback NFS server inside the openclaw container deadlocks the gateway's main JS thread.

Forward path is blocked too: v2026.4.26 has an unfixed acpx-EPERM regression on remote filesystems (#73333, fix PR #73341 closed but not merged).

So the only safe move is back. We previously ran on 2026.4.22 fat in #406 with no hang. Fat variant has all bundled plugin runtime deps prebaked → no 90s install penalty on first boot. Has CODEX_HOME (added 4.7) so ChatGPT OAuth works.

Test plan

  • CI builds extended image off 2026.4.22 upstream
  • Dev redeploys cleanly
  • Provision a fresh container, watch CloudWatch logs reach [gateway] ready and stay healthy past starting channels and sidecars... (the wedge point)
  • Backend gateway connection pool establishes WS handshake (no more [ws] closed before connect code=1006)
  • ChatGPT-OAuth signup path completes end-to-end

Risk

  • New tag 2026.4.22-bootstrap won't resolve until the extended-image CI workflow runs once and pushes a per-commit tag. First deploy after merge will fail; subsequent deploys after the image build are fine.

🤖 Generated with Claude Code

Changed files

  • apps/backend/core/containers/config.py (modified, +17/-25)
  • openclaw-version.json (modified, +5/-5)

Code Example

openclaw-gateway.service active (running)
MainPID=openclaw-gateway
CPU: ~90-105% of one core for several minutes
RSS: commonly ~700-1000 MB during the bad state
/health: sometimes live, but CLI/gateway calls could time out while the box was saturated

---

pid_cpu_pct_30s=100.37 rss_kb=768xxx
pid_cpu_pct_25s=100.40 rss_kb=758xxx
pid_cpu_pct_25s=64.72  rss_kb=710xxx

---

[gateway] ready (1 plugin: memory-core; ...)
[gateway] starting channels and sidecars...
[agents/auth-profiles] kept local oauth over external cli bootstrap-only provider
[codex/catalog] codex model discovery failed; using fallback catalog
[model-pricing] ... timeout
[heartbeat] started
[plugins] [hooks] running gateway_start (1 handlers)

---

runTaskRegistryMaintenance
  -> shouldMarkLost
  -> hasBackingSession
  -> loadSessionStore
  -> readSessionStoreCache
  -> structuredClone

---

shouldMarkLost
  -> hasBackingSession
  -> deriveSessionChatType
  -> iterateBootstrapChannelPlugins
  -> getBootstrapChannelPlugin
  -> resolveActiveBootstrapPlugins
  -> resolveBundledChannelRootScope
  -> resolveBundledPluginsDir
  -> resolveOpenClawPackageRootSync
  -> findPackageRootSync
  -> readPackageNameSync
  -> parsePackageName

---

16.2% structuredClone node:internal/worker/js_transferable
14.1% parsePackageName .../openclaw-root-*.js
 2.5% readFileSync node:fs
 1.5% readFileUtf8

---

~/.openclaw/tasks/runs.sqlite      1740800 bytes
~/.openclaw/tasks/runs.sqlite-wal  4152992 bytes
~/.openclaw/tasks/runs.sqlite-shm    32768 bytes

---

systemctl --user stop openclaw-gateway.service

mkdir -p ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
mv ~/.openclaw/tasks/runs.sqlite     ~/.openclaw/tasks/runs.sqlite.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-shm ~/.openclaw/tasks/runs.sqlite-shm.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-wal ~/.openclaw/tasks/runs.sqlite-wal.disabled-YYYYMMDD-HHMMSS

systemctl --user restart openclaw-gateway.service

---

/health: {"ok":true,"status":"live"}
gateway inference with openai-codex/gpt-5.5: returned OK
idle CPU sample after warmup: 0.77% over 30s

---

pid_cpu_pct_25s=0.00 rss_kb=524804
RAW_BUFFERClick to expand / collapse

Summary

On a long-running OpenClaw installation, openclaw gateway can enter a sustained high-CPU state during task registry maintenance when ~/.openclaw/tasks/runs.sqlite* contains stale task/session entries.

In my case the gateway was technically live, but it consumed most of one CPU core for many minutes and made the machine feel very sluggish. Rebuilding only the task registry database fixed the issue.

Environment

  • OpenClaw: 2026.4.25 (aa36ee6)
  • Install method: npm global install
  • Runtime: Node v25.9.0, npm 11.12.1
  • OS: Arch Linux systemd user service
  • Gateway service: openclaw-gateway.service
  • Model configured: openai-codex/gpt-5.5

Actual behavior

After starting the gateway, the process became live but continued burning CPU:

openclaw-gateway.service active (running)
MainPID=openclaw-gateway
CPU: ~90-105% of one core for several minutes
RSS: commonly ~700-1000 MB during the bad state
/health: sometimes live, but CLI/gateway calls could time out while the box was saturated

Representative samples before the fix:

pid_cpu_pct_30s=100.37 rss_kb=768xxx
pid_cpu_pct_25s=100.40 rss_kb=758xxx
pid_cpu_pct_25s=64.72  rss_kb=710xxx

The process log around startup looked normal enough, so this was not obvious from logs alone:

[gateway] ready (1 plugin: memory-core; ...)
[gateway] starting channels and sidecars...
[agents/auth-profiles] kept local oauth over external cli bootstrap-only provider
[codex/catalog] codex model discovery failed; using fallback catalog
[model-pricing] ... timeout
[heartbeat] started
[plugins] [hooks] running gateway_start (1 handlers)

CPU profile evidence

A Node CPU profile pointed to task registry maintenance repeatedly loading/cloning session state while checking stale tasks:

runTaskRegistryMaintenance
  -> shouldMarkLost
  -> hasBackingSession
  -> loadSessionStore
  -> readSessionStoreCache
  -> structuredClone

Another hot stack from the same profile:

shouldMarkLost
  -> hasBackingSession
  -> deriveSessionChatType
  -> iterateBootstrapChannelPlugins
  -> getBootstrapChannelPlugin
  -> resolveActiveBootstrapPlugins
  -> resolveBundledChannelRootScope
  -> resolveBundledPluginsDir
  -> resolveOpenClawPackageRootSync
  -> findPackageRootSync
  -> readPackageNameSync
  -> parsePackageName

Top self samples included:

16.2% structuredClone node:internal/worker/js_transferable
14.1% parsePackageName .../openclaw-root-*.js
 2.5% readFileSync node:fs
 1.5% readFileUtf8

The relevant task DB files before the workaround were:

~/.openclaw/tasks/runs.sqlite      1740800 bytes
~/.openclaw/tasks/runs.sqlite-wal  4152992 bytes
~/.openclaw/tasks/runs.sqlite-shm    32768 bytes

Things I ruled out

I tested these independently and they were not the root cause:

  • models.json cache regeneration
  • Feishu / Lark channel startup
  • custom model provider definitions and fallbacks
  • multiple openai-codex auth profiles
  • the production workspace directory
  • extension discovery warnings from a separate claworld plugin manifest
  • ~/.openclaw/plugins/installs.json
  • internal hooks config

For example, a temp HOME using the same OpenClaw version, same Codex credentials, same OpenClaw auth profiles, and even the same production workspace idled normally after startup. The high CPU reproduced only with the production task registry DB in place.

Workaround that fixed it

Stopping the gateway, moving the task registry DB aside, and letting OpenClaw recreate it fixed the high CPU:

systemctl --user stop openclaw-gateway.service

mkdir -p ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
mv ~/.openclaw/tasks/runs.sqlite     ~/.openclaw/tasks/runs.sqlite.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-shm ~/.openclaw/tasks/runs.sqlite-shm.disabled-YYYYMMDD-HHMMSS
mv ~/.openclaw/tasks/runs.sqlite-wal ~/.openclaw/tasks/runs.sqlite-wal.disabled-YYYYMMDD-HHMMSS

systemctl --user restart openclaw-gateway.service

After OpenClaw rebuilt the DB:

/health: {"ok":true,"status":"live"}
gateway inference with openai-codex/gpt-5.5: returned OK
idle CPU sample after warmup: 0.77% over 30s

With the full original config restored, keeping only the regenerated task registry DB, the gateway also idled correctly:

pid_cpu_pct_25s=0.00 rss_kb=524804

Expected behavior

Task registry maintenance should not be able to keep the gateway in a sustained hot loop because of stale/corrupt/large historical run data.

Possible mitigations:

  • avoid reloading or structuredCloneing the full session store for each task being checked
  • cache session-store lookups during a maintenance pass
  • batch/limit maintenance work per tick
  • detect pathological or corrupt task registries and quarantine/rebuild them automatically
  • log a clear warning when maintenance is skipping or quarantining stale task registry data

I did not attach the old runs.sqlite because it may contain private run/session metadata, but I can provide schema/counts or more sanitized profile details if useful.

extent analysis

TL;DR

Rebuilding the task registry database by stopping the gateway, moving the existing database files aside, and letting OpenClaw recreate it can resolve the high CPU issue caused by stale task/session entries.

Guidance

  • Identify and isolate the issue: Confirm that the high CPU usage is due to the task registry maintenance by analyzing CPU profiles and logs.
  • Rebuild the task registry database: Stop the gateway, move the existing runs.sqlite, runs.sqlite-shm, and runs.sqlite-wal files to a safe location, and restart the gateway to allow OpenClaw to recreate the database.
  • Monitor and verify: After rebuilding the database, check the gateway's CPU usage and ensure it returns to normal levels, and verify the gateway's health and functionality.
  • Consider long-term mitigations: Explore implementing caching for session-store lookups, batching maintenance work, or detecting and quarantining corrupt task registries to prevent similar issues in the future.

Example

The provided workaround script can be used as a starting point:

systemctl --user stop openclaw-gateway.service
mkdir -p ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
mv ~/.openclaw/tasks/runs.sqlite* ~/.openclaw/_disabled_task_registry/$(date +%Y%m%d-%H%M%S)
systemctl --user restart openclaw-gateway.service

Notes

The root cause of the issue appears to be related to the task registry maintenance mechanism and the presence of stale task/session entries in the database. The provided workaround resolves the issue by rebuilding the database, but further investigation and implementation of long-term mitigations may be necessary to prevent similar issues.

Recommendation

Apply the workaround by rebuilding the task registry database, as it has been proven to resolve the high CPU issue in this specific case. This approach allows for immediate relief while further analysis and potential code changes

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Task registry maintenance should not be able to keep the gateway in a sustained hot loop because of stale/corrupt/large historical run data.

Possible mitigations:

  • avoid reloading or structuredCloneing the full session store for each task being checked
  • cache session-store lookups during a maintenance pass
  • batch/limit maintenance work per tick
  • detect pathological or corrupt task registries and quarantine/rebuild them automatically
  • log a clear warning when maintenance is skipping or quarantining stale task registry data

I did not attach the old runs.sqlite because it may contain private run/session metadata, but I can provide schema/counts or more sanitized profile details if useful.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING