openclaw - ✅(Solved) Fix Session management performance degrades severely with subagent usage (100%+ CPU at ~400 sessions) [3 pull requests, 1 participants]

lucca-alma · 2026-03-31T19:10:39Z

[openclaw] PR 58550: feat sessions : SQLite-backed two-tier session store — fixes 140%+ CPU at scale - Repository: openclaw/openclaw - Author: alexdeg92 - Stat… # PR #58550: feat(sessions): SQLite-backed two-tier session store — fixes 140%+ CPU at scale - Repository: openclaw/openclaw - Author: alexdeg92 - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/58550 ## Description (problem / solution / changelog) # SQLite-backed Session Store ## Problem The flat `sessions.json` file causes severe performance issues at scale: - File grows to 42MB+ with 1000+ sessions - Every session operation requires reading/writing the entire file - Results in 140%+ CPU usage and 6+ second response times - JSON parsing/serialization becomes the bottleneck Related issues: #58534 (perf), #57497 (Postgres request) ## Solution: Two-tier Architecture ### Hot Index (SQLite) A lightweight SQLite database replaces `sessions.json` for metadata: ``` ~/.openclaw/state/agents/{agentId}/sessions/sessions.sqlite ``` **Schema columns:** - `session_key` (PRIMARY KEY) - session identifier - `session_id` - UUID - `updated_at`, `created_at` - timestamps (indexed) - `channel`, `last_channel`, `last_to`, `last_account_id`, `last_thread_id` - routing - `label`, `display_name`, `status` - display info - `model`, `model_provider`, `total_tokens`, `input_tokens`, `output_tokens` - model state - `message_count`, `archived` - metadata - `entry_json` - full SessionEntry blob for complex fields **Benefits:** - O(1) session lookups instead of O(n) JSON parsing - Incremental updates (no full file rewrites) - Proper indexing for common query patterns - WAL mode for concurrent read/write - ~10x faster at 1000+ sessions ### Cold Storage (unchanged) Existing `.jsonl` transcript files stay as-is: - Per-session files, already efficient - Only loaded on explicit `sessions_history` calls - Never in the hot path ## Configuration Add to `openclaw.json`: ```json { "session": { "storeType": "sqlite" // "json" (default) or "sqlite" } } ``` ## Migration ### Automatic (on first access) When `storeType: "sqlite"` is set, existing `sessions.json` is automatically migrated to SQLite on first load. ### Manual (CLI) ```bash # Preview migration openclaw sessions migrate --dry-run # Migrate default agent openclaw sessions migrate # Migrate all agents openclaw sessions migrate --all-agents # Check store info openclaw sessions store-info ``` ## Fallback Behavior - If SQLite unavailable (Node < 22.5), falls back to JSON automatically - If SQLite operations fail, falls back to JSON for that operation - `sessions.json` is preserved during migration (not deleted) ## Files Changed ### New Files - `src/config/sessions/store-sqlite.ts` - SQLite storage implementation - `src/config/sessions/store-facade.ts` - Backend abstraction layer - `src/commands/sessions-migrate.ts` - Migration command ### Modified Files - `src/config/types.base.ts` - Added `SessionStoreType` and `storeType` config - `src/config/sessions/store.ts` - Integrated facade for load/save - `src/cli/program/register.status-health-sessions.ts` - CLI commands ## Performance Expectations | Metric | JSON (1000 sessions) | SQLite | |--------|---------------------|--------| | Load time | ~800ms | ~15ms | | Single update | ~800ms | ~5ms | | List all | ~800ms | ~20ms | | Memory | 42MB parsed | ~2MB | | CPU (save) | 100%+ | <5% | ## Testing ```bash # Run session store tests pnpm test -- src/config/sessions/ # Type check pnpm tsgo # Lint pnpm check ``` ## Backward Compatibility - Default is `storeType: "json"` for backward compatibility - Existing `sessions.json` files continue to work - Migration is opt-in via config or CLI command - SQLite requires Node 22.5+ (built-in `node:sqlite`) --- Closes #58534 Related #57497 ## Changed files - `PR_DESCRIPTION.md` (added, +123/-0) - `src/cli/program/register.status-health-sessions.ts` (modified, +73/-0) - `src/commands/sessions-migrate.ts` (added, +256/-0) - `src/config/schema.base.generated.ts` (modified, +17/-0) - `src/config/schema.help.ts` (modified, +2/-0) - `src/config/schema.labels.ts` (modified, +1/-0) - `src/config/sessions/store-facade.ts` (added, +200/-0) - `src/config/sessions/store-sqlite.test.ts` (added, +288/-0) - `src/config/sessions/store-sqlite.ts` (added, +614/-0) - `src/config/sessions/store.ts` (modified, +70/-44) - `src/config/types.base.ts` (modified, +11/-0) - `src/config/zod-schema.session.ts` (modified, +1/-0) --- # PR #59436: perf(gateway): apply limit before building session rows and index child lookups - Repository: openclaw/openclaw - Author: coderredlab - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/59436 ## Description (problem / solution / changelog) ## Summary Three complementary optimisations for `listSessionsFromStore()` to reduce `sessions.list` latency, especially on resource-constrained hosts (Android/proot): - **Early limit application**: Pre-sort entries by `entry.updatedAt` and apply `limit`

openclaw2026-03-31 19:10:39

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#58534•Fetched 2026-04-08 02:01:28

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lucca-alma

Participants

lucca-alma

Timeline (top)

referenced ×6cross-referenced ×4

Root Cause

sessions.list serializes all sessions on every call — O(n) with expensive per-session work
Control UI/Dashboard polls sessions.list frequently (~every 7 seconds), even when not viewing Sessions tab
archiveAfterMinutes (default 60) doesn't keep pace with spawn rate
No pagination, caching, or incremental sync

Fix Action

Fix / Workaround

Session Breakdown (typical)

2-4 long-lived main/channel sessions (webchat, Slack)
~400+ ephemeral sessions (subagents, spawn-dispatch, cron jobs)

With dispatcher running every 10 minutes spawning 2 sessions each cycle, accumulation is ~12 sessions/hour × 24h = ~288 sessions/day.

Current Workaround

PR fix notes

PR #58550: feat(sessions): SQLite-backed two-tier session store — fixes 140%+ CPU at scale

Repository: openclaw/openclaw
Author: alexdeg92
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/58550

Description (problem / solution / changelog)

SQLite-backed Session Store

Problem

The flat sessions.json file causes severe performance issues at scale:

File grows to 42MB+ with 1000+ sessions
Every session operation requires reading/writing the entire file
Results in 140%+ CPU usage and 6+ second response times
JSON parsing/serialization becomes the bottleneck

Related issues: #58534 (perf), #57497 (Postgres request)

Solution: Two-tier Architecture

Hot Index (SQLite)

A lightweight SQLite database replaces sessions.json for metadata:

~/.openclaw/state/agents/{agentId}/sessions/sessions.sqlite

Schema columns:

session_key (PRIMARY KEY) - session identifier
session_id - UUID
updated_at, created_at - timestamps (indexed)
channel, last_channel, last_to, last_account_id, last_thread_id - routing
label, display_name, status - display info
model, model_provider, total_tokens, input_tokens, output_tokens - model state
message_count, archived - metadata
entry_json - full SessionEntry blob for complex fields

Benefits:

O(1) session lookups instead of O(n) JSON parsing
Incremental updates (no full file rewrites)
Proper indexing for common query patterns
WAL mode for concurrent read/write
~10x faster at 1000+ sessions

Cold Storage (unchanged)

Existing .jsonl transcript files stay as-is:

Per-session files, already efficient
Only loaded on explicit sessions_history calls
Never in the hot path

Configuration

Add to openclaw.json:

{
  "session": {
    "storeType": "sqlite"  // "json" (default) or "sqlite"
  }
}

Migration

Automatic (on first access)

When storeType: "sqlite" is set, existing sessions.json is automatically migrated to SQLite on first load.

Manual (CLI)

# Preview migration
openclaw sessions migrate --dry-run

# Migrate default agent
openclaw sessions migrate

# Migrate all agents
openclaw sessions migrate --all-agents

# Check store info
openclaw sessions store-info

Fallback Behavior

If SQLite unavailable (Node < 22.5), falls back to JSON automatically
If SQLite operations fail, falls back to JSON for that operation
sessions.json is preserved during migration (not deleted)

Files Changed

New Files

src/config/sessions/store-sqlite.ts - SQLite storage implementation
src/config/sessions/store-facade.ts - Backend abstraction layer
src/commands/sessions-migrate.ts - Migration command

Modified Files

src/config/types.base.ts - Added SessionStoreType and storeType config
src/config/sessions/store.ts - Integrated facade for load/save
src/cli/program/register.status-health-sessions.ts - CLI commands

Performance Expectations

Metric	JSON (1000 sessions)	SQLite
Load time	~800ms	~15ms
Single update	~800ms	~5ms
List all	~800ms	~20ms
Memory	42MB parsed	~2MB
CPU (save)	100%+	<5%

Testing

# Run session store tests
pnpm test -- src/config/sessions/

# Type check
pnpm tsgo

# Lint
pnpm check

Backward Compatibility

Default is storeType: "json" for backward compatibility
Existing sessions.json files continue to work
Migration is opt-in via config or CLI command
SQLite requires Node 22.5+ (built-in node:sqlite)

Closes #58534 Related #57497

Changed files

PR_DESCRIPTION.md (added, +123/-0)
src/cli/program/register.status-health-sessions.ts (modified, +73/-0)
src/commands/sessions-migrate.ts (added, +256/-0)
src/config/schema.base.generated.ts (modified, +17/-0)
src/config/schema.help.ts (modified, +2/-0)
src/config/schema.labels.ts (modified, +1/-0)
src/config/sessions/store-facade.ts (added, +200/-0)
src/config/sessions/store-sqlite.test.ts (added, +288/-0)
src/config/sessions/store-sqlite.ts (added, +614/-0)
src/config/sessions/store.ts (modified, +70/-44)
src/config/types.base.ts (modified, +11/-0)
src/config/zod-schema.session.ts (modified, +1/-0)

PR #59436: perf(gateway): apply limit before building session rows and index child lookups

Repository: openclaw/openclaw
Author: coderredlab
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/59436

Description (problem / solution / changelog)

Summary

Three complementary optimisations for listSessionsFromStore() to reduce sessions.list latency, especially on resource-constrained hosts (Android/proot):

Early limit application: Pre-sort entries by entry.updatedAt and apply limit + activeMinutes filters before calling buildGatewaySessionRow(), avoiding expensive per-row disk I/O (transcript reads, file stats) for sessions that would be discarded.
Child session index: Replace the O(N²) resolveChildSessionKeys() store scan with a one-time buildChildSessionIndex() reverse map — O(N) build + O(1) per-row lookup.
Search pre-filtering: When text search is active, derive all five search-target fields (key, sessionId, label, subject, displayName) from the store entry alone — no disk I/O — and discard non-matching entries before building rows. This allows the limit to be applied early in all code paths.

Before / After

With 447 sessions and limit=10 (typical dashboard call):

Phase	Before	After
`buildGatewaySessionRow()` calls	447	≤10 (with or without search)
`resolveChildSessionKeys` store iterations	447 × 447 ≈ 200K	1 index build (447) + 10 lookups
Transcript file reads	up to 447	up to 10

Motivation

On resource-constrained hosts (Android/proot where every syscall goes through ptrace) the per-row disk I/O in buildGatewaySessionRow — transcript reads, realpathSync, existsSync, fstatSync — dominates sessions.list latency. With ~450 accumulated sessions the dashboard becomes unresponsive (6.5 s+, CPU 100%).

Refs: #57715, #58534

Test plan

pnpm exec vitest run src/gateway/session-utils.test.ts — 79 tests pass
pnpm exec vitest run src/gateway/sessions-resolve.test.ts — 2 tests pass
Pre-commit hooks (tsgo, lint, conflict markers) pass

Changed files

src/gateway/session-utils.ts (modified, +115/-42)

PR #59464: perf(gateway): start channels without blocking sidecar init

Repository: openclaw/openclaw
Author: coderredlab
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/59464

Description (problem / solution / changelog)

Summary

Fire startChannels() without await in startGatewaySidecars() so channel plugin loading no longer blocks the event loop.

Problem

Channel plugins — particularly WhatsApp (@whiskeysockets/baileys 9.1 MB + jimp 3.9 MB) — pull in large dependency trees. Their dynamic import() chains monopolise the Node.js event loop during module resolution:

[hooks] loaded 5 internal hook handlers      ← event loop blocked from here
                                              ← 95 seconds of silence
[whatsapp] [default] starting provider       ← event loop released

During this window the gateway cannot serve HTTP or WebSocket requests, so the dashboard UI is completely unresponsive even though the server is already listening.

On constrained hosts (Android/proot where every stat/read syscall goes through ptrace) this takes ~95 seconds. Even on macOS it can take 5-8 seconds.

Fix

Wrap the prewarmConfiguredPrimaryModel + startChannels sequence in a void IIFE so it runs in the background. The gateway can start serving requests (dashboard, health checks, pairing) immediately after hooks are loaded. Errors are still caught and logged.

Nothing downstream in startGatewaySidecars depends on channel startup completing — pluginServices, hooks, ACP reconciliation, and memory backend all proceed independently.

Refs: #58534

Test plan

pnpm exec vitest run src/gateway/server-startup.test.ts — 2 tests pass
Pre-commit hooks (tsgo, lint, conflict markers) pass

Changed files

src/gateway/server-close.test.ts (modified, +1/-1)
src/gateway/server-close.ts (modified, +11/-3)
src/gateway/server-startup.ts (modified, +43/-19)
src/gateway/server.impl.ts (modified, +9/-4)

RAW_BUFFERClick to expand / collapse

Problem

OpenClaw natively supports and encourages subagent spawning for task parallelism, but session accumulation severely impacts host performance, forcing users to sacrifice data retention for operational stability.

Real-World Metrics (from our deployment)

Before (447 sessions accumulated over ~3 days)

Metric	Value
`sessions.list` response time	6.5 seconds
Gateway CPU	100-115% sustained
Gateway RAM	1+ GB
Dashboard usability	Effectively frozen
Gateway restart	Did not help (sessions persist on disk)
Host reboot	Did not help

After (aggressive manual pruning to 23 sessions)

Metric	Value
`sessions.list` response time	760ms
Gateway CPU	1%
Gateway RAM	~400 MB
Dashboard usability	Responsive

Session Breakdown (typical)

2-4 long-lived main/channel sessions (webchat, Slack)
~400+ ephemeral sessions (subagents, spawn-dispatch, cron jobs)

With dispatcher running every 10 minutes spawning 2 sessions each cycle, accumulation is ~12 sessions/hour × 24h = ~288 sessions/day.

Root Cause

sessions.list serializes all sessions on every call — O(n) with expensive per-session work
Control UI/Dashboard polls sessions.list frequently (~every 7 seconds), even when not viewing Sessions tab
archiveAfterMinutes (default 60) doesn't keep pace with spawn rate
No pagination, caching, or incremental sync

Current Workaround

We had to:

Manually archive old .jsonl transcripts
Run openclaw sessions cleanup --enforce --fix-missing
Set archiveAfterMinutes: 120 (still aggressive)
Accept losing troubleshooting history to maintain stability

This is a poor tradeoff — we lose the ability to debug issues from hours ago.

Suggested Improvements

Pagination for sessions.list — don't serialize 400+ sessions per request
Reduce Control UI poll frequency — or skip sessions.list when not on Sessions view
Incremental sync — send deltas instead of full list
Tiered retention config — separate policies for:
- Main/channel sessions (keep days/weeks)
- Subagent sessions (keep hours)
Hard session count cap with LRU eviction — auto-archive oldest ephemeral sessions when limit reached
Background proactive cleanup — don't wait for user to hit performance cliff
Lazy loading — defer full session metadata until requested

Environment

OpenClaw v2026.3.28
macOS 15.7.2 (arm64), Apple Silicon
Workload: Automated dispatcher spawning subagents for issue processing

Impact

This is a significant reliability issue for anyone running automated subagent workflows. The system encourages subagent spawning but doesn't scale session management to match, creating a hidden performance cliff that's hard to diagnose and forces users to choose between history retention and system stability.

extent analysis

TL;DR

Implement pagination for sessions.list and reduce Control UI poll frequency to mitigate session accumulation and improve performance.

Guidance

Implement pagination for sessions.list to reduce the number of sessions serialized per request, improving response time and reducing CPU usage.
Reduce the frequency of Control UI polls or skip sessions.list when not on the Sessions view to decrease the load on the system.
Consider implementing a hard session count cap with LRU eviction to auto-archive oldest ephemeral sessions when the limit is reached, preventing excessive session accumulation.
Review and adjust the archiveAfterMinutes setting to balance history retention with system stability, potentially using a tiered retention config.

Example

No code snippet is provided as the issue does not specify a particular codebase or implementation details.

Notes

The suggested improvements require changes to the OpenClaw implementation, and the effectiveness of these changes may vary depending on the specific use case and workload.

Recommendation

Apply the suggested improvements, starting with pagination for sessions.list and reducing Control UI poll frequency, to mitigate the performance issues caused by session accumulation. This approach addresses the root causes of the problem and provides a more scalable solution than the current workaround.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#logging issue #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Session management performance degrades severely with subagent usage (100%+ CPU at ~400 sessions) [3 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Session Breakdown (typical)

Current Workaround

PR fix notes

PR #58550: feat(sessions): SQLite-backed two-tier session store — fixes 140%+ CPU at scale

Description (problem / solution / changelog)

SQLite-backed Session Store

Problem

Solution: Two-tier Architecture

Hot Index (SQLite)

Cold Storage (unchanged)

Configuration

Migration

Automatic (on first access)

Manual (CLI)

Fallback Behavior

Files Changed

New Files

Modified Files

Performance Expectations

Testing

Backward Compatibility

Changed files

PR #59436: perf(gateway): apply limit before building session rows and index child lookups

Description (problem / solution / changelog)

Summary

Before / After

Motivation

Test plan

Changed files

PR #59464: perf(gateway): start channels without blocking sidecar init

Description (problem / solution / changelog)

Summary

Problem

Fix

Test plan

Changed files

Problem

Real-World Metrics (from our deployment)

Before (447 sessions accumulated over ~3 days)

After (aggressive manual pruning to 23 sessions)

Session Breakdown (typical)

Root Cause

Current Workaround

Suggested Improvements

Environment

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING