openclaw - ✅(Solved) Fix Session management performance degrades severely with subagent usage (100%+ CPU at ~400 sessions) [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#58534Fetched 2026-04-08 02:01:28
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Participants
Timeline (top)
referenced ×6cross-referenced ×4

Root Cause

  1. sessions.list serializes all sessions on every call — O(n) with expensive per-session work
  2. Control UI/Dashboard polls sessions.list frequently (~every 7 seconds), even when not viewing Sessions tab
  3. archiveAfterMinutes (default 60) doesn't keep pace with spawn rate
  4. No pagination, caching, or incremental sync

Fix Action

Fix / Workaround

Session Breakdown (typical)

  • 2-4 long-lived main/channel sessions (webchat, Slack)
  • ~400+ ephemeral sessions (subagents, spawn-dispatch, cron jobs)

With dispatcher running every 10 minutes spawning 2 sessions each cycle, accumulation is ~12 sessions/hour × 24h = ~288 sessions/day.

Current Workaround

PR fix notes

PR #58550: feat(sessions): SQLite-backed two-tier session store — fixes 140%+ CPU at scale

Description (problem / solution / changelog)

SQLite-backed Session Store

Problem

The flat sessions.json file causes severe performance issues at scale:

  • File grows to 42MB+ with 1000+ sessions
  • Every session operation requires reading/writing the entire file
  • Results in 140%+ CPU usage and 6+ second response times
  • JSON parsing/serialization becomes the bottleneck

Related issues: #58534 (perf), #57497 (Postgres request)

Solution: Two-tier Architecture

Hot Index (SQLite)

A lightweight SQLite database replaces sessions.json for metadata:

~/.openclaw/state/agents/{agentId}/sessions/sessions.sqlite

Schema columns:

  • session_key (PRIMARY KEY) - session identifier
  • session_id - UUID
  • updated_at, created_at - timestamps (indexed)
  • channel, last_channel, last_to, last_account_id, last_thread_id - routing
  • label, display_name, status - display info
  • model, model_provider, total_tokens, input_tokens, output_tokens - model state
  • message_count, archived - metadata
  • entry_json - full SessionEntry blob for complex fields

Benefits:

  • O(1) session lookups instead of O(n) JSON parsing
  • Incremental updates (no full file rewrites)
  • Proper indexing for common query patterns
  • WAL mode for concurrent read/write
  • ~10x faster at 1000+ sessions

Cold Storage (unchanged)

Existing .jsonl transcript files stay as-is:

  • Per-session files, already efficient
  • Only loaded on explicit sessions_history calls
  • Never in the hot path

Configuration

Add to openclaw.json:

{
  "session": {
    "storeType": "sqlite"  // "json" (default) or "sqlite"
  }
}

Migration

Automatic (on first access)

When storeType: "sqlite" is set, existing sessions.json is automatically migrated to SQLite on first load.

Manual (CLI)

# Preview migration
openclaw sessions migrate --dry-run

# Migrate default agent
openclaw sessions migrate

# Migrate all agents
openclaw sessions migrate --all-agents

# Check store info
openclaw sessions store-info

Fallback Behavior

  • If SQLite unavailable (Node < 22.5), falls back to JSON automatically
  • If SQLite operations fail, falls back to JSON for that operation
  • sessions.json is preserved during migration (not deleted)

Files Changed

New Files

  • src/config/sessions/store-sqlite.ts - SQLite storage implementation
  • src/config/sessions/store-facade.ts - Backend abstraction layer
  • src/commands/sessions-migrate.ts - Migration command

Modified Files

  • src/config/types.base.ts - Added SessionStoreType and storeType config
  • src/config/sessions/store.ts - Integrated facade for load/save
  • src/cli/program/register.status-health-sessions.ts - CLI commands

Performance Expectations

MetricJSON (1000 sessions)SQLite
Load time~800ms~15ms
Single update~800ms~5ms
List all~800ms~20ms
Memory42MB parsed~2MB
CPU (save)100%+<5%

Testing

# Run session store tests
pnpm test -- src/config/sessions/

# Type check
pnpm tsgo

# Lint
pnpm check

Backward Compatibility

  • Default is storeType: "json" for backward compatibility
  • Existing sessions.json files continue to work
  • Migration is opt-in via config or CLI command
  • SQLite requires Node 22.5+ (built-in node:sqlite)

Closes #58534 Related #57497

Changed files

  • PR_DESCRIPTION.md (added, +123/-0)
  • src/cli/program/register.status-health-sessions.ts (modified, +73/-0)
  • src/commands/sessions-migrate.ts (added, +256/-0)
  • src/config/schema.base.generated.ts (modified, +17/-0)
  • src/config/schema.help.ts (modified, +2/-0)
  • src/config/schema.labels.ts (modified, +1/-0)
  • src/config/sessions/store-facade.ts (added, +200/-0)
  • src/config/sessions/store-sqlite.test.ts (added, +288/-0)
  • src/config/sessions/store-sqlite.ts (added, +614/-0)
  • src/config/sessions/store.ts (modified, +70/-44)
  • src/config/types.base.ts (modified, +11/-0)
  • src/config/zod-schema.session.ts (modified, +1/-0)

PR #59436: perf(gateway): apply limit before building session rows and index child lookups

Description (problem / solution / changelog)

Summary

Three complementary optimisations for listSessionsFromStore() to reduce sessions.list latency, especially on resource-constrained hosts (Android/proot):

  • Early limit application: Pre-sort entries by entry.updatedAt and apply limit + activeMinutes filters before calling buildGatewaySessionRow(), avoiding expensive per-row disk I/O (transcript reads, file stats) for sessions that would be discarded.
  • Child session index: Replace the O(N²) resolveChildSessionKeys() store scan with a one-time buildChildSessionIndex() reverse map — O(N) build + O(1) per-row lookup.
  • Search pre-filtering: When text search is active, derive all five search-target fields (key, sessionId, label, subject, displayName) from the store entry alone — no disk I/O — and discard non-matching entries before building rows. This allows the limit to be applied early in all code paths.

Before / After

With 447 sessions and limit=10 (typical dashboard call):

PhaseBeforeAfter
buildGatewaySessionRow() calls447≤10 (with or without search)
resolveChildSessionKeys store iterations447 × 447 ≈ 200K1 index build (447) + 10 lookups
Transcript file readsup to 447up to 10

Motivation

On resource-constrained hosts (Android/proot where every syscall goes through ptrace) the per-row disk I/O in buildGatewaySessionRow — transcript reads, realpathSync, existsSync, fstatSync — dominates sessions.list latency. With ~450 accumulated sessions the dashboard becomes unresponsive (6.5 s+, CPU 100%).

Refs: #57715, #58534

Test plan

  • pnpm exec vitest run src/gateway/session-utils.test.ts — 79 tests pass
  • pnpm exec vitest run src/gateway/sessions-resolve.test.ts — 2 tests pass
  • Pre-commit hooks (tsgo, lint, conflict markers) pass

Changed files

  • src/gateway/session-utils.ts (modified, +115/-42)

PR #59464: perf(gateway): start channels without blocking sidecar init

Description (problem / solution / changelog)

Summary

  • Fire startChannels() without await in startGatewaySidecars() so channel plugin loading no longer blocks the event loop.

Problem

Channel plugins — particularly WhatsApp (@whiskeysockets/baileys 9.1 MB + jimp 3.9 MB) — pull in large dependency trees. Their dynamic import() chains monopolise the Node.js event loop during module resolution:

[hooks] loaded 5 internal hook handlers      ← event loop blocked from here
                                              ← 95 seconds of silence
[whatsapp] [default] starting provider       ← event loop released

During this window the gateway cannot serve HTTP or WebSocket requests, so the dashboard UI is completely unresponsive even though the server is already listening.

On constrained hosts (Android/proot where every stat/read syscall goes through ptrace) this takes ~95 seconds. Even on macOS it can take 5-8 seconds.

Fix

Wrap the prewarmConfiguredPrimaryModel + startChannels sequence in a void IIFE so it runs in the background. The gateway can start serving requests (dashboard, health checks, pairing) immediately after hooks are loaded. Errors are still caught and logged.

Nothing downstream in startGatewaySidecars depends on channel startup completing — pluginServices, hooks, ACP reconciliation, and memory backend all proceed independently.

Refs: #58534

Test plan

  • pnpm exec vitest run src/gateway/server-startup.test.ts — 2 tests pass
  • Pre-commit hooks (tsgo, lint, conflict markers) pass

Changed files

  • src/gateway/server-close.test.ts (modified, +1/-1)
  • src/gateway/server-close.ts (modified, +11/-3)
  • src/gateway/server-startup.ts (modified, +43/-19)
  • src/gateway/server.impl.ts (modified, +9/-4)
RAW_BUFFERClick to expand / collapse

Problem

OpenClaw natively supports and encourages subagent spawning for task parallelism, but session accumulation severely impacts host performance, forcing users to sacrifice data retention for operational stability.

Real-World Metrics (from our deployment)

Before (447 sessions accumulated over ~3 days)

MetricValue
sessions.list response time6.5 seconds
Gateway CPU100-115% sustained
Gateway RAM1+ GB
Dashboard usabilityEffectively frozen
Gateway restartDid not help (sessions persist on disk)
Host rebootDid not help

After (aggressive manual pruning to 23 sessions)

MetricValue
sessions.list response time760ms
Gateway CPU1%
Gateway RAM~400 MB
Dashboard usabilityResponsive

Session Breakdown (typical)

  • 2-4 long-lived main/channel sessions (webchat, Slack)
  • ~400+ ephemeral sessions (subagents, spawn-dispatch, cron jobs)

With dispatcher running every 10 minutes spawning 2 sessions each cycle, accumulation is ~12 sessions/hour × 24h = ~288 sessions/day.

Root Cause

  1. sessions.list serializes all sessions on every call — O(n) with expensive per-session work
  2. Control UI/Dashboard polls sessions.list frequently (~every 7 seconds), even when not viewing Sessions tab
  3. archiveAfterMinutes (default 60) doesn't keep pace with spawn rate
  4. No pagination, caching, or incremental sync

Current Workaround

We had to:

  1. Manually archive old .jsonl transcripts
  2. Run openclaw sessions cleanup --enforce --fix-missing
  3. Set archiveAfterMinutes: 120 (still aggressive)
  4. Accept losing troubleshooting history to maintain stability

This is a poor tradeoff — we lose the ability to debug issues from hours ago.

Suggested Improvements

  1. Pagination for sessions.list — don't serialize 400+ sessions per request
  2. Reduce Control UI poll frequency — or skip sessions.list when not on Sessions view
  3. Incremental sync — send deltas instead of full list
  4. Tiered retention config — separate policies for:
    • Main/channel sessions (keep days/weeks)
    • Subagent sessions (keep hours)
  5. Hard session count cap with LRU eviction — auto-archive oldest ephemeral sessions when limit reached
  6. Background proactive cleanup — don't wait for user to hit performance cliff
  7. Lazy loading — defer full session metadata until requested

Environment

  • OpenClaw v2026.3.28
  • macOS 15.7.2 (arm64), Apple Silicon
  • Workload: Automated dispatcher spawning subagents for issue processing

Impact

This is a significant reliability issue for anyone running automated subagent workflows. The system encourages subagent spawning but doesn't scale session management to match, creating a hidden performance cliff that's hard to diagnose and forces users to choose between history retention and system stability.

extent analysis

TL;DR

Implement pagination for sessions.list and reduce Control UI poll frequency to mitigate session accumulation and improve performance.

Guidance

  • Implement pagination for sessions.list to reduce the number of sessions serialized per request, improving response time and reducing CPU usage.
  • Reduce the frequency of Control UI polls or skip sessions.list when not on the Sessions view to decrease the load on the system.
  • Consider implementing a hard session count cap with LRU eviction to auto-archive oldest ephemeral sessions when the limit is reached, preventing excessive session accumulation.
  • Review and adjust the archiveAfterMinutes setting to balance history retention with system stability, potentially using a tiered retention config.

Example

No code snippet is provided as the issue does not specify a particular codebase or implementation details.

Notes

The suggested improvements require changes to the OpenClaw implementation, and the effectiveness of these changes may vary depending on the specific use case and workload.

Recommendation

Apply the suggested improvements, starting with pagination for sessions.list and reducing Control UI poll frequency, to mitigate the performance issues caused by session accumulation. This approach addresses the root causes of the problem and provides a more scalable solution than the current workaround.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING