openclaw - 💡(How to fix) Fix QMD per-agent SQLite caches cause extreme disk I/O on multi-agent deployments [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#57996Fetched 2026-04-08 01:55:07
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Participants
Assignees
Timeline (top)
assigned ×1commented ×1

Root Cause

Root cause

Fix Action

Fix / Workaround

Workaround

RAW_BUFFERClick to expand / collapse

Environment

  • macOS 26.3, MacBook Air (M4), 460GB SSD
  • OpenClaw latest stable
  • 19 persistent agents running concurrently

Problem

Each agent gets its own QMD SQLite cache at ~/.openclaw/agents/<name>/qmd/xdg-cache/qmd/index.sqlite. With 19 agents, the combined cache totals 2.5GB on disk (main agent alone is 1.2GB). Each SQLite instance independently maintains WAL journaling, checkpointing, and fsync operations.

macOS diagnostic reports show the gateway node process wrote 34.36GB to disk over 20 hours (~481KB/sec sustained), triggering a disk writes diagnostic event. The machine also generated 8 separate node crash reports in a single day, and the sustained I/O pressure caused macOS to force a user-session restart — killing all running apps including the gateway.

Root cause

SQLite WAL write amplification multiplied across 19 independent database instances. A single 1.2GB SQLite database with active WAL can easily generate 10-20x its size in cumulative disk writes through journaling, checkpointing, and page rewrites. Multiply that by 19 agents and the math lands at 34GB.

Impact

  • macOS forced user-session logout, killing all agents mid-task
  • Gateway required full watchdog recovery cycle post-restart
  • SSD endurance concern for always-on multi-agent deployments

Suggested improvements

  1. Shared QMD cache — single SQLite instance with agent-namespaced tables instead of N separate databases
  2. Configurable WAL checkpoint interval — allow tuning for deployments with many agents
  3. Read-only or reduced-write mode for agents that don't need heavy QMD caching
  4. Cache size limits or LRU eviction per agent to prevent unbounded growth (1.2GB for a single agent's inference cache seems excessive)

Workaround

None currently without disabling agents. Running VACUUM on individual SQLite files provides temporary relief but doesn't address the write amplification from concurrent WAL operations.

extent analysis

Fix Plan

To address the SQLite WAL write amplification issue, we will implement the following steps:

  • Shared QMD cache: Create a single SQLite instance with agent-namespaced tables.
  • Configurable WAL checkpoint interval: Introduce a configuration option to tune the WAL checkpoint interval.
  • Read-only or reduced-write mode: Implement a read-only or reduced-write mode for agents that don't require heavy QMD caching.
  • Cache size limits or LRU eviction: Enforce cache size limits or LRU eviction per agent to prevent unbounded growth.

Example Code

To create a shared QMD cache, you can modify the database initialization code to use a single SQLite instance with agent-namespaced tables:

import sqlite3

# Create a shared QMD cache database
conn = sqlite3.connect('/path/to/shared/qmd/cache.db')
cursor = conn.cursor()

# Create agent-namespaced tables
cursor.execute('''
    CREATE TABLE IF NOT EXISTS qmd_data (
        id INTEGER PRIMARY KEY,
        agent_name TEXT,
        data BLOB
    );
''')

# Insert data into the shared cache
def insert_data(agent_name, data):
    cursor.execute('INSERT INTO qmd_data (agent_name, data) VALUES (?, ?)', (agent_name, data))
    conn.commit()

To implement a configurable WAL checkpoint interval, you can add a configuration option to your application:

import sqlite3

# Define a configuration option for the WAL checkpoint interval
wal_checkpoint_interval = 1000  # 1000 seconds (or 16.7 minutes)

# Create a connection to the SQLite database
conn = sqlite3.connect('/path/to/shared/qmd/cache.db')
cursor = conn.cursor()

# Set the WAL checkpoint interval
cursor.execute('PRAGMA wal_autocheckpoint = ?', (wal_checkpoint_interval,))
conn.commit()

Verification

To verify that the fix worked, you can monitor the disk writes and check for any diagnostic events or node crash reports. You can also use tools like iotop or sysdig to monitor the I/O activity of your application.

Extra Tips

  • Regularly run VACUUM on the shared QMD cache database to maintain its size and performance.
  • Consider using a more efficient caching mechanism, such as an in-memory cache or a caching layer with a fixed size limit.
  • Monitor the SSD endurance and consider using a more robust storage solution for always-on multi-agent deployments.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING