openclaw - 💡(How to fix) Fix QMD per-agent SQLite caches cause extreme disk I/O on multi-agent deployments [1 comments, 1 participants]

openclaw2026-03-30 23:47:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#57996•Fetched 2026-04-08 01:55:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Orionation

Participants

Orionation

Assignees

vincentkoc

Timeline (top)

assigned ×1commented ×1

Root Cause

Root cause

Fix Action

Fix / Workaround

Workaround

RAW_BUFFERClick to expand / collapse

Environment

macOS 26.3, MacBook Air (M4), 460GB SSD
OpenClaw latest stable
19 persistent agents running concurrently

Problem

Each agent gets its own QMD SQLite cache at ~/.openclaw/agents/<name>/qmd/xdg-cache/qmd/index.sqlite. With 19 agents, the combined cache totals 2.5GB on disk (main agent alone is 1.2GB). Each SQLite instance independently maintains WAL journaling, checkpointing, and fsync operations.

macOS diagnostic reports show the gateway node process wrote 34.36GB to disk over 20 hours (~481KB/sec sustained), triggering a disk writes diagnostic event. The machine also generated 8 separate node crash reports in a single day, and the sustained I/O pressure caused macOS to force a user-session restart — killing all running apps including the gateway.

Root cause

SQLite WAL write amplification multiplied across 19 independent database instances. A single 1.2GB SQLite database with active WAL can easily generate 10-20x its size in cumulative disk writes through journaling, checkpointing, and page rewrites. Multiply that by 19 agents and the math lands at 34GB.

Impact

macOS forced user-session logout, killing all agents mid-task
Gateway required full watchdog recovery cycle post-restart
SSD endurance concern for always-on multi-agent deployments

Suggested improvements

Shared QMD cache — single SQLite instance with agent-namespaced tables instead of N separate databases
Configurable WAL checkpoint interval — allow tuning for deployments with many agents
Read-only or reduced-write mode for agents that don't need heavy QMD caching
Cache size limits or LRU eviction per agent to prevent unbounded growth (1.2GB for a single agent's inference cache seems excessive)

Workaround

None currently without disabling agents. Running VACUUM on individual SQLite files provides temporary relief but doesn't address the write amplification from concurrent WAL operations.

extent analysis

Fix Plan

To address the SQLite WAL write amplification issue, we will implement the following steps:

Shared QMD cache: Create a single SQLite instance with agent-namespaced tables.
Configurable WAL checkpoint interval: Introduce a configuration option to tune the WAL checkpoint interval.
Read-only or reduced-write mode: Implement a read-only or reduced-write mode for agents that don't require heavy QMD caching.
Cache size limits or LRU eviction: Enforce cache size limits or LRU eviction per agent to prevent unbounded growth.

Example Code

To create a shared QMD cache, you can modify the database initialization code to use a single SQLite instance with agent-namespaced tables:

import sqlite3

# Create a shared QMD cache database
conn = sqlite3.connect('/path/to/shared/qmd/cache.db')
cursor = conn.cursor()

# Create agent-namespaced tables
cursor.execute('''
    CREATE TABLE IF NOT EXISTS qmd_data (
        id INTEGER PRIMARY KEY,
        agent_name TEXT,
        data BLOB
    );
''')

# Insert data into the shared cache
def insert_data(agent_name, data):
    cursor.execute('INSERT INTO qmd_data (agent_name, data) VALUES (?, ?)', (agent_name, data))
    conn.commit()

To implement a configurable WAL checkpoint interval, you can add a configuration option to your application:

import sqlite3

# Define a configuration option for the WAL checkpoint interval
wal_checkpoint_interval = 1000  # 1000 seconds (or 16.7 minutes)

# Create a connection to the SQLite database
conn = sqlite3.connect('/path/to/shared/qmd/cache.db')
cursor = conn.cursor()

# Set the WAL checkpoint interval
cursor.execute('PRAGMA wal_autocheckpoint = ?', (wal_checkpoint_interval,))
conn.commit()

Verification

To verify that the fix worked, you can monitor the disk writes and check for any diagnostic events or node crash reports. You can also use tools like iotop or sysdig to monitor the I/O activity of your application.

Extra Tips

Regularly run VACUUM on the shared QMD cache database to maintain its size and performance.
Consider using a more efficient caching mechanism, such as an in-memory cache or a caching layer with a fixed size limit.
Monitor the SSD endurance and consider using a more robust storage solution for always-on multi-agent deployments.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#response parsing #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix QMD per-agent SQLite caches cause extreme disk I/O on multi-agent deployments [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix QMD per-agent SQLite caches cause extreme disk I/O on multi-agent deployments [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING