openclaw - 💡(How to fix) Fix agents.files.set hangs for >30s on a specific agentId after N compactions; co-resident agents on the same VM are fine

openclaw2026-05-18 01:12:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fix / Workaround

Sharing a reproducible production issue we hit on a fork tracking 2026.5.4 (carrying only the 5-line allowlist patch from #80329).

The only mitigation we have is agents.delete + fresh agents.create for the stuck agentId, which loses gateway-side conversation memory.

Code Example

// Same gateway, same TLS pin, same WSS connection pool.
await client.setAgentFile(amironovAgentId, "HEARTBEAT.md", "probe\n")
// → 60ms, ok

await client.setAgentFile(stuckAgentId, "HEARTBEAT.md", "probe\n")
// → throws "OpenClaw connection timeout" at exactly 31508ms (our requestTimeout)

RAW_BUFFERClick to expand / collapse

Sharing a reproducible production issue we hit on a fork tracking 2026.5.4 (carrying only the 5-line allowlist patch from #80329).

Symptom

After ~10 compaction cycles on the same per-user agent, all subsequent agents.files.set calls to that specific agentId block until our client-side timeout fires (~30s). Other agents on the same VM are responsive at <100ms during the same window. The VM itself is healthy: CPU <1%, draining=false, low agentCount.

Reproducible signature in our setup:

Gateway: one e2-medium chat-pool VM running OpenClaw 2026.5.4 fork.
Stuck agent: sessionEpoch=10 (10 compactions over its lifetime).
Co-resident agent on the same VM: sessionEpoch=1. agents.files.set succeeds at 55–60ms per file on the same heartbeat tick.

Minimal probe (Node, via our ClawClient over WSS):

// Same gateway, same TLS pin, same WSS connection pool.
await client.setAgentFile(amironovAgentId, "HEARTBEAT.md", "probe\n")
// → 60ms, ok

await client.setAgentFile(stuckAgentId, "HEARTBEAT.md", "probe\n")
// → throws "OpenClaw connection timeout" at exactly 31508ms (our requestTimeout)

What "compaction" looks like in our app

On each compaction we re-render and re-write all five PROVISIONED_FILES (SOUL.md, IDENTITY.md, USER.md, MEMORY.md, TOOLS.md) plus our credentials.json. So a compaction event is a burst of ~6 agents.files.set calls back-to-back. sessionEpoch is our app-side counter.

We also bump the OpenAI session key with an :e<epoch> suffix so the gateway treats the next chat as a new conversation. Presumably that creates fresh session state on your side per epoch.

Hypothesis

A per-agent op queue (or some per-agent state) on the gateway accumulates across epochs and eventually wedges on a specific agent. Co-resident agents keep working, so it isn't a per-VM cap. CPU is idle, so it isn't contention-driven. Feels like a memory-leak or deadlock localized to the agent record.

What we tried

Restarting our API process — no effect, gateway state persists.
Comparing to a co-resident agent — confirms it's per-agent, not per-VM.
We already serialize writes per gateway on our side (per #73683 follow-up); doesn't help here.

Recovery

The only mitigation we have is agents.delete + fresh agents.create for the stuck agentId, which loses gateway-side conversation memory.

What would help

A pointer at the right log channel / env var on the gateway side to capture what agents.files.set is blocked on for the wedged agent. Happy to repro with verbose logging and attach a tcpdump / strace.
Knowing whether you can think of a per-agent op queue or lock that could leak state across our high-compaction-cadence pattern.

Thanks!

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix agents.files.set hangs for >30s on a specific agentId after N compactions; co-resident agents on the same VM are fine

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Symptom

What "compaction" looks like in our app

Hypothesis

What we tried

Recovery

What would help

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix agents.files.set hangs for >30s on a specific agentId after N compactions; co-resident agents on the same VM are fine

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Symptom

What "compaction" looks like in our app

Hypothesis

What we tried

Recovery

What would help

Still need to ship something?

RELATED_DISCOVERY

TRENDING