openclaw - 💡(How to fix) Fix agents.files.set hangs for >30s on a specific agentId after N compactions; co-resident agents on the same VM are fine

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

Sharing a reproducible production issue we hit on a fork tracking 2026.5.4 (carrying only the 5-line allowlist patch from #80329).

The only mitigation we have is agents.delete + fresh agents.create for the stuck agentId, which loses gateway-side conversation memory.

Code Example

// Same gateway, same TLS pin, same WSS connection pool.
await client.setAgentFile(amironovAgentId, "HEARTBEAT.md", "probe\n")
// → 60ms, ok

await client.setAgentFile(stuckAgentId, "HEARTBEAT.md", "probe\n")
// → throws "OpenClaw connection timeout" at exactly 31508ms (our requestTimeout)
RAW_BUFFERClick to expand / collapse

Sharing a reproducible production issue we hit on a fork tracking 2026.5.4 (carrying only the 5-line allowlist patch from #80329).

Symptom

After ~10 compaction cycles on the same per-user agent, all subsequent agents.files.set calls to that specific agentId block until our client-side timeout fires (~30s). Other agents on the same VM are responsive at <100ms during the same window. The VM itself is healthy: CPU <1%, draining=false, low agentCount.

Reproducible signature in our setup:

  • Gateway: one e2-medium chat-pool VM running OpenClaw 2026.5.4 fork.
  • Stuck agent: sessionEpoch=10 (10 compactions over its lifetime).
  • Co-resident agent on the same VM: sessionEpoch=1. agents.files.set succeeds at 55–60ms per file on the same heartbeat tick.

Minimal probe (Node, via our ClawClient over WSS):

// Same gateway, same TLS pin, same WSS connection pool.
await client.setAgentFile(amironovAgentId, "HEARTBEAT.md", "probe\n")
// → 60ms, ok

await client.setAgentFile(stuckAgentId, "HEARTBEAT.md", "probe\n")
// → throws "OpenClaw connection timeout" at exactly 31508ms (our requestTimeout)

What "compaction" looks like in our app

On each compaction we re-render and re-write all five PROVISIONED_FILES (SOUL.md, IDENTITY.md, USER.md, MEMORY.md, TOOLS.md) plus our credentials.json. So a compaction event is a burst of ~6 agents.files.set calls back-to-back. sessionEpoch is our app-side counter.

We also bump the OpenAI session key with an :e<epoch> suffix so the gateway treats the next chat as a new conversation. Presumably that creates fresh session state on your side per epoch.

Hypothesis

A per-agent op queue (or some per-agent state) on the gateway accumulates across epochs and eventually wedges on a specific agent. Co-resident agents keep working, so it isn't a per-VM cap. CPU is idle, so it isn't contention-driven. Feels like a memory-leak or deadlock localized to the agent record.

What we tried

  • Restarting our API process — no effect, gateway state persists.
  • Comparing to a co-resident agent — confirms it's per-agent, not per-VM.
  • We already serialize writes per gateway on our side (per #73683 follow-up); doesn't help here.

Recovery

The only mitigation we have is agents.delete + fresh agents.create for the stuck agentId, which loses gateway-side conversation memory.

What would help

  • A pointer at the right log channel / env var on the gateway side to capture what agents.files.set is blocked on for the wedged agent. Happy to repro with verbose logging and attach a tcpdump / strace.
  • Knowing whether you can think of a per-agent op queue or lock that could leak state across our high-compaction-cadence pattern.

Thanks!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix agents.files.set hangs for >30s on a specific agentId after N compactions; co-resident agents on the same VM are fine