openclaw - ✅(Solved) Fix [Bug]: 2026.4.29 performance regression — +43% request latency, stuck sessions, event loop blocking [1 pull requests, 11 comments, 11 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#76123Fetched 2026-05-03 04:42:07
View on GitHub
Comments
11
Participants
11
Timeline
19
Reactions
8
Timeline (top)
commented ×11cross-referenced ×2labeled ×2subscribed ×2

After upgrading from 2026.4.26 to 2026.4.29 (Pi framework 0.70.6), all request latency increased 35-40% (~75s → ~106s), with 124 liveness warnings (P99 peak 48s), stuck sessions lasting 10+ minutes, 36 failover events, and 5 MCP connection timeouts — confirmed on same hardware, same config, same agents.

Error Message

[14:07:32] WARN: stuck session: sessionId=main state=processing age=985s queueDepth=1 [14:07:32] WARN: stuck session recovery skipped: reason=active_embedded_run [14:08:02] WARN: stuck session: age=1015s queueDepth=1 [14:52:14] WARN: failed to start server "dingtalk-knowledge" Error: MCP server connection timed out after 30000ms [14:52:47] WARN: failed to start server "dingtalk-document" Error: MCP server connection timed out after 30000ms [14:55:24] embedded run failover decision (network error, 269s) ... (36 total: quota exceeded, network error, billing error)

Root Cause

After upgrading from 2026.4.26 to 2026.4.29 (Pi framework 0.70.6), all request latency increased 35-40% (~75s → ~106s), with 124 liveness warnings (P99 peak 48s), stuck sessions lasting 10+ minutes, 36 failover events, and 5 MCP connection timeouts — confirmed on same hardware, same config, same agents.

Fix Action

Fix / Workaround

startup: total=~19,500ms auth=~8,900ms model-resolution=~870ms attempt-dispatch=~9,300ms prep: total=~55,000ms system-prompt=~18,759ms stream-setup=~18,849ms bundle-tools=~8,561ms TOTAL: ~74,500ms After upgrade (2026.4.29) — degraded stable state (average of 44 requests)

startup: total=30,691ms auth=12,730ms model-resolution=3,657ms attempt-dispatch=14,275ms prep: total=75,672ms system-prompt=26,938ms stream-setup=26,296ms bundle-tools=10,964ms TOTAL: ~106,300ms Regression by stage: Stage Before After Change auth ~8,900ms 12,730ms +43% model-resolution ~870ms 3,657ms +320% attempt-dispatch ~9,300ms 14,275ms +53% Startup TOTAL ~19,500ms 30,691ms +57% system-prompt ~18,759ms 26,938ms +44% stream-setup ~18,849ms 26,296ms +40% bundle-tools ~8,561ms 10,964ms +28% session-resource-loader ~460ms 2,138ms +365% Prep TOTAL ~55,000ms 75,672ms +38% GRAND TOTAL ~74,500ms ~106,300ms +43% Liveness Warnings (124 total over 6 hours):

Temporary workaround: downgrade to OpenClaw 2026.4.26

PR fix notes

PR #76219: fix(gateway): improve transcript session-key cache hot path

Description (problem / solution / changelog)

Problem

Transcript update broadcasts call resolveSessionKeyForTranscriptFile, which loaded combined session stores before checking TRANSCRIPT_SESSION_KEY_CACHE. Tool-heavy runs can broadcast transcript updates many times, so the same transcript repeatedly paid for full combined session-store loads and scans.

Fix

This stores transcript session-key cache entries with a cheap session-store fingerprint. When the fingerprint matches, resolveSessionKeyForTranscriptFile returns from the hot cache hit before loading the full combined store. When the fingerprint changes or cannot be resolved, the resolver falls back to the existing full validation path.

The fingerprint includes the relevant session-store config shape plus sorted store file stats, so config-only changes such as session.mainKey, session.store, session scope, default agent id, or configured agent ids invalidate cached mappings without parsing the stores.

Quality

This does not disable tools, plugins, or memory. It preserves stale mapping recovery and keeps the duplicate/freshest selection logic on the existing full-resolution fallback path.

Tests run

  • corepack pnpm test src/gateway/session-transcript-key.test.ts
  • corepack pnpm test src/gateway/session-utils.test.ts -t resolveGatewaySessionStoreFingerprint
  • corepack pnpm exec oxfmt --check --threads=1 src/gateway/session-transcript-key.ts src/gateway/session-transcript-key.test.ts src/gateway/session-utils.ts src/gateway/session-utils.test.ts
  • git diff --check

Fixes #76123.

Thanks @kaka4413.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/session-transcript-key.test.ts (modified, +9/-1)
  • src/gateway/session-transcript-key.ts (modified, +36/-4)
  • src/gateway/session-utils.test.ts (modified, +35/-0)
  • src/gateway/session-utils.ts (modified, +45/-0)

Code Example

{
  "primary": "dashscope-coding/qwen3.6-plus",
  "fallbacks": ["dashscope-coding/qwen3.5-plus"]
}
Pi framework updated from ~0.6x to 0.70.6:

    @mariozechner/pi-agent-core: ~0.6x → 0.70.6
    @mariozechner/pi-ai: ~0.6x → 0.70.6
    @mariozechner/pi-coding-agent: ~0.6x → 0.70.6

Channels: DingTalk (@dingtalk-real-ai/dingtalk-connector v0.8.20, 6 accounts) + WeCom (custom plugin).
7 agents total (main + 6 platform agents).
Server: 2-core Intel Xeon Platinum, 1.8GB RAM, 40GB disk (84% used).

### Logs, screenshots, and evidence
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

After upgrading from 2026.4.26 to 2026.4.29 (Pi framework 0.70.6), all request latency increased 35-40% (~75s → ~106s), with 124 liveness warnings (P99 peak 48s), stuck sessions lasting 10+ minutes, 36 failover events, and 5 MCP connection timeouts — confirmed on same hardware, same config, same agents.

Steps to reproduce

  1. Start OpenClaw 2026.4.29 on a 2-core 1.8GB Linux server (Intel Xeon Platinum, Ubuntu/Alibaba Cloud Linux).
  2. Configure 7 agents (1 main + 6 platform agents) with 2 channels (DingTalk with 6 accounts + WeCom).
  3. Use model config: primary=dashscope-coding/qwen3.6-plus, fallback=dashscope-coding/qwen3.5-plus.
  4. Send a message to any DingTalk bot and measure request completion time.
  5. Observe: each request takes ~106s (vs ~75s on 2026.4.26), with periodic stuck session warnings every 30s.
  6. Check gateway logs for liveness warnings, MCP connection timeouts, and failover events.

Expected behavior

On OpenClaw 2026.4.26 with the identical hardware, config, and agent setup: average request time ~75s (startup ~19.5s + prep ~55s), no stuck sessions, no liveness warnings, no MCP timeouts. Measured from 90 timed requests in the gateway logs.

Actual behavior

On OpenClaw 2026.4.29 (Pi 0.70.6):

  • Average request time: ~106s (+43%), with startup ~30.7s (+57%) and prep ~75.7s (+38%).
  • model-resolution stage: ~870ms → ~3,657ms (+320%)
  • session-resource-loader: ~460ms → ~2,138ms (+365%)
  • 124 liveness warnings over 6 hours (P99 peak: 48,318ms, utilization 100% on 4 occasions)
  • Stuck sessions detected continuously for 10+ minutes (age 985s–2,694s, every 30s), recovery skipped due to active_embedded_run
  • 36 failover events (quota exceeded, network errors, billing errors)
  • 5 MCP server connection timeouts (30s each, dingtalk-knowledge and dingtalk-document)
  • 3 extreme requests: 150s, 165s, 169s (bundle-tools stage 81s–94s)
  • 1 gateway health check timeout on local loopback (10s)
  • 22 quota exceeded errors (total wait 7,397s / 2.1h)

OpenClaw version

2026.4.29 (a448042)

Operating system

Linux 5.10.134-19.2.al8.x86_64 x64 (Alibaba Cloud Linux)

Install method

npm global

Model

dashscope-coding/qwen3.6-plus (primary), dashscope-coding/qwen3.5-plus (fallback)

Provider / routing chain

openclaw → dashscope (Alibaba Cloud DashScope API, OpenAI-compatible endpoint)

Additional provider/model setup details

Model config identical before and after upgrade:

{
  "primary": "dashscope-coding/qwen3.6-plus",
  "fallbacks": ["dashscope-coding/qwen3.5-plus"]
}
Pi framework updated from ~0.6x to 0.70.6:

    @mariozechner/pi-agent-core: ~0.6x → 0.70.6
    @mariozechner/pi-ai: ~0.6x → 0.70.6
    @mariozechner/pi-coding-agent: ~0.6x → 0.70.6

Channels: DingTalk (@dingtalk-real-ai/dingtalk-connector v0.8.20, 6 accounts) + WeCom (custom plugin).
7 agents total (main + 6 platform agents).
Server: 2-core Intel Xeon Platinum, 1.8GB RAM, 40GB disk (84% used).

### Logs, screenshots, and evidence

```shell
Performance Timing Data (90 timed requests from gateway logs)
Before upgrade (2026.4.26) — baseline

startup: total=~19,500ms auth=~8,900ms model-resolution=~870ms attempt-dispatch=~9,300ms
prep: total=~55,000ms system-prompt=~18,759ms stream-setup=~18,849ms bundle-tools=~8,561ms
TOTAL: ~74,500ms
After upgrade (2026.4.29) — degraded stable state (average of 44 requests)

startup: total=30,691ms auth=12,730ms model-resolution=3,657ms attempt-dispatch=14,275ms
prep: total=75,672ms system-prompt=26,938ms stream-setup=26,296ms bundle-tools=10,964ms
TOTAL: ~106,300ms
Regression by stage:
Stage 	Before 	After 	Change
auth 	~8,900ms 	12,730ms 	+43%
model-resolution 	~870ms 	3,657ms 	+320%
attempt-dispatch 	~9,300ms 	14,275ms 	+53%
Startup TOTAL 	~19,500ms 	30,691ms 	+57%
system-prompt 	~18,759ms 	26,938ms 	+44%
stream-setup 	~18,849ms 	26,296ms 	+40%
bundle-tools 	~8,561ms 	10,964ms 	+28%
session-resource-loader 	~460ms 	2,138ms 	+365%
Prep TOTAL 	~55,000ms 	75,672ms 	+38%
GRAND TOTAL 	~74,500ms 	~106,300ms 	+43%
Liveness Warnings (124 total over 6 hours):

    eventLoopDelayP99: avg 7,452ms, max 48,318ms, median 65ms
    eventLoopUtilization: avg 0.501, max 1.000
    8 warnings with P99 > 10s, 4 with utilization 100%

Stuck Sessions:

[14:07:32] WARN: stuck session: sessionId=main state=processing age=985s queueDepth=1
  reason=processing_with_queued_work recovery=checking
[14:07:32] WARN: stuck session recovery skipped: reason=active_embedded_run
  action=observe_only sessionId=main ... age=985s queueDepth=1
[14:08:02] WARN: stuck session: age=1015s queueDepth=1
... (every 30s for 10+ minutes, age reaching 2,694s)

MCP Timeouts (5 total):

[14:52:14] WARN: failed to start server "dingtalk-knowledge"
  Error: MCP server connection timed out after 30000ms
[14:52:47] WARN: failed to start server "dingtalk-document"
  Error: MCP server connection timed out after 30000ms
... (5 × 30s = 150s total)

Failover Events (36 total):

[14:29:06] embedded run failover decision (quota exceeded, 2,379s)
[14:55:24] embedded run failover decision (network error, 269s)
... (36 total: quota exceeded, network error, billing error)

Extreme Requests:
Time 	Stage 	Value 	Normal 	Multiplier 	Total
14:53 	session-loader 	42,284ms 	~2s 	21x 	169,650ms
14:54 	bundle-tools 	81,077ms 	~11s 	7.4x 	169,650ms
18:09 	bundle-tools 	94,411ms 	~11s 	8.6x 	165,273ms
Config diff (only change between versions):

- "lastTouchedVersion": "2026.4.26"
+ "lastTouchedVersion": "2026.4.29"

Log file: /tmp/openclaw/openclaw-2026-05-02.log

Impact and severity

Affected: All users on OpenClaw 2026.4.29 with 2+ agents and multi-channel setup Severity: High — 43% latency increase across ALL requests, system instability (stuck sessions, event loop blocking up to 48s, frequent failovers) Frequency: 100% — every request is affected; liveness warnings occur every 3-5 minutes Consequence:

Requests take ~30s longer on average, causing API quota exhaustion faster
Stuck sessions block concurrent work for 10+ minutes
Event loop blocking (48s P99) makes the system unresponsive intermittently
36 failover events mean 36 failed/timeout requests in 6 hours
Gateway health check timeout indicates severe event loop starvation

Additional information

Last known good version: 2026.4.26 First known bad version: 2026.4.29

Excluded factors (all tested, no improvement):

Disk I/O: cleaned 3GB (94%→84% usage) — no change
Memory: 62% used, only 164MB swap — not the cause
Config: diff confirms identical except version tag
memory-wiki plugin: disabled for testing — no change
Agent count: 7 unchanged
Network latency: MCP/model API 100-170ms (normal)
Concurrent requests: maxConcurrent=4, no queuing observed
Compaction: mode=safeguard, no issues
Plugin dependency loading: 4-7ms, no delay
Workspace files: 1,900 files, 112MB, loads normally
Skills: 12 SKILL.md files, unchanged
SQLite state store: not enabled
Model API quota: same API key, quota exhaustion is a RESULT not cause

Key changes in 2026.4.29:

Steer queue mode (replaced old queue, 500ms debounce added)
Commitments system (new feature)
SQLite plugin state store (new infrastructure)
Event-loop readiness diagnostics (new monitoring)
Pi framework updated to 0.70.6

Temporary workaround: downgrade to OpenClaw 2026.4.26

extent analysis

TL;DR

Downgrade to OpenClaw 2026.4.26 to immediately mitigate the 43% latency increase and system instability introduced in version 2026.4.29.

Guidance

  1. Verify the issue: Confirm that the latency increase and system instability are consistent across multiple requests and agents.
  2. Check event loop diagnostics: Review the event-loop readiness diagnostics introduced in 2026.4.29 to understand the cause of event loop blocking.
  3. Investigate Steer queue mode: Examine the impact of the new Steer queue mode and 500ms debounce on request latency.
  4. Monitor Pi framework updates: Research potential issues with the Pi framework update to 0.70.6 and its effects on the system.
  5. Test individual components: Isolate and test each new feature (e.g., commitments system, SQLite plugin state store) to identify the root cause of the regression.

Example

No code snippet is provided as the issue is related to a specific version upgrade and its effects on the system, rather than a code-level problem.

Notes

The exact cause of the regression is unclear, but downgrading to 2026.4.26 provides a temporary workaround. Further investigation is needed to identify and fix the root cause.

Recommendation

Apply the temporary workaround: downgrade to OpenClaw 2026.4.26. This will immediately mitigate the latency increase and system instability, allowing for further investigation and debugging of the root cause.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

On OpenClaw 2026.4.26 with the identical hardware, config, and agent setup: average request time ~75s (startup ~19.5s + prep ~55s), no stuck sessions, no liveness warnings, no MCP timeouts. Measured from 90 timed requests in the gateway logs.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING