openclaw - ✅(Solved) Fix [Bug]: Stuck Session Recovery 机制双重失效 + Session 预处理每次耗时过长 [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#76038Fetched 2026-05-03 04:43:05
View on GitHub
Comments
2
Participants
3
Timeline
6
Reactions
3
Author
Timeline (top)
commented ×2cross-referenced ×2subscribed ×1unsubscribed ×1

Fix Action

Fixed

PR fix notes

PR #76115: fix(diagnostics): experiment with stuck session aborts

Description (problem / solution / changelog)

Summary

  • Adds experimental diagnostics.stuckSessionAbortMs for a second-stage stalled/stuck recovery threshold.
  • Keeps warn-threshold recovery observe-only for active work, but lets the longer experimental threshold abort active embedded reply work or release an active unregistered lane.
  • Emits structured session.recovery diagnostics and clears stale bookkeeping after successful no-active-work recovery so repeated source-live no_active_work action=none logs stop.
  • Documents the experimental knob, config schema/help, reload planning, stability records, and changelog.

Experimental behavior

This is intentionally marked experimental. The abort-capable path only applies to session.stalled and session.stuck; session.long_running work that is still making progress is not aborted.

Refs #76038 and #71127.

Validation

  • pnpm test src/logging/diagnostic.test.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts src/logging/diagnostic-stuck-session-recovery.integration.test.ts src/logging/diagnostic-stability.test.ts src/gateway/config-reload.test.ts src/config/schema.help.quality.test.ts
  • pnpm config:schema:check && pnpm config:docs:check
  • Testbox pnpm check:changed on tbx_01kqmjaexhs8y7b5jab36n269c (passed; lanes: core, coreTests, docs)

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/.generated/config-baseline.sha256 (modified, +2/-2)
  • docs/concepts/agent-loop.md (modified, +1/-1)
  • docs/concepts/queue.md (modified, +1/-1)
  • docs/gateway/configuration-reference.md (modified, +2/-0)
  • docs/gateway/opentelemetry.md (modified, +6/-0)
  • src/config/schema.base.generated.ts (modified, +21/-0)
  • src/config/schema.help.ts (modified, +2/-0)
  • src/config/schema.labels.ts (modified, +1/-0)
  • src/config/types.base.ts (modified, +2/-0)
  • src/config/zod-schema.ts (modified, +1/-0)
  • src/gateway/config-reload-plan.ts (modified, +2/-1)
  • src/gateway/config-reload.test.ts (modified, +6/-0)
  • src/infra/diagnostic-events.ts (modified, +28/-0)
  • src/logging/diagnostic-session-state.ts (modified, +1/-0)
  • src/logging/diagnostic-stability.ts (modified, +13/-0)
  • src/logging/diagnostic-stuck-session-recovery.runtime.test.ts (modified, +59/-3)
  • src/logging/diagnostic-stuck-session-recovery.runtime.ts (modified, +100/-20)
  • src/logging/diagnostic.test.ts (modified, +141/-0)
  • src/logging/diagnostic.ts (modified, +203/-14)
RAW_BUFFERClick to expand / collapse

Bug 报告

🔴 核心 Bug:Stuck Session Recovery 机制双重失效

问题描述: Session 长时间停留在 `processing` 状态,事件循环完全阻塞,导致 Gateway 无响应,最终被 systemd 超时强杀。

两种情况均不恢复(根因)

情况 A — 有活跃工作时: ``` stuck session recovery skipped: reason=active_reply_work action=keep_lane age=289s, 385s... 持续多分钟,系统选择"保持 lane 占用"不处理 ```

情况 B — 无活跃工作时: ``` stuck session recovery no-op: reason=no_active_work action=none lane=session:agent:main:main 系统什么都不做(action=none) ```

结果:无论哪种情况,系统都不执行真正恢复操作。Session 卡住直到 systemd 超时强杀。

触发频率:全天多次 SIGKILL,最长卡死 385 秒。

注:我们的 case 已在 #73581 提交过,这里是更详细的行为分析。


🔴 Session 预处理每次都极慢(176-307 秒)

问题描述:用户每次发消息,session 预处理耗时高达 3-5 分钟,体验极差。

实测数据(多次记录)

时间prep 总耗时
10:02183 秒
10:09201 秒
10:19307 秒(最长)
10:23295 秒
10:31281 秒
11:30178 秒
12:03178 秒

每次固定耗时拆解

阶段耗时
system-prompt 加载58-69 秒
stream-setup 建立57-102 秒
core-plugin-tools30-40 秒
bundle-tools26-33 秒

核心问题:这些阶段每次都重复执行,没有缓存机制,导致用户每次发消息都要等 3-5 分钟才能得到首个回复。


🔴 问题三:4.29 升级导致配置丢失

触发场景:从 4.27 升级到 4.29 后,systemd override.conf 配置出现问题。

表现

  • `FEISHU_APP_SECRET is missing or empty`
  • Gateway 连续多次启动失败

建议:升级流程中保留用户自定义配置,避免覆盖 systemd override 文件。


建议修复方向

  1. Stuck Session:需要强制超时熔断——无活跃工作时强制释放 lane,有活跃工作时也应设置超时上限
  2. Session 预处理:预处理结果应支持缓存复用,无需每次重新加载
  3. 升级流程:保留用户 systemd override 配置

环境信息

  • OpenClaw 版本:v2026.4.27(4.29 也有同样问题)
  • 系统:Linux 6.17,node 22
  • 通道:飞书 WebSocket
  • 模型:bailian/qwen3.6-plus

extent analysis

TL;DR

Implement a forced timeout mechanism for stuck sessions and cache session preprocessing results to improve system responsiveness.

Guidance

  • For stuck sessions, consider introducing a timeout mechanism that releases the lane after a certain period, even if there's active work, to prevent indefinite blocking.
  • To improve session preprocessing performance, implement a caching mechanism to store and reuse preprocessing results, reducing the need for repeated computations.
  • When upgrading, ensure that user-defined configurations, such as systemd override files, are preserved to avoid configuration loss and startup failures.
  • Review the system's event loop and lane management to identify potential bottlenecks and areas for optimization.

Example

No specific code example is provided due to the lack of detailed implementation details in the issue description. However, a caching mechanism for session preprocessing could be implemented using a cache library or a simple in-memory cache.

Notes

The issue description lacks specific technical details about the system's architecture and implementation, making it challenging to provide a precise solution. The suggested guidance is based on the problem description and may require adjustments according to the actual system implementation.

Recommendation

Apply the suggested workarounds, focusing on implementing a forced timeout mechanism for stuck sessions and caching session preprocessing results, as these seem to be the most critical issues affecting system performance and responsiveness.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Stuck Session Recovery 机制双重失效 + Session 预处理每次耗时过长 [1 pull requests, 2 comments, 3 participants]