openclaw - ✅(Solved) Fix v2026.4.21 升级事故:bundled 运行时依赖缺失导致 doctor --fix 失败,且失败恢复可能把 openclaw.json 改写为无效最小配置 / v2026.4.21 upgrade regression: missing bundled runtime deps break doctor --fix, and failed recovery can rewrite openclaw.json into an invalid minimal config [1 pull requests, 5 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70096Fetched 2026-04-23 07:29:22
View on GitHub
Comments
5
Participants
5
Timeline
7
Reactions
1
Timeline (top)
commented ×5closed ×1cross-referenced ×1

Error Message

This incident was not caused by a single error, but by two stacked problems: 4. if validation fails, keep the original file untouched and surface a clear error

4. 错误分层更清晰 / Better error separation

  • the full error logs

Root Cause

This incident was not caused by a single error, but by two stacked problems:

Fix Action

Fixed

PR fix notes

PR #70138: fix(plugins): eagerly install bundled runtime deps

Description (problem / solution / changelog)

Summary

  • Problem: packaged installs could ship bundled channel/plugin entrypoints without their runtime dependencies installed, so bootstrap paths failed with Cannot find module ... before openclaw doctor --fix could help.
  • Why it matters: fresh installs and updates could look successful while openclaw status, onboarding, or bundled channel entry loading immediately crashed.
  • What changed: packaged postinstall now eagerly installs bundled plugin runtime dependencies by default, and packaged installs fail fast if that nested install fails.
  • What did NOT change (scope boundary): source checkouts still skip eager bundled-runtime installation, and this PR does not change the config-recovery half of #70096.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #70008
  • Closes #70093
  • Closes #70099
  • Related #70096
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: packaged installs excluded dist/extensions/*/node_modules/**, and normal postinstall runs skipped bundled runtime dependency installation unless an eager-install env var was set.
  • Missing detection / guardrail: user installs did not fail when the nested install path failed, so packaged installs could silently continue with missing bundled runtime deps.
  • Contributing context (if known): release-check already forces eager install under a dedicated env flag, which masked the mismatch between release validation and normal end-user installs.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
    • test/scripts/postinstall-bundled-plugins.test.ts
    • src/plugins/stage-bundled-plugin-runtime-deps.test.ts
  • Scenario the test should lock in:
    • packaged installs eagerly install bundled runtime deps by default
    • packaged installs fail fast when that install step fails
    • explicit opt-out via OPENCLAW_EAGER_BUNDLED_PLUGIN_DEPS=0 still works
  • Why this is the smallest reliable guardrail:
    • these tests exercise the actual postinstall decision boundary without needing a full publish/release lane.
  • Existing test that already covers this (if any):
    • the packed install smoke and bundled channel entry smoke cover the installed layout verification path.
  • If no new test is added, why not:
    • N/A

User-visible / Behavior Changes

  • Fresh packaged installs and updates now install bundled plugin runtime deps by default.
  • If that nested install fails, the packaged install now fails immediately with a clear postinstall error instead of warning and continuing into a broken runtime state.

Diagram (if applicable)

Before:
install package -> skip bundled runtime dep install by default -> bootstrap loads bundled entry -> Cannot find module

After:
install package -> install bundled runtime deps -> bundled entry loads successfully
               \-> install fails -> postinstall throws immediately

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (Yes)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:
    • Packaged install already runs postinstall. This change makes the bundled runtime dependency install path the default packaged-install behavior, using the existing npm runner and failing fast on errors instead of silently continuing.

Repro + Verification

Environment

  • OS: macOS host verifying packaged tarball install
  • Runtime/container: Node/npm packaged install smoke
  • Model/provider: N/A
  • Integration/channel (if any): bundled channel entry smoke (Feishu, Slack, Nostr, WhatsApp, Telegram, bundled diffs)
  • Relevant config (redacted): none

Steps

  1. pnpm test test/scripts/postinstall-bundled-plugins.test.ts src/plugins/stage-bundled-plugin-runtime-deps.test.ts
  2. npm pack --ignore-scripts then npm install the tarball into a temp prefix
  3. Verify bundled runtime dep sentinels exist and run node scripts/test-built-bundled-channel-entry-smoke.mjs --package-root <installed package root>

Expected

  • packaged installs eagerly install bundled runtime deps
  • packaged installs fail on nested bundled-runtime install failure
  • bundled channel entry smoke passes from the installed layout

Actual

  • targeted tests pass
  • real packed-install smoke passes
  • bundled channel entry smoke passes from the installed package root

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • pnpm test test/scripts/postinstall-bundled-plugins.test.ts src/plugins/stage-bundled-plugin-runtime-deps.test.ts
    • packed tarball install now materializes bundled runtime dep sentinels for Feishu/Slack/Nostr/WhatsApp/Telegram/diffs deps
    • OPENCLAW_DISABLE_BUNDLED_ENTRY_SOURCE_FALLBACK=1 node scripts/test-built-bundled-channel-entry-smoke.mjs --package-root <installed package root> passes
  • Edge cases checked:
    • explicit opt-out via OPENCLAW_EAGER_BUNDLED_PLUGIN_DEPS=0
    • packaged install failure path now throws with context instead of warning and continuing
  • What you did not verify:
    • full pnpm check:changed is still blocked on unrelated existing tsgo:core:test failures in src/agents/pi-embedded-runner/*

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (Yes)
  • Migration needed? (No)
  • If yes, exact upgrade steps:
    • OPENCLAW_EAGER_BUNDLED_PLUGIN_DEPS=0|false|off now explicitly disables the default eager packaged-install behavior.

Risks and Mitigations

  • Risk: offline/air-gapped installs that cannot reach the registry now fail during packaged postinstall instead of appearing successful.
    • Mitigation: this is intentional fail-fast behavior because bundled entrypoints are expected to work immediately after install; silent success left users in a broken runtime state.

Changed files

  • scripts/postinstall-bundled-plugins.mjs (modified, +103/-52)
  • src/plugins/stage-bundled-plugin-runtime-deps.test.ts (modified, +130/-0)
  • test/scripts/postinstall-bundled-plugins.test.ts (modified, +86/-3)

Code Example

openclaw update

---

openclaw update

---

@larksuiteoapi/node-sdk

---

@larksuiteoapi/node-sdk

---

openclaw doctor --fix

---

Cannot find module '@larksuiteoapi/node-sdk'

---

openclaw doctor --fix

---

Cannot find module '@larksuiteoapi/node-sdk'

---

~/.openclaw/openclaw.json

---

~/.openclaw/openclaw.json

---

openclaw gateway restart

---

Gateway start blocked: existing config is missing gateway.mode

---

openclaw gateway restart

---

Gateway start blocked: existing config is missing gateway.mode

---

npm install -g @larksuiteoapi/node-sdk

---

~/.openclaw/openclaw.json.bak

---

~/.openclaw/openclaw.json

---

openclaw doctor --fix

---

openclaw gateway install --force
openclaw gateway restart

---

npm install -g @larksuiteoapi/node-sdk

---

~/.openclaw/openclaw.json.bak

---

~/.openclaw/openclaw.json

---

openclaw doctor --fix

---

openclaw gateway install --force
openclaw gateway restart

---

openclaw gateway status --deep

---

openclaw gateway status --deep

---

openclaw gateway stop

---

lsof -nP -iTCP:18789 -sTCP:LISTEN

---

openclaw gateway stop

---

lsof -nP -iTCP:18789 -sTCP:LISTEN
RAW_BUFFERClick to expand / collapse

概述 / Summary

我在 macOS 本地环境中,使用:

openclaw update

将 OpenClaw 从 2026.4.15 升级到 2026.4.21 后,升级没有正常完成,并引发了一次本地实例不可用的事故。

After upgrading OpenClaw from 2026.4.15 to 2026.4.21 on macOS using:

openclaw update

the upgrade did not complete cleanly and caused the local instance to become unhealthy.

这次事故不是单一报错,而是两个问题叠加导致的:

  1. bundled / extension 运行时依赖缺失
  2. openclaw doctor --fix 在失败路径上把当前有效配置改写成了不完整的最小配置

This incident was not caused by a single error, but by two stacked problems:

  1. missing bundled / extension runtime dependencies
  2. openclaw doctor --fix rewriting the active valid config into an incomplete minimal config on a failure path

起因 / Initial failure

升级到 2026.4.21 后,OpenClaw 在加载 Feishu 扩展运行时代码时,会直接依赖:

@larksuiteoapi/node-sdk

但这个依赖当时并没有随升级后的全局安装环境完整落地。

After upgrading to 2026.4.21, the Feishu extension runtime attempted to load:

@larksuiteoapi/node-sdk

but that dependency was not actually present in the installed global/runtime environment.

此时执行:

openclaw doctor --fix

会直接报错:

Cannot find module '@larksuiteoapi/node-sdk'

Running:

openclaw doctor --fix

failed with:

Cannot find module '@larksuiteoapi/node-sdk'

这说明第一层问题是:升级后的运行时依赖不完整,至少缺少 Feishu 扩展所需的 @larksuiteoapi/node-sdk

This indicates the first issue: the post-upgrade runtime environment was incomplete, at minimum missing @larksuiteoapi/node-sdk, which is required by the Feishu extension.


事故经过 / What happened next

doctor --fix 失败之后,我检查发现当前配置文件:

~/.openclaw/openclaw.json

被缩成了一个非常小的“最小配置”,只剩下 gateway auth、meta 等少量字段。

After doctor --fix failed, I found that the active config file:

~/.openclaw/openclaw.json

had been reduced to a very small “minimal config”, containing only a few fields such as gateway auth and meta.

原本完整配置中的很多关键部分已经不在当前文件里,例如:

  • gateway.mode
  • Telegram 配置
  • Feishu 配置
  • iMessage 配置
  • agents / 其他原有运行配置

Many important parts from the original full config were no longer present in the active file, including:

  • gateway.mode
  • Telegram configuration
  • Feishu configuration
  • iMessage configuration
  • agents / other existing runtime configuration

随后再执行:

openclaw gateway restart

Gateway 无法正常启动,并出现:

Gateway start blocked: existing config is missing gateway.mode

After that, running:

openclaw gateway restart

did not recover the service. Instead, Gateway refused to start with:

Gateway start blocked: existing config is missing gateway.mode

也就是说,第二层问题并不是“Gateway 启动慢”或“端口占用”,而是:
当前活动配置已经被写成了缺少关键字段的损坏状态,OpenClaw 正确地拒绝用这份配置启动。

So the second issue was not “slow startup” or “port conflict”; rather:
the active config had been rewritten into a damaged state missing required fields, and OpenClaw correctly refused to start with it.


实际根因 / Root cause

从这次排障过程看,根因是两部分叠加:

  1. [email protected] 升级后,bundled plugin / extension 所需的运行时依赖没有完整落地
  2. 失败的修复流程把 ~/.openclaw/openclaw.json 改写成了不完整配置,导致后续 Gateway 启动被阻断

Based on the recovery process, the root cause appears to be a combination of:

  1. [email protected] not leaving all required bundled plugin / extension runtime dependencies properly installed after upgrade
  2. a failed recovery path rewriting ~/.openclaw/openclaw.json into an incomplete config, which then blocked Gateway startup

更具体地说:

  • 第一层:Feishu 扩展运行时依赖缺失
  • 第二层:doctor --fix 的失败路径没有做到“要么成功修复、要么完全不动原配置”

More specifically:

  • first layer: missing Feishu extension runtime dependency
  • second layer: the failure path around doctor --fix did not behave transactionally (“either fix successfully, or leave the original config untouched”)

处理过程 / Recovery steps

我是这样恢复的:

  1. 先手动补装缺失依赖:
npm install -g @larksuiteoapi/node-sdk
  1. 检查发现备份文件仍然存在:
~/.openclaw/openclaw.json.bak
  1. 备份文件中保留了原本完整的配置,包括:

    • gateway.mode=local
    • Telegram 配置
    • iMessage 配置
    • Feishu 配置
    • agents 配置
  2. 将完整备份恢复回:

~/.openclaw/openclaw.json

同时保留当前生成的 gateway auth 设置,避免 token / auth 状态不一致。

  1. 恢复后重新执行:
openclaw doctor --fix

这次可以继续运行,并自动补装了缺失的 bundled plugin runtime dependencies,包括:

  • @larksuiteoapi/node-sdk
  • grammy
  • @grammyjs/runner
  • @grammyjs/transformer-throttler
  • @pierre/diffs
  • @pierre/theme
  1. 随后重新安装并重启 Gateway:
openclaw gateway install --force
openclaw gateway restart

I recovered the environment as follows:

  1. manually installed the missing dependency:
npm install -g @larksuiteoapi/node-sdk
  1. confirmed the backup config still existed:
~/.openclaw/openclaw.json.bak
  1. verified that the backup still contained the original full config, including:

    • gateway.mode=local
    • Telegram config
    • iMessage config
    • Feishu config
    • agents config
  2. restored the full backup into:

~/.openclaw/openclaw.json

while preserving the current gateway auth settings, so token/auth state would remain consistent.

  1. re-ran:
openclaw doctor --fix

This time it succeeded and automatically installed missing bundled plugin runtime dependencies, including:

  • @larksuiteoapi/node-sdk
  • grammy
  • @grammyjs/runner
  • @grammyjs/transformer-throttler
  • @pierre/diffs
  • @pierre/theme
  1. then reinstalled and restarted Gateway:
openclaw gateway install --force
openclaw gateway restart

恢复后的验证 / Validation after recovery

最后执行:

openclaw gateway status --deep

结果显示:

  • Connectivity probe: ok
  • Capability: admin-capable
  • Listening: *:18789

Finally, I ran:

openclaw gateway status --deep

and got:

  • Connectivity probe: ok
  • Capability: admin-capable
  • Listening: *:18789

这说明在“补齐缺失依赖 + 恢复完整配置 + 重装并重启 Gateway”之后,实例已经恢复正常。

This shows that after restoring missing dependencies, restoring the full config, and reinstalling/restarting Gateway, the instance returned to a healthy state.


当前状态补充 / Additional note about current state

按我的需要,恢复完成后我又执行了:

openclaw gateway stop

并确认:

lsof -nP -iTCP:18789 -sTCP:LISTEN

没有进程继续监听 18789

After recovery, I intentionally ran:

openclaw gateway stop

and confirmed with:

lsof -nP -iTCP:18789 -sTCP:LISTEN

that nothing was listening on port 18789.

所以“当前 Gateway 已停止”是我主动执行的结果,不属于本次事故的一部分。

So the fact that Gateway is currently stopped was intentional after recovery, and is not part of the incident itself.


期望行为 / Expected behavior

我认为以下行为才是更合理、安全的:

  1. openclaw update 不应该把 bundled extension/plugin 的运行时依赖遗漏掉
  2. openclaw doctor --fix 不应该在失败路径上把当前有效配置改写成无效最小配置
  3. 如果修复失败,应该保留原配置不动,而不是留下一个缺字段的 active config
  4. gateway.mode 这种关键字段,应该在写回前显式校验
  5. 配置修复流程应该具备事务性和回滚保护

I believe the expected behavior should be:

  1. openclaw update should not leave bundled extension/plugin runtime dependencies missing
  2. openclaw doctor --fix should never rewrite a valid active config into an invalid minimal config on a failure path
  3. if recovery fails, the previous valid config should remain untouched
  4. required fields such as gateway.mode should be explicitly validated before write
  5. config repair should be transactional and rollback-safe

建议修复方向 / Suggested fixes

1. 发布/安装链路校验 / Packaging and install validation

建议在 release / install / update 流程中增加校验,确保 bundled plugin 所需运行时依赖完整存在。

Suggested: add validation in release / install / update flows to ensure all bundled plugin runtime dependencies are actually present.

例如可以增加:

  • 对 Feishu 扩展运行时依赖的 smoke test
  • 对 Telegram 相关运行时依赖的 smoke test
  • 升级完成后的依赖完整性检查
  • 对 bundled plugin 入口进行一次实际 require/import 验证

For example:

  • smoke tests for Feishu extension runtime deps
  • smoke tests for Telegram-related runtime deps
  • dependency completeness checks after upgrade
  • real require/import validation for bundled plugin entrypoints

2. doctor --fix 的安全性 / Safer doctor --fix behavior

建议让 doctor --fix 的配置写回过程具备事务性:

  1. 先生成候选配置
  2. 完整校验候选配置
  3. 只有在校验通过时才替换当前 active config
  4. 如果校验失败,则保留原文件并明确报错

Suggested: make config rewrite during doctor --fix transactional:

  1. generate candidate config
  2. fully validate candidate config
  3. replace the active config only if validation passes
  4. if validation fails, keep the original file untouched and surface a clear error

此外应特别保护关键字段,例如:

  • gateway.mode
  • gateway.bind
  • channels
  • agents
  • auth / session related required blocks

It would also help to explicitly protect critical fields such as:

  • gateway.mode
  • gateway.bind
  • channels
  • agents
  • auth / session-related required blocks

3. 备份与恢复提示 / Backup and recovery UX

如果 .bak 可用,CLI 应明确提示用户可以直接恢复。
如果准备把现有完整配置覆盖成“最小配置”,应在执行前有非常明确的风险提示,甚至默认禁止。

If .bak exists, the CLI should explicitly tell the user it can be used for recovery.
If the tool is about to overwrite an existing full config with a “minimal config”, it should issue a very strong warning—or ideally refuse by default.

4. 错误分层更清晰 / Better error separation

这次事故里,用户侧看到的是多个阶段的不同问题叠在一起:

  • 依赖缺失
  • doctor --fix 失败
  • 配置被缩坏
  • Gateway 因缺少 gateway.mode 拒绝启动

In this incident, the user-visible failures were layered:

  • missing dependency
  • doctor --fix failure
  • config being collapsed into an incomplete state
  • Gateway refusing to start because gateway.mode was missing

建议在 CLI 输出里把这些阶段明确区分,避免用户误以为只是单纯“Gateway 没启动起来”。

It would help if CLI output separated these phases clearly, instead of making it look like “Gateway just failed to start”.


环境 / Environment

  • Upgrade path: 2026.4.15 -> 2026.4.21
  • Update command: openclaw update
  • OS: macOS
  • Affected area:
    • bundled plugin runtime dependencies
    • doctor --fix recovery path
    • config safety / config rewrite
    • gateway startup validation

如需补充 / If helpful

如果维护者需要,我可以再补充以下材料(可脱敏):

  • openclaw.json 被缩成最小配置前后的结构差异
  • .bak 恢复后的配置字段对比
  • 当时的完整报错日志
  • doctor --fix 自动补装依赖时的输出

If helpful, I can also provide additional redacted materials:

  • before/after structure of openclaw.json when it was collapsed into a minimal config
  • config field differences after restoring from .bak
  • the full error logs
  • the dependency-install output from the successful doctor --fix

extent analysis

TL;DR

To fix the issue, manually install the missing dependency @larksuiteoapi/node-sdk and restore the original configuration from the backup file ~/.openclaw/openclaw.json.bak to ensure all required fields are present.

Guidance

  1. Manually install missing dependencies: Run npm install -g @larksuiteoapi/node-sdk to ensure all required dependencies are installed.
  2. Restore original configuration: Copy the contents of ~/.openclaw/openclaw.json.bak back into ~/.openclaw/openclaw.json to restore the full configuration, including critical fields like gateway.mode.
  3. Re-run doctor --fix: After restoring the configuration, re-run openclaw doctor --fix to automatically install any missing dependencies and ensure the configuration is valid.
  4. Verify Gateway startup: After fixing the configuration, restart the Gateway with openclaw gateway restart and verify it starts correctly.

Example

No specific code example is necessary for this issue, as the solution involves command-line operations and configuration file manipulation.

Notes

  • The issue is specific to the upgrade from 2026.4.15 to 2026.4.21 on macOS.
  • The doctor --fix command's behavior of rewriting the configuration into a minimal state on failure is identified as a contributing factor to the issue.
  • Restoring the original configuration from the backup file is crucial to recovering from the issue.

Recommendation

Apply the workaround by manually installing missing dependencies and restoring the original configuration from the backup file. This approach directly addresses the identified root causes of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING