openclaw - ✅(Solved) Fix Skills snapshot not invalidated on /restart or gateway restart [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#54938Fetched 2026-04-08 01:34:18
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
cross-referenced ×2referenced ×1

Fix Action

Workaround

Manually delete the skillsSnapshot field from the affected session in <agentDir>/sessions/sessions.json and restart the gateway.

PR fix notes

PR #54969: fix: invalidate stale skillsSnapshot on gateway restart

Description (problem / solution / changelog)

Summary

The skillsSnapshot cached in sessions.json was never invalidated on gateway restart because the staleness check only looked at whether a snapshot existed, not whether it was outdated.

Changes

  • Moved skillsSnapshotVersion computation before the needsSkillsSnapshot check
  • Compare the cached snapshotVersion against the current getSkillsSnapshotVersion() to detect stale snapshots and rebuild them

Testing

  • Adding a new skill and restarting now correctly picks up the new skill in existing sessions.

Fixes openclaw/openclaw#54938

Changed files

  • src/agents/agent-command.ts (modified, +3/-1)

PR #55021: fix: invalidate skills snapshot on gateway restart

Description (problem / solution / changelog)

Fixes #54938

Skills snapshot cached in sessions was never refreshed on /restart or gateway restart. Now: (1) bump the global skills snapshot version before SIGUSR1, and (2) compare snapshot version in agent-command.ts so stale snapshots get rebuilt.

Changed files

  • src/agents/agent-command.ts (modified, +5/-1)
  • src/infra/restart.ts (modified, +2/-0)

PR #67401: fix(stability): session skills snapshot, tool-loop guard, TUI watchdog, LM Studio preload backoff

Description (problem / solution / changelog)

Summary

Four stability fixes for issues hit during a single self-hosted OpenClaw + LM Studio debugging session. All are reproducible, low-surface, and include unit tests.

  • Problem: disabling a bundled skill in config still left the model calling it, producing infinite `Tool X not found` loops until the embedded-run timeout. Root cause: the `skillsSnapshot` persisted in `sessions.json` was never invalidated when `skills.*` config changed.
  • Problem: the `unknownToolThreshold` stream guard was gated behind `tools.loopDetection.enabled`, which defaults to `false`. The protection against hallucinated / removed tool calls was effectively off in the stock config.
  • Problem: the TUI `streaming · Xm Ys` indicator never reset when the gateway's `state: "final"` event was lost (WS reconnect, gateway restart, etc.), leaving the TUI stuck indefinitely until killed.
  • Problem: LM Studio's memory guardrail rejecting `POST /v1/models/load` caused OpenClaw to re-hit the endpoint on every chat request (~every 2s), producing hundreds of WARN log lines per hour without useful retry semantics.
  • Why it matters: each of these failure modes amplifies any other local-model hiccup into a session-long stuck state that users have to recover manually.
  • What changed:
    • `src/gateway/config-reload.ts`: bump `skillsSnapshotVersion` when a config diff touches `skills.*`, via a new `shouldInvalidateSkillsSnapshotForPaths` helper wired into the single `applySnapshot` code path (covers both watcher writes and in-process `config.apply`).
    • `src/agents/pi-embedded-runner/run/attempt.ts`: make `resolveUnknownToolGuardThreshold` always return a positive threshold (default 10) regardless of `tools.loopDetection.enabled`. The guard is a pure safety net with no false-positive surface.
    • `src/tui/tui-event-handlers.ts`: add a 30s delta-silence watchdog that resets `activityStatus` to `idle` on timeout and surfaces a short system-log note; exposes `dispose()` + configurable `streamingWatchdogMs` context option.
    • `extensions/lmstudio/src/stream.ts`: add per-`(baseUrl, modelKey, contextLength)` cooldown (5s → 10s → 20s → … → 5min cap) after preload failures; during cooldown the wrapper skips preload entirely and runs the inference stream directly (the model is often already loaded via LM Studio's UI). The log line now carries consecutive-failure count and remaining cooldown.
  • What did NOT change (scope boundary): no schema additions, no new config keys (except the TUI `streamingWatchdogMs` developer option), no changes to the public Plugin SDK, no touches to session persistence format, no changes to the `detectToolCallLoop` / `before-tool-call` dispatcher path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #54938
  • Closes #62971
  • Related to #49059, #65346, #56075, #41972, #51154
  • This PR fixes a bug or regression

Root Cause (if applicable)

  1. Skills snapshot staleness. `getSkillsSnapshotVersion` is only bumped on filesystem changes to the skills tree (watched via chokidar in `src/agents/skills/refresh.ts`). Config changes to `skills.allowBundled` / `skills.entries.*` / `skills.profile` go through `config-reload` and `config.apply` without ever bumping the version. Sessions keep reading the cached `skillsSnapshot` from `sessions.json`, advertising skills that no longer exist in the active config.

  2. Unknown-tool guard gated behind opt-in flag. Historical commit introduced `tools.loopDetection.enabled` as a coarse toggle for repetition detectors (genericRepeat / pingPong / pollNoProgress). The unknown-tool guard is a different concern — it only triggers when the model is calling a tool that is objectively not registered in this run — but it was wired through the same gate. Default install effectively had no protection against `Tool X not found` loops.

  3. TUI streaming state. `tui-event-handlers.ts` transitions `activityStatus` to `streaming` on `evt.state === "delta"` and back to `idle` / `error` only on the corresponding `final` / `aborted` / `error` event. The gateway does emit a `final` event, but delivery over the WebSocket is not atomic with the run lifecycle — a connection flap or gateway restart between completion and delivery leaves the indicator pinned. There was no client-side fallback.

  4. LM Studio preload. The wrapper dedupes concurrent preloads via `preloadInFlight`, but on failure it immediately drops the entry in `.finally()`, so the next chat request creates a brand-new preload. Combined with no circuit breaker on the failure, a guardrail rejection (memory pressure, model not loadable) produces one failed preload per chat request for as long as the user stays in the session.

Test Plan

  • New unit coverage in each affected area (`config-reload.test.ts`, `attempt.test.ts`, `tui-event-handlers.test.ts`, `extensions/lmstudio/src/stream.test.ts`).
  • `pnpm test <touched files>` → 189 / 189 passing.
  • `pnpm tsgo` clean.
  • `pnpm check` clean (0 warnings, 0 errors across oxlint / import-cycle / madge / custom webhook / pairing-scope guards).

Scoped tests cover: skills-version bump wiring + path-prefix helper, threshold override / default / invalid-input / fractional-floor for the unknown-tool guard, watchdog arm / disarm / rearm / dispose / zero-disabled cases, preload cooldown skip + expiry-based retry.

Reviewer notes

  • `bumpSkillsSnapshotVersion` invalidates globally rather than per-session; that matches existing callers from `src/infra/skills-remote.ts` and is the desired semantic here. A config change is intended to affect every agent session under the workspace.
  • I considered adding a dispatcher-side guard in `tool-loop-detection.ts` too, but the observed loops go through the stream-wrapper path before a tool is ever dispatched (the model's output is not a call to a real tool), so that would not have helped.
  • The LM Studio cooldown resets on any successful preload, so healthy sessions are unaffected.
  • The TUI watchdog uses `.unref()` on its timer to avoid blocking process exit, and `dispose()` is exposed for callers that want to cancel it deterministically.
  • No changes to `AGENTS.md` / `CLAUDE.md` / `docs/` since the behavior changes are implementation-level and covered by the existing skills / streaming / LM Studio docs. Happy to add notes if reviewers prefer.

Changed files

  • CHANGELOG.md (modified, +4/-0)
  • extensions/lmstudio/src/stream.test.ts (modified, +117/-2)
  • extensions/lmstudio/src/stream.ts (modified, +119/-19)
  • src/agents/pi-embedded-runner/run/attempt.test.ts (modified, +21/-7)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +14/-4)
  • src/agents/skills/refresh-state.ts (modified, +1/-1)
  • src/gateway/config-reload.test.ts (modified, +100/-0)
  • src/gateway/config-reload.ts (modified, +34/-0)
  • src/tui/tui-event-handlers.test.ts (modified, +221/-1)
  • src/tui/tui-event-handlers.ts (modified, +65/-1)
RAW_BUFFERClick to expand / collapse

Bug

When a new skill is added (either via config allowlist or by placing it in a workspace/managed skills directory), the skillsSnapshot cached in sessions.json is not invalidated by /restart or gateway restart. The stale snapshot persists and the new skill never appears in the session prompt.

Steps to Reproduce

  1. Have an existing session with a cached skillsSnapshot in sessions.json
  2. Add a new skill to the workspace skills directory (e.g. <workspace>/skills/my-skill/SKILL.md)
  3. Add the skill name to the agent's skills allowlist in config (if applicable)
  4. Gateway restart (SIGUSR1) or /restart command
  5. Send a message in the session

Expected: The skill appears in the session's available skills list
Actual: The skill does not appear. The stale skillsSnapshot from sessions.json is reused.

Evidence

  • Gateway logs confirm the skill is loaded (Discord slash command count increased from 133 to 134)
  • The skillsSnapshot field in sessions.json for the session does not contain the new skill
  • Multiple gateway restarts and /restart commands did not clear the snapshot
  • Manually deleting the skillsSnapshot key from sessions.json and restarting fixed the issue

Environment

  • OpenClaw 2026.3.24
  • macOS (arm64)
  • Multi-agent setup with per-agent skills allowlist

Workaround

Manually delete the skillsSnapshot field from the affected session in <agentDir>/sessions/sessions.json and restart the gateway.

Suggested Fix

/restart (and gateway restart) should invalidate the skillsSnapshot so it rebuilds on the next turn. Alternatively, compare a hash of the current eligible skills against the cached snapshot and refresh if stale.

extent analysis

Fix Plan

To fix the issue, we need to invalidate the skillsSnapshot when a new skill is added or when the gateway restarts. Here are the steps:

  • Modify the /restart command to clear the skillsSnapshot field from the sessions.json file.
  • Update the gateway restart logic to also clear the skillsSnapshot field.
  • Alternatively, implement a hash comparison to refresh the skillsSnapshot if it's stale.

Example Code

import json
import os

def invalidate_skills_snapshot(session_id, agent_dir):
    sessions_file = os.path.join(agent_dir, 'sessions', 'sessions.json')
    with open(sessions_file, 'r+') as f:
        sessions = json.load(f)
        if session_id in sessions:
            sessions[session_id].pop('skillsSnapshot', None)
            f.seek(0)
            json.dump(sessions, f)
            f.truncate()

# Call this function when the /restart command is executed or when the gateway restarts
invalidate_skills_snapshot('session_id', '/path/to/agent/dir')

Verification

To verify that the fix worked, follow these steps:

  • Add a new skill to the workspace skills directory.
  • Restart the gateway or execute the /restart command.
  • Send a message in the session and check if the new skill appears in the available skills list.
  • Verify that the skillsSnapshot field in sessions.json has been updated or cleared.

Extra Tips

  • Make sure to handle any potential errors when reading or writing to the sessions.json file.
  • Consider implementing a more robust caching mechanism that can handle changes to the skills directory.
  • Test the fix thoroughly to ensure that it works as expected in different scenarios.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Skills snapshot not invalidated on /restart or gateway restart [3 pull requests, 1 participants]