openclaw - 💡(How to fix) Fix [Bug]: in-process gateway restart silently drops user config (stale startupConfigSnapshotRead reuse)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When the gateway is configured to use the in-process restart path (the default in container environments without OPENCLAW_SYSTEMD_UNIT), every SIGUSR1 restart serves the post-restart gateway with a stale startupConfigSnapshotRead captured before the restart loop began. The new gateway boots from this stale snapshot and then overwrites the on-disk openclaw.json during normal startup, dropping every key the user wrote between the original boot and the restart.

The user's symptom is "the bot disappeared after I configured it" — bot token reverts to empty, agent default model reverts to upstream's openai/gpt-5.5, channels become disabled, auth profiles vanish.

Root Cause

In src/cli/gateway-cli/run.ts (in 2026.5.7 the relevant compiled file is dist/run-DVqWLkV9.js):

// dist/run-DVqWLkV9.js around line 685
const { cfg, snapshot, startupConfigSnapshotRead } = await readGatewayStartupConfig({ startupTrace });
// ... outside the for-loop ...

// dist/run-DVqWLkV9.js around line 857-868
const startLoop = async () => await runGatewayLoop({
    runtime: defaultRuntime,
    lockPort: port,
    healthHost,
    start: async ({ startupStartedAt } = {}) => await startGatewayServer(port, {
        bind,
        auth: authOverride,
        tailscale: tailscaleOverride,
        startupStartedAt,
        ...startupConfigSnapshotRead ? { startupConfigSnapshotRead } : {}    // ← captured ONCE
    })
});

startupConfigSnapshotRead is captured ONCE before the restart loop. Every iteration of runGatewayLoop's for(;;) calls params.start() with the SAME stale snapshot.

In src/gateway/server-startup-config.ts (dist/server-startup-config-*.js):

async function loadGatewayStartupConfigSnapshot(params) {
    const measure = params.measure ?? (async (_name, run) => await run());
    let snapshotRead = params.initialSnapshotRead ?? await measure(
        "config.snapshot.read",
        () => readConfigFileSnapshotWithPluginMetadata({ measure })
    );
    // ...
}

When initialSnapshotRead is provided (which it always is on iterations >= 2 of the in-process restart loop), loadGatewayStartupConfigSnapshot honors it and never re-reads from disk. The gateway then proceeds to apply this stale snapshot via setRuntimeConfigSnapshot, and downstream config writes (auth bootstrap, plugin auto-enable persistence, control-ui seed, etc.) overwrite openclaw.json with the stale projection.

Fix Action

Fix / Workaround

  • Container deployments without supervisor markers: all SIGUSR1 / config-set-triggered restarts wipe user config. Affects every BYOK reconfigure, every commands.ownerAllowFrom bootstrap from pairing approve, every plugin install/enable that touches restart-required paths.
  • Workaround in use: force the supervised-respawn path by setting a fake systemd marker (OPENCLAW_SYSTEMD_UNIT=openclaw-tenant.service). The gateway then exits cleanly and a Docker-native restart_policy: unless-stopped revives the container with a fresh runtime snapshot. Cost: ~13–17s per restart (vs ~1–2s for the in-process path) plus the user-perceived "bot disappeared" gap.

Why this is structural rather than a workaround

Code Example

// dist/run-DVqWLkV9.js around line 685
const { cfg, snapshot, startupConfigSnapshotRead } = await readGatewayStartupConfig({ startupTrace });
// ... outside the for-loop ...

// dist/run-DVqWLkV9.js around line 857-868
const startLoop = async () => await runGatewayLoop({
    runtime: defaultRuntime,
    lockPort: port,
    healthHost,
    start: async ({ startupStartedAt } = {}) => await startGatewayServer(port, {
        bind,
        auth: authOverride,
        tailscale: tailscaleOverride,
        startupStartedAt,
        ...startupConfigSnapshotRead ? { startupConfigSnapshotRead } : {}    // ← captured ONCE
    })
});

---

async function loadGatewayStartupConfigSnapshot(params) {
    const measure = params.measure ?? (async (_name, run) => await run());
    let snapshotRead = params.initialSnapshotRead ?? await measure(
        "config.snapshot.read",
        () => readConfigFileSnapshotWithPluginMetadata({ measure })
    );
    // ...
}

---

let isFirstStart = true;
const startLoop = async () => await runGatewayLoop({
    runtime: defaultRuntime,
    lockPort: port,
    healthHost,
    start: async ({ startupStartedAt } = {}) => {
        const opts = { bind, auth: authOverride, tailscale: tailscaleOverride, startupStartedAt };
        if (isFirstStart && startupConfigSnapshotRead) {
            opts.startupConfigSnapshotRead = startupConfigSnapshotRead;
        }
        isFirstStart = false;
        return await startGatewayServer(port, opts);
    }
});
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug — in-process gateway restart silently drops user-applied config.

Summary

When the gateway is configured to use the in-process restart path (the default in container environments without OPENCLAW_SYSTEMD_UNIT), every SIGUSR1 restart serves the post-restart gateway with a stale startupConfigSnapshotRead captured before the restart loop began. The new gateway boots from this stale snapshot and then overwrites the on-disk openclaw.json during normal startup, dropping every key the user wrote between the original boot and the restart.

The user's symptom is "the bot disappeared after I configured it" — bot token reverts to empty, agent default model reverts to upstream's openai/gpt-5.5, channels become disabled, auth profiles vanish.

Reproduction (verified on 2026.5.7)

  1. Build a tenant container with OPENCLAW_SYSTEMD_UNIT unset (or run a fresh tenant on Docker without supervisor markers).
  2. Wait for first gateway ready.
  3. Write a configureBot-shaped batch (sets auth.profiles.<rotation-id>, agents.defaults.model.primary, channels.telegram.botToken, channels.telegram.enabled, plugins.entries.telegram.enabled, env.<provider>_API_KEY for env-driven providers).
  4. The batch contains restart-required paths (auth.profiles[*] and env.* are unmatched in the reload-plan and default to restart).
  5. Watch the gateway log "received SIGUSR1; restarting" → "restart mode: in-process restart (container: use in-process restart to keep PID 1 alive)" → "gateway ready".
  6. Read the on-disk openclaw.json: every user-written key is gone, replaced by what the gateway snapshot held before the user's writes.

Empirical reproduction artifacts:

  • Pre-write disk SHA: f960c1e3…
  • Post-restart disk SHA: 21876867… (smaller than pre-write — the gateway wrote back its stale projection)
  • Post-restart agents.defaults.model.primary: missing (the user-supplied kimi/kimi-code was dropped)
  • Post-restart channels.telegram.enabled: false (user wrote true)
  • Post-restart channels.telegram.botToken: removed entirely
  • Gateway log post-restart: agent model: openai/gpt-5.5 (upstream default, not the user's kimi/kimi-code)

The supervised-respawn path (OPENCLAW_SYSTEMD_UNIT set, gateway exits, Docker restart_policy: unless-stopped revives) is not affected — the new container reads openclaw.json from disk fresh.

Root cause

In src/cli/gateway-cli/run.ts (in 2026.5.7 the relevant compiled file is dist/run-DVqWLkV9.js):

// dist/run-DVqWLkV9.js around line 685
const { cfg, snapshot, startupConfigSnapshotRead } = await readGatewayStartupConfig({ startupTrace });
// ... outside the for-loop ...

// dist/run-DVqWLkV9.js around line 857-868
const startLoop = async () => await runGatewayLoop({
    runtime: defaultRuntime,
    lockPort: port,
    healthHost,
    start: async ({ startupStartedAt } = {}) => await startGatewayServer(port, {
        bind,
        auth: authOverride,
        tailscale: tailscaleOverride,
        startupStartedAt,
        ...startupConfigSnapshotRead ? { startupConfigSnapshotRead } : {}    // ← captured ONCE
    })
});

startupConfigSnapshotRead is captured ONCE before the restart loop. Every iteration of runGatewayLoop's for(;;) calls params.start() with the SAME stale snapshot.

In src/gateway/server-startup-config.ts (dist/server-startup-config-*.js):

async function loadGatewayStartupConfigSnapshot(params) {
    const measure = params.measure ?? (async (_name, run) => await run());
    let snapshotRead = params.initialSnapshotRead ?? await measure(
        "config.snapshot.read",
        () => readConfigFileSnapshotWithPluginMetadata({ measure })
    );
    // ...
}

When initialSnapshotRead is provided (which it always is on iterations >= 2 of the in-process restart loop), loadGatewayStartupConfigSnapshot honors it and never re-reads from disk. The gateway then proceeds to apply this stale snapshot via setRuntimeConfigSnapshot, and downstream config writes (auth bootstrap, plugin auto-enable persistence, control-ui seed, etc.) overwrite openclaw.json with the stale projection.

Proposed fix

Drop startupConfigSnapshotRead after the first params.start() call so subsequent restart-loop iterations re-read the file from disk:

let isFirstStart = true;
const startLoop = async () => await runGatewayLoop({
    runtime: defaultRuntime,
    lockPort: port,
    healthHost,
    start: async ({ startupStartedAt } = {}) => {
        const opts = { bind, auth: authOverride, tailscale: tailscaleOverride, startupStartedAt };
        if (isFirstStart && startupConfigSnapshotRead) {
            opts.startupConfigSnapshotRead = startupConfigSnapshotRead;
        }
        isFirstStart = false;
        return await startGatewayServer(port, opts);
    }
});

Cost on the happy path: one extra disk read per restart (negligible — a few hundred microseconds for readConfigFileSnapshotWithPluginMetadata). The current first-start optimization is preserved.

Alternatively, loadGatewayStartupConfigSnapshot could detect "called from a restart iteration" and force re-read regardless of initialSnapshotRead. The above is more local.

Impact

  • Container deployments without supervisor markers: all SIGUSR1 / config-set-triggered restarts wipe user config. Affects every BYOK reconfigure, every commands.ownerAllowFrom bootstrap from pairing approve, every plugin install/enable that touches restart-required paths.
  • Workaround in use: force the supervised-respawn path by setting a fake systemd marker (OPENCLAW_SYSTEMD_UNIT=openclaw-tenant.service). The gateway then exits cleanly and a Docker-native restart_policy: unless-stopped revives the container with a fresh runtime snapshot. Cost: ~13–17s per restart (vs ~1–2s for the in-process path) plus the user-perceived "bot disappeared" gap.

Environment

  • OpenClaw version: 2026.5.7
  • Image: alpine 3, node:24-alpine, OpenClaw installed via npm install -g [email protected]
  • Process tree: tini -> /usr/local/bin/entrypoint.sh -> openclaw gateway run
  • Container runtime: Docker (with restart_policy: unless-stopped)
  • OPENCLAW_SYSTEMD_UNIT deliberately unset for repro

Related issues

This issue is narrower than #78136 (drain-state stuck after restart) and #79738 (wrapper-script rewrites config). Both are different bugs in adjacent code paths.

Why this is structural rather than a workaround

The current behavior is at odds with prepareGatewayStartupConfig's contract — that function is intended to run on every gateway start, including after a SIGUSR1 restart, and to incorporate the latest on-disk state. By short-circuiting it with a stale initialSnapshotRead, the gateway loses the ability to honor user config writes between restarts even though the replaceConfigFile watcher correctly fires the reload signal.

The cost of dropping the optimization on iteration >=2 is one extra disk read (readConfigFileSnapshotWithPluginMetadata reads ~10–50 KB of JSON). The benefit is that container deployments — where in-process restart is the documented default — no longer silently corrupt user config.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING