openclaw - 💡(How to fix) Fix [Bug] Overnight runtime split: gateway reachable while service inactive, then token mismatch (restart/new session doesn't recover) [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#49297Fetched 2026-04-08 00:56:47
View on GitHub
Comments
3
Participants
3
Timeline
13
Reactions
0
Timeline (top)
cross-referenced ×10commented ×3

After an overnight period, OpenClaw became effectively non-responsive in Telegram despite cron activity and manual recovery attempts (gateway restart, new session).

The key failure pattern is a runtime state split:

  • Gateway endpoint can appear reachable
  • while Gateway service reports stopped/inactive
  • and later CLI/gateway handshake can fail with unauthorized: gateway token mismatch

This can create a "looks alive but no reliable replies" state.

Root Cause

This combination is hard to diagnose because users can see mixed signals:

  • Cron appears active
  • Some sessions still produce text
  • Yet delivery/reply reliability is broken
  • Restart/new session may not recover if token/config/service state diverge

Fix Action

Fix / Workaround

Environment

  • OpenClaw versions involved: historically 2026.3.1 runtime path; reproduced symptoms still present after upgrade to 2026.3.13
  • OS: Linux (WSL2)
  • Channel: Telegram
  • Deployment: systemd user service (openclaw-gateway.service)
RAW_BUFFERClick to expand / collapse

Bug type

Regression / reliability gap (runtime state split)

Summary

After an overnight period, OpenClaw became effectively non-responsive in Telegram despite cron activity and manual recovery attempts (gateway restart, new session).

The key failure pattern is a runtime state split:

  • Gateway endpoint can appear reachable
  • while Gateway service reports stopped/inactive
  • and later CLI/gateway handshake can fail with unauthorized: gateway token mismatch

This can create a "looks alive but no reliable replies" state.

Environment

  • OpenClaw versions involved: historically 2026.3.1 runtime path; reproduced symptoms still present after upgrade to 2026.3.13
  • OS: Linux (WSL2)
  • Channel: Telegram
  • Deployment: systemd user service (openclaw-gateway.service)

Incident timeline (Sydney time)

  • ~00:00: system behavior was normal before sleep.
  • 00:21:37: first hard evidence of abnormal state in cron audit session.
    • openclaw status output in-session shows:
      • Gateway ... reachable
      • Gateway service ... stopped (state inactive)
  • 00:22:00: cron run summary explicitly reports gateway inactive/stopped blocker.
  • ~09:14–09:18: after backup/rebootstrap flow, config-audit.jsonl shows setup/onboard rewrites.
  • Post-rebootstrap: local probe/status shows unauthorized: gateway token mismatch.

What we verified

  1. Cron scheduler kept producing finished events during the window (not a full scheduler halt).
  2. A significant portion of runs were configured with delivery.mode=none (silent by design), which masked failure visibility.
  3. Current mismatch evidence:
    • ~/.openclaw/openclaw.json gateway auth token != systemd env OPENCLAW_GATEWAY_TOKEN
    • causes local loopback probe unauthorized.

Why this matters

This combination is hard to diagnose because users can see mixed signals:

  • Cron appears active
  • Some sessions still produce text
  • Yet delivery/reply reliability is broken
  • Restart/new session may not recover if token/config/service state diverge

Expected behavior

  • No split-brain state between gateway reachable and gateway service inactive for normal local setups.
  • Token source-of-truth should stay consistent across config writes + systemd-managed runtime.
  • If mismatch is detected, recovery path should be explicit and self-healing.

Actual behavior

  • Reachability and service state can diverge.
  • Token mismatch can persist across restart attempts.
  • User-facing symptom: "worked before sleep, broken after wake" with low observability.

Suggested fixes

  1. Token consistency guard at startup
    • Compare runtime token source and config token before accepting connections.
    • Refuse/repair inconsistent startup.
  2. Single source of truth for gateway token under systemd
    • Avoid stale token pinned in unit env during config re-init flows.
  3. Split-state detector
    • If Gateway reachable but service inactive (or vice versa), emit critical warning + self-heal action.
  4. Post-setup/onboard validation
    • Run immediate probe sanity check after config rewrite and surface actionable fix.

Related issues likely in same cluster

  • isolated cron / announce delivery truthfulness regressions
  • telegram polling stall + restart recovery
  • dmPolicy/allowlist drift in setup/doctor flows

If helpful, I can provide redacted run/session IDs and exact minute-level evidence snippets from local logs.

extent analysis

Fix Plan

To address the runtime state split issue in OpenClaw, we will implement the following fixes:

  • Token Consistency Guard: Ensure the gateway token is consistent across the configuration and systemd environment.
  • Single Source of Truth: Use the configuration file as the single source of truth for the gateway token.
  • Split-State Detector: Detect and self-heal when the gateway is reachable but the service is inactive, or vice versa.

Step-by-Step Solution

  1. Update Token Loading: Load the gateway token from the configuration file (~/.openclaw/openclaw.json) instead of the systemd environment variable.

import json

def load_gateway_token(config_file): with open(config_file, 'r') as f: config = json.load(f) return config['gateway_token']


2. **Implement Token Consistency Guard**:
   Compare the loaded token with the systemd environment variable and refuse to start if they are inconsistent.
   ```python
import os

def check_token_consistency(token, env_token):
    if token != env_token:
        raise ValueError("Gateway token mismatch")
  1. Single Source of Truth: Update the systemd service file to load the gateway token from the configuration file.

Environment=OPENCLAW_GATEWAY_TOKEN=$(jq -r '.gateway_token' ~/.openclaw/openclaw.json)


4. **Split-State Detector**:
   Implement a detector that checks the gateway reachability and service state, and emits a critical warning with self-heal action if they are inconsistent.
   ```python
def detect_split_state(gateway_reachable, service_active):
    if gateway_reachable and not service_active:
        # Emit critical warning and self-heal
        print("Critical warning: Gateway reachable but service inactive")
        # Self-heal action
    elif not gateway_reachable and service_active:
        # Emit critical warning and self-heal
        print("Critical warning: Gateway not reachable but service active")
        # Self-heal action

Verification

To verify the fix, restart the OpenClaw service and check the logs for any critical warnings or errors. Also, test the gateway reachability and service state using the openclaw status command.

Extra Tips

  • Regularly review the configuration file and systemd environment variables to ensure consistency.
  • Consider implementing a periodic task to check for token consistency and split-state detection.
  • Monitor the OpenClaw service logs for any errors or warnings related to token mismatch or split-state detection.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • No split-brain state between gateway reachable and gateway service inactive for normal local setups.
  • Token source-of-truth should stay consistent across config writes + systemd-managed runtime.
  • If mismatch is detected, recovery path should be explicit and self-healing.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING