openclaw - 💡(How to fix) Fix Heartbeat scheduler chain permanently broken when requests-in-flight returned [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63232Fetched 2026-04-09 07:56:34
View on GitHub
Comments
1
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
commented ×1

When runOnce() returns { status: "skipped", reason: "requests-in-flight" }, the heartbeat scheduler's scheduleNext() is never called again, permanently breaking the heartbeat chain for all agents until the gateway is restarted.

Error Message

scheduleNext() is never called again. The heartbeat scheduler goes permanently silent. No error is logged. The only recovery is a full gateway restart.

  • No error logged — extremely difficult to diagnose without reading source

Root Cause

In the heartbeat runner's run() function (line ~989-1060):

let requestsInFlight = false;
try {
    for (const agent of state.agents.values()) {
        // ...
        if (res.status === "skipped" && res.reason === "requests-in-flight") {
            requestsInFlight = true;
            return res;  // ← exits the for-loop AND the try block
        }
        // ...
    }
} finally {
    if (!requestsInFlight) scheduleNext();  // ← skipped when requestsInFlight = true
}

The logic intends to avoid scheduling the next heartbeat while requests are still in flight. However, nothing ever re-invokes scheduleNext() after the in-flight requests complete. The chain breaks permanently.

Fix Action

Workaround

openclaw gateway restart

Filed from a 14-agent OpenClaw deployment. Reproduction confirmed across 4 incidents in 72 hours.

Code Example

let requestsInFlight = false;
try {
    for (const agent of state.agents.values()) {
        // ...
        if (res.status === "skipped" && res.reason === "requests-in-flight") {
            requestsInFlight = true;
            return res;  // ← exits the for-loop AND the try block
        }
        // ...
    }
} finally {
    if (!requestsInFlight) scheduleNext();  // ← skipped when requestsInFlight = true
}

---

} finally {
    if (!requestsInFlight) {
        scheduleNext();
    } else {
        setTimeout(() => {
            scheduleNext();
        }, 30_000); // 30s — enough for most in-flight requests to complete
    }
}

---

} finally {
    scheduleNext(); // Always schedule. If requests are in-flight, next tick will check again.
}

---

openclaw gateway restart
RAW_BUFFERClick to expand / collapse

Summary

When runOnce() returns { status: "skipped", reason: "requests-in-flight" }, the heartbeat scheduler's scheduleNext() is never called again, permanently breaking the heartbeat chain for all agents until the gateway is restarted.

Root Cause

In the heartbeat runner's run() function (line ~989-1060):

let requestsInFlight = false;
try {
    for (const agent of state.agents.values()) {
        // ...
        if (res.status === "skipped" && res.reason === "requests-in-flight") {
            requestsInFlight = true;
            return res;  // ← exits the for-loop AND the try block
        }
        // ...
    }
} finally {
    if (!requestsInFlight) scheduleNext();  // ← skipped when requestsInFlight = true
}

The logic intends to avoid scheduling the next heartbeat while requests are still in flight. However, nothing ever re-invokes scheduleNext() after the in-flight requests complete. The chain breaks permanently.

Expected behavior

When requests are in-flight, the scheduler should defer scheduleNext() but ensure it fires once the in-flight requests complete (or after a safety timeout).

Actual behavior

scheduleNext() is never called again. The heartbeat scheduler goes permanently silent. No error is logged. The only recovery is a full gateway restart.

Reproduction

  1. Configure multiple agents with heartbeats (at least 3-4)
  2. Have an agent's heartbeat trigger a long-running session (e.g., enriched heartbeat with file reads and graph operations)
  3. While that session is running, trigger another heartbeat interval tick
  4. The second tick sees requests-in-flight, sets requestsInFlight = true, and returns early
  5. scheduleNext() is skipped in the finally block
  6. No future heartbeats fire for any agent

Environment: OpenClaw 2026.4.5 (3e72c03), 14-agent mesh with mixed heartbeat intervals (1h, 2h, 4h). Observed 4 times in 72 hours. More likely when multiple enriched heartbeats have overlapping windows.

Suggested Fix

Option A: Safety-net timeout (minimal change)

} finally {
    if (!requestsInFlight) {
        scheduleNext();
    } else {
        setTimeout(() => {
            scheduleNext();
        }, 30_000); // 30s — enough for most in-flight requests to complete
    }
}

Option B: Always schedule (if scheduleNext is idempotent)

} finally {
    scheduleNext(); // Always schedule. If requests are in-flight, next tick will check again.
}

Impact

  • Silent heartbeat failure across entire mesh
  • No error logged — extremely difficult to diagnose without reading source
  • Only recovery is gateway restart
  • Affects any deployment with multiple agents and overlapping heartbeat windows
  • More likely to trigger under load (exactly when heartbeats are most needed)

Workaround

openclaw gateway restart

Filed from a 14-agent OpenClaw deployment. Reproduction confirmed across 4 incidents in 72 hours.

extent analysis

TL;DR

Implement a safety-net timeout or ensure scheduleNext() is always called to prevent the heartbeat chain from breaking when requests are in-flight.

Guidance

  • Identify the run() function in the heartbeat runner and modify the finally block to include a safety-net timeout or always schedule the next heartbeat.
  • Verify that scheduleNext() is idempotent before always scheduling it, to avoid potential issues.
  • Test the changes under load to ensure the heartbeat chain remains intact even when requests are in-flight.
  • Consider logging a warning or error when requestsInFlight is true to aid in diagnosis.

Example

} finally {
    if (!requestsInFlight) {
        scheduleNext();
    } else {
        setTimeout(() => {
            scheduleNext();
        }, 30_000); // 30s — enough for most in-flight requests to complete
    }
}

Notes

The provided fix options assume that scheduleNext() is either idempotent or can be safely called with a timeout. Additional testing and verification are necessary to ensure the chosen solution works correctly in all scenarios.

Recommendation

Apply the safety-net timeout workaround, as it provides a minimal change with a clear timeout, allowing for easier testing and verification. This approach also avoids potential issues with always scheduling the next heartbeat, in case scheduleNext() is not idempotent.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When requests are in-flight, the scheduler should defer scheduleNext() but ensure it fires once the in-flight requests complete (or after a safety timeout).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Heartbeat scheduler chain permanently broken when requests-in-flight returned [1 comments, 1 participants]