openclaw - 💡(How to fix) Fix Heartbeat scheduler chain permanently broken when requests-in-flight returned [1 comments, 1 participants]

Q: Expected behavior

When requests are in-flight, the scheduler should defer `scheduleNext()` but ensure it fires once the in-flight requests complete (or after a safety timeout).

openclaw2026-04-08 16:03:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#63232•Fetched 2026-04-09 07:56:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mar-zh

Participants

mar-zh

Timeline (top)

commented ×1

When runOnce() returns { status: "skipped", reason: "requests-in-flight" }, the heartbeat scheduler's scheduleNext() is never called again, permanently breaking the heartbeat chain for all agents until the gateway is restarted.

Error Message

scheduleNext() is never called again. The heartbeat scheduler goes permanently silent. No error is logged. The only recovery is a full gateway restart.

No error logged — extremely difficult to diagnose without reading source

Root Cause

In the heartbeat runner's run() function (line ~989-1060):

let requestsInFlight = false;
try {
    for (const agent of state.agents.values()) {
        // ...
        if (res.status === "skipped" && res.reason === "requests-in-flight") {
            requestsInFlight = true;
            return res;  // ← exits the for-loop AND the try block
        }
        // ...
    }
} finally {
    if (!requestsInFlight) scheduleNext();  // ← skipped when requestsInFlight = true
}

The logic intends to avoid scheduling the next heartbeat while requests are still in flight. However, nothing ever re-invokes scheduleNext() after the in-flight requests complete. The chain breaks permanently.

Fix Action

Workaround

openclaw gateway restart

Filed from a 14-agent OpenClaw deployment. Reproduction confirmed across 4 incidents in 72 hours.

Code Example

let requestsInFlight = false;
try {
    for (const agent of state.agents.values()) {
        // ...
        if (res.status === "skipped" && res.reason === "requests-in-flight") {
            requestsInFlight = true;
            return res;  // ← exits the for-loop AND the try block
        }
        // ...
    }
} finally {
    if (!requestsInFlight) scheduleNext();  // ← skipped when requestsInFlight = true
}

---

} finally {
    if (!requestsInFlight) {
        scheduleNext();
    } else {
        setTimeout(() => {
            scheduleNext();
        }, 30_000); // 30s — enough for most in-flight requests to complete
    }
}

---

} finally {
    scheduleNext(); // Always schedule. If requests are in-flight, next tick will check again.
}

---

openclaw gateway restart

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

In the heartbeat runner's run() function (line ~989-1060):

let requestsInFlight = false;
try {
    for (const agent of state.agents.values()) {
        // ...
        if (res.status === "skipped" && res.reason === "requests-in-flight") {
            requestsInFlight = true;
            return res;  // ← exits the for-loop AND the try block
        }
        // ...
    }
} finally {
    if (!requestsInFlight) scheduleNext();  // ← skipped when requestsInFlight = true
}

Expected behavior

When requests are in-flight, the scheduler should defer scheduleNext() but ensure it fires once the in-flight requests complete (or after a safety timeout).

Actual behavior

scheduleNext() is never called again. The heartbeat scheduler goes permanently silent. No error is logged. The only recovery is a full gateway restart.

Reproduction

Configure multiple agents with heartbeats (at least 3-4)
Have an agent's heartbeat trigger a long-running session (e.g., enriched heartbeat with file reads and graph operations)
While that session is running, trigger another heartbeat interval tick
The second tick sees requests-in-flight, sets requestsInFlight = true, and returns early
scheduleNext() is skipped in the finally block
No future heartbeats fire for any agent

Environment: OpenClaw 2026.4.5 (3e72c03), 14-agent mesh with mixed heartbeat intervals (1h, 2h, 4h). Observed 4 times in 72 hours. More likely when multiple enriched heartbeats have overlapping windows.

Suggested Fix

Option A: Safety-net timeout (minimal change)

} finally {
    if (!requestsInFlight) {
        scheduleNext();
    } else {
        setTimeout(() => {
            scheduleNext();
        }, 30_000); // 30s — enough for most in-flight requests to complete
    }
}

Option B: Always schedule (if scheduleNext is idempotent)

} finally {
    scheduleNext(); // Always schedule. If requests are in-flight, next tick will check again.
}

Impact

Silent heartbeat failure across entire mesh
No error logged — extremely difficult to diagnose without reading source
Only recovery is gateway restart
Affects any deployment with multiple agents and overlapping heartbeat windows
More likely to trigger under load (exactly when heartbeats are most needed)

Workaround

openclaw gateway restart

Filed from a 14-agent OpenClaw deployment. Reproduction confirmed across 4 incidents in 72 hours.

extent analysis

TL;DR

Implement a safety-net timeout or ensure scheduleNext() is always called to prevent the heartbeat chain from breaking when requests are in-flight.

Guidance

Identify the run() function in the heartbeat runner and modify the finally block to include a safety-net timeout or always schedule the next heartbeat.
Verify that scheduleNext() is idempotent before always scheduling it, to avoid potential issues.
Test the changes under load to ensure the heartbeat chain remains intact even when requests are in-flight.
Consider logging a warning or error when requestsInFlight is true to aid in diagnosis.

Example

} finally {
    if (!requestsInFlight) {
        scheduleNext();
    } else {
        setTimeout(() => {
            scheduleNext();
        }, 30_000); // 30s — enough for most in-flight requests to complete
    }
}

Notes

The provided fix options assume that scheduleNext() is either idempotent or can be safely called with a timeout. Additional testing and verification are necessary to ensure the chosen solution works correctly in all scenarios.

Recommendation

Apply the safety-net timeout workaround, as it provides a minimal change with a clear timeout, allowing for easier testing and verification. This approach also avoids potential issues with always scheduling the next heartbeat, in case scheduleNext() is not idempotent.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

When requests are in-flight, the scheduler should defer scheduleNext() but ensure it fires once the in-flight requests complete (or after a safety timeout).

#pipeline error #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Heartbeat scheduler chain permanently broken when requests-in-flight returned [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Root Cause

Expected behavior

Actual behavior

Reproduction

Suggested Fix

Option A: Safety-net timeout (minimal change)

Option B: Always schedule (if scheduleNext is idempotent)

Impact

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Heartbeat scheduler chain permanently broken when requests-in-flight returned [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Root Cause

Expected behavior

Actual behavior

Reproduction

Suggested Fix

Option A: Safety-net timeout (minimal change)

Option B: Always schedule (if scheduleNext is idempotent)

Impact

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING