openclaw - ✅(Solved) Fix Heartbeat: support model fallbacks (single point of failure when provider quota is exhausted) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
  • OpenClaw version: 2026.4.15 (041266a)
  • Config path affected: agents.list[*].heartbeat.model
  • Docs consulted: docs.openclaw.ai/gateway/configuration, docs.openclaw.ai/gateway/heartbeat

Happy to contribute a PR if the maintainers agree on the API shape (Option A vs B).

Error Message

Trying to add fallbacks as a sibling key

$ openclaw config set "agents.list[N].heartbeat.fallbacks" '["zai/glm-5.1"]' --strict-json --dry-run Error: agents.list.N.heartbeat: Unrecognized key: "fallbacks"

Trying to upgrade "model" into an object with fallbacks

$ openclaw config set "agents.list[N].heartbeat.model" '{"primary":"zai/glm-4.7","fallbacks":["zai/glm-5.1"]}' --strict-json --dry-run Error: agents.list.N.heartbeat.model: Invalid input: expected string, received object

Root Cause

We only caught this because a separate cron job (using delivery.mode: announce) stopped delivering — the heartbeat itself gave no external signal of failure.

Fix Action

Fix / Workaround

  • agents.list[crypto].heartbeat.model = "oc/kimi-k2.5" (Ollama Cloud)
  • Provider hit the weekly usage cap
  • Every 5-minute heartbeat kept returning 429 "you have reached your weekly usage limit"
  • Agent silently stopped executing its pipeline (no HEARTBEAT.md read, no tool calls, no side effects) for ~24h before we noticed
  • Workaround: manually swap heartbeat.model to a different provider (zai/glm-4.7 in our case) and openclaw daemon restart

PR fix notes

PR #69495: feat(heartbeat): support model fallbacks via {primary,fallbacks} (#69434)

Description (problem / solution / changelog)

Summary

  • Accept agents.*.heartbeat.model as { primary, fallbacks } in addition to the existing string form, mirroring agents.defaults.model. Closes #69434.
  • Heartbeat ticks now fail over to the next fallback on retriable provider errors (rate-limit/429, overload/502/503, timeout) so primary-provider quota exhaustion no longer silently stalls the heartbeat loop for hours.
  • String form stays fully backwards-compatible; new object form is additive.

Design

  • Schema: src/config/zod-schema.agent-runtime.ts reuses AgentModelSchema (the same union already used by agents.defaults.model, imageModel, etc.), and src/config/types.agent-defaults.ts replaces model?: string with model?: AgentModelConfig.
  • Plumbing: heartbeat-runner.ts resolves primary + fallbacks via the shared resolveAgentModelPrimaryValue / resolveAgentModelFallbackValues helpers and passes them through a new GetReplyOptions.heartbeatModelFallbacks to the reply pipeline. get-reply-run.ts then stores them on a new FollowupRun.run.modelFallbacksOverride field. resolveModelFallbackOptions (used by both agent-runner-execution.ts and the memory flush path) and the inline fallback in followup-runner.ts prefer this override over resolveRunModelFallbacksOverride, so the existing runWithModelFallback machinery and its retriable-error classifier (429/502/503/timeout) transparently pick up the heartbeat chain.
  • get-reply.ts switches the defaults-fallback read to resolveAgentModelPrimaryValue(agentCfg?.heartbeat?.model) so object-form defaults resolve correctly.
  • HeartbeatSummary (consumed by /status, /health, compaction diagnostics, run attempts) keeps model?: string (now the primary) and gains an optional modelFallbacks?: string[] — no existing reader breaks.
  • gateway/model-pricing-cache.ts migrates heartbeat.model registration from addResolvedModelRef to addModelListLike, so heartbeat fallbacks get the same pricing prefetch treatment as the agent-level model list.
  • Empty fallbacks: [] in the object form disables the fallback chain for that heartbeat without replacing it.

Docs / config help

  • docs/gateway/configuration-reference.md documents both forms and the retriable-error failover semantics.
  • src/config/schema.help.ts + src/config/schema.labels.ts add help text and labels for heartbeat.model, .primary, and .fallbacks at both defaults and per-agent scopes.
  • Changelog entry under ## Unreleased### Changes.

Test plan

  • New unit tests in src/infra/heartbeat-runner.model-override.test.ts:
    • passes primary + fallbacks when heartbeat.model is an object
    • omits fallbacks when object form has no fallbacks array
    • propagates per-agent heartbeat fallbacks after merging defaults
  • 11 pre-existing heartbeat-model-override tests still pass (string form regression guard)
  • pnpm tsgo:prod (core + extensions) green
  • pnpm lint:core 0/0
  • pnpm format:check clean
  • pnpm test:changed — 2248 tests passing across 246 files
  • pnpm config:schema:gen + pnpm config:docs:gen baselines regenerated and checked in
  • Manual: verify a 429-exhausted primary provider actually fails over to a second provider in a live tick (covered indirectly through runWithModelFallback unit tests + heartbeat-runner tests; end-to-end live repro is in the reporter's production environment)

Notes

  • Pre-commit hook was bypassed via FAST_COMMIT=1 (documented escape hatch) to avoid a pre-existing src/entry.version-fast-path.test.ts:44 test-type failure on an untouched file. All other gates (format, lint, core typecheck, tests, doc baselines) pass.

🤖 Generated with Claude Code

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/gateway/configuration-reference.md (modified, +9/-1)
  • src/auto-reply/get-reply-options.types.ts (modified, +7/-0)
  • src/auto-reply/reply/agent-runner-utils.ts (modified, +12/-5)
  • src/auto-reply/reply/followup-runner.ts (modified, +8/-5)
  • src/auto-reply/reply/get-reply-run.ts (modified, +3/-0)
  • src/auto-reply/reply/get-reply.ts (modified, +2/-1)
  • src/auto-reply/reply/queue/types.ts (modified, +6/-0)
  • src/config/schema.base.generated.ts (modified, +86/-2)
  • src/config/schema.help.ts (modified, +10/-0)
  • src/config/schema.labels.ts (modified, +6/-0)
  • src/config/types.agent-defaults.ts (modified, +2/-2)
  • src/config/zod-schema.agent-runtime.ts (modified, +1/-1)
  • src/gateway/model-pricing-cache.ts (modified, +2/-2)
  • src/infra/heartbeat-runner.model-override.test.ts (modified, +47/-1)
  • src/infra/heartbeat-runner.ts (modified, +12/-1)
  • src/infra/heartbeat-summary.ts (modified, +14/-2)

Code Example

# Trying to add fallbacks as a sibling key
$ openclaw config set "agents.list[N].heartbeat.fallbacks" '["zai/glm-5.1"]' --strict-json --dry-run
Error: agents.list.N.heartbeat: Unrecognized key: "fallbacks"

# Trying to upgrade "model" into an object with fallbacks
$ openclaw config set "agents.list[N].heartbeat.model" '{"primary":"zai/glm-4.7","fallbacks":["zai/glm-5.1"]}' --strict-json --dry-run
Error: agents.list.N.heartbeat.model: Invalid input: expected string, received object

---

{
  agents: {
    list: [{
      id: "crypto",
      heartbeat: {
        every: "5m",
        model: {
          primary: "oc/kimi-k2.5",
          fallbacks: ["zai/glm-4.7", "zai/glm-5.1"]
        }
      }
    }]
  }
}

---

{
  heartbeat: {
    every: "5m",
    model: "oc/kimi-k2.5",
    fallbacks: ["zai/glm-4.7", "zai/glm-5.1"]
  }
}
RAW_BUFFERClick to expand / collapse

Problem

The current schema for agents.*.heartbeat accepts a single model string (e.g. "oc/kimi-k2.5") and no fallbacks array. Unlike agents.defaults.model, which supports { primary, fallbacks }, the heartbeat model is a single point of failure.

When the chosen heartbeat provider returns 429 (quota exhausted, rate limit), every heartbeat tick fails with the same error. The documented advice is only that "if the main queue is busy, the heartbeat is skipped and retried later" — that does not cover quota exhaustion, where the retry keeps failing with the same 429.

Impact

Production incident reproduced in our environment (crypto-trading agent):

  • agents.list[crypto].heartbeat.model = "oc/kimi-k2.5" (Ollama Cloud)
  • Provider hit the weekly usage cap
  • Every 5-minute heartbeat kept returning 429 "you have reached your weekly usage limit"
  • Agent silently stopped executing its pipeline (no HEARTBEAT.md read, no tool calls, no side effects) for ~24h before we noticed
  • Workaround: manually swap heartbeat.model to a different provider (zai/glm-4.7 in our case) and openclaw daemon restart

We only caught this because a separate cron job (using delivery.mode: announce) stopped delivering — the heartbeat itself gave no external signal of failure.

Schema today

Tested locally on OpenClaw 2026.4.15 (041266a):

# Trying to add fallbacks as a sibling key
$ openclaw config set "agents.list[N].heartbeat.fallbacks" '["zai/glm-5.1"]' --strict-json --dry-run
Error: agents.list.N.heartbeat: Unrecognized key: "fallbacks"

# Trying to upgrade "model" into an object with fallbacks
$ openclaw config set "agents.list[N].heartbeat.model" '{"primary":"zai/glm-4.7","fallbacks":["zai/glm-5.1"]}' --strict-json --dry-run
Error: agents.list.N.heartbeat.model: Invalid input: expected string, received object

Proposed behavior

Mirror the semantics already in agents.defaults.model. Two API shapes would both satisfy the use case; implementers can pick whichever fits the existing schema better:

Option A — extend model to accept object form:

{
  agents: {
    list: [{
      id: "crypto",
      heartbeat: {
        every: "5m",
        model: {
          primary: "oc/kimi-k2.5",
          fallbacks: ["zai/glm-4.7", "zai/glm-5.1"]
        }
      }
    }]
  }
}

Option B — sibling fallbacks array:

{
  heartbeat: {
    every: "5m",
    model: "oc/kimi-k2.5",
    fallbacks: ["zai/glm-4.7", "zai/glm-5.1"]
  }
}

Trigger condition: retry the same heartbeat tick against the next fallback model when the primary attempt fails with a retriable error (429, 502, 503, network timeout). Don't fail over on 4xx caused by prompt/tool errors — only on provider-availability errors, same rule used by agents.defaults.model.fallbacks.

Alternatives considered

  • Upgrading the provider plan. Works, but doesn't solve the structural single-point-of-failure and still fails silently during incidents.
  • Watchdog cron that checks for repeated 429 and swaps heartbeat.model automatically. Feasible in userland but seems like plumbing OpenClaw should own, given agents.defaults.model.fallbacks already sets the precedent.
  • Detecting stalled heartbeats and emitting an external signal (e.g. Slack/WhatsApp alert when N consecutive ticks error). Complementary to this request — fallbacks solve recovery, external signal solves observability. Happy to split into a separate issue if useful.

Context

  • OpenClaw version: 2026.4.15 (041266a)
  • Config path affected: agents.list[*].heartbeat.model
  • Docs consulted: docs.openclaw.ai/gateway/configuration, docs.openclaw.ai/gateway/heartbeat

Happy to contribute a PR if the maintainers agree on the API shape (Option A vs B).

extent analysis

TL;DR

Implementing a fallback mechanism for the agents.*.heartbeat model, similar to agents.defaults.model, can prevent single-point-of-failure issues when the primary heartbeat provider returns a 429 error.

Guidance

  • Consider modifying the agents.list[*].heartbeat.model configuration to support either an object with primary and fallbacks properties (Option A) or a sibling fallbacks array (Option B) to enable retrying heartbeat ticks against alternative models when the primary attempt fails.
  • Evaluate the feasibility of implementing a watchdog cron that checks for repeated 429 errors and swaps heartbeat.model automatically as a temporary workaround.
  • Review the OpenClaw documentation and configuration to ensure that the proposed behavior aligns with existing features, such as agents.defaults.model.fallbacks.
  • Test the proposed changes using the openclaw config set command with the --dry-run option to verify the updated configuration.

Example

{
  agents: {
    list: [{
      id: "crypto",
      heartbeat: {
        every: "5m",
        model: {
          primary: "oc/kimi-k2.5",
          fallbacks: ["zai/glm-4.7", "zai/glm-5.1"]
        }
      }
    }]
  }
}

or

{
  heartbeat: {
    every: "5m",
    model: "oc/kimi-k2.5",
    fallbacks: ["zai/glm-4.7", "zai/glm-5.1"]
  }
}

Notes

The proposed solution assumes that the OpenClaw maintainers agree on the API shape (Option A vs B). Additionally, the implementation should ensure that the fallback mechanism only triggers on provider-availability errors (e.g., 429, 502, 503, network timeout) and not on prompt/tool errors.

Recommendation

Apply a workaround, such as implementing a watchdog cron or manually swapping heartbeat.model to a different provider, until the fallback mechanism is implemented in OpenClaw. This will help prevent silent failures and ensure that the agent pipeline continues to execute.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Heartbeat: support model fallbacks (single point of failure when provider quota is exhausted) [1 pull requests]