openclaw - 💡(How to fix) Fix [Bug]: Watchdog fallback chain silently drops primary-failure root cause when fallback also fails — only terminal error surfaces in runtime_events [1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71744Fetched 2026-04-26 05:08:56
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Timeline (top)
mentioned ×1subscribed ×1

When the watchdog fallback chain rotates a primary model to a fallback (e.g. Anthropic → local Ollama) and the fallback also fails (e.g. RAM exhaustion at the fallback host), the root cause of the original primary failure is silently dropped — the operator-visible message is the fallback's failure ("All models failed" / "model requires N GiB RAM") and the actual reason the primary went down is only available in the Claude CLI session's stderr, never in runtime_events.jsonl.

Error Message

  1. Operator sees only "All models failed" / fallback's error message in the operator channel.
  2. Inspect runtime_events.jsonl — there is no event describing why the primary was rotated, only (at best) the fallback's terminal error. The original primary failure (which is the actually actionable signal — e.g. "host swap exhausted, watchdog escalated at t=…") is in the Claude CLI session stderr, which most operators never read.
  • from_failure_detail: the actual error string the operator would need (truncated if huge)

Root Cause

Summary

When the watchdog fallback chain rotates a primary model to a fallback (e.g. Anthropic → local Ollama) and the fallback also fails (e.g. RAM exhaustion at the fallback host), the root cause of the original primary failure is silently dropped — the operator-visible message is the fallback's failure ("All models failed" / "model requires N GiB RAM") and the actual reason the primary went down is only available in the Claude CLI session's stderr, never in runtime_events.jsonl.

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When the watchdog fallback chain rotates a primary model to a fallback (e.g. Anthropic → local Ollama) and the fallback also fails (e.g. RAM exhaustion at the fallback host), the root cause of the original primary failure is silently dropped — the operator-visible message is the fallback's failure ("All models failed" / "model requires N GiB RAM") and the actual reason the primary went down is only available in the Claude CLI session's stderr, never in runtime_events.jsonl.

Steps to reproduce

  1. Configure an agent with primary = Anthropic (claude-cli/claude-opus-4-7) and a local Ollama fallback (ollama/<large-model> whose RAM requirement is close to host headroom).
  2. Trigger any condition that makes the primary fail (high host RAM pressure, slow gateway response, plugin hang — anything that flips the watchdog).
  3. Watchdog rotates to Ollama. Ollama returns 500: model requires X GiB RAM, only Y GiB available.
  4. Operator sees only "All models failed" / fallback's error message in the operator channel.
  5. Inspect runtime_events.jsonl — there is no event describing why the primary was rotated, only (at best) the fallback's terminal error. The original primary failure (which is the actually actionable signal — e.g. "host swap exhausted, watchdog escalated at t=…") is in the Claude CLI session stderr, which most operators never read.

Expected behavior

Each step of the fallback chain should emit a runtime_event with at least:

  • type: fallback_step
  • from_model: e.g. claude-cli/claude-opus-4-7
  • to_model: e.g. ollama/<model>
  • from_failure_reason: short classifier (timeout, http_5xx, ram_pressure, plugin_error, signal, etc.)
  • from_failure_detail: the actual error string the operator would need (truncated if huge)
  • chain_position: integer
  • final_outcome: succeeded / next_fallback / chain_exhausted

This way runtime_events.jsonl becomes the single source of truth for "what happened" — operators don't have to guess between "the primary failed for reason A" vs "the fallback failed for reason B" vs both. Today only the terminal failure surfaces, which is often the least informative one (e.g. "Ollama OOM" tells you nothing about why Anthropic was rotated in the first place).

Actual behavior

Real incident on this deployment, 2026-04-17 ~17:34 UTC:

  • Forced Proactivity Probe sent to Opus 4-7 (claude-cli/claude-opus-4-7 primary).
  • Watchdog rotated to local Ollama fallback (ollama/gemma4).
  • Ollama returned: 500: model requires 13.7 GiB RAM, only 11.9 GiB available.
  • Operator received All models failed in Telegram with no context.
  • The actual root cause — host swap had been exhausted by 14 hung openclaw-agent processes (separate issue, see #71710) and the watchdog rotated because the primary timeout budget was being consumed by host I/O wait — was not in runtime_events.jsonl. It was only in Claude CLI session stderr, which required searching across multiple session files to reconstruct.

This made the incident much harder to diagnose than it needed to be: the fallback's "OOM" message pointed at a real but downstream symptom, while the actionable signal (host swap saturation → primary watchdog) lived nowhere structured.

OpenClaw version

2026.4.23 (incident reproduced on 2026.4.17 codebase, runtime_events.jsonl schema unchanged through 2026.4.23 per writer-script audit)

Operating system

Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255

Install method

npm global (/home/ubuntu/.npm-global/lib/node_modules/openclaw), Node v22.22.1

Model

Primary: claude-cli/claude-opus-4-7. Fallback chain (then-active): ollama/gemma4.

Provider / routing chain

openclaw -> auth-profile anthropic:claude-cli -> claude (CLI) -> anthropic.com
└── watchdog rotate → openclaw -> ollama (localhost:11434) -> gguf model

Additional provider/model setup details

Operator-side runtime_events writer is ~/scripts/write_decision.py (custom). The OpenClaw side that would emit fallback-step events doesn't currently; we capture from Claude CLI stderr post-hoc, which is unreliable and requires correlating timestamps across multiple log surfaces.

Suggested minimal change: in the existing watchdog fallback path, just before invoking the next provider, emit a runtime_event of type fallback_step with the fields listed under "Expected behavior". No retroactive replay needed — the events become available from the next firing.

Related (different angles on the same observability gap):

  • #62684 (closed) — Local Ollama agent pipeline times out with no provider logs. Same family of "fallback failure with no upstream context" symptom, a different root manifestation (Ollama timeout vs Ollama OOM).

Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).

extent analysis

TL;DR

Emit a runtime_event of type fallback_step with detailed information about the primary failure reason before invoking the next provider in the watchdog fallback path.

Guidance

  • Identify the point in the watchdog fallback path where the primary model failure reason is available and add code to emit a runtime_event with the required fields (type, from_model, to_model, from_failure_reason, from_failure_detail, chain_position, final_outcome).
  • Ensure the runtime_event is written to runtime_events.jsonl in a way that is consistent with the existing event writing mechanism.
  • Review the custom write_decision.py script to ensure it can handle the new fallback_step events correctly.
  • Test the updated watchdog fallback path to verify that the runtime_event is emitted correctly and contains the expected information.

Example

# Pseudo-code example of emitting a runtime_event
def emit_runtime_event(event_type, from_model, to_model, failure_reason, failure_detail):
    event = {
        'type': event_type,
        'from_model': from_model,
        'to_model': to_model,
        'from_failure_reason': failure_reason,
        'from_failure_detail': failure_detail,
        'chain_position': 1,  # Example chain position
        'final_outcome': 'next_fallback'  # Example final outcome
    }
    # Write the event to runtime_events.jsonl
    with open('runtime_events.jsonl', 'a') as f:
        f.write(json.dumps(event) + '\n')

# Example usage
emit_runtime_event('fallback_step', 'claude-cli/claude-opus-4-7', 'ollama/gemma4', 'timeout', 'Host swap exhausted')

Notes

The suggested change only addresses the observability gap for fallback chain failures and does not fix the underlying issues that cause the primary model

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Each step of the fallback chain should emit a runtime_event with at least:

  • type: fallback_step
  • from_model: e.g. claude-cli/claude-opus-4-7
  • to_model: e.g. ollama/<model>
  • from_failure_reason: short classifier (timeout, http_5xx, ram_pressure, plugin_error, signal, etc.)
  • from_failure_detail: the actual error string the operator would need (truncated if huge)
  • chain_position: integer
  • final_outcome: succeeded / next_fallback / chain_exhausted

This way runtime_events.jsonl becomes the single source of truth for "what happened" — operators don't have to guess between "the primary failed for reason A" vs "the fallback failed for reason B" vs both. Today only the terminal failure surfaces, which is often the least informative one (e.g. "Ollama OOM" tells you nothing about why Anthropic was rotated in the first place).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING