Each step of the fallback chain should emit a `runtime_event` with at least: - `type`: `fallback_step` - `from_model`: e.g. `claude-cli/claude-opus-4-7` - `to_model`: e.g. `ollama/ ` - `from_failure_reason`: short classifier (timeout, http_5xx, ram_pressure, plugin_error, signal, etc.) - `from_failure_detail`: the actual error string the operator would need (truncated if huge) - `chain_position`: integer - `final_outcome`: `succeeded` / `next_fallback` / `chain_exhausted` This way `runtime_events.jsonl` becomes the single source of truth for "what happened" — operators don't have to guess between "the primary failed for reason A" vs "the fallback failed for reason B" vs both. Today only the terminal failure surfaces, which is often the least informative one (e.g. "Ollama OOM" tells you nothing about *why* Anthropic was rotated in the first place).

openclaw - 💡(How to fix) Fix [Bug]: Watchdog fallback chain silently drops primary-failure root cause when fallback also fails — only terminal error surfaces in runtime_events [1 participants]

openclaw2026-04-25 20:13:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#71744•Fetched 2026-04-26 05:08:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

nikolaykazakovvs-ux

Participants

nikolaykazakovvs-ux

Timeline (top)

mentioned ×1subscribed ×1

When the watchdog fallback chain rotates a primary model to a fallback (e.g. Anthropic → local Ollama) and the fallback also fails (e.g. RAM exhaustion at the fallback host), the root cause of the original primary failure is silently dropped — the operator-visible message is the fallback's failure ("All models failed" / "model requires N GiB RAM") and the actual reason the primary went down is only available in the Claude CLI session's stderr, never in runtime_events.jsonl.

Error Message

Operator sees only "All models failed" / fallback's error message in the operator channel.
Inspect runtime_events.jsonl — there is no event describing why the primary was rotated, only (at best) the fallback's terminal error. The original primary failure (which is the actually actionable signal — e.g. "host swap exhausted, watchdog escalated at t=…") is in the Claude CLI session stderr, which most operators never read.

from_failure_detail: the actual error string the operator would need (truncated if huge)

Root Cause

Summary

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Configure an agent with primary = Anthropic (claude-cli/claude-opus-4-7) and a local Ollama fallback (ollama/<large-model> whose RAM requirement is close to host headroom).
Trigger any condition that makes the primary fail (high host RAM pressure, slow gateway response, plugin hang — anything that flips the watchdog).
Watchdog rotates to Ollama. Ollama returns 500: model requires X GiB RAM, only Y GiB available.
Operator sees only "All models failed" / fallback's error message in the operator channel.
Inspect runtime_events.jsonl — there is no event describing why the primary was rotated, only (at best) the fallback's terminal error. The original primary failure (which is the actually actionable signal — e.g. "host swap exhausted, watchdog escalated at t=…") is in the Claude CLI session stderr, which most operators never read.

Expected behavior

Each step of the fallback chain should emit a runtime_event with at least:

type: fallback_step
from_model: e.g. claude-cli/claude-opus-4-7
to_model: e.g. ollama/<model>
from_failure_reason: short classifier (timeout, http_5xx, ram_pressure, plugin_error, signal, etc.)
from_failure_detail: the actual error string the operator would need (truncated if huge)
chain_position: integer
final_outcome: succeeded / next_fallback / chain_exhausted

This way runtime_events.jsonl becomes the single source of truth for "what happened" — operators don't have to guess between "the primary failed for reason A" vs "the fallback failed for reason B" vs both. Today only the terminal failure surfaces, which is often the least informative one (e.g. "Ollama OOM" tells you nothing about why Anthropic was rotated in the first place).

Actual behavior

Real incident on this deployment, 2026-04-17 ~17:34 UTC:

Forced Proactivity Probe sent to Opus 4-7 (claude-cli/claude-opus-4-7 primary).
Watchdog rotated to local Ollama fallback (ollama/gemma4).
Ollama returned: 500: model requires 13.7 GiB RAM, only 11.9 GiB available.
Operator received All models failed in Telegram with no context.
The actual root cause — host swap had been exhausted by 14 hung openclaw-agent processes (separate issue, see #71710) and the watchdog rotated because the primary timeout budget was being consumed by host I/O wait — was not in runtime_events.jsonl. It was only in Claude CLI session stderr, which required searching across multiple session files to reconstruct.

This made the incident much harder to diagnose than it needed to be: the fallback's "OOM" message pointed at a real but downstream symptom, while the actionable signal (host swap saturation → primary watchdog) lived nowhere structured.

OpenClaw version

2026.4.23 (incident reproduced on 2026.4.17 codebase, runtime_events.jsonl schema unchanged through 2026.4.23 per writer-script audit)

Operating system

Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255

Install method

npm global (/home/ubuntu/.npm-global/lib/node_modules/openclaw), Node v22.22.1

Model

Primary: claude-cli/claude-opus-4-7. Fallback chain (then-active): ollama/gemma4.

Provider / routing chain

openclaw -> auth-profile anthropic:claude-cli -> claude (CLI) -> anthropic.com
└── watchdog rotate → openclaw -> ollama (localhost:11434) -> gguf model

Additional provider/model setup details

Operator-side runtime_events writer is ~/scripts/write_decision.py (custom). The OpenClaw side that would emit fallback-step events doesn't currently; we capture from Claude CLI stderr post-hoc, which is unreliable and requires correlating timestamps across multiple log surfaces.

Suggested minimal change: in the existing watchdog fallback path, just before invoking the next provider, emit a runtime_event of type fallback_step with the fields listed under "Expected behavior". No retroactive replay needed — the events become available from the next firing.

Related (different angles on the same observability gap):

#62684 (closed) — Local Ollama agent pipeline times out with no provider logs. Same family of "fallback failure with no upstream context" symptom, a different root manifestation (Ollama timeout vs Ollama OOM).

Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).

extent analysis

TL;DR

Emit a runtime_event of type fallback_step with detailed information about the primary failure reason before invoking the next provider in the watchdog fallback path.

Guidance

Identify the point in the watchdog fallback path where the primary model failure reason is available and add code to emit a runtime_event with the required fields (type, from_model, to_model, from_failure_reason, from_failure_detail, chain_position, final_outcome).
Ensure the runtime_event is written to runtime_events.jsonl in a way that is consistent with the existing event writing mechanism.
Review the custom write_decision.py script to ensure it can handle the new fallback_step events correctly.
Test the updated watchdog fallback path to verify that the runtime_event is emitted correctly and contains the expected information.

Example

# Pseudo-code example of emitting a runtime_event
def emit_runtime_event(event_type, from_model, to_model, failure_reason, failure_detail):
    event = {
        'type': event_type,
        'from_model': from_model,
        'to_model': to_model,
        'from_failure_reason': failure_reason,
        'from_failure_detail': failure_detail,
        'chain_position': 1,  # Example chain position
        'final_outcome': 'next_fallback'  # Example final outcome
    }
    # Write the event to runtime_events.jsonl
    with open('runtime_events.jsonl', 'a') as f:
        f.write(json.dumps(event) + '\n')

# Example usage
emit_runtime_event('fallback_step', 'claude-cli/claude-opus-4-7', 'ollama/gemma4', 'timeout', 'Host swap exhausted')

Notes

The suggested change only addresses the observability gap for fallback chain failures and does not fix the underlying issues that cause the primary model

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Each step of the fallback chain should emit a runtime_event with at least:

type: fallback_step
from_model: e.g. claude-cli/claude-opus-4-7
to_model: e.g. ollama/<model>
from_failure_reason: short classifier (timeout, http_5xx, ram_pressure, plugin_error, signal, etc.)
from_failure_detail: the actual error string the operator would need (truncated if huge)
chain_position: integer
final_outcome: succeeded / next_fallback / chain_exhausted

#container setup #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Watchdog fallback chain silently drops primary-failure root cause when fallback also fails — only terminal error surfaces in runtime_events [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Summary

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

extent analysis

TL;DR

Guidance

Example

Notes

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Watchdog fallback chain silently drops primary-failure root cause when fallback also fails — only terminal error surfaces in runtime_events [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Summary

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

extent analysis

TL;DR

Guidance

Example

Notes

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING