openclaw - ✅(Solved) Fix Agent run timeout during tool execution misclassified as LLM timeout, triggers unnecessary model fallback [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#52147Fetched 2026-04-08 01:15:05
View on GitHub
Comments
3
Participants
3
Timeline
8
Reactions
0
Timeline (top)
commented ×3cross-referenced ×3mentioned ×1subscribed ×1

Agent run timeout during long tool execution (e.g. process(poll)) is misclassified as "LLM request timed out", triggering unnecessary model fallback — even though the primary model responded correctly.

Error Message

WARN embedded run timeout: runId=<redacted> sessionId=<redacted> timeoutMs=600000 DEBUG run cleanup: runId=<redacted> sessionId=<redacted> aborted=true timedOut=true WARN embedded_run_failover_decision: stage=assistant decision=fallback_model failoverReason=timeout provider=anthropic model=claude-opus-4-6 timedOut=true status=408 ERROR lane task error: lane=main durationMs=630119 error="FailoverError: LLM request timed out."

Root Cause

Failover condition (line 111112):

if (!aborted && failoverFailure || timedOut && !timedOutDuringCompaction) {
    // → enters fallback branch even when timeout was caused by tool execution
}

PR fix notes

PR #320: feat: OpenClaw adapter refactor + board UI redesign

Description (problem / solution / changelog)

Summary

Two-part PR combining OpenClaw adapter improvements with a board UI redesign inspired by Composio Agent Orchestrator and OpenAI Symphony.

Commit 1: OpenClaw adapter refactor

  • Restructured Rust payload types (ChatPayloadAgentPayload, simplified stream handling)
  • Switched from chat.send to agent method for gateway communication
  • Added timeout handling (timeout_ms, wait_timeout_ms) to send requests
  • Updated dispatcher routes for new adapter shape
  • Frontend: suppress model/reasoning selection when agent is "openclaw" (backend manages config)
  • Hide model controls in DispatcherPreferenceChips for OpenClaw agent
  • Added test for OpenClaw model selection suppression

Commit 2: Board UI redesign

  • New CIBadge component: standalone CI status badge with individual check list
  • New DiffSizeBadge component: compact diff size pill (+/- with XS-XL label)
  • Complete SessionCard rewrite: attention-aware styling, alert pills, quick reply, done variant, dynamic borders
  • WorkspaceOverview: replaced simple session buttons with rich SessionCard components
  • Symphony-style stat card captions ("Agent sessions in progress", "Cleared to land", etc.)
  • Kanban column captions ("Agents working", "Human needed", "Ready to land")
  • Added additions/deletions fields to DashboardPR type

Closes #<!-- issue number -->

User-Facing Release Notes

  • Session cards now show CI status, diff size, alert pills, and inline agent messaging directly on the card
  • Dashboard stat cards include descriptive captions explaining each metric
  • Kanban board columns show contextual subtitles explaining their purpose
  • Done sessions render as compact cards with expandable detail panels
  • Cards visually indicate priority: merge-ready sessions get a green border glow

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Plugin addition / modification
  • Documentation update
  • Refactor / chore

Checklist

  • pnpm build passes with no errors
  • pnpm typecheck passes with no errors
  • cargo clippy --workspace -- -D warnings passes
  • cargo test --workspace passes
  • No any types introduced without justification
  • No secrets or credentials committed
  • User-facing release notes are filled in plain English
  • PR title follows conventional commits

Testing

# Rust checks
cargo clippy --workspace -- -D warnings
cargo test --workspace

# Frontend checks
bun run --cwd packages/core build
bun run --cwd packages/cli build
bun run --cwd packages/web build
bun run --cwd packages/web typecheck

Screenshots / Demo

Board column captions and new session cards are visible in the dashboard at http://localhost:3000. Cards show:

  • Activity dot with pulse animation for active agents
  • CI status badges and diff size pills
  • Color-coded alert pills for issues needing attention
  • Inline quick-reply for messaging agents
  • Done card variant with expandable detail panel
<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

Summary by CodeRabbit

Release Notes

  • New Features

    • Added CI status badge component with dynamic check counts and links to pull request checks
    • Added diff size badge displaying additions and deletions
  • Improvements

    • Refactored session cards with new interactive selection model and consolidated badge displays
    • Enhanced workspace dashboard with captions for statistics and improved session list layout
    • Workspace board now displays role captions beneath column headers
    • Optimized model and reasoning control visibility for better UI flow
    • Updated OpenClaw installation guidance to reference backend runtime settings
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Changed files

  • crates/conductor-executors/src/agents/openclaw.rs (modified, +299/-209)
  • crates/conductor-server/src/routes/agents.rs (modified, +1/-1)
  • crates/conductor-server/src/routes/dispatcher.rs (modified, +43/-4)
  • packages/web/src/components/CIBadge.tsx (added, +140/-0)
  • packages/web/src/components/DiffSizeBadge.tsx (added, +30/-0)
  • packages/web/src/components/SessionCard.tsx (modified, +438/-277)
  • packages/web/src/components/board/WorkspaceKanban.tsx (modified, +17/-0)
  • packages/web/src/components/dispatcher/DispatcherPreferenceChips.tsx (modified, +58/-55)
  • packages/web/src/features/dashboard/components/WorkspaceOverview.tsx (modified, +14/-108)
  • packages/web/src/lib/agentModelSelection.test.ts (modified, +17/-0)
  • packages/web/src/lib/agentModelSelection.ts (modified, +28/-0)
  • packages/web/src/lib/knownAgents.ts (modified, +1/-1)
  • packages/web/src/lib/types.ts (modified, +2/-0)

Code Example

WARN  embedded run timeout: runId=<redacted> sessionId=<redacted> timeoutMs=600000
DEBUG run cleanup: runId=<redacted> sessionId=<redacted> aborted=true timedOut=true
WARN  embedded_run_failover_decision: stage=assistant decision=fallback_model failoverReason=timeout provider=anthropic model=claude-opus-4-6 timedOut=true status=408
ERROR lane task error: lane=main durationMs=630119 error="FailoverError: LLM request timed out."

---

// Line 109601-109602
let timedOut = false;
let timedOutDuringCompaction = false;

---

if (!aborted && failoverFailure || timedOut && !timedOutDuringCompaction) {
    // → enters fallback branch even when timeout was caused by tool execution
}

---

if (timedOut && !timedOutDuringCompaction && payloads.length === 0) return {
    payloads: [{ text: "Request timed out before a response was generated...", isError: true }],
    // ...
};

---

// Current: only compaction is exempt
if (timedOut && !timedOutDuringCompaction) { /* failover */ }

// Proposed: tool execution also exempt
if (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution) { /* failover */ }

// Or better: general timeout cause
if (timedOut && timeoutCause === 'llm_request') { /* failover */ }
RAW_BUFFERClick to expand / collapse

Bug type

Agent behavior

Summary

Agent run timeout during long tool execution (e.g. process(poll)) is misclassified as "LLM request timed out", triggering unnecessary model fallback — even though the primary model responded correctly.

Steps to reproduce

  1. Configure an agent with claude-opus-4-6 as primary model and a fallback model chain
  2. In a session, have the agent spawn a background process via exec (background: true)
  3. Have the agent monitor the process using process(poll) with 2-minute poll intervals
  4. Wait for the total run time to exceed DEFAULT_AGENT_TIMEOUT_SECONDS (600s / 10 minutes)
  5. Observe: the run is aborted with "FailoverError: LLM request timed out." and falls back to the next model in the chain

Alternatively: any tool call that takes a long time (browser automation, long exec, repeated process polling) can trigger the same behavior.

Expected behavior

Tool execution time should not count toward the LLM request timeout. The primary model had already responded successfully and the agent was executing tool calls — no LLM request was in flight when the timeout fired.

Precedent: PR #46889 added timedOutDuringCompaction to exempt compaction operations from triggering failover. Long-running tool execution should receive the same treatment.

Actual behavior

The run-level timer fires after 600s regardless of what the agent is doing. When it fires during tool execution:

  1. abortRun(true) sets timedOut = true
  2. The failover decision checks timedOut && !timedOutDuringCompaction → enters fallback branch
  3. A FailoverError: LLM request timed out. is thrown
  4. The system falls over to the next configured model
  5. The fallback model receives the full context (~61K tokens) but has no useful work to do (the original task was already being handled)

Gateway log sequence (redacted):

WARN  embedded run timeout: runId=<redacted> sessionId=<redacted> timeoutMs=600000
DEBUG run cleanup: runId=<redacted> sessionId=<redacted> aborted=true timedOut=true
WARN  embedded_run_failover_decision: stage=assistant decision=fallback_model failoverReason=timeout provider=anthropic model=claude-opus-4-6 timedOut=true status=408
ERROR lane task error: lane=main durationMs=630119 error="FailoverError: LLM request timed out."

The primary model (Opus) had completed its response. The agent was in a process(poll) loop monitoring a background worker:

  • Poll 1: 2 min wait → got output
  • Poll 2: 2 min wait → got output
  • Poll 3: 2 min wait → got output
  • Poll 4: 2 min wait → got output
  • Poll 5: started, aborted at ~17s by run timeout

Total tool execution time: ~10 minutes. No LLM request was pending.

OpenClaw version

2026.3.13 (61d171a)

Operating system

Ubuntu 24.04.4 LTS, Linux 6.17.0-19-generic x86_64

Install method

npm (global)

Model

claude-opus-4-6 (Anthropic) — primary model that was "timed out"

Provider / routing chain

anthropic/claude-opus-4-6 → openrouter/google/gemini-3.1-pro-preview (fallback #1) → openrouter/openai/gpt-5.4 (fallback #2) → ollama/qwen3.5:9b (fallback #3)

Additional provider/model setup details

Default agent timeout is the built-in DEFAULT_AGENT_TIMEOUT_SECONDS = 600. No custom agents.defaults.timeoutSeconds override was set.

The fallback model (gemini-3.1-pro-preview via OpenRouter) consumed ~61K input tokens at $0.145 cost but produced 0 completion tokens — confirming it had no useful work to do.

Logs, screenshots, and evidence

Source code analysis (minified bundle auth-profiles-DDVivXkv.js)

Timeout declaration and abort:

// Line 109601-109602
let timedOut = false;
let timedOutDuringCompaction = false;

Failover condition (line 111112):

if (!aborted && failoverFailure || timedOut && !timedOutDuringCompaction) {
    // → enters fallback branch even when timeout was caused by tool execution
}

User-facing error (line 111176):

if (timedOut && !timedOutDuringCompaction && payloads.length === 0) return {
    payloads: [{ text: "Request timed out before a response was generated...", isError: true }],
    // ...
};

Note: timedOutDuringCompaction is the only exemption. There is no timedOutDuringToolExecution or equivalent.

Compaction precedent (PR #46889)

The timedOutDuringCompaction mechanism proves the design intent: certain non-LLM operations should not trigger failover. Tool execution is a missing case.

Impact and severity

  • Affected: Any agent using long-running tools (process polling, browser automation, multi-minute exec tasks) with a fallback model chain configured
  • Severity: Blocks workflow + unnecessary cost — the fallback model is invoked with the full conversation context but produces no useful output
  • Frequency: Deterministic — any tool execution exceeding DEFAULT_AGENT_TIMEOUT_SECONDS (600s) will trigger this
  • Consequence: (1) Wasted API cost on the fallback model, (2) original task is interrupted mid-execution, (3) misleading "LLM request timed out" error when the LLM was not involved

Additional information

Suggested fix

Add a timedOutDuringToolExecution flag (or refactor to a general timeoutCause enum) so tool execution time is exempt from the failover path, consistent with the existing compaction exemption:

// Current: only compaction is exempt
if (timedOut && !timedOutDuringCompaction) { /* failover */ }

// Proposed: tool execution also exempt
if (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution) { /* failover */ }

// Or better: general timeout cause
if (timedOut && timeoutCause === 'llm_request') { /* failover */ }

Alternatively, the run deadline could be extended while tool execution is actively in flight (same approach as PR #46889 does for compaction).

extent analysis

Fix Plan

To resolve the issue, we need to add a mechanism to exempt tool execution time from triggering the failover path. We can achieve this by introducing a timedOutDuringToolExecution flag or a more general timeoutCause enum.

Step-by-Step Solution:

  1. Add a timedOutDuringToolExecution flag: Introduce a new flag to track whether the timeout occurred during tool execution.
  2. Update the failover condition: Modify the failover condition to check for both timedOutDuringCompaction and timedOutDuringToolExecution.
  3. Set the timedOutDuringToolExecution flag: Update the code to set the timedOutDuringToolExecution flag when a timeout occurs during tool execution.

Example Code:

// Add a timedOutDuringToolExecution flag
let timedOutDuringToolExecution = false;

// Update the failover condition
if (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution) {
    // Enter fallback branch
}

// Set the timedOutDuringToolExecution flag
if (toolExecutionInProgress && timedOut) {
    timedOutDuringToolExecution = true;
}

Alternatively, you can use a timeoutCause enum to make the code more scalable:

// Define a timeoutCause enum
const TimeoutCause = {
    LLM_REQUEST: 'llm_request',
    TOOL_EXECUTION: 'tool_execution',
    COMPACTION: 'compaction'
};

// Update the failover condition
if (timedOut && timeoutCause === TimeoutCause.LLM_REQUEST) {
    // Enter fallback branch
}

// Set the timeoutCause
if (toolExecutionInProgress && timedOut) {
    timeoutCause = TimeoutCause.TOOL_EXECUTION;
}

Verification

To verify that the fix worked, you can test the following scenarios:

  • Run a tool execution that exceeds the DEFAULT_AGENT_TIMEOUT_SECONDS (600s) and verify that the failover path is not triggered.
  • Run a tool execution that completes within the DEFAULT_AGENT_TIMEOUT_SECONDS (600s) and verify that the failover path is not triggered.
  • Run an LLM request that times out and verify that the failover path is triggered.

Extra Tips

  • Make sure to update the documentation to reflect the changes made to the failover condition.
  • Consider adding logging to track the timedOutDuringToolExecution flag or timeoutCause enum to help with debugging.
  • Review the code to ensure that the timedOutDuringToolExecution flag or timeoutCause enum is properly reset after each tool execution or LLM request.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Tool execution time should not count toward the LLM request timeout. The primary model had already responded successfully and the agent was executing tool calls — no LLM request was in flight when the timeout fired.

Precedent: PR #46889 added timedOutDuringCompaction to exempt compaction operations from triggering failover. Long-running tool execution should receive the same treatment.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING