openclaw - ✅(Solved) Fix [Bug]: Active Memory `timeoutMs` not respected — failover chain blocks replies for 40-125 seconds [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#64867Fetched 2026-04-12 13:26:28
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Assignees
Timeline (top)
labeled ×2assigned ×1closed ×1cross-referenced ×1

The Active Memory plugin's config.timeoutMs setting (set to 15000ms) is not enforced as a hard cap on total execution time. When the initial LLM request times out, the embedded run failover chain (rotate auth profile → fallback model) adds additional retry attempts that are not bounded by timeoutMs. This results in Active Memory blocking the main reply for 40-125+ seconds instead of the configured 15 seconds.

This is a blocking issue — Active Memory runs synchronously before every reply in eligible sessions, so each timeout directly delays the user-visible response.

Error Message

Log Evidence

Timeout #2 (125.2 seconds, Haiku)

07:33:01 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1391 07:33:49 [agent] embedded run failover decision: runId=active-memory-mnufpo6t-6e996d78 stage=assistant decision=rotate_profile reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:154a23a3efe6 07:34:58 [agent] embedded run failover decision: runId=active-memory-mnufpo6t-6e996d78 stage=assistant decision=fallback_model reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:ab919eb4413a 07:34:58 [diagnostic] lane task error: lane=session:agent:main:main:active-memory:735c698d8eb1 durationMs=117120 error="FailoverError: LLM request timed out." 07:35:06 [plugins] active-memory: agent=main session=agent:main:main done status=timeout elapsedMs=125243 summaryChars=0

Timeout #4 (123.8 seconds, Haiku)

07:57:29 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1666 07:58:17 [agent] embedded run failover decision: runId=active-memory-mnugl5b0-1a2d5b36 stage=assistant decision=rotate_profile reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:154a23a3efe6 07:59:33 [agent] embedded run failover decision: runId=active-memory-mnugl5b0-1a2d5b36 stage=assistant decision=fallback_model reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:ab919eb4413a 07:59:33 [diagnostic] lane task error: lane=session:agent:main:main:active-memory:f3cb0a1f008b durationMs=123816 error="FailoverError: LLM request timed out." 07:59:33 [plugins] active-memory: agent=main session=agent:main:main done status=timeout elapsedMs=123821 summaryChars=0

Successful run immediately after timeout #4 (same session, 1.9 seconds)

08:01:40 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1666 08:01:42 [plugins] active-memory: agent=main session=agent:main:main done status=ok elapsedMs=1913 summaryChars=204

Root Cause

The failover chain operates independently of timeoutMs:

  1. Initial LLM request starts
  2. Request times out (per some internal timeout, not timeoutMs)
  3. Failover step 1: rotate_profile — tries the same model with a different auth profile
  4. That also times out
  5. Failover step 2: fallback_model — tries a different model entirely
  6. That also times out (or the overall lane errors out)
  7. Active Memory finally returns status=timeout

Each step in the failover chain appears to get its own timeout window (~30-60s each), and timeoutMs does not abort the chain early. The timeoutMs value appears to only apply to the initial request, not the total embedded run lifecycle.

Fix Action

Workaround

Disabled Active Memory (config.enabled: false) until the timeout enforcement is fixed. The feature works well when the LLM responds promptly (1-2 seconds), but the unbounded failover chain makes it unusable in production.

PR fix notes

PR #65046: fix(active-memory): stop caller timeouts from continuing failover

Description (problem / solution / changelog)

Summary

  • stop embedded active-memory runs from treating caller-owned aborts as retry/failover candidates
  • ignore late active-memory payloads that arrive after the plugin timeout has already fired
  • add focused regression coverage for external-abort failover decisions and late timeout payloads

Testing

  • pnpm vitest run src/agents/pi-embedded-runner/run/failover-policy.test.ts extensions/active-memory/index.test.ts

Fixes #64867

Changed files

  • extensions/active-memory/index.test.ts (modified, +32/-0)
  • extensions/active-memory/index.ts (modified, +12/-0)
  • extensions/codex/src/app-server/event-projector.ts (modified, +1/-0)
  • src/agents/harness/selection.test.ts (modified, +1/-0)
  • src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.e2e.test.ts (modified, +1/-0)
  • src/agents/pi-embedded-runner/run.overflow-compaction.fixture.ts (modified, +1/-0)
  • src/agents/pi-embedded-runner/run.ts (modified, +5/-0)
  • src/agents/pi-embedded-runner/run/assistant-failover.ts (modified, +3/-1)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +3/-0)
  • src/agents/pi-embedded-runner/run/failover-policy.test.ts (modified, +41/-0)
  • src/agents/pi-embedded-runner/run/failover-policy.ts (modified, +14/-0)
  • src/agents/pi-embedded-runner/run/types.ts (modified, +2/-0)
  • src/agents/pi-embedded-runner/usage-reporting.test.ts (modified, +1/-0)
  • src/agents/test-helpers/pi-embedded-runner-e2e-fixtures.ts (modified, +1/-0)

Code Example

{
  "plugins": {
    "entries": {
      "active-memory": {
        "enabled": true,
        "config": {
          "enabled": true,
          "agents": ["main"],
          "allowedChatTypes": ["direct"],
          "modelFallbackPolicy": "default-remote",
          "queryMode": "recent",
          "promptStyle": "balanced",
          "timeoutMs": 15000,
          "maxSummaryChars": 220,
          "persistTranscripts": false,
          "logging": true,
          "model": "anthropic/claude-haiku-4-5"
        }
      }
    }
  }
}

---

## Log Evidence

### Timeout #2 (125.2 seconds, Haiku)


07:33:01 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1391
07:33:49 [agent] embedded run failover decision: runId=active-memory-mnufpo6t-6e996d78 stage=assistant decision=rotate_profile reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:154a23a3efe6
07:34:58 [agent] embedded run failover decision: runId=active-memory-mnufpo6t-6e996d78 stage=assistant decision=fallback_model reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:ab919eb4413a
07:34:58 [diagnostic] lane task error: lane=session:agent:main:main:active-memory:735c698d8eb1 durationMs=117120 error="FailoverError: LLM request timed out."
07:35:06 [plugins] active-memory: agent=main session=agent:main:main done status=timeout elapsedMs=125243 summaryChars=0


### Timeout #4 (123.8 seconds, Haiku)


07:57:29 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1666
07:58:17 [agent] embedded run failover decision: runId=active-memory-mnugl5b0-1a2d5b36 stage=assistant decision=rotate_profile reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:154a23a3efe6
07:59:33 [agent] embedded run failover decision: runId=active-memory-mnugl5b0-1a2d5b36 stage=assistant decision=fallback_model reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:ab919eb4413a
07:59:33 [diagnostic] lane task error: lane=session:agent:main:main:active-memory:f3cb0a1f008b durationMs=123816 error="FailoverError: LLM request timed out."
07:59:33 [plugins] active-memory: agent=main session=agent:main:main done status=timeout elapsedMs=123821 summaryChars=0


### Successful run immediately after timeout #4 (same session, 1.9 seconds)


08:01:40 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1666
08:01:42 [plugins] active-memory: agent=main session=agent:main:main done status=ok elapsedMs=1913 summaryChars=204
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

The Active Memory plugin's config.timeoutMs setting (set to 15000ms) is not enforced as a hard cap on total execution time. When the initial LLM request times out, the embedded run failover chain (rotate auth profile → fallback model) adds additional retry attempts that are not bounded by timeoutMs. This results in Active Memory blocking the main reply for 40-125+ seconds instead of the configured 15 seconds.

This is a blocking issue — Active Memory runs synchronously before every reply in eligible sessions, so each timeout directly delays the user-visible response.

Steps to reproduce

Environment

  • OpenClaw version: 2026.4.10 (44e5b62)
  • Platform: Linux (Ubuntu 24.04), x64, Node v25.8.2
  • Channel: WhatsApp (direct messages)
  • Auth: Anthropic OAuth token (two profiles configured)
  • Memory backend: QMD
  • Context engine: lossless-claw 0.8.0

Configuration

{
  "plugins": {
    "entries": {
      "active-memory": {
        "enabled": true,
        "config": {
          "enabled": true,
          "agents": ["main"],
          "allowedChatTypes": ["direct"],
          "modelFallbackPolicy": "default-remote",
          "queryMode": "recent",
          "promptStyle": "balanced",
          "timeoutMs": 15000,
          "maxSummaryChars": 220,
          "persistTranscripts": false,
          "logging": true,
          "model": "anthropic/claude-haiku-4-5"
        }
      }
    }
  }
}

Expected behavior

config.timeoutMs: 15000 should be a hard ceiling on total Active Memory execution time, including any failover/retry attempts. If the initial request doesn't complete within 15 seconds, Active Memory should return NONE/empty and let the main reply proceed.

Actual behavior

Observed Behavior

Out of 14 Active Memory invocations in a ~40 minute window, 4 timed out with durations far exceeding the configured 15s timeout:

#TimeModelDurationStatusFailover Chain
107:28:59claude-opus-4-6*45.4stimeoutrotate_profile → fallback_model
207:33:01claude-haiku-4-5125.2stimeoutrotate_profile → fallback_model
307:42:07claude-haiku-4-541.6stimeoutrotate_profile → fallback_model
407:57:29claude-haiku-4-5123.8stimeoutrotate_profile → fallback_model

*Run #1 used Opus because the config.model patch hadn't restarted yet.

The remaining 10 invocations completed successfully in 1.2-2.2 seconds with status=empty or status=ok.

OpenClaw version

2026.4.10

Operating system

Ubuntu 24.04

Install method

npm global

Model

anthropic/opus-4-6

Provider / routing chain

Openclaw -> anthropic -> opus

Additional provider/model setup details

Root Cause Analysis

The failover chain operates independently of timeoutMs:

  1. Initial LLM request starts
  2. Request times out (per some internal timeout, not timeoutMs)
  3. Failover step 1: rotate_profile — tries the same model with a different auth profile
  4. That also times out
  5. Failover step 2: fallback_model — tries a different model entirely
  6. That also times out (or the overall lane errors out)
  7. Active Memory finally returns status=timeout

Each step in the failover chain appears to get its own timeout window (~30-60s each), and timeoutMs does not abort the chain early. The timeoutMs value appears to only apply to the initial request, not the total embedded run lifecycle.

Logs, screenshots, and evidence

## Log Evidence

### Timeout #2 (125.2 seconds, Haiku)


07:33:01 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1391
07:33:49 [agent] embedded run failover decision: runId=active-memory-mnufpo6t-6e996d78 stage=assistant decision=rotate_profile reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:154a23a3efe6
07:34:58 [agent] embedded run failover decision: runId=active-memory-mnufpo6t-6e996d78 stage=assistant decision=fallback_model reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:ab919eb4413a
07:34:58 [diagnostic] lane task error: lane=session:agent:main:main:active-memory:735c698d8eb1 durationMs=117120 error="FailoverError: LLM request timed out."
07:35:06 [plugins] active-memory: agent=main session=agent:main:main done status=timeout elapsedMs=125243 summaryChars=0


### Timeout #4 (123.8 seconds, Haiku)


07:57:29 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1666
07:58:17 [agent] embedded run failover decision: runId=active-memory-mnugl5b0-1a2d5b36 stage=assistant decision=rotate_profile reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:154a23a3efe6
07:59:33 [agent] embedded run failover decision: runId=active-memory-mnugl5b0-1a2d5b36 stage=assistant decision=fallback_model reason=timeout provider=anthropic/claude-haiku-4-5 profile=sha256:ab919eb4413a
07:59:33 [diagnostic] lane task error: lane=session:agent:main:main:active-memory:f3cb0a1f008b durationMs=123816 error="FailoverError: LLM request timed out."
07:59:33 [plugins] active-memory: agent=main session=agent:main:main done status=timeout elapsedMs=123821 summaryChars=0


### Successful run immediately after timeout #4 (same session, 1.9 seconds)


08:01:40 [plugins] active-memory: agent=main session=agent:main:main start timeoutMs=15000 queryChars=1666
08:01:42 [plugins] active-memory: agent=main session=agent:main:main done status=ok elapsedMs=1913 summaryChars=204

Impact and severity

  • Every timeout blocks the user-visible reply by 40-125 seconds
  • Active Memory is a blocking pre-reply step — there is no async fallback
  • The pattern is intermittent (4/14 = ~29% failure rate) but severe when it hits
  • WhatsApp disconnects (status 408) correlate with the long blocking periods, suggesting the connection idles out while waiting

Additional information

Suggested Fix

  1. Hard timeout enforcement: timeoutMs should wrap the entire embedded run lifecycle (initial request + all failover attempts). If the total elapsed time exceeds timeoutMs, abort immediately and return empty/NONE.

  2. Or: disable failover for Active Memory runs. The Active Memory plugin is latency-sensitive by design. The failover chain (rotate profile → fallback model) is appropriate for main replies but counterproductive for a blocking pre-reply sub-agent where speed matters more than resilience.

  3. Or: add a separate maxTotalTimeoutMs config that caps the entire chain, independent of per-attempt timeouts.

Workaround

Disabled Active Memory (config.enabled: false) until the timeout enforcement is fixed. The feature works well when the LLM responds promptly (1-2 seconds), but the unbounded failover chain makes it unusable in production.

Additional Context

  • Two Anthropic auth profiles are configured (OAuth tokens). The failover chain rotates between them before falling back to a different model.
  • The main session LLM requests (Opus) are working fine during the same time window — this appears specific to the embedded Active Memory sub-agent runner.
  • modelFallbackPolicy: "default-remote" was set per the docs' recommended setup. Switching to "resolved-only" might reduce the chain length but wouldn't fix the core issue of timeoutMs not being enforced.

extent analysis

TL;DR

Implement a hard timeout enforcement for the entire embedded run lifecycle of the Active Memory plugin, ensuring that the timeoutMs setting is respected across all failover attempts.

Guidance

  • Review the Active Memory plugin's configuration to understand how timeoutMs is currently being applied and identify areas where the failover chain is not respecting this timeout.
  • Consider disabling the failover chain for Active Memory runs or implementing a separate maxTotalTimeoutMs config to cap the entire chain, as suggested in the issue.
  • Verify that the main session LLM requests are working as expected and that the issue is specific to the embedded Active Memory sub-agent runner.
  • Test the workaround of disabling Active Memory until the timeout enforcement is fixed to ensure that the feature works well when the LLM responds promptly.

Example

No code snippet is provided as the issue does not imply a specific code change, but rather a configuration or design adjustment to the Active Memory plugin.

Notes

The issue highlights the importance of respecting the timeoutMs setting across all failover attempts in the Active Memory plugin to prevent blocking issues. The suggested fixes and workarounds aim to address this specific problem.

Recommendation

Apply workaround: Disable Active Memory (config.enabled: false) until the timeout enforcement is fixed, as this will prevent the blocking issues caused by the unbounded failover chain. This is a temporary solution until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

config.timeoutMs: 15000 should be a hard ceiling on total Active Memory execution time, including any failover/retry attempts. If the initial request doesn't complete within 15 seconds, Active Memory should return NONE/empty and let the main reply proceed.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING