openclaw - ✅(Solved) Fix active-memory embedded sub-agent WebSocket handshake timeout (all models affected) [1 pull requests, 3 comments, 3 participants]

hfgwq2dgx8-tech · 2026-04-29T11:10:38Z

[openclaw] PR 74480: fix active-memory : preserve setup grace for embedded recall - Repository: openclaw/openclaw - Author: volcano303 - State: open | merged:… # PR #74480: fix(active-memory): preserve setup grace for embedded recall - Repository: openclaw/openclaw - Author: volcano303 - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/74480 ## Description (problem / solution / changelog) ## Summary Describe the problem and fix in 2–5 bullets: - Problem: Active Memory could time out at the raw configured recall timeout because the embedded recall runner did not receive the setup-grace timeout budget. - Why it matters: This made provider-agnostic Active Memory runs fail around 15s with empty output even though the plugin wrapper already intended to allow setup grace. - What changed: Active Memory now uses one effective embedded recall timeout equal to `timeoutMs + setupGraceTimeoutMs` for the embedded runner, outer watchdog, and hook timeout. - What did NOT change (scope boundary): This does not change memory search behavior, provider routing, circuit breaker semantics, or the configured `timeoutMs` clamp itself. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [x] Skills / tool execution - [ ] Auth / tokens - [x] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #73306 - Related #74292 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - Root cause: Active Memory computed a setup-grace-aware timeout for the outer watchdog and hook, but passed only `config.timeoutMs` into `runEmbeddedPiAgent`, allowing the embedded runner to self-timeout before the grace window could help. - Missing detection / guardrail: The setup-grace regression test did not assert that the embedded runner itself received the grace-extended timeout. - Contributing context (if known): Recent fixes added `memory_recall`, circuit breaker behavior, and setup-grace handling, but the timeout budget was not applied consistently across the embedded runner boundary. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - [ ] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - Target test or file: `extensions/active-memory/index.test.ts` - Scenario the test should lock in: Active Memory setup delay longer than raw `timeoutMs` but within setup grace should still succeed, and `runEmbeddedPiAgent` should receive `timeoutMs + setupGraceTimeoutMs`. - Why this is the smallest reliable guardrail: The bug is at the plugin-to-embedded-runner timeout handoff, so a focused plugin unit test catches it without requiring a live provider. - Existing test that already covers this (if any): Existing setup-grace test covered the outer behavior but not the embedded runner timeout argument. - If no new test is added, why not: N/A ## User-visible / Behavior Changes Active Memory recall gets the intended setup-grace budget before timing out, reducing false 15s timeout failures during embedded runner startup/setup. ## Diagram (if applicable) ```text Before: Active Memory timeoutMs=15000 -> embedded runner timeoutMs=15000 -> setup can consume budget -> timeout After: Active Memory timeoutMs=15000 + setup grace -> embedded runner gets effective timeout -> recall can complete within grace window ``` ## Security Impact (required) - New permissions/capabilities? (`No`) - Secrets/tokens handling changed? (`No`) - New/changed network calls? (`No`) - Command/tool execution surface changed? (`No`) - Data access scope changed? (`No`) - If any `Yes`, explain risk + mitigation: N/A ## Repro + Verification ### Environment - OS: Linux - Runtime/container: Node 22.14.0, pnpm - Model/provider: Not required for unit verification - Integration/channel (if any): Active Memory plugin - Relevant config (redacted): `active-memory.timeoutMs` with setup grace enabled ### Steps 1. Configure Active Memory with a small `timeoutMs`. 2. Simulate embedded recall setup taking longer than raw `timeoutMs` but less than `timeoutMs + setupGraceTimeoutMs`. 3. Run the Active Memory plugin tests. ### Expected - Embedded recall receives the setup-grace-extended timeout. - Recall can succeed when setup delay fits inside the grace window. ### Actual - Before this PR, the embedded runner received only raw `timeoutMs`. - After this PR, the embedded runner receives `timeoutMs + setupGraceTimeoutMs`. ## Evidence Attach at least one: - [x] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) ## Compatibility / Migration - Backward compatible? (`Yes`) - Config/env changes? (`No`) - Migration needed? (`No`) - If yes, exact upgrade steps: N/A ## Risks and Mitigations - Risk: Active Me

openclaw2026-04-29 11:10:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74292•Fetched 2026-04-30 06:26:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

cross-referenced ×5commented ×3closed ×1mentioned ×1

Error Message

Workaround: Disabling active-memory eliminates the error. Main agent functionality works normally without it.

Root Cause

The active-memory plugin embedded sub-agent consistently times out (~17-45s) regardless of which model is used. The root cause appears to be a WebSocket handshake failure when the embedded sub-agent tries to connect back to the Gateway to execute memory_search / memory_get tools.

Fix Action

Fix / Workaround

Workaround: Disabling active-memory eliminates the error. Main agent functionality works normally without it.

PR fix notes

PR #74480: fix(active-memory): preserve setup grace for embedded recall

Repository: openclaw/openclaw
Author: volcano303
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/74480

Description (problem / solution / changelog)

Summary

Describe the problem and fix in 2–5 bullets:

Problem: Active Memory could time out at the raw configured recall timeout because the embedded recall runner did not receive the setup-grace timeout budget.
Why it matters: This made provider-agnostic Active Memory runs fail around 15s with empty output even though the plugin wrapper already intended to allow setup grace.
What changed: Active Memory now uses one effective embedded recall timeout equal to timeoutMs + setupGraceTimeoutMs for the embedded runner, outer watchdog, and hook timeout.
What did NOT change (scope boundary): This does not change memory search behavior, provider routing, circuit breaker semantics, or the configured timeoutMs clamp itself.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #73306
Related #74292
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: Active Memory computed a setup-grace-aware timeout for the outer watchdog and hook, but passed only config.timeoutMs into runEmbeddedPiAgent, allowing the embedded runner to self-timeout before the grace window could help.
Missing detection / guardrail: The setup-grace regression test did not assert that the embedded runner itself received the grace-extended timeout.
Contributing context (if known): Recent fixes added memory_recall, circuit breaker behavior, and setup-grace handling, but the timeout budget was not applied consistently across the embedded runner boundary.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/active-memory/index.test.ts
Scenario the test should lock in: Active Memory setup delay longer than raw timeoutMs but within setup grace should still succeed, and runEmbeddedPiAgent should receive timeoutMs + setupGraceTimeoutMs.
Why this is the smallest reliable guardrail: The bug is at the plugin-to-embedded-runner timeout handoff, so a focused plugin unit test catches it without requiring a live provider.
Existing test that already covers this (if any): Existing setup-grace test covered the outer behavior but not the embedded runner timeout argument.
If no new test is added, why not: N/A

User-visible / Behavior Changes

Active Memory recall gets the intended setup-grace budget before timing out, reducing false 15s timeout failures during embedded runner startup/setup.

Diagram (if applicable)

Before:
Active Memory timeoutMs=15000 -> embedded runner timeoutMs=15000 -> setup can consume budget -> timeout

After:
Active Memory timeoutMs=15000 + setup grace -> embedded runner gets effective timeout -> recall can complete within grace window

Security Impact (required)

New permissions/capabilities? (No)
Secrets/tokens handling changed? (No)
New/changed network calls? (No)
Command/tool execution surface changed? (No)
Data access scope changed? (No)
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: Linux
Runtime/container: Node 22.14.0, pnpm
Model/provider: Not required for unit verification
Integration/channel (if any): Active Memory plugin
Relevant config (redacted): active-memory.timeoutMs with setup grace enabled

Steps

Configure Active Memory with a small timeoutMs.
Simulate embedded recall setup taking longer than raw timeoutMs but less than timeoutMs + setupGraceTimeoutMs.
Run the Active Memory plugin tests.

Expected

Embedded recall receives the setup-grace-extended timeout.
Recall can succeed when setup delay fits inside the grace window.

Actual

Before this PR, the embedded runner received only raw timeoutMs.
After this PR, the embedded runner receives timeoutMs + setupGraceTimeoutMs.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (No)
Migration needed? (No)
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: Active Memory embedded recall may run longer than operators expect if they interpreted timeoutMs as a hard total wall-clock limit.
Mitigation: This matches the existing setup-grace design already used by the outer watchdog and hook timeout; the configured timeout clamp and circuit breaker remain in place.


Verification commands to mention if needed:

```bash
pnpm test:fast extensions/active-memory/index.test.ts
pnpm test:fast extensions/active-memory/config.test.ts
pnpm exec oxlint extensions/active-memory/index.ts extensions/active-memory/index.test.ts
git diff --check

## Changed files

- `extensions/active-memory/index.test.ts` (modified, +13/-7)
- `extensions/active-memory/index.ts` (modified, +9/-3)

Code Example

active-memory: agent=duoduo session=agent:duoduo:feishu:direct:ou_... activeProvider=deepseek activeModel=deepseek-v4-flash done status=timeout elapsedMs=27059 summaryChars=0

handshake timeout + closed before connect (repeated)
conn=... peer=127.0.0.1:...->127.0.0.1:18789 ... code=1006 reason=n/a

---

{
  "active-memory": {
    "enabled": true,
    "config": {
      "agents": ["duoduo", "miumiu", "jelly", "sugarbaby", "der", "zhumi"],
      "model": "deepseek/deepseek-v4-flash",
      "modelFallback": "glm/GLM-5-Turbo",
      "queryMode": "recent",
      "promptStyle": "balanced",
      "allowedChatTypes": ["direct"],
      "timeoutMs": 15000,
      "maxSummaryChars": 220,
      "logging": true
    }
  }
}

RAW_BUFFERClick to expand / collapse

Environment:

OpenClaw: 2026.4.26 (be8c246)
OS: macOS 15.2 (arm64, Mac mini M2)
Node: v24.14.0
Config: 6 agents, active-memory enabled for all, models tested: GLM-5.1 / GLM-5-Turbo / DeepSeek V4 Flash

Description:

Evidence from gateway logs:

active-memory: agent=duoduo session=agent:duoduo:feishu:direct:ou_... activeProvider=deepseek activeModel=deepseek-v4-flash done status=timeout elapsedMs=27059 summaryChars=0

handshake timeout + closed before connect (repeated)
conn=... peer=127.0.0.1:...->127.0.0.1:18789 ... code=1006 reason=n/a

Key observations:

Direct API calls to the same models respond in <1s (tested with curl, DeepSeek V4 Flash: 0.6s)
All 3 tested models (GLM-5.1, GLM-5-Turbo, DeepSeek V4 Flash) produce the same timeout
summaryChars=0 on every attempt — the sub-agent never completes its tool calls
Embedding index is healthy: 110 files, 940 chunks, openclaw memory status --deep shows no errors
The WS closed before connect / code=1006 entries appear immediately after each active-memory timeout

Config:

{
  "active-memory": {
    "enabled": true,
    "config": {
      "agents": ["duoduo", "miumiu", "jelly", "sugarbaby", "der", "zhumi"],
      "model": "deepseek/deepseek-v4-flash",
      "modelFallback": "glm/GLM-5-Turbo",
      "queryMode": "recent",
      "promptStyle": "balanced",
      "allowedChatTypes": ["direct"],
      "timeoutMs": 15000,
      "maxSummaryChars": 220,
      "logging": true
    }
  }
}

Expected behavior: Active memory sub-agent connects to Gateway via WebSocket, executes memory_search / memory_get, and returns a summary within the timeout window.

Actual behavior: WebSocket handshake fails consistently, causing the sub-agent to timeout without producing any summary.

Workaround: Disabling active-memory eliminates the error. Main agent functionality works normally without it.

Additional context:

Gateway bind: 127.0.0.1:18789
LaunchAgent: gui/501/ai.openclaw.gateway
Embedding provider: GLM embedding-3 (remote, working)
Memory index: SQLite backend, all files indexed

extent analysis

TL;DR

The WebSocket handshake failure between the active-memory sub-agent and the Gateway is likely causing the timeouts, and adjusting the timeoutMs configuration or checking the Gateway's WebSocket connection settings may help.

Guidance

Verify the Gateway's WebSocket connection settings to ensure it is properly configured to handle connections from the active-memory sub-agent.
Consider increasing the timeoutMs value in the active-memory configuration to allow more time for the WebSocket handshake to complete.
Check the system's firewall or network settings to ensure that the connection between the sub-agent and the Gateway is not being blocked.
Test the WebSocket connection independently using a tool like wscat to isolate the issue.

Example

No code snippet is provided as the issue seems to be related to configuration and network settings rather than code.

Notes

The issue may be related to the specific network environment or the Gateway's configuration, so further investigation and testing may be needed to determine the root cause.

Recommendation

Apply workaround: Adjust the timeoutMs value or check the Gateway's WebSocket connection settings, as this may help resolve the issue without requiring a full fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #integration issue #index setup #retrieval issue #search optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix active-memory embedded sub-agent WebSocket handshake timeout (all models affected) [1 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #74480: fix(active-memory): preserve setup grace for embedded recall

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Compatibility / Migration

Risks and Mitigations

Code Example

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING