openclaw - 💡(How to fix) Fix [Bug]: ReplyRunAlreadyActiveError still fires on 5.4 for discrete sequential chat.send (follow-up to #77485) [1 comments, 2 participants]

openclaw2026-05-05 17:16:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77960•Fetched 2026-05-06 06:18:44

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bws14email

Participants

bws14email

clawsweeper[bot]

Timeline (top)

labeled ×2commented ×1

Issue #77485 was fixed in 2026.5.4 release notes ("clear the active reply-run guard before draining queued same-session follow-up turns"), and the bundled binary IS the new one (file hashes changed), but the same alternating ReplyRunAlreadyActiveError 50% chat-failure pattern still reproduces for our discrete sequential chat.send path through the gateway WebSocket.

Error Message

Gateway error log (clean log, freshly cycled by PM2 right after switch — all 16 lines are from this single 10-call probe):

followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main ... (14 more identical lines) ...

Suspected residual source — same files as #77485:

dist/run-state-Bg5KVIP6.js: replyRunState.activeRunsByKey.has(sessionKey) → ReplyRunAlreadyActiveError
dist/agent-runner.runtime-BwDd4yvB.js: catches ReplyRunAlreadyActiveError, returns canned "Previous run is still shutting down" fallback (which our relay wraps as "I had a brief hiccup processing that. Could you try again?")

Comparison run on 2026.4.26 with identical config: same 10-call probe completes with all replies present, average warm latency 1.2-1.7s, zero ReplyRunAlreadyActiveError events. So 2026.4.26 is still the only known-good version for this path.

Happy to attach: full gateway log around the affected window, sanitized openclaw.json, plugin manifest, run-id sequence per WS frame.

Root Cause

Fix Action

Fix / Workaround

Failed calls return in ~300ms (below provider RTT) — gateway throws before any LLM dispatch. Identical symptom shape to #77485 on 5.3.

2026.5.4 (commit 325df3e); identical regression as #77485 reported on 5.3, fixed by Steipete in commit a9817a5 per release notes — but the bundled fix in 5.4 doesn't cover this path. Last known good: 2026.4.26 (commit be8c246).

Probe configuration:

10 sequential chat.send calls
1-second sleep between each
Failed calls return in ~300ms (well below the 1s gap), so this is NOT a same-millisecond race — the guard is occupied a full second after the prior call returned and dispatched.

Code Example

Gateway error log (clean log, freshly cycled by PM2 right after switch — all 16 lines are from this single 10-call probe):

  followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main
  followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main
  ... (14 more identical lines) ...

Suspected residual source — same files as #77485:
- dist/run-state-Bg5KVIP6.js: replyRunState.activeRunsByKey.has(sessionKey) → ReplyRunAlreadyActiveError
- dist/agent-runner.runtime-BwDd4yvB.js: catches ReplyRunAlreadyActiveError, returns canned "Previous run is still shutting down" fallback (which our relay wraps as "I had a brief hiccup processing that. Could you try again?")

Comparison run on 2026.4.26 with identical config: same 10-call probe completes with all replies present, average warm latency 1.2-1.7s, zero ReplyRunAlreadyActiveError events. So 2026.4.26 is still the only known-good version for this path.

Happy to attach: full gateway log around the affected window, sanitized openclaw.json, plugin manifest, run-id sequence per WS frame.

RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

Summary

Steps to reproduce

Install OpenClaw 2026.5.4 (verified bundled binary differs from 5.3: dist/run-state-Bg5KVIP6.js vs 5.3's run-state-B5YH0TzQ.js, sha256 3cdea3a69fe7… vs 3ac08e0e7c7c…).
Switch active version to 2026.5.4 and start the gateway in embedded mode.
Send 10 sequential chat requests through the gateway WebSocket path with a 1-second pause between each (failed calls return in ~300ms, so the 1s gap is well past the prior call's wall-clock completion):

for i in 1 2 3 4 5 6 7 8 9 10; do curl -s -X POST http://127.0.0.1:18789/chat.send
-H "Content-Type: application/json"
-d "{"sessionKey":"agent:test:main","message":"Reply containing the literal text: ok-$i-$(date +%s)"}" sleep 1 done
Inspect gateway error log — ReplyRunAlreadyActiveError fires 15+ times across the 10-call probe.

Expected behavior

After the fix shipped in 2026.5.4 (release notes citing fix for #77485), all 10 sequential chat requests should reach the LLM and return real replies, matching the 2026.4.26 behavior where the same workload produces ~1.2-1.7s warm replies on every call with zero ReplyRunAlreadyActiveError events in the gateway log.

Actual behavior

Same alternating fast-fail / pass pattern as #77485 reported on 5.3. 10-call probe wall-clock timings on 2026.5.4:

call 1: 317ms FAIL — empty/canned reply call 2: 1689ms pass — real reply call 3: 302ms FAIL call 4: 1876ms pass call 5: 299ms FAIL call 6: 1592ms pass call 7: 303ms FAIL call 8: 1778ms pass call 9: 315ms FAIL call 10: 1604ms pass

Gateway error log shows 16 occurrences of: followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main

Failed calls return in ~300ms (below provider RTT) — gateway throws before any LLM dispatch. Identical symptom shape to #77485 on 5.3.

OpenClaw version

Operating system

Ubuntu 24.04 noble (Linux x64), kernel 6.8

Install method

Custom side-by-side installer running: npm install --production --legacy-peer-deps --ignore-scripts; node_modules/openclaw -> ../ self-symlink applied post-install (same as #77485 setup)

Model

Custom side-by-side installer running: npm install --production --legacy-peer-deps --ignore-scripts; node_modules/openclaw -> ../ self-symlink applied post-install (same as #77485 setup)

Provider / routing chain

Custom side-by-side installer running: npm install --production --legacy-peer-deps --ignore-scripts; node_modules/openclaw -> ../ self-symlink applied post-install (same as #77485 setup)

Additional provider/model setup details

Bundled-binary verification (proves we have the 5.4 fix code, not stale 5.3):

dist/run-state-B5YH0TzQ.js (5.3) sha256 3ac08e0e7c7c201740d1910290ab16b5f0da6bc27a37d5c28f3c0c9b85656e65
dist/run-state-Bg5KVIP6.js (5.4) sha256 3cdea3a69fe7be00ccf0a77279c51fbe9e977cfc13868063f09259f6305538dd
dist/agent-runner.runtime-c5kFvGrx.js (5.3) → dist/agent-runner.runtime-BwDd4yvB.js (5.4) File hashes confirm the new code is loaded, so this is not a stale-binary issue.

Hypothesis: the 5.4 fix description ("clear the active reply-run guard before draining queued same-session follow-up turns") describes a narrower code path than what we hit. Our relay sends discrete sequential chat.send requests via WebSocket — each is a fresh user-initiated turn, NOT a queued auto-follow-up of a prior turn. If the new finally-block drain only runs in the auto-follow-up path (e.g. when an agent decides to chain a turn after tool execution), the fast-arriving sequential chat.send from the relay/WS path still races the active-run guard.

Probe configuration:

10 sequential chat.send calls
1-second sleep between each
Failed calls return in ~300ms (well below the 1s gap), so this is NOT a same-millisecond race — the guard is occupied a full second after the prior call returned and dispatched.

Plugin manifest declares contracts.tools (50 tools) per the new 5.3+ schema. Plugin loads cleanly, no plugin-side warnings. plugins.entries.codex.enabled = false legacy entry was removed by sandbox doctor --fix; our config is clean. Three per-tenant agents in agents.list, each with tools.allow allowlist of all 50 tool names. Gateway runs in embedded mode under PM2, single process bound to 127.0.0.1:18789.

Logs, screenshots, and evidence

Gateway error log (clean log, freshly cycled by PM2 right after switch — all 16 lines are from this single 10-call probe):

  followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main
  followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main
  ... (14 more identical lines) ...

Suspected residual source — same files as #77485:
- dist/run-state-Bg5KVIP6.js: replyRunState.activeRunsByKey.has(sessionKey) → ReplyRunAlreadyActiveError
- dist/agent-runner.runtime-BwDd4yvB.js: catches ReplyRunAlreadyActiveError, returns canned "Previous run is still shutting down" fallback (which our relay wraps as "I had a brief hiccup processing that. Could you try again?")

Comparison run on 2026.4.26 with identical config: same 10-call probe completes with all replies present, average warm latency 1.2-1.7s, zero ReplyRunAlreadyActiveError events. So 2026.4.26 is still the only known-good version for this path.

Happy to attach: full gateway log around the affected window, sanitized openclaw.json, plugin manifest, run-id sequence per WS frame.

Impact and severity

Affected: any operator using the gateway WebSocket chat.send path with sequential per-session requests on 2026.5.4. Severity: High — blocks chat workflow (50% reply failure). Frequency: Always — deterministic alternating pattern reproduced 4/4 times across separate gateway restarts on 2026.5.4. Consequence: 5.4 still unusable for production WebSocket-based agents using discrete sequential chat.send (vs auto-follow-up turns). Forces continued rollback to 2026.4.26.

Additional information

Direct follow-up to #77485, which was closed by commit a9817a5. The 5.4 release-note line "clear the active reply-run guard before draining queued same-session follow-up turns, so sequential chat.send calls no longer trip ReplyRunAlreadyActiveError" implies the discrete chat.send path should be covered, but the live 10-call probe still reproduces the same symptom shape with the new binary.

Possible coverage gap: the new finally-block drain may only fire in the agent-runner's queued-follow-up path (when the agent itself queues another turn, e.g. after tool execution), not in the gateway's incoming-WS chat.send dispatcher path. The two paths share the activeRunsByKey guard but may have different completion-clearing ordering.

Last known good: 2026.4.26 (commit be8c246). First known bad: 2026.5.3. Tested as bad: 2026.5.4 (this report). Did not test 2026.4.27 / 2026.4.29 — installed side-by-side but never switched.

Pre-switch sandbox checks all passed (catalog smoke, doctor read-only, doctor --fix dry-run with diff). Manifest patched with contracts.tools per 5.3+ schema. Postflight relay smoke and plugin provenance gates all pass — only the sustained-latency probe gate (10 sequential chat.send calls) catches the issue.

Repro is reproducible on demand. Happy to capture additional traces or test a fix candidate from a branch — same deterministic 30-second test loop.

extent analysis

TL;DR

The issue can be mitigated by modifying the gateway's incoming-WS chat.send dispatcher path to clear the active reply-run guard, similar to the fix applied to the agent-runner's queued-follow-up path.

Guidance

Review the dist/run-state-Bg5KVIP6.js and dist/agent-runner.runtime-BwDd4yvB.js files to understand how the active reply-run guard is handled in the gateway's incoming-WS chat.send dispatcher path.
Investigate the possibility of adding a finally-block drain to the gateway's incoming-WS chat.send dispatcher path to clear the active reply-run guard, similar to the fix applied to the agent-runner's queued-follow-up path.
Test the modified code with the 10-call probe to verify that the ReplyRunAlreadyActiveError is no longer triggered.
Consider creating a new issue or pull request to address the coverage gap in the 5.4 release.

Example

No code snippet is provided as the issue requires a deeper understanding of the OpenClaw codebase and the specific changes made in the 5.4 release.

Notes

The issue appears to be a regression introduced in the 5.4 release, and the fix applied to the agent-runner's queued-follow-up path may not be sufficient to cover the gateway's incoming-WS chat.send dispatcher path. Further investigation and testing are required to determine the root cause and develop a comprehensive solution.

Recommendation

Apply a workaround by modifying the gateway's incoming-WS chat.send dispatcher path to clear the active reply-run guard, as this is likely to mitigate the issue. However, a more comprehensive solution may require changes to the OpenClaw codebase to address the coverage gap in the 5.4 release.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: ReplyRunAlreadyActiveError still fires on 5.4 for discrete sequential chat.send (follow-up to #77485) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING