openclaw - 💡(How to fix) Fix [Bug]: ReplyRunAlreadyActiveError still reproduces on 2026.5.7 with fast-model RTT — likely race in async cleanup vs sequential chat.send (follow-up to #77485 and #77960)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Maintainer @tmimmanuel could not reproduce the prior bug (#77960) against main 1c1136902b using Cerebras Qwen 235B (10/10 pass). I retried with 2026.5.7 stable using Gemini 2.5 Flash and the same alternating-fail pattern still reproduces (5/10 pass, 18 fresh ReplyRunAlreadyActiveError). Failed calls return in 272-295ms — faster than the prior call's LLM RTT (1340-2598ms warm) — strongly suggesting an async cleanup vs. incoming-request race that is masked when the model is slow.

Error Message

Gateway error log (fresh post-switch, 18 occurrences across one 10-call probe — full identical-line pattern, only sessionKey changes between maintainer's repro and ours):

followup queue drain failed for agent:tenant003:main: ReplyRunAlreadyActiveError: Reply run already active for agent:tenant003:main

Same error class as #77485 and #77960. Same alternating timing signature. Same canned-reply user-facing symptom ("Previous run is still shutting down" wrapped by our relay as "I had a brief hiccup processing that").

Suspected source — same files as prior issues (now renamed by the rebuild):

  • dist/run-state-COZ3YHhO.js (5.7): replyRunState.activeRunsByKey.has(sessionKey) check → throw ReplyRunAlreadyActiveError
  • dist/agent-runner.runtime-DQsCsHUA.js (5.7): catches ReplyRunAlreadyActiveError, returns canned "Previous run is still shutting down" fallback

Comparison: identical config on 2026.4.26 produces all 10 real replies with zero ReplyRunAlreadyActiveError events. So this is exclusively a 5.3+ regression that the 5.4 (#77485) and 5.7 (#77960) fix attempts have NOT closed for our model-RTT range.

Happy to attach: full gateway log with timestamped frame boundaries, agent trajectory JSON, run-id sequence per WS frame, openclaw.json (sanitized).

Root Cause

Maintainer @tmimmanuel could not reproduce the prior bug (#77960) against main 1c1136902b using Cerebras Qwen 235B (10/10 pass). I retried with 2026.5.7 stable using Gemini 2.5 Flash and the same alternating-fail pattern still reproduces (5/10 pass, 18 fresh ReplyRunAlreadyActiveError). Failed calls return in 272-295ms — faster than the prior call's LLM RTT (1340-2598ms warm) — strongly suggesting an async cleanup vs. incoming-request race that is masked when the model is slow.

Fix Action

Fix / Workaround

The 272-295ms fast-fail timing is below provider RTT — the gateway throws ReplyRunAlreadyActiveError before any LLM dispatch. The PRIOR call took 1340-2598ms (normal warm Gemini Flash), so the cleanup window the new code expects (drain runs in finally after replyOperation.complete) is shorter than the gap between the prior call's WS-ack return and the next incoming chat.send.

This is a follow-up to:

  • #77485 (closed) — original report on 5.3. Fixed by commit a9817a5 in 5.4 but only covered queued same-session follow-up turns.
  • #77960 (closed by maintainer's "can't reproduce" cross-check) — follow-up on 5.4. Maintainer @tmimmanuel tested main 1c1136902b with cerebras/qwen-3-235b and got 10/10 pass. We retried 5.7 with the centralized 0909df1a4f drain-lifecycle refactor included and still see the alternating-fail pattern.

Code Example

Gateway error log (fresh post-switch, 18 occurrences across one 10-call probe — full identical-line pattern, only sessionKey changes between maintainer's repro and ours):

  followup queue drain failed for agent:tenant003:main: ReplyRunAlreadyActiveError: Reply run already active for agent:tenant003:main

Same error class as #77485 and #77960. Same alternating timing signature. Same canned-reply user-facing symptom ("Previous run is still shutting down" wrapped by our relay as "I had a brief hiccup processing that").

Suspected source — same files as prior issues (now renamed by the rebuild):
- dist/run-state-COZ3YHhO.js (5.7): replyRunState.activeRunsByKey.has(sessionKey) check → throw ReplyRunAlreadyActiveError
- dist/agent-runner.runtime-DQsCsHUA.js (5.7): catches ReplyRunAlreadyActiveError, returns canned "Previous run is still shutting down" fallback

Comparison: identical config on 2026.4.26 produces all 10 real replies with zero ReplyRunAlreadyActiveError events. So this is exclusively a 5.3+ regression that the 5.4 (#77485) and 5.7 (#77960) fix attempts have NOT closed for our model-RTT range.

Happy to attach: full gateway log with timestamped frame boundaries, agent trajectory JSON, run-id sequence per WS frame, openclaw.json (sanitized).
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Maintainer @tmimmanuel could not reproduce the prior bug (#77960) against main 1c1136902b using Cerebras Qwen 235B (10/10 pass). I retried with 2026.5.7 stable using Gemini 2.5 Flash and the same alternating-fail pattern still reproduces (5/10 pass, 18 fresh ReplyRunAlreadyActiveError). Failed calls return in 272-295ms — faster than the prior call's LLM RTT (1340-2598ms warm) — strongly suggesting an async cleanup vs. incoming-request race that is masked when the model is slow.

Steps to reproduce

  1. Install OpenClaw 2026.5.7 (commit eeef486). Confirm bundled files differ from 5.4: dist/run-state-COZ3YHhO.js (sha256 e0a9c447f55c…) vs 5.4's dist/run-state-Bg5KVIP6.js (sha256 3cdea3a69fe7…) — so the 0909df1a4f drain-lifecycle refactor IS in the running binary.

  2. Configure a fast-RTT provider. Specifically gemini/gemini-2.5-flash via openai-completions (warm replies typically 1.2-1.7s).

  3. Send 10 sequential chat.send requests through the gateway WebSocket path, each waiting for the prior to return, with a 1-second sleep between:

    for i in 1 2 3 4 5 6 7 8 9 10; do curl -s -X POST http://127.0.0.1:18789/chat.send
    -H "Content-Type: application/json"
    -d "{"sessionKey":"agent:test:main","message":"Reply containing the literal text: ok-$i-$(date +%s)"}" sleep 1 done

  4. Observe alternating pattern: odd calls return 272-295ms with empty/canned reply, even calls return 1340-2598ms with real reply.

  5. Gateway error log shows ReplyRunAlreadyActiveError firing every other call (18 occurrences across 10 calls in our run).

Expected behavior

Per #77485 and #77960 fix expectations: all 10 sequential chat.send calls should reach the LLM and return real replies, regardless of provider RTT. The centralized reply-followup-drain-lifecycle refactor in commit 0909df1a4f should clear the active-run guard before the next request can race it.

Actual behavior

Alternating fast-fail / pass pattern on 2026.5.7:

call 1: 295ms FAIL — empty/canned reply call 2: 1340ms pass — real reply call 3: 279ms FAIL call 4: 1384ms pass call 5: 273ms FAIL call 6: 2598ms pass call 7: 274ms FAIL call 8: 1511ms pass call 9: 272ms FAIL call 10: 2047ms pass

Gateway error log: followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main (... 17 more identical lines across the 10-call probe ...)

The 272-295ms fast-fail timing is below provider RTT — the gateway throws ReplyRunAlreadyActiveError before any LLM dispatch. The PRIOR call took 1340-2598ms (normal warm Gemini Flash), so the cleanup window the new code expects (drain runs in finally after replyOperation.complete) is shorter than the gap between the prior call's WS-ack return and the next incoming chat.send.

OpenClaw version

2026.5.7 (commit eeef486) — verified by bundled file hashes that this build includes commit 0909df1a4f (centralize reply followup drain lifecycle). Last known good for this path: 2026.4.26 (commit be8c246).

Operating system

Ubuntu 24.04 noble (Linux x64), kernel 6.8

Install method

Custom side-by-side installer: npm install --production --legacy-peer-deps --ignore-scripts; node_modules/openclaw -> ../ self-symlink applied post-install. Same install pattern used in #77485 and #77960 — not a stale-binary or install-corruption issue (sandbox doctor + catalog smoke pass cleanly).

Model

gemini/gemini-2.5-flash (warm RTT 1.2-1.7s; this is the key variable vs the maintainer's Cerebras Qwen 235B test which got 10/10 pass)

Provider / routing chain

custom relay -> gateway WebSocket chat.send -> openclaw -> openai-completions -> local Gemini proxy (127.0.0.1:19990) -> Google Gemini API

Additional provider/model setup details

Bundled-binary verification proves the 0909df1a4f refactor IS in the running code:

  • dist/run-state-Bg5KVIP6.js (5.4, sha256 3cdea3a69fe7be00ccf0a77279c51fbe9e977cfc13868063f09259f6305538dd)
  • dist/run-state-COZ3YHhO.js (5.7, sha256 e0a9c447f55c73994f55ba8eef7ba5ce86870add45354edb2588191ece25958a)
  • dist/agent-runner.runtime-BwDd4yvB.js (5.4) → dist/agent-runner.runtime-DQsCsHUA.js (5.7) File hashes differ — this is the latest fixed binary, not a stale build.

Plugin manifest declares contracts.tools (50 tools) per the 5.3+ schema. Plugin loads cleanly, no plugin-side warnings. Three per-tenant agents in agents.list, each with tools.allow allowlist of all 50 plugin tool names. Tested via tenant003. Gateway runs in embedded mode under PM2, single process bound to 127.0.0.1:18789. plugins.entries.codex.enabled = false was removed by sandbox doctor --fix; config is clean.

HYPOTHESIS — model-RTT-dependent race window:

@tmimmanuel's cross-check on main 1c1136902b used cerebras/qwen-3-235b-a22b-instruct-2507 and got pattern: pppppppppp (10/10 pass). Our setup uses gemini/gemini-2.5-flash and gets 5/10 pass with the alternating pattern.

The race appears to be: T+0ms: prior call.send completes → WS frame returns to client T+~Xms: reply-run cleanup completes (drain finally block in 0909df1a4f) — X is non-zero, possibly microtask-deferred or async I/O T+Yms: next sequential chat.send arrives at gateway, checks activeRunsByKey

If Y < X (cleanup didn't finish in time): guard still held → ReplyRunAlreadyActiveError If Y > X (cleanup done): clean path

With slow models (Cerebras Qwen 235B, ~2-4s RTT), the GAP between prior call's WS-ack and next call's arrival is large enough that X < Y reliably → cleanup wins → pass.

With fast models (Gemini Flash, ~1-2s warm), the gap is smaller and X > Y on every other call → throw.

This explains the maintainer's "couldn't reproduce" + our reliable repro.

Logs, screenshots, and evidence

Gateway error log (fresh post-switch, 18 occurrences across one 10-call probe — full identical-line pattern, only sessionKey changes between maintainer's repro and ours):

  followup queue drain failed for agent:tenant003:main: ReplyRunAlreadyActiveError: Reply run already active for agent:tenant003:main

Same error class as #77485 and #77960. Same alternating timing signature. Same canned-reply user-facing symptom ("Previous run is still shutting down" wrapped by our relay as "I had a brief hiccup processing that").

Suspected source — same files as prior issues (now renamed by the rebuild):
- dist/run-state-COZ3YHhO.js (5.7): replyRunState.activeRunsByKey.has(sessionKey) check → throw ReplyRunAlreadyActiveError
- dist/agent-runner.runtime-DQsCsHUA.js (5.7): catches ReplyRunAlreadyActiveError, returns canned "Previous run is still shutting down" fallback

Comparison: identical config on 2026.4.26 produces all 10 real replies with zero ReplyRunAlreadyActiveError events. So this is exclusively a 5.3+ regression that the 5.4 (#77485) and 5.7 (#77960) fix attempts have NOT closed for our model-RTT range.

Happy to attach: full gateway log with timestamped frame boundaries, agent trajectory JSON, run-id sequence per WS frame, openclaw.json (sanitized).

Impact and severity

Affected: any operator using the gateway WebSocket chat.send path with a fast-RTT provider (Gemini Flash, Groq, Cerebras small models, etc.) and sequential per-session requests on 2026.5.3 through 2026.5.7. May also affect prereleases (5.9-beta, 5.10-beta) — not tested. Severity: High — blocks chat workflow with 50% reply failure on every operator using a fast provider. Frequency: Always — deterministic alternating pattern reproduced 4/4 attempts across separate gateway restarts on 2026.5.4 and 2026.5.7. Consequence: Production WebSocket-based agents on a fast LLM provider are unusable on any 5.x version. Forces continued rollback to 2026.4.26. Slow-provider operators won't notice — that's why the maintainer test passed and the bug shipped masked through 2 fix attempts.

Additional information

This is a follow-up to:

  • #77485 (closed) — original report on 5.3. Fixed by commit a9817a5 in 5.4 but only covered queued same-session follow-up turns.
  • #77960 (closed by maintainer's "can't reproduce" cross-check) — follow-up on 5.4. Maintainer @tmimmanuel tested main 1c1136902b with cerebras/qwen-3-235b and got 10/10 pass. We retried 5.7 with the centralized 0909df1a4f drain-lifecycle refactor included and still see the alternating-fail pattern.

The maintainer's cross-check was a great signal — it surfaced the model-RTT-dependent race window that the closed issues didn't have data for. This new report is the model-RTT-race scenario, not a re-litigation of either prior issue.

WHAT WOULD CONFIRM THE HYPOTHESIS quickly:

@tmimmanuel — could you reproduce your 10-call probe on main with a FAST-RTT provider? Specifically:

  1. Groq (any model, sub-300ms RTT typical)
  2. Cerebras with their SMALL models (cerebras/qwen-3-32b or similar, ~300-500ms)
  3. Or insert a synthetic ~200-400ms ack delay in your test fixture to simulate fast cleanup vs fast incoming-request timing

If a fast-RTT run reproduces pppppppppp, my hypothesis is wrong and we need to look elsewhere (probe harness, relay layer, session-key shape). If it reproduces ffpfpfpfpf, the fix needs to be:

  • queue the incoming request when the guard is held (release on prior cleanup, then process), OR
  • synchronously clear the registry before the WS response acks the prior call (the documented {runId, status:"started"} non-blocking ack contract decouples request boundary from cleanup completion — that's where the race lives)

Last known good: 2026.4.26. First known bad: 2026.5.3. Tested as bad: 2026.5.4 (#77960) and 2026.5.7 (this report). Did not test betas (5.9-beta, 5.10-beta) — would test if maintainer wants signal on whether the bug shape changes in prerelease.

Reproduction is deterministic and reproducible on demand. Happy to capture additional traces, test a fix candidate from a branch, or run any specific instrumentation. The 30-second test loop has caught this on 4/4 attempts across 3 different versions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Per #77485 and #77960 fix expectations: all 10 sequential chat.send calls should reach the LLM and return real replies, regardless of provider RTT. The centralized reply-followup-drain-lifecycle refactor in commit 0909df1a4f should clear the active-run guard before the next request can race it.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: ReplyRunAlreadyActiveError still reproduces on 2026.5.7 with fast-model RTT — likely race in async cleanup vs sequential chat.send (follow-up to #77485 and #77960)