openclaw - ✅(Solved) Fix gateway/chat.send: attachment branch race — chatAbortControllers check/set separated by image+media I/O spawns duplicate agent runs [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70139Fetched 2026-04-23 07:28:48
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

chat.send's attachment branch holds a race window between the pre-parse chatAbortControllers.get guard (~L1920) and the post-parse .set (~L1960). Two real I/O awaits — resolveGatewayModelSupportsImages and parseMessageWithAttachments — sit in between. A same-idempotencyKey retry that arrives during that window passes the initial guard and falls through to the .set, overwriting the first caller's AbortController entry. The real run keeps executing under the first controller while the map now points at the orphan second controller — a subsequent user-initiated chat.abort aborts the orphan and fails to stop the run.

The downstream claimInboundDedupe (src/auto-reply/reply/inbound-dedupe.ts:94-112, entered via src/auto-reply/reply/dispatch-from-config.ts:269-273) already blocks duplicate agent-pipeline invocation (the LLM is only called once for the winner). So the harms are gateway-layer only but still user-visible:

  1. 🔴 User abort misdirectionchat.abort hits the orphan controller while the real run keeps going (reliability regression).
  2. Duplicate {status: "started"} ack to the client.
  3. Duplicate persistChatSendImages / saveMediaBuffer writes (offloadedRefs duplicated).
  4. context.dedupe double-write on .then / .catch.

This also violates the documented same-idempotencyKey contract — docs/web/control-ui.md#L131-132 states: "Re-sending with the same idempotencyKey returns {status: "in_flight"} while running". That contract is held on the no-attachment branch (which has no await between get and set and is atomic under single-thread semantics) but broken on the attachment branch during parsing.

Root Cause

chat.send's attachment branch holds a race window between the pre-parse chatAbortControllers.get guard (~L1920) and the post-parse .set (~L1960). Two real I/O awaits — resolveGatewayModelSupportsImages and parseMessageWithAttachments — sit in between. A same-idempotencyKey retry that arrives during that window passes the initial guard and falls through to the .set, overwriting the first caller's AbortController entry. The real run keeps executing under the first controller while the map now points at the orphan second controller — a subsequent user-initiated chat.abort aborts the orphan and fails to stop the run.

The downstream claimInboundDedupe (src/auto-reply/reply/inbound-dedupe.ts:94-112, entered via src/auto-reply/reply/dispatch-from-config.ts:269-273) already blocks duplicate agent-pipeline invocation (the LLM is only called once for the winner). So the harms are gateway-layer only but still user-visible:

  1. 🔴 User abort misdirectionchat.abort hits the orphan controller while the real run keeps going (reliability regression).
  2. Duplicate {status: "started"} ack to the client.
  3. Duplicate persistChatSendImages / saveMediaBuffer writes (offloadedRefs duplicated).
  4. context.dedupe double-write on .then / .catch.

This also violates the documented same-idempotencyKey contract — docs/web/control-ui.md#L131-132 states: "Re-sending with the same idempotencyKey returns {status: "in_flight"} while running". That contract is held on the no-attachment branch (which has no await between get and set and is atomic under single-thread semantics) but broken on the attachment branch during parsing.

Fix Action

Fix / Workaround

The downstream claimInboundDedupe (src/auto-reply/reply/inbound-dedupe.ts:94-112, entered via src/auto-reply/reply/dispatch-from-config.ts:269-273) already blocks duplicate agent-pipeline invocation (the LLM is only called once for the winner). So the harms are gateway-layer only but still user-visible:

  • src/gateway/server-methods/chat.ts:1920 — pre-parse chatAbortControllers.get(clientRunId) guard.
  • src/gateway/server-methods/chat.ts:1930await resolveGatewayModelSupportsImages(...).
  • src/gateway/server-methods/chat.ts:1936await parseMessageWithAttachments(...).
  • src/gateway/server-methods/chat.ts:1960chatAbortControllers.set(clientRunId, ...) (race window ends here).
  • src/gateway/server-methods/agent-wait-dedupe.ts:206-220context.dedupe is terminal-only; it does not guard the in-flight window.
  • src/auto-reply/reply/inbound-dedupe.ts:94-112 + src/auto-reply/reply/dispatch-from-config.ts:269-273 — downstream guard that blocks duplicate pipeline invocation but runs after the gateway-layer overwrite.

PR fix notes

PR #70142: fix(gateway): re-check chatAbortControllers after attachment parse (protect user abort pathway)

Description (problem / solution / changelog)

Summary

  • The attachment branch of chat.send has two real I/O awaits (resolveGatewayModelSupportsImages, parseMessageWithAttachments) between the pre-parse chatAbortControllers.get guard and the post-parse .set. A same-idempotencyKey retry that arrives during that window passes the initial guard and overwrites the original caller's AbortController entry — subsequent user-initiated chat.abort hits the orphan controller and fails to stop the real run.
  • Add a post-parse re-check so the documented same-idempotencyKey contract (docs/web/control-ui.md#L131-L132, docs/concepts/architecture.md#L95-L96) holds on the attachment branch as it already does on the no-attachment branch (which has no await between get and set and is atomic under single-thread semantics).
  • Regression test included: two concurrent same-idempotencyKey attachment sends must produce exactly one started ack and one in_flight ack, and the map entry must belong to the winner.

Change Type

  • Bug fix
  • New feature
  • Refactor required for the fix
  • Chore / tooling only

Scope

  • src/gateway/server-methods/chat.ts (+20 lines, attachment branch only)
  • src/gateway/server-methods/chat.directive-tags.test.ts (+150 lines — a parseMessageWithAttachments wrapper mock and a new race describe block)

The new test lives in chat.directive-tags.test.ts to reuse the existing mock harness. Happy to split into chat.attachment-race.test.ts in a follow-up if preferred.

Linked Issue

Closes #70139

Root Cause

  1. The chat.send handler in chat.ts passes the pre-parse chatAbortControllers.get(clientRunId) guard.
  2. When normalizedAttachments.length > 0, it awaits resolveGatewayModelSupportsImages and then parseMessageWithAttachments — this is the race window.
  3. A second caller with the same idempotencyKey arriving in that window also passes the guard.
  4. Both callers reach chatAbortControllers.set (~L1960) and the second one overwrites the first caller's AbortController entry.
  5. The real run keeps executing under the first controller, but the map now points at the orphan second controller — a later chat.abort RPC aborts the orphan and the real run keeps running.

claimInboundDedupe (auto-reply/reply/inbound-dedupe.ts:94-112) already blocks the downstream agent-pipeline duplicate, so the LLM is only called once. The remaining harms are gateway-layer:

  • 🔴 User abort misdirection — user-visible reliability regression
  • Duplicate {status: "started"} ack sent to the client
  • Duplicate persistChatSendImages / saveMediaBuffer writes (offloadedRefs duplicated)
  • context.dedupe double-write on .then/.catch

Regression Test Plan

A new describe block chat.send attachment branch — chatAbortControllers atomicity in chat.directive-tags.test.ts:

  • parseMessageWithAttachments is wrapped via a hoisted attachmentParseHooks.pending deferred-promise hook (non-race tests are unaffected — when pending === null the wrapper delegates to the real function via vi.importActual).
  • Fire two concurrent same-idempotencyKey attachment sends with chatHandlers["chat.send"] against a shared context.
  • Wait until callCount >= 2 so both callers have entered the race window.
  • Release the deferred parse and await both settled.
  • Assertions: startedAcks.length === 1 and inFlightAcks.length === 1 and chatAbortControllers.has(idem) === true.

Baseline proof:

  • Before fix (revert chat.ts only, keep the test): both callers receive startedexpect(startedAcks).toHaveLength(1) fails.
  • After fix: passes.

Security Impact

  • No new permissions, secrets, or network calls.
  • No data-scope changes.
  • No-attachment branch is unchanged (still atomic).
  • When the re-check fires, the second caller receives the exact same payload / cached: true metadata as the pre-parse guard's {status: "in_flight"} response.

Repro + Verification

Environment

  • openclaw upstream/main at 45ca6566 (rebased)
  • Node 22.18.0, pnpm 10.33.0

Steps

  1. git checkout fix/chat-send-attachment-race-recheck
  2. pnpm install --frozen-lockfile
  3. pnpm exec vitest run --config test/vitest/vitest.gateway.config.ts src/gateway/server-methods/chat.directive-tags.test.ts

Expected

  • After fix: 54/54 pass.
  • Before fix (revert only chat.ts): 1 failed in the attachment race describe.

Actual

54 passed (54).

Evidence

Failing test before (chat.ts reverted, test kept)

× chat.send attachment branch — chatAbortControllers atomicity > returns in_flight for a same-idempotencyKey retry that arrives during attachment parsing
  AssertionError: expected startedAcks length 1 but got 2

Passing test after

 Test Files  1 passed (1)
      Tests  54 passed (54)
   Duration  945ms

Full gateway config

pnpm exec vitest run --config test/vitest/vitest.gateway.config.ts → 2577/2579 passed. The two pre-existing failures are in server.plugin-http-auth.test.ts (Mattermost slash callback routes) and reproduce on upstream/main without this PR — unrelated.

Build + Check

  • pnpm build
  • pnpm check

Human Verification

  1. Worktree-level before/after bisection confirmed (1 failed vs 54 passed).
  2. No-attachment branch structure unchanged (no await between get and set; still atomic).
  3. Rationale for Option B (post-parse re-check): stays XS, keeps the no-attachment sibling unchanged, touches neither server-maintenance.ts nor chat-abort.ts.

Review Conversations

  • Ready for Greptile / Codex / maintainer review.
  • Different axis from PR #69208 (duplicate transcript/replay umbrella) — this PR is a pre-spawn gateway-layer guard; the umbrella is post-persist.

Compatibility / Migration

  • No API contract changes — if anything, this PR makes the documented {status: "in_flight"} contract hold on the attachment branch too.
  • No wire-format changes.
  • No migration required.
  • No performance impact on the no-attachment chat.send path (that branch is not modified).

Risks and Mitigations

RiskMitigation
The re-check itself introduces a new race windowNo await between the re-check and .set — JS single-thread atomic, same reason the no-attachment sibling is safe today.
Duplicate parsing cost on the losing callerOnly on the attachment + same-idempotencyKey retry edge. Non-race callers pay a single Map.get() (tens of ns), unmeasurable overhead.
Test file name mismatchAcknowledged in the commit message and this PR body. Happy to split into chat.attachment-race.test.ts in a follow-up.

Related

  • Related: #69208 (duplicate transcript/replay umbrella — different axis)
  • Complementary to: #68801 (agentRunStarts TTL sweep — downstream safety belt)
  • Adjacent: #69747 (surface chat.send lifecycle errors — adjacent lines, semantically orthogonal; trivial rebase if either lands first)

[AI-assisted] — tested via the regression harness above; reviewed through the openclaw-audit pre-PR 5-agent cross-review protocol.

Changed files

  • src/gateway/server-methods/chat.directive-tags.test.ts (modified, +178/-1)
  • src/gateway/server-methods/chat.ts (modified, +44/-2)
RAW_BUFFERClick to expand / collapse

Summary

chat.send's attachment branch holds a race window between the pre-parse chatAbortControllers.get guard (~L1920) and the post-parse .set (~L1960). Two real I/O awaits — resolveGatewayModelSupportsImages and parseMessageWithAttachments — sit in between. A same-idempotencyKey retry that arrives during that window passes the initial guard and falls through to the .set, overwriting the first caller's AbortController entry. The real run keeps executing under the first controller while the map now points at the orphan second controller — a subsequent user-initiated chat.abort aborts the orphan and fails to stop the run.

The downstream claimInboundDedupe (src/auto-reply/reply/inbound-dedupe.ts:94-112, entered via src/auto-reply/reply/dispatch-from-config.ts:269-273) already blocks duplicate agent-pipeline invocation (the LLM is only called once for the winner). So the harms are gateway-layer only but still user-visible:

  1. 🔴 User abort misdirectionchat.abort hits the orphan controller while the real run keeps going (reliability regression).
  2. Duplicate {status: "started"} ack to the client.
  3. Duplicate persistChatSendImages / saveMediaBuffer writes (offloadedRefs duplicated).
  4. context.dedupe double-write on .then / .catch.

This also violates the documented same-idempotencyKey contract — docs/web/control-ui.md#L131-132 states: "Re-sending with the same idempotencyKey returns {status: "in_flight"} while running". That contract is held on the no-attachment branch (which has no await between get and set and is atomic under single-thread semantics) but broken on the attachment branch during parsing.

Evidence

  • src/gateway/server-methods/chat.ts:1920 — pre-parse chatAbortControllers.get(clientRunId) guard.
  • src/gateway/server-methods/chat.ts:1930await resolveGatewayModelSupportsImages(...).
  • src/gateway/server-methods/chat.ts:1936await parseMessageWithAttachments(...).
  • src/gateway/server-methods/chat.ts:1960chatAbortControllers.set(clientRunId, ...) (race window ends here).
  • src/gateway/server-methods/agent-wait-dedupe.ts:206-220context.dedupe is terminal-only; it does not guard the in-flight window.
  • src/auto-reply/reply/inbound-dedupe.ts:94-112 + src/auto-reply/reply/dispatch-from-config.ts:269-273 — downstream guard that blocks duplicate pipeline invocation but runs after the gateway-layer overwrite.

Reproduction

  1. Start a chat.send with at least one attachment and idempotencyKey = X.
  2. Block or slow the first caller inside parseMessageWithAttachments (e.g., a large image path or an intentionally slowed fs write).
  3. While the first call is awaiting parsing, fire a second chat.send with the same idempotencyKey = X.
  4. Observe: both callers receive {status: "started"}. chatAbortControllers.get(X) returns the second caller's controller. The first caller's run is executing but its abort controller is no longer reachable from the map.
  5. Issue chat.abort for the run — it succeeds on the orphan but the real run keeps running.

Trigger surface in-tree:

  • apps/macos/Sources/OpenClaw/GatewayConnection.swift:186-247 automatically retries the same chat.send up to 6 times on URLError with identical params (including idempotencyKey).
  • Control UI retry storms observed in issue #56485.
  • Webchat duplicate symptoms in issue #46005.

The no-attachment branch never exhibits this behavior — L1920 and L1960 are separated by no await there, so the handler is atomic from get to set under single-thread semantics.

Scope

  • Single file: src/gateway/server-methods/chat.ts, attachment branch only.
  • Suggested fix: a post-parse re-check that mirrors the pre-parse guard; return {status: "in_flight"} if another caller already claimed the entry during parsing.

Related

  • Umbrella: #69208 (duplicate transcript/replay umbrella) — different axis (post-persist).
  • Complementary: #68801 (agentRunStarts TTL sweep) — downstream safety belt; does not prevent the gateway-layer overwrite.
  • Adjacent: #69747 (surface chat.send lifecycle errors) — touches adjacent lines but semantically orthogonal.
<!-- openclaw-audit: CAND-023 -->

extent analysis

TL;DR

To fix the race condition in the chat.send attachment branch, add a post-parse re-check that mirrors the pre-parse guard, returning {status: "in_flight"} if another caller already claimed the entry during parsing.

Guidance

  • Identify the specific lines of code involved in the race condition, namely L1920 and L1960 in src/gateway/server-methods/chat.ts.
  • Implement a post-parse re-check after parseMessageWithAttachments to verify if another caller has claimed the idempotencyKey entry.
  • If the entry is already claimed, return {status: "in_flight"} to prevent overwriting the existing AbortController.
  • Consider reviewing the claimInboundDedupe function in src/auto-reply/reply/inbound-dedupe.ts to ensure it correctly blocks duplicate agent-pipeline invocations.

Example

// Pseudo-code example of the proposed fix
if (chatAbortControllers.get(clientRunId)) {
  // Pre-parse guard
  return { status: "in_flight" };
}
await resolveGatewayModelSupportsImages(...);
await parseMessageWithAttachments(...);
if (chatAbortControllers.get(clientRunId)) {
  // Post-parse re-check
  return { status: "in_flight" };
}
chatAbortControllers.set(clientRunId, ...);

Notes

The proposed fix only addresses the attachment branch of the chat.send method and does not affect the no-attachment branch. The claimInboundDedupe function already blocks duplicate agent-pipeline invocations, but the gateway-layer overwrite still needs to be prevented.

Recommendation

Apply the proposed workaround by adding a post-parse re-check to prevent the race condition and ensure correct behavior for chat.send with attachments. This fix is necessary to maintain the documented same-idempotencyKey contract and prevent user-visible issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix gateway/chat.send: attachment branch race — chatAbortControllers check/set separated by image+media I/O spawns duplicate agent runs [1 pull requests, 1 participants]