openclaw - ✅(Solved) Fix [Bug]: enqueueSystemEvent not deduplicated by runId/contextKey — agents cascade duplicate exec approval prompts under new IDs, locking ecosystem [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Under load, enqueueSystemEvent does not deduplicate queued exec approval requests by runId or contextKey. When a heartbeat run times out and the gateway fails over, the replacement attempt re-queues the same exec call with a fresh approval ID. Each retry surfaces a new Telegram approval prompt for the identical command, cascading until the operator kills the gateway. Left alone, it saturates the approval channel fast enough to risk system-level memory pressure.

Reproduced repeatably on a multi-agent install. Filing now so it can be fixed before users with directPolicy: "allow" + high-frequency heartbeats discover it the hard way.

Error Message

Continual, unceasing consecutive approval prompts delivered to Telegram seconds apart, identical command, different IDs:

Root Cause

Under load, enqueueSystemEvent does not deduplicate queued exec approval requests by runId or contextKey. When a heartbeat run times out and the gateway fails over, the replacement attempt re-queues the same exec call with a fresh approval ID. Each retry surfaces a new Telegram approval prompt for the identical command, cascading until the operator kills the gateway. Left alone, it saturates the approval channel fast enough to risk system-level memory pressure.

Reproduced repeatably on a multi-agent install. Filing now so it can be fixed before users with directPolicy: "allow" + high-frequency heartbeats discover it the hard way.

Fix Action

Fix / Workaround

Workaround in place

  • All 11 agent heartbeats set to every: "999h" (circuit breaker)
  • No agent work resumes on a normal schedule until this is fixed or a dedup workaround exists at the exec-approvals layer

PR fix notes

PR #3: fix(approvals): accept allow-always source metadata

Description (problem / solution / changelog)

<!-- CURSOR_AGENT_PR_BODY_BEGIN -->

Summary

  • Problem: Telegram/native allow-always approvals already persist allowlist entries with source: "allow-always", but the gateway exec.approvals.set schema rejected that field as unexpected.
  • Why it matters: local approvals files and the gateway push path disagree on the allowlist contract, so gateway sync can fail even when the on-disk approvals file is valid for the runtime.
  • What changed: src/gateway/protocol/schema/exec-approvals.ts now accepts optional source: "allow-always" on allowlist entries, matching the existing persisted approvals type.
  • What did NOT change (scope boundary): no approval decision logic, persistence behavior, or Telegram/native reply handling changed in this PR; it only aligns the gateway protocol schema with existing stored data.
  • AI-assisted: Yes. I implemented the protocol fix and verified it locally with targeted tests and changed-scope checks.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #69482
  • Related #69478
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: the durable approvals store type and normalization path already preserved source: "allow-always", but the gateway protocol schema for exec.approvals.set and exec.approvals.node.set did not allow that field.
  • Missing detection / guardrail: there was no validator regression test covering a real allowlist payload that included source: "allow-always".
  • Contributing context (if known): the issue report shows this surfaced through Telegram "Allow always", but the mismatch is fundamentally between persisted approvals data and the gateway wire contract.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/protocol/index.test.ts
  • Scenario the test should lock in: validateExecApprovalsSetParams and validateExecApprovalsNodeSetParams accept allowlist entries that include source: "allow-always".
  • Why this is the smallest reliable guardrail: the bug is a protocol/schema rejection, so validator tests hit the exact contract boundary that failed without needing a live Telegram or gateway sync flow.
  • Existing test that already covers this (if any): none.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

  • Gateway approval set payloads may now include persisted allow-always source metadata without being rejected as invalid.

Diagram (if applicable)

Before:
[allowlist entry with source="allow-always"] -> [exec.approvals.set validator] -> [reject unexpected property]

After:
[allowlist entry with source="allow-always"] -> [exec.approvals.set validator] -> [accept payload]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Ubuntu Linux in Cursor Cloud
  • Runtime/container: Node 22 + pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): Gateway exec approvals protocol validation
  • Relevant config (redacted): approvals payload with allowlist entry { pattern: "/usr/bin/curl", source: "allow-always" }

Steps

  1. Construct an approvals file payload containing an allowlist entry with source: "allow-always".
  2. Validate it through the gateway protocol path for exec.approvals.set or exec.approvals.node.set.
  3. Observe whether the validator accepts or rejects the payload.

Expected

  • The gateway protocol should accept allowlist entries that match the already-persisted approvals type, including source: "allow-always".

Actual

  • Before this fix, the schema rejected the payload because source was not part of ExecApprovalsAllowlistEntrySchema.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: ran pnpm test src/gateway/protocol/index.test.ts; ran pnpm check:changed.
  • Edge cases checked: both gateway set and node set validators accept allow-always source metadata; no unrelated schema fields were broadened.
  • What you did not verify: live Telegram approval flow or a real openclaw approvals set --gateway round-trip against a running gateway.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: this only fixes the schema mismatch slice of the issue; if another runtime path strips or mishandles source, that would need follow-up.
    • Mitigation: this PR is intentionally scoped to the concrete validator rejection, and the new tests lock the contract in place.
<!-- CURSOR_AGENT_PR_BODY_END --> <div><a href="https://cursor.com/agents/bc-a225a134-d59a-46cf-a2c4-14c5aaa6297d"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cursor.com/assets/images/open-in-web-dark.png"><source media="(prefers-color-scheme: light)" srcset="https://cursor.com/assets/images/open-in-web-light.png"><img alt="Open in Web" width="114" height="28" src="https://cursor.com/assets/images/open-in-web-dark.png"></picture></a>&nbsp;<a href="https://cursor.com/background-agent?bcId=bc-a225a134-d59a-46cf-a2c4-14c5aaa6297d"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cursor.com/assets/images/open-in-cursor-dark.png"><source media="(prefers-color-scheme: light)" srcset="https://cursor.com/assets/images/open-in-cursor-light.png"><img alt="Open in Cursor" width="131" height="28" src="https://cursor.com/assets/images/open-in-cursor-dark.png"></picture></a>&nbsp;</div>

Changed files

  • src/gateway/protocol/index.test.ts (modified, +45/-1)
  • src/gateway/protocol/schema/exec-approvals.ts (modified, +1/-0)

Code Example

ps aux | grep -E "contextstored|vllm|openclaw" | grep -v grep | awk '{print $11}' | sort -n | tail -5

---

## Attached evidence

1. Screenshot of two consecutive approval prompts with different IDs for the same command
2. `bug-30-log-excerpt.txt`60 lines of the cascade from the gateway log
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Summary

Under load, enqueueSystemEvent does not deduplicate queued exec approval requests by runId or contextKey. When a heartbeat run times out and the gateway fails over, the replacement attempt re-queues the same exec call with a fresh approval ID. Each retry surfaces a new Telegram approval prompt for the identical command, cascading until the operator kills the gateway. Left alone, it saturates the approval channel fast enough to risk system-level memory pressure.

Reproduced repeatably on a multi-agent install. Filing now so it can be fixed before users with directPolicy: "allow" + high-frequency heartbeats discover it the hard way.

Steps to reproduce

What the exec call is

Routine health-check probe issued from Maelcum's heartbeat:

ps aux | grep -E "contextstored|vllm|openclaw" | grep -v grep | awk '{print $11}' | sort -n | tail -5

Hits on-miss under the current allowlist, so an approval prompt is expected on first encounter. The bug is that it fires again, and again, and again, each time under a new approval ID, for the same run intent.

Not a duplicate of

I looked for upstream issues that might cover this and found three that are adjacent but distinct:

  • #66487 — heartbeat prompt drops completion body (peek-not-consume on a different queue event path, not the approval queue)
  • #14191 — heartbeat routes to wrong session queue (routing bug, not a dedup bug)
  • #36325deliver:false hooks still inject via enqueueSystemEvent (delivery flag bypass, not retry dedup)

None of these address the approval-event retry path or the (runId, contextKey) dedup gap.

Workaround in place

  • All 11 agent heartbeats set to every: "999h" (circuit breaker)
  • No agent work resumes on a normal schedule until this is fixed or a dedup workaround exists at the exec-approvals layer

Related bug (filing separately)

Telegram /approve allow-always writes a source field into the approvals allowlist entry that openclaw approvals set --file then rejects as unexpected on push. Will cross-reference the issue once filed.

Expected behavior

Either:

  1. enqueueSystemEvent deduplicates queued exec approval events by (agentId, contextKey) or (runId, contextKey), coalescing retries into the already-pending prompt; or
  2. When a run fails over, any exec approval events it queued are cancelled before the replacement run is allowed to enqueue new ones.

Today, neither happens.

bug-30-log-excerpt-clean.txt

<img width="581" height="889" alt="Image" src="https://github.com/user-attachments/assets/4a74dc6c-8c52-4e82-80dd-9b65032a46d7" />

Actual behavior

Observed behavior

Continual, unceasing consecutive approval prompts delivered to Telegram seconds apart, identical command, different IDs:

  • befadc79-10bd-4e78-b1a4-9e2f546fd3c5
  • 871d7305-c1cc-412c-9393-d538e99e4ae1
  • etc.

Screenshot attached below.

Gateway log (/tmp/openclaw/openclaw-2026-04-18.log) shows the cascade signature (excerpt attached):

  • stuck session: sessionId=maelcum sessionId=<uuid> sessionKey=agent:maelcum:telegram:direct:<user_id> — age ticking up by ~30s per line, crossing 462s before intervention
  • embedded_run_failover_decision failoverReason=timeout — cycling through the provider chain: vllm-fastvllm-brainopenrouter/z-ai/glm-5
  • Heartbeat re-firing and regenerating the run under fresh runIds while the prior attempt is still pending approval

Each failover attempt re-enters enqueueSystemEvent carrying the same exec call, but the event queue has no compound key covering the (runId, contextKey) pair — so the prior queued approval does not cancel or collapse, and a new one is enqueued instead.

OpenClaw version

2026.4.14 (323493f)`

Operating system

macOS 26.4.1

Install method

npm global, latest stable as of filing

Model

mlx-community/Qwen3.5-9B-OptiQ-4bit (local, via rapid-mlx 0.3.12)

Provider / routing chain

openclaw -> vllm-fast (localhost:8001, rapid-mlx 0.3.12) -> Qwen3.5-9B-OptiQ-4bit

Additional provider/model setup details

Environment

  • Host: macOS, Mac Mini M4 Pro, 48 GB unified memory
  • Gateway: launchd-supervised, loopback bind, port 18789
  • Heartbeat: every: "3h", directPolicy: "allow", target: "telegram", lightContext: true
  • Exec approval policy: defaults.security: "allowlist", ask: "on-miss", askFallback: "deny"; maelcum uses host defaults

Logs, screenshots, and evidence

## Attached evidence

1. Screenshot of two consecutive approval prompts with different IDs for the same command
2. `bug-30-log-excerpt.txt`60 lines of the cascade from the gateway log

Impact and severity

Impact

  • Saturates the approval channel — every cascade cycle produces a new Telegram prompt
  • Fast enough to outrun manual intervention; forcing a gateway restart (openclaw gateway restart) is the only reliable stop
  • On installs with many agents sharing a channel, one stuck agent can drown all approval prompts for every other agent
  • Forced me to set all 11 heartbeats to every: "999h" as a circuit breaker while the bug is unresolved — effectively disabling the ecosystem's scheduled work layer

Additional information

No response

extent analysis

TL;DR

The most likely fix involves modifying the enqueueSystemEvent function to deduplicate queued exec approval events by (runId, contextKey) or (agentId, contextKey).

Guidance

  • Review the enqueueSystemEvent function to understand how it handles event queuing and deduplication.
  • Consider implementing a compound key for the event queue that includes (runId, contextKey) to prevent duplicate approval events.
  • Investigate the on-miss allowlist policy and its interaction with the enqueueSystemEvent function to ensure that it does not contribute to the duplication issue.
  • Evaluate the feasibility of cancelling pending approval events when a run fails over, as an alternative solution.

Example

// Pseudocode example of a potential fix
function enqueueSystemEvent(event) {
  const key = `${event.runId}:${event.contextKey}`;
  if (queue.has(key)) {
    // Update the existing event instead of adding a new one
    queue.get(key).approvalId = event.approvalId;
  } else {
    queue.set(key, event);
  }
}

Notes

The provided information suggests that the issue is specific to the enqueueSystemEvent function and its handling of event deduplication. However, without access to the full codebase, it is difficult to provide a definitive solution. The suggested fix may require modifications to the event queue data structure and the enqueueSystemEvent function.

Recommendation

Apply a workaround by modifying the enqueueSystemEvent function to deduplicate queued exec approval events, as this is the most direct approach to addressing the issue. This change can help prevent the saturation of the approval channel and mitigate the impact of the bug until a more comprehensive solution can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Either:

  1. enqueueSystemEvent deduplicates queued exec approval events by (agentId, contextKey) or (runId, contextKey), coalescing retries into the already-pending prompt; or
  2. When a run fails over, any exec approval events it queued are cancelled before the replacement run is allowed to enqueue new ones.

Today, neither happens.

bug-30-log-excerpt-clean.txt

<img width="581" height="889" alt="Image" src="https://github.com/user-attachments/assets/4a74dc6c-8c52-4e82-80dd-9b65032a46d7" />

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: enqueueSystemEvent not deduplicated by runId/contextKey — agents cascade duplicate exec approval prompts under new IDs, locking ecosystem [1 pull requests]