openclaw - 💡(How to fix) Fix [Bug]: delivery-queue: send-retry creates fresh UUIDs (not idempotent); recovery should fail-permanent on Telegram 400 'too long'; send-message should pre-split at 4096 chars [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75131Fetched 2026-05-01 05:37:50
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Author
Timeline (top)
commented ×1cross-referenced ×1

Three related defects in the Telegram delivery path that compound into a persistent queue jam and measurable gateway-wide event-loop saturation:

  1. Send-retry is not idempotent at the queue layer. Each retry creates a fresh queue UUID instead of incrementing retryCount on the existing one.
  2. delivery-recovery treats Telegram 400 message is too long as transient. Doomed items survive forever and re-fire on every gateway restart.
  3. Send-message tool does not pre-split at Telegram's 4096-char limit. Subagent responses regularly exceed it; every long reply triggers (1) and (2).

When all three fire together, gateway main-thread event loop saturates. Local repro: archiving 3 stuck items dropped eventLoopUtilization from 0.996 → 0.264 and process CPU from 101% → 39% within one liveness-warning interval (30s).

Root Cause

Three related defects in the Telegram delivery path that compound into a persistent queue jam and measurable gateway-wide event-loop saturation:

  1. Send-retry is not idempotent at the queue layer. Each retry creates a fresh queue UUID instead of incrementing retryCount on the existing one.
  2. delivery-recovery treats Telegram 400 message is too long as transient. Doomed items survive forever and re-fire on every gateway restart.
  3. Send-message tool does not pre-split at Telegram's 4096-char limit. Subagent responses regularly exceed it; every long reply triggers (1) and (2).

When all three fire together, gateway main-thread event loop saturates. Local repro: archiving 3 stuck items dropped eventLoopUtilization from 0.996 → 0.264 and process CPU from 101% → 39% within one liveness-warning interval (30s).

Code Example

09:44:10.677  enqueue a1509a7f-166b-499c-af6d-51554adb4f1a  →
              09:44:11.050  [telegram] message failed: 400: Bad Request: message is too long
09:44:12.327  enqueue 7230e3d1-7893-436f-940f-f0800215abdd  →
              09:44:12.447  [telegram] message failed: 400: Bad Request: message is too long
09:44:14.691  enqueue ee96678f-e79c-4189-abe8-f00a936567e2  →
              09:44:14.837  [telegram] message failed: 400: Bad Request: message is too long

---

[delivery-recovery] Found 3 pending delivery entries — starting recovery
...
[delivery-recovery] Delivery recovery complete: 0 recovered, 3 failed,
                    0 skipped (max retries), 0 deferred (backoff)

---

2026-04-30T11:14:23.204  [diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu interval=30s
  eventLoopDelayP99Ms=109 eventLoopDelayMaxMs=9303
  eventLoopUtilization=0.996 cpuCoreRatio=1.142 active=3 waiting=0 queued=0

---

2026-04-30T11:18:53.956  [diagnostic] liveness warning:
  reasons=event_loop_delay interval=30s
  eventLoopDelayP99Ms=45.9 eventLoopDelayMaxMs=2569
  eventLoopUtilization=0.264 cpuCoreRatio=0.281 active=1 waiting=0 queued=0
RAW_BUFFERClick to expand / collapse

Summary

Three related defects in the Telegram delivery path that compound into a persistent queue jam and measurable gateway-wide event-loop saturation:

  1. Send-retry is not idempotent at the queue layer. Each retry creates a fresh queue UUID instead of incrementing retryCount on the existing one.
  2. delivery-recovery treats Telegram 400 message is too long as transient. Doomed items survive forever and re-fire on every gateway restart.
  3. Send-message tool does not pre-split at Telegram's 4096-char limit. Subagent responses regularly exceed it; every long reply triggers (1) and (2).

When all three fire together, gateway main-thread event loop saturates. Local repro: archiving 3 stuck items dropped eventLoopUtilization from 0.996 → 0.264 and process CPU from 101% → 39% within one liveness-warning interval (30s).

Evidence (2026-04-30 incident)

(a) Non-idempotent send-retry

A single logical agent message produced three queue items in 4 seconds with three different UUIDs but identical SHA1 payloads:

09:44:10.677  enqueue a1509a7f-166b-499c-af6d-51554adb4f1a  →
              09:44:11.050  [telegram] message failed: 400: Bad Request: message is too long
09:44:12.327  enqueue 7230e3d1-7893-436f-940f-f0800215abdd  →
              09:44:12.447  [telegram] message failed: 400: Bad Request: message is too long
09:44:14.691  enqueue ee96678f-e79c-4189-abe8-f00a936567e2  →
              09:44:14.837  [telegram] message failed: 400: Bad Request: message is too long

All three queue files: accountId: default, session.key: agent:main:telegram:direct:<chatId>, payloads[0].text length 5606, identical SHA1 (cbff4861443adbbdfeaa835384040422825ceb67), retryCount: 2 each.

Verified the producing agent emitted only one logical send:

  • The text was produced once by a state-manager subagent and returned to the parent agent via the sessions_spawn toolResult envelope.
  • The parent's session jsonl (post-envelope) has zero message / telegram_send tool_use blocks — the only outbound toolCall was sessions_yield.
  • The triplication therefore happened entirely below the agent layer, inside the send/retry path.

(b) delivery-recovery retries 400 forever

On every gateway restart the recovery loop re-attempts these doomed items:

[delivery-recovery] Found 3 pending delivery entries — starting recovery
...
[delivery-recovery] Delivery recovery complete: 0 recovered, 3 failed,
                    0 skipped (max retries), 0 deferred (backoff)

Each restart triggers 3+ Telegram API calls that are guaranteed to fail. With 12+ guardian-triggered restarts in a 14-hour window, the load adds up. The items survive in delivery-queue/ indefinitely until manually moved to archived-discarded-*/.

(c) No pre-split

Telegram's hard limit is 4096 chars. The state-manager response in this incident was 5606 chars — a normal-length subagent review. Long replies are routine; the send-message tool / channel adapter has no split or truncate path.

Impact: measurable gateway saturation

After the three poisoned items had been retrying for hours:

2026-04-30T11:14:23.204  [diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu interval=30s
  eventLoopDelayP99Ms=109 eventLoopDelayMaxMs=9303
  eventLoopUtilization=0.996 cpuCoreRatio=1.142 active=3 waiting=0 queued=0

A 9.3-second event-loop stall on a 30s window. P99 99–256ms across multiple windows. ELU pinned at 0.99+.

After archiving the 3 items locally (no other change):

2026-04-30T11:18:53.956  [diagnostic] liveness warning:
  reasons=event_loop_delay interval=30s
  eventLoopDelayP99Ms=45.9 eventLoopDelayMaxMs=2569
  eventLoopUtilization=0.264 cpuCoreRatio=0.281 active=1 waiting=0 queued=0

ELU 0.996 → 0.264, CPU core ratio 1.142 → 0.281, P99 109ms → 45.9ms — within one 30s interval. Process-level CPU dropped 101% → 39% over the same period.

Fix shape

  • (a) Idempotent enqueue. Generate the queue UUID once per logical send. On send failure, mutate retryCount and lastError on the existing file rather than enqueuing a new one. Atomic write (write-to-temp + rename) so the queue never sees a half-updated file.
  • (b) Permanent-fail classification. delivery-recovery should classify Telegram 400-class errors as permanent and move the item to delivery-queue/failed/ rather than retrying:
    • 400 message is too long
    • 400 chat not found
    • 400 bot was blocked by the user
    • 403 Forbidden: bot can't send messages to bots
    • 403 user is deactivated
  • (c) Pre-split at the channel adapter. Either auto-split text payloads into ≤4096-char chunks (preserving Markdown fenced blocks and URLs) or auto-truncate with a "see full output at <artifact-path>" pointer when the agent has already written the full text to disk. Splitting is preferable — it preserves the user-visible content.

Each part stands alone but they compound; fixing only (b) without (a) still leaves orphan duplicates, and fixing only (a) without (c) still produces guaranteed-fail messages on long replies.

extent analysis

TL;DR

Implement idempotent enqueue, permanent-fail classification, and pre-split at the channel adapter to resolve the Telegram delivery path issues.

Guidance

  • Implement idempotent enqueue by generating a queue UUID once per logical send and mutating retryCount and lastError on the existing file upon send failure.
  • Classify Telegram 400-class errors as permanent failures in delivery-recovery and move the item to delivery-queue/failed/ instead of retrying.
  • Pre-split text payloads into ≤4096-char chunks at the channel adapter, preserving Markdown fenced blocks and URLs, or auto-truncate with a pointer to the full output.

Example

def split_text_payload(text):
    # Split text into ≤4096-char chunks, preserving Markdown fenced blocks and URLs
    chunks = []
    current_chunk = ""
    for line in text.splitlines():
        if len(current_chunk) + len(line) > 4096:
            chunks.append(current_chunk)
            current_chunk = line
        else:
            current_chunk += "\n" + line
    chunks.append(current_chunk)
    return chunks

Notes

The provided solution assumes that the existing codebase has the necessary infrastructure to support idempotent enqueue, permanent-fail classification, and pre-split at the channel adapter. Additional modifications may be required to integrate these changes into the existing system.

Recommendation

Apply the workaround by implementing idempotent enqueue, permanent-fail classification, and pre-split at the channel adapter. This will help resolve the Telegram delivery path issues and prevent gateway saturation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: delivery-queue: send-retry creates fresh UUIDs (not idempotent); recovery should fail-permanent on Telegram 400 'too long'; send-message should pre-split at 4096 chars [1 comments, 2 participants]