openclaw - ✅(Solved) Fix [Bug]: Delivery-recovery retries indefinitely on permanent HTTP 400 errors (message too long, auth, not-found) — should classify and halt on non-transient failures [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74321Fetched 2026-04-30 06:25:32
View on GitHub
Comments
2
Participants
3
Timeline
3
Reactions
2
Timeline (top)
commented ×2cross-referenced ×1

The delivery-recovery system retries failed delivery queue entries indefinitely, regardless of whether the failure is transient or permanent. A Telegram delivery failed with 400: Bad Request: message is too long — a payload-size error that cannot resolve on retry. The entry accumulated 5 retries over ~15 hours, continued backing off, and eventually delivered the stale message fragments to the user during an unrelated gateway restart, causing a confusing phantom-output incident.

The system has no classification of error permanence. Every failure is treated as a transient retry candidate.

Error Message

The delivery-recovery system retries failed delivery queue entries indefinitely, regardless of whether the failure is transient or permanent. A Telegram delivery failed with 400: Bad Request: message is too long — a payload-size error that cannot resolve on retry. The entry accumulated 5 retries over ~15 hours, continued backing off, and eventually delivered the stale message fragments to the user during an unrelated gateway restart, causing a confusing phantom-output incident. The system has no classification of error permanence. Every failure is treated as a transient retry candidate. | Error class | Examples | Behavior | 2. Log with full error and payload summary 3. Send a short alert to the operator channel: "Delivery failed (permanent): [channel] [error]. Payload discarded."

Root Cause

Delivery-recovery applies uniform retry-with-backoff to all HTTP errors. There is no distinction between:

  • Transient (should retry): 429 rate limit, 5xx server errors, network timeouts, ETIMEDOUT
  • Permanent (should not retry): 400 message too long, 401 unauthorized, 403 forbidden, 404 not found

Fix Action

Fixed

PR fix notes

PR #74656: fix(outbound): classify permanent delivery failures and halt retries (#74321)

Description (problem / solution / changelog)

Summary

Fixes #74321.

The delivery-queue recovery path (src/infra/outbound/delivery-queue-recovery.ts) only classified identity/membership errors (chat not found, bot blocked) as permanent — all other 400s, including message is too long, content_too_large, payload too large, and auth failures, were retried indefinitely with the same oversized or invalid payload.

Root Cause

classifyDeliveryError in delivery-queue-recovery.ts checked isPermanentDeliveryError only for membership/routing failures. Size and auth errors fell through to the transient retry path, resulting in an infinite backoff loop with no chance of success.

Fix

Extended isPermanentDeliveryError to return true for:

  • Payload-size errors: message is too long, content_too_large, text must be no longer than N, payload too large — retrying with the same content can never succeed
  • Auth failures: unauthorized: bot token, forbidden: not enough rights, forbidden: bot was banned
  • Channel routing failures: channel not found, thread not found (expanding the existing not-found category)

Entries matching these patterns now move immediately to failed/ without consuming retry budget.

Tests

src/infra/outbound/delivery-queue.recovery.test.ts — 3 new regression tests:

  • message is too long → permanent
  • content_too_large → permanent
  • unauthorized: bot token → permanent
Tests  14 passed (14)

Audit

  • Audit A (existing helper): isPermanentDeliveryError already exists; this PR extends its pattern set — no new helper needed
  • Audit B (shared callers): isPermanentDeliveryError — 1 caller (classifyDeliveryError). Pattern expansion is additive; existing permanent patterns unchanged
  • Audit C (rival): No rival PR for #74321 found

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/infra/outbound/delivery-queue-recovery.ts (modified, +13/-0)
  • src/infra/outbound/delivery-queue.recovery.test.ts (modified, +34/-0)

Code Example

{
  "enqueuedAt": 1777420328680,
  "channel": "telegram",
  "retryCount": 5,
  "lastError": "Call to 'sendMessage' failed! (400: Bad Request: message is too long)",
  "lastAttemptAt": 1777463156624,
  "payloads": [
    { "text": "Timed out again — the files are too large..." },
    { "text": "Now I'll read the remaining four interviews in parallel." },
    { "text": "Now I have everything I need. Writing the audit report:" },
    { "text": "Audit complete. Here's the full report:\n\n# Hallucination Audit..." }
  ]
}
RAW_BUFFERClick to expand / collapse

Environment

  • OpenClaw: 2026.4.25 (aa36ee6)
  • Channel: Telegram
  • Delivery entry: 8520a19c-4047-4e5c-ae57-561fd8acc78f (evidence below)

Summary

The delivery-recovery system retries failed delivery queue entries indefinitely, regardless of whether the failure is transient or permanent. A Telegram delivery failed with 400: Bad Request: message is too long — a payload-size error that cannot resolve on retry. The entry accumulated 5 retries over ~15 hours, continued backing off, and eventually delivered the stale message fragments to the user during an unrelated gateway restart, causing a confusing phantom-output incident.

The system has no classification of error permanence. Every failure is treated as a transient retry candidate.

Evidence

Delivery queue entry (paraphrased from delivery-queue/8520a19c-*.json):

{
  "enqueuedAt": 1777420328680,
  "channel": "telegram",
  "retryCount": 5,
  "lastError": "Call to 'sendMessage' failed! (400: Bad Request: message is too long)",
  "lastAttemptAt": 1777463156624,
  "payloads": [
    { "text": "Timed out again — the files are too large..." },
    { "text": "Now I'll read the remaining four interviews in parallel." },
    { "text": "Now I have everything I need. Writing the audit report:" },
    { "text": "Audit complete. Here's the full report:\n\n# Hallucination Audit..." }
  ]
}

The 4th payload (full audit report) exceeded Telegram's message length limit. The first three shorter payloads had been retrying alongside it. On gateway restart, backoff flushed and all four delivered — three successfully, one still failing. The user received 3 stale strings from a completed task 15 hours later during active unrelated work.

Root Cause

Delivery-recovery applies uniform retry-with-backoff to all HTTP errors. There is no distinction between:

  • Transient (should retry): 429 rate limit, 5xx server errors, network timeouts, ETIMEDOUT
  • Permanent (should not retry): 400 message too long, 401 unauthorized, 403 forbidden, 404 not found

Proposed Fix

Classify errors before queuing for retry:

Error classExamplesBehavior
Transient429, 500, 502, 503, 504, ETIMEDOUT, ECONNRESETRetry with backoff (current behavior)
Permanent400 message too long, 401, 403, 404Mark failed immediately, alert user, do not retry
Ambiguous 400400 with body indicating rate limit or temporary stateRetry (provider-specific classification)

For permanent failures, the recommended behavior:

  1. Move entry to delivery-queue/failed/
  2. Log with full error and payload summary
  3. Send a short alert to the operator channel: "Delivery failed (permanent): [channel] [error]. Payload discarded."
  4. Do not retry.

Telegram-specific: message is too long should also trigger automatic message splitting (chunk at ~4000 chars) before failing, as a best-effort recovery. If splitting itself fails, then apply the permanent-failure path.

Impact

  • Users receive stale, out-of-context messages hours or days after the original task
  • Gateway restarts flush backoff and deliver a burst of old messages
  • No visibility into failed entries without inspecting the filesystem directly
  • Operator cannot distinguish current output from replayed history

Related

  • #74239 — same class of silent failure accumulation; both bugs share the pattern of permanent errors being treated as transient
  • #74283 — delivery-recovery post-approval behavior

extent analysis

TL;DR

Classify errors before queuing for retry to distinguish between transient and permanent failures, and implement a retry policy based on this classification.

Guidance

  • Introduce error classification for HTTP errors to determine whether to retry or mark as failed immediately.
  • Implement a retry policy with backoff for transient errors (e.g., 429, 500, 502, 503, 504, ETIMEDOUT, ECONNRESET).
  • For permanent failures (e.g., 400 message too long, 401, 403, 404), move the entry to a failed queue, log the error, and send an alert to the operator channel without retrying.
  • Consider implementing automatic message splitting for Telegram messages that exceed the character limit as a best-effort recovery before marking as a permanent failure.

Example

// Example of a classified error
{
  "errorClass": "Permanent",
  "errorCode": 400,
  "errorMessage": "message is too long",
  "retryCount": 0,
  "payloads": [
    { "text": "Audit complete. Here's the full report:\n\n# Hallucination Audit..." }
  ]
}

Notes

  • The proposed fix requires updates to the delivery-recovery system to classify errors and apply the appropriate retry policy.
  • The introduction of error classification and a retry policy may require additional logging and monitoring to ensure correct behavior.

Recommendation

Apply the proposed fix by introducing error classification and a retry policy to handle transient and permanent failures differently, as this will prevent stale messages from being delivered and provide better visibility into failed entries.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Delivery-recovery retries indefinitely on permanent HTTP 400 errors (message too long, auth, not-found) — should classify and halt on non-transient failures [1 pull requests, 2 comments, 3 participants]