openclaw - ✅(Solved) Fix [Bug]: Delivery-recovery retries indefinitely on permanent HTTP 400 errors (message too long, auth, not-found) — should classify and halt on non-transient failures [1 pull requests, 2 comments, 3 participants]

axonrelaybot · 2026-04-29T12:02:35Z

[openclaw] The delivery-recovery system retries failed delivery queue entries indefinitely, regardless of whether the failure is transient or permanent. A Tele… The delivery-recovery system retries failed delivery queue entries indefinitely, regardless of whether the failure is transient or permanent. A Telegram delivery failed with `400: Bad Request: message is too long` — a payload-size error that cannot resolve on retry. The entry accumulated 5 retries over ~15 hours, continued backing off, and eventually delivered the stale message fragments to the user during an unrelated gateway restart, causing a confusing phantom-output incident. The system has no classification of error permanence. Every failure is treated as a transient retry candidate. # PR #74656: fix(outbound): classify permanent delivery failures and halt retries (#74321) - Repository: openclaw/openclaw - Author: hclsys - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/74656 ## Description (problem / solution / changelog) ## Summary Fixes #74321. The delivery-queue recovery path (`src/infra/outbound/delivery-queue-recovery.ts`) only classified identity/membership errors (chat not found, bot blocked) as permanent — all other 400s, including `message is too long`, `content_too_large`, `payload too large`, and auth failures, were retried indefinitely with the same oversized or invalid payload. ## Root Cause `classifyDeliveryError` in `delivery-queue-recovery.ts` checked `isPermanentDeliveryError` only for membership/routing failures. Size and auth errors fell through to the transient retry path, resulting in an infinite backoff loop with no chance of success. ## Fix Extended `isPermanentDeliveryError` to return `true` for: - **Payload-size errors:** `message is too long`, `content_too_large`, `text must be no longer than N`, `payload too large` — retrying with the same content can never succeed - **Auth failures:** `unauthorized: bot token`, `forbidden: not enough rights`, `forbidden: bot was banned` - **Channel routing failures:** `channel not found`, `thread not found` (expanding the existing not-found category) Entries matching these patterns now move immediately to `failed/` without consuming retry budget. ## Tests `src/infra/outbound/delivery-queue.recovery.test.ts` — 3 new regression tests: - `message is too long` → permanent - `content_too_large` → permanent - `unauthorized: bot token` → permanent ``` Tests 14 passed (14) ``` ## Audit - **Audit A (existing helper):** `isPermanentDeliveryError` already exists; this PR extends its pattern set — no new helper needed - **Audit B (shared callers):** `isPermanentDeliveryError` — 1 caller (classifyDeliveryError). Pattern expansion is additive; existing permanent patterns unchanged - **Audit C (rival):** No rival PR for #74321 found ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `src/infra/outbound/delivery-queue-recovery.ts` (modified, +13/-0) - `src/infra/outbound/delivery-queue.recovery.test.ts` (modified, +34/-0) ## Fixed - Fixed by PR: fix(outbound): classify permanent delivery failures and halt retries (#74321) (https://github.com/openclaw/openclaw/pull/74656) ## Environment - **OpenClaw:** 2026.4.25 (aa36ee6) - **Channel:** Telegram - **Delivery entry:** `8520a19c-4047-4e5c-ae57-561fd8acc78f` (evidence below) ## Summary The delivery-recovery system retries failed delivery queue entries indefinitely, regardless of whether the failure is transient or permanent. A Telegram delivery failed with `400: Bad Request: message is too long` — a payload-size error that cannot resolve on retry. The entry accumulated 5 retries over ~15 hours, continued backing off, and eventually delivered the stale message fragments to the user during an unrelated gateway restart, causing a confusing phantom-output incident. The system has no classification of error permanence. Every failure is treated as a transient retry candidate. ## Evidence Delivery queue entry (paraphrased from `delivery-queue/8520a19c-*.json`): ```json { "enqueuedAt": 1777420328680, "channel": "telegram", "retryCount": 5, "lastError": "Call to 'sendMessage' failed! (400: Bad Request: message is too long)", "lastAttemptAt": 1777463156624, "payloads": [ { "text": "Timed out again — the files are too large..." }, { "text": "Now I'll read the remaining four interviews in parallel." }, { "text": "Now I have everything I need. Writing the audit report:" }, { "text": "Audit complete. Here's the full report:\n\n# Hallucination Audit..." } ] } ``` The 4th payload (full audit report) exceeded Telegram's message length limit. The first three shorter payloads had been retrying alongside it. On gateway restart, backoff flushed and all four delivered — three successfully, one still failing. The user received 3 stale strings from a completed task 15 hours later during active unrelated work. ## Root Cause Delivery-recovery applies uniform retry-with-backoff to all HTTP errors. There is no distinction between: - **Transient**

openclaw2026-04-29 12:02:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74321•Fetched 2026-04-30 06:25:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2cross-referenced ×1

The system has no classification of error permanence. Every failure is treated as a transient retry candidate.

Error Message

The delivery-recovery system retries failed delivery queue entries indefinitely, regardless of whether the failure is transient or permanent. A Telegram delivery failed with 400: Bad Request: message is too long — a payload-size error that cannot resolve on retry. The entry accumulated 5 retries over ~15 hours, continued backing off, and eventually delivered the stale message fragments to the user during an unrelated gateway restart, causing a confusing phantom-output incident. The system has no classification of error permanence. Every failure is treated as a transient retry candidate. | Error class | Examples | Behavior | 2. Log with full error and payload summary 3. Send a short alert to the operator channel: "Delivery failed (permanent): [channel] [error]. Payload discarded."

Root Cause

Delivery-recovery applies uniform retry-with-backoff to all HTTP errors. There is no distinction between:

Transient (should retry): 429 rate limit, 5xx server errors, network timeouts, ETIMEDOUT
Permanent (should not retry): 400 message too long, 401 unauthorized, 403 forbidden, 404 not found

Fix Action

Fixed

Fixed by PR: fix(outbound): classify permanent delivery failures and halt retries (#74321) (https://github.com/openclaw/openclaw/pull/74656)

PR fix notes

PR #74656: fix(outbound): classify permanent delivery failures and halt retries (#74321)

Repository: openclaw/openclaw
Author: hclsys
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/74656

Description (problem / solution / changelog)

Summary

Fixes #74321.

The delivery-queue recovery path (src/infra/outbound/delivery-queue-recovery.ts) only classified identity/membership errors (chat not found, bot blocked) as permanent — all other 400s, including message is too long, content_too_large, payload too large, and auth failures, were retried indefinitely with the same oversized or invalid payload.

Root Cause

classifyDeliveryError in delivery-queue-recovery.ts checked isPermanentDeliveryError only for membership/routing failures. Size and auth errors fell through to the transient retry path, resulting in an infinite backoff loop with no chance of success.

Fix

Extended isPermanentDeliveryError to return true for:

Payload-size errors: message is too long, content_too_large, text must be no longer than N, payload too large — retrying with the same content can never succeed
Auth failures: unauthorized: bot token, forbidden: not enough rights, forbidden: bot was banned
Channel routing failures: channel not found, thread not found (expanding the existing not-found category)

Entries matching these patterns now move immediately to failed/ without consuming retry budget.

Tests

src/infra/outbound/delivery-queue.recovery.test.ts — 3 new regression tests:

message is too long → permanent
content_too_large → permanent
unauthorized: bot token → permanent

Tests  14 passed (14)

Audit

Audit A (existing helper): isPermanentDeliveryError already exists; this PR extends its pattern set — no new helper needed
Audit B (shared callers): isPermanentDeliveryError — 1 caller (classifyDeliveryError). Pattern expansion is additive; existing permanent patterns unchanged
Audit C (rival): No rival PR for #74321 found

Changed files

CHANGELOG.md (modified, +1/-0)
src/infra/outbound/delivery-queue-recovery.ts (modified, +13/-0)
src/infra/outbound/delivery-queue.recovery.test.ts (modified, +34/-0)

Code Example

{
  "enqueuedAt": 1777420328680,
  "channel": "telegram",
  "retryCount": 5,
  "lastError": "Call to 'sendMessage' failed! (400: Bad Request: message is too long)",
  "lastAttemptAt": 1777463156624,
  "payloads": [
    { "text": "Timed out again — the files are too large..." },
    { "text": "Now I'll read the remaining four interviews in parallel." },
    { "text": "Now I have everything I need. Writing the audit report:" },
    { "text": "Audit complete. Here's the full report:\n\n# Hallucination Audit..." }
  ]
}

RAW_BUFFERClick to expand / collapse

Environment

OpenClaw: 2026.4.25 (aa36ee6)
Channel: Telegram
Delivery entry: 8520a19c-4047-4e5c-ae57-561fd8acc78f (evidence below)

Summary

The system has no classification of error permanence. Every failure is treated as a transient retry candidate.

Evidence

Delivery queue entry (paraphrased from delivery-queue/8520a19c-*.json):

{
  "enqueuedAt": 1777420328680,
  "channel": "telegram",
  "retryCount": 5,
  "lastError": "Call to 'sendMessage' failed! (400: Bad Request: message is too long)",
  "lastAttemptAt": 1777463156624,
  "payloads": [
    { "text": "Timed out again — the files are too large..." },
    { "text": "Now I'll read the remaining four interviews in parallel." },
    { "text": "Now I have everything I need. Writing the audit report:" },
    { "text": "Audit complete. Here's the full report:\n\n# Hallucination Audit..." }
  ]
}

The 4th payload (full audit report) exceeded Telegram's message length limit. The first three shorter payloads had been retrying alongside it. On gateway restart, backoff flushed and all four delivered — three successfully, one still failing. The user received 3 stale strings from a completed task 15 hours later during active unrelated work.

Root Cause

Delivery-recovery applies uniform retry-with-backoff to all HTTP errors. There is no distinction between:

Transient (should retry): 429 rate limit, 5xx server errors, network timeouts, ETIMEDOUT
Permanent (should not retry): 400 message too long, 401 unauthorized, 403 forbidden, 404 not found

Proposed Fix

Classify errors before queuing for retry:

Error class	Examples	Behavior
Transient	429, 500, 502, 503, 504, ETIMEDOUT, ECONNRESET	Retry with backoff (current behavior)
Permanent	400 message too long, 401, 403, 404	Mark failed immediately, alert user, do not retry
Ambiguous 400	400 with body indicating rate limit or temporary state	Retry (provider-specific classification)

For permanent failures, the recommended behavior:

Move entry to delivery-queue/failed/
Log with full error and payload summary
Send a short alert to the operator channel: "Delivery failed (permanent): [channel] [error]. Payload discarded."
Do not retry.

Telegram-specific: message is too long should also trigger automatic message splitting (chunk at ~4000 chars) before failing, as a best-effort recovery. If splitting itself fails, then apply the permanent-failure path.

Impact

Users receive stale, out-of-context messages hours or days after the original task
Gateway restarts flush backoff and deliver a burst of old messages
No visibility into failed entries without inspecting the filesystem directly
Operator cannot distinguish current output from replayed history

#74239 — same class of silent failure accumulation; both bugs share the pattern of permanent errors being treated as transient
#74283 — delivery-recovery post-approval behavior

extent analysis

TL;DR

Classify errors before queuing for retry to distinguish between transient and permanent failures, and implement a retry policy based on this classification.

Guidance

Introduce error classification for HTTP errors to determine whether to retry or mark as failed immediately.
Implement a retry policy with backoff for transient errors (e.g., 429, 500, 502, 503, 504, ETIMEDOUT, ECONNRESET).
For permanent failures (e.g., 400 message too long, 401, 403, 404), move the entry to a failed queue, log the error, and send an alert to the operator channel without retrying.
Consider implementing automatic message splitting for Telegram messages that exceed the character limit as a best-effort recovery before marking as a permanent failure.

Example

// Example of a classified error
{
  "errorClass": "Permanent",
  "errorCode": 400,
  "errorMessage": "message is too long",
  "retryCount": 0,
  "payloads": [
    { "text": "Audit complete. Here's the full report:\n\n# Hallucination Audit..." }
  ]
}

Notes

The proposed fix requires updates to the delivery-recovery system to classify errors and apply the appropriate retry policy.
The introduction of error classification and a retry policy may require additional logging and monitoring to ensure correct behavior.

Recommendation

Apply the proposed fix by introducing error classification and a retry policy to handle transient and permanent failures differently, as this will prevent stale messages from being delivered and provide better visibility into failed entries.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#permission error #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Delivery-recovery retries indefinitely on permanent HTTP 400 errors (message too long, auth, not-found) — should classify and halt on non-transient failures [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #74656: fix(outbound): classify permanent delivery failures and halt retries (#74321)

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Tests

Audit

Changed files

Code Example

Environment

Summary

Evidence

Root Cause

Proposed Fix

Impact

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING