hermes - 💡(How to fix) Fix WeChat (iLink) rate-limit retry storm causes gateway OOM and SIGKILL — no circuit breaker [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When iLink/WeChat enters a persistent rate-limiting state (errcode -2), the _send_text_chunk retry loop in gateway/platforms/weixin.py retries each individual chunk with backoff, but there is no circuit breaker across invocations. When the agent has multiple messages queued (e.g. long multi-part replies, tool results, or cron-initiated push messages arriving during a rate-limit window), every message spawns its own retry loop. These stack up concurrently, creating a retry storm.

Error Message

  • Multiple WARNING logs: [Weixin] rate limited for <user>; backing off 3.0s before retry
  • After 30+ retries across many messages in ~30 seconds, gateway memory spikes to 3.4GB
  • Gateway becomes unresponsive, eventually hits systemd's 90s timeout
  • systemd sends SIGTERM → gateway can't drain in time → SIGKILL

Root Cause

When iLink/WeChat enters a persistent rate-limiting state (errcode -2), the _send_text_chunk retry loop in gateway/platforms/weixin.py retries each individual chunk with backoff, but there is no circuit breaker across invocations. When the agent has multiple messages queued (e.g. long multi-part replies, tool results, or cron-initiated push messages arriving during a rate-limit window), every message spawns its own retry loop. These stack up concurrently, creating a retry storm.

Fix Action

Fixed

Code Example

May 16 07:35:13 WARNING [Weixin] rate limited for o9cq800p; backing off 3.0s
May 16 07:35:16 WARNING [Weixin] rate limited for o9cq800p; backing off 3.0s
... (30+ times) ...
May 16 07:35:44 WARNING Shutdown context: signal=SIGTERM loadavg_1m=3.48
May 16 07:37:14 systemd: Stopped with SIGKILL. Consumed 40min CPU, 3.4G memory peak.
RAW_BUFFERClick to expand / collapse

WeChat rate-limit retry storm causes gateway OOM and eventual SIGKILL (no circuit breaker)

Description

When iLink/WeChat enters a persistent rate-limiting state (errcode -2), the _send_text_chunk retry loop in gateway/platforms/weixin.py retries each individual chunk with backoff, but there is no circuit breaker across invocations. When the agent has multiple messages queued (e.g. long multi-part replies, tool results, or cron-initiated push messages arriving during a rate-limit window), every message spawns its own retry loop. These stack up concurrently, creating a retry storm.

Observed behavior

  • Multiple WARNING logs: [Weixin] rate limited for <user>; backing off 3.0s before retry
  • After 30+ retries across many messages in ~30 seconds, gateway memory spikes to 3.4GB
  • Gateway becomes unresponsive, eventually hits systemd's 90s timeout
  • systemd sends SIGTERM → gateway can't drain in time → SIGKILL

Relevant code

gateway/platforms/weixin.py lines 1622–1643 — per-chunk rate-limit backoff loop, but with no cross-invocation throttling or circuit breaker.

Expected behavior

After N consecutive rate-limit rejections within a time window, the gateway should stop retrying and raise the error (at least for that platform, ideally for that specific chat/account). A circuit breaker pattern:

  1. Increment a counter on each rate-limit event (per chat or global)
  2. When counter exceeds threshold in a time window, stop all pending sends and surface the error
  3. Reset the breaker after a cooldown period (e.g. 30-60s without a rate limit)

Environment

  • Hermes commit: latest main as of 2026-05-16
  • Platform: WeChat (iLink)
  • The retry logic exists but lacks cross-invocation throttling
  • Related: #21011, #21061 (cover rate limiting but not the retry storm OOM failure mode)

Sample logs

May 16 07:35:13 WARNING [Weixin] rate limited for o9cq800p; backing off 3.0s
May 16 07:35:16 WARNING [Weixin] rate limited for o9cq800p; backing off 3.0s
... (30+ times) ...
May 16 07:35:44 WARNING Shutdown context: signal=SIGTERM loadavg_1m=3.48
May 16 07:37:14 systemd: Stopped with SIGKILL. Consumed 40min CPU, 3.4G memory peak.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

After N consecutive rate-limit rejections within a time window, the gateway should stop retrying and raise the error (at least for that platform, ideally for that specific chat/account). A circuit breaker pattern:

  1. Increment a counter on each rate-limit event (per chat or global)
  2. When counter exceeds threshold in a time window, stop all pending sends and surface the error
  3. Reset the breaker after a cooldown period (e.g. 30-60s without a rate limit)

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING