hermes - ✅(Solved) Fix weixin: ret=-2 with empty errmsg also indicates stale context_token (follow-up to #17228) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18100Fetched 2026-05-01 05:53:51
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×3cross-referenced ×1

Error Message

#17228 fixed the stale-context_token case where iLink returns ret=-2 with errmsg="unknown error". In the wild I'm hitting a closely related variant that the current check misses: ret=-2 with errmsg=None (empty). The adapter falls through to the rate-limit branch and burns all retries against the dead token, same net symptom as the original bug. ERROR gateway.platforms.weixin: [Weixin] send failed to=o9cq80_r: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None ERROR gateway.platforms.base: [Weixin] Fallback send also failed: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None Same recovery pattern as #17228: after a user-initiated inbound message refreshed the session, the next cron push went through immediately with no code change. So this is a session-expired signal, not a frequency limit — iLink is just returning an empty errmsg instead of the string "unknown error". """True when iLink returns ret=-2 / errcode=-2 with 'unknown error', return (errmsg or "").lower() == "unknown error" (errmsg or "").lower() == "unknown error" returns False when errmsg is None or "", so the tokenless-retry branch never fires for the empty-errmsg variant. Treat both "unknown error" and empty/None errmsg as stale-session signals. A genuine rate limit from iLink carries a populated errmsg ("frequency limit" / "too frequently" / similar), so we don't mask real throttling — and even if a rate limit did slip through, the worst case is one extra tokenless attempt before the normal rate-limit backoff kicks in.

  • stale session: ret=-2, errmsg="unknown error" OR errmsg empty/None Treating "unknown error" and empty/None errmsg as stale-session signals return msg == "unknown error" I've applied this patch locally and the cron-initiated pushes recover on the first retry instead of dropping — same behaviour the #17228 fix produces for the "unknown error" variant. Happy to send this as a PR (with a regression test covering errmsg=None / "" / "unknown error" / "frequency limit"). Want me to open one, or would you prefer to fold it into the existing helper yourself? I don't want to step on an in-flight branch.

Root Cause

gateway/platforms/weixin.py::_is_stale_session_ret on current main:

def _is_stale_session_ret(
    ret: "Optional[int]", errcode: "Optional[int]", errmsg: "Optional[str]",
) -> bool:
    """True when iLink returns ret=-2 / errcode=-2 with 'unknown error',
    which is a stale-session signal (same as errcode=-14) rather than
    a genuine rate limit."""
    if ret != RATE_LIMIT_ERRCODE and errcode != RATE_LIMIT_ERRCODE:
        return False
    return (errmsg or "").lower() == "unknown error"

(errmsg or "").lower() == "unknown error" returns False when errmsg is None or "", so the tokenless-retry branch never fires for the empty-errmsg variant.

Fix Action

Fix / Workaround

I've applied this patch locally and the cron-initiated pushes recover on the first retry instead of dropping — same behaviour the #17228 fix produces for the "unknown error" variant.

PR fix notes

PR #18105: fix(weixin): treat empty rate-limit message as stale session

Description (problem / solution / changelog)

Summary

  • Treat iLink ret=-2 / errcode=-2 with an empty or missing errmsg as a stale Weixin session signal.
  • Keep populated rate-limit messages on the existing rate-limit/backoff path.
  • Extend the Weixin stale-session helper regression tests for empty/None errmsg on both ret and errcode variants.

Root cause

gateway/platforms/weixin.py::_is_stale_session_ret only recognized "unknown error" as the stale-session flavor of iLink -2 responses. In practice, iLink can return the same stale-context-token signal with errmsg=None / empty, so Hermes misclassified it as a genuine rate limit and retried against the dead context token.

Fix

Normalize the message with strip().lower() and treat both "unknown error" and an empty normalized message as stale-session signals. Non-empty rate-limit text such as "freq limit" remains excluded.

Regression coverage

  • Updated TestIsStaleSessionRet to assert ret=-2 with None / "" errmsg is stale.
  • Added matching coverage for errcode=-2 with None / "" errmsg.
  • Existing genuine rate-limit and success-code guards continue to verify we do not broaden unrelated cases.

Testing

  • scripts/run_tests.sh tests/gateway/test_weixin.py::TestIsStaleSessionRet -q
  • scripts/run_tests.sh tests/gateway/test_weixin.py -q

Closes #18100

Changed files

  • gateway/platforms/weixin.py (modified, +9/-4)
  • tests/gateway/test_weixin.py (modified, +7/-3)

Code Example

WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
ERROR   gateway.platforms.weixin: [Weixin] send failed to=o9cq80_r: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None
WARNING gateway.platforms.base:   [Weixin] Send failed: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None — trying plain-text fallback
...
ERROR   gateway.platforms.base:   [Weixin] Fallback send also failed: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None

---

def _is_stale_session_ret(
    ret: "Optional[int]", errcode: "Optional[int]", errmsg: "Optional[str]",
) -> bool:
    """True when iLink returns ret=-2 / errcode=-2 with 'unknown error',
    which is a stale-session signal (same as errcode=-14) rather than
    a genuine rate limit."""
    if ret != RATE_LIMIT_ERRCODE and errcode != RATE_LIMIT_ERRCODE:
        return False
    return (errmsg or "").lower() == "unknown error"

---

def _is_stale_session_ret(
    ret: "Optional[int]", errcode: "Optional[int]", errmsg: "Optional[str]",
) -> bool:
    """True when iLink returns ret=-2 / errcode=-2 that is likely a stale
    context_token rather than a genuine rate limit.

    Empirically iLink signals these two scenarios weakly:
    - stale session:      ret=-2, errmsg="unknown error" OR errmsg empty/None
    - genuine rate limit: ret=-2 with a populated errmsg such as
      "frequency limit" / "too frequently" / similar

    Treating "unknown error" and empty/None errmsg as stale-session signals
    lets the caller attempt one tokenless retry. A true rate limit still
    falls through to the existing rate-limit backoff path if the tokenless
    attempt also fails.
    """
    if ret != RATE_LIMIT_ERRCODE and errcode != RATE_LIMIT_ERRCODE:
        return False
    msg = (errmsg or "").strip().lower()
    if not msg:
        return True
    return msg == "unknown error"
RAW_BUFFERClick to expand / collapse

Follow-up to #17228

#17228 fixed the stale-context_token case where iLink returns ret=-2 with errmsg="unknown error". In the wild I'm hitting a closely related variant that the current check misses: ret=-2 with errmsg=None (empty). The adapter falls through to the rate-limit branch and burns all retries against the dead token, same net symptom as the original bug.

Repro / evidence

~/.hermes/logs/gateway.log (WSL2 / iLink personal WeChat bot, current main):

WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
WARNING gateway.platforms.weixin: [Weixin] rate limited for o9cq80_r; backing off 3.0s before retry
ERROR   gateway.platforms.weixin: [Weixin] send failed to=o9cq80_r: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None
WARNING gateway.platforms.base:   [Weixin] Send failed: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None — trying plain-text fallback
...
ERROR   gateway.platforms.base:   [Weixin] Fallback send also failed: iLink sendmessage rate limited: ret=-2 errcode=None errmsg=None

Same recovery pattern as #17228: after a user-initiated inbound message refreshed the session, the next cron push went through immediately with no code change. So this is a session-expired signal, not a frequency limit — iLink is just returning an empty errmsg instead of the string "unknown error".

Root cause

gateway/platforms/weixin.py::_is_stale_session_ret on current main:

def _is_stale_session_ret(
    ret: "Optional[int]", errcode: "Optional[int]", errmsg: "Optional[str]",
) -> bool:
    """True when iLink returns ret=-2 / errcode=-2 with 'unknown error',
    which is a stale-session signal (same as errcode=-14) rather than
    a genuine rate limit."""
    if ret != RATE_LIMIT_ERRCODE and errcode != RATE_LIMIT_ERRCODE:
        return False
    return (errmsg or "").lower() == "unknown error"

(errmsg or "").lower() == "unknown error" returns False when errmsg is None or "", so the tokenless-retry branch never fires for the empty-errmsg variant.

Proposed fix (Option B from #17228, minimally widened)

Treat both "unknown error" and empty/None errmsg as stale-session signals. A genuine rate limit from iLink carries a populated errmsg ("frequency limit" / "too frequently" / similar), so we don't mask real throttling — and even if a rate limit did slip through, the worst case is one extra tokenless attempt before the normal rate-limit backoff kicks in.

def _is_stale_session_ret(
    ret: "Optional[int]", errcode: "Optional[int]", errmsg: "Optional[str]",
) -> bool:
    """True when iLink returns ret=-2 / errcode=-2 that is likely a stale
    context_token rather than a genuine rate limit.

    Empirically iLink signals these two scenarios weakly:
    - stale session:      ret=-2, errmsg="unknown error" OR errmsg empty/None
    - genuine rate limit: ret=-2 with a populated errmsg such as
      "frequency limit" / "too frequently" / similar

    Treating "unknown error" and empty/None errmsg as stale-session signals
    lets the caller attempt one tokenless retry. A true rate limit still
    falls through to the existing rate-limit backoff path if the tokenless
    attempt also fails.
    """
    if ret != RATE_LIMIT_ERRCODE and errcode != RATE_LIMIT_ERRCODE:
        return False
    msg = (errmsg or "").strip().lower()
    if not msg:
        return True
    return msg == "unknown error"

Verification

I've applied this patch locally and the cron-initiated pushes recover on the first retry instead of dropping — same behaviour the #17228 fix produces for the "unknown error" variant.

Question

Happy to send this as a PR (with a regression test covering errmsg=None / "" / "unknown error" / "frequency limit"). Want me to open one, or would you prefer to fold it into the existing helper yourself? I don't want to step on an in-flight branch.

— reported from a zh-CN / WSL2 iLink setup, in case locale turns out to matter for which errmsg iLink returns.

extent analysis

TL;DR

Update the _is_stale_session_ret function to treat both "unknown error" and empty/None errmsg as stale-session signals.

Guidance

  • Review the proposed fix in the issue, which widens the condition to check for empty/None errmsg in addition to "unknown error".
  • Verify that the updated function behaves correctly for different errmsg values, including None, "", "unknown error", and "frequency limit".
  • Consider adding regression tests to cover these scenarios.
  • Before opening a PR, check if there are any in-flight branches that may conflict with this change.

Example

def _is_stale_session_ret(
    ret: "Optional[int]", errcode: "Optional[int]", errmsg: "Optional[str]",
) -> bool:
    # ...
    msg = (errmsg or "").strip().lower()
    if not msg:
        return True
    return msg == "unknown error"

Notes

The fix assumes that a genuine rate limit from iLink always carries a populated errmsg. If this assumption is incorrect, the updated function may mask real throttling.

Recommendation

Apply the proposed workaround by updating the _is_stale_session_ret function as described, and add regression tests to ensure the function behaves correctly in different scenarios.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING