hermes - ✅(Solved) Fix [Bug]: WeCom AI Bot WebSocket reconnects cause group message delivery to stop silently [3 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#11554Fetched 2026-04-18 06:00:17
View on GitHub
Comments
1
Participants
1
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×3referenced ×3closed ×1commented ×1

When the WeCom AI Bot WebSocket connection drops and reconnects (which happens every ~5 minutes in my environment), the new connection appears healthy (heartbeats/ping work fine), but inbound group messages (aibot_msg_callback) are no longer delivered.

The gateway process itself never crashes, and there are no errors in the logs after reconnect — the messages simply stop arriving.

Error Message

[Wecom] WebSocket error: WeCom websocket closed [Wecom] Reconnected [Wecom] Received websocket payload: cmd=None req_id=ping-... [Wecom] Ignoring websocket payload: {'headers': {'req_id': 'ping-...'}, 'errcode': 0, 'errmsg': 'ok'}

... repeated pings forever, zero aibot_msg_callback

Root Cause

Suspected Root Cause

After _open_connection() re-subscribes via aibot_subscribe, the WeCom server may not re-associate the new WebSocket connection with the existing group chat subscriptions. The reconnect logic in _listen_loop() only re-opens the connection and re-sends aibot_subscribe, but there may be a missing step to re-register or re-sync group chat message routing.

Fix Action

Fixed

PR fix notes

PR #11572: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 e…

Description (problem / solution / changelog)

…rrors #11554

What does this PR do?

This PR addresses two critical issues with the WeCom AI Bot WebSocket gateway integration, specifically resolving #11554.

  1. Silent Message Drops on Reconnect: When the WebSocket reconnects, the WeCom server keeps the previous session alive as a "zombie", routing inbound group messages to the dead connection instead of the new one.
  2. WeCom errcode 600039 in Group Chats: WeCom AI Bots are restricted from sending proactive messages (APP_CMD_SEND) in group chats and must reply to an existing req_id (APP_CMD_RESPONSE). Additionally, the stream message type is unsupported on many WeCom mobile clients, triggering the same 600039 error.

Changes Made

  • Added device_id to aibot_subscribe: Generating and persisting a device_id per adapter instance ensures the WeCom server correctly overrides the old zombie session and re-routes messages upon reconnection.
  • Replaced stream with markdown: Removed _send_reply_stream in favor of _send_reply_markdown to ensure message formatting compatibility across all WeCom clients.
  • Implemented req_id caching for group fallbacks: The adapter now tracks the last seen req_id for each chat_id (_last_chat_req_ids). If the bot attempts to send a message to a group without a specific reply_to context, it gracefully falls back to sending a reply using the cached req_id, bypassing the proactive send restriction.

Related Issue

Fixes #11554

Testing

  • Verified WebSocket reconnects no longer result in silent message drops.
  • Verified bot can successfully send messages to group chats without hitting the 600039: device type not support error.
<!-- Describe the change clearly. What problem does it solve? Why is this approach the right one? -->

Related Issue

<!-- Link the issue this PR addresses. If no issue exists, consider creating one first. -->

Fixes #

Type of Change

<!-- Check the one that applies. -->
  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

<!-- List the specific changes. Include file paths for code changes. -->

How to Test

<!-- Steps to verify this change works. For bugs: reproduction steps + proof that the fix works. -->

Checklist

<!-- Complete these before requesting review. -->

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: <!-- e.g. Ubuntu 24.04, macOS 15.2, Windows 11 -->

Documentation & Housekeeping

<!-- Check all that apply. It's OK to check "N/A" if a category doesn't apply to your change. -->
  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

<!-- Only fill this out if you're adding a skill. Delete this section otherwise. -->
  • This skill is broadly useful to most users (if bundled) — see Contributing Guide
  • SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
  • No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
  • I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

<!-- If applicable, add screenshots or log output showing the fix/feature in action. -->

Changed files

  • gateway/platforms/wecom.py (modified, +22/-11)

PR #11656: fix(wecom): watchdog + reset _last_msg_at on reconnect to fix silent group message loss

Description (problem / solution / changelog)

Problem

WeCom AI Bot WebSocket connections experience silent message delivery failures after reconnects:

  1. The WebSocket drops and reconnects every ~5 minutes.
  2. After reconnect, the connection appears healthy (pings flow), but inbound group messages () stop arriving.
  3. The auto-reconnect only reset , leaving stale. This caused the watchdog (if present) to miscalculate idle time.

Changes

  • **Add **: Monitors inbound message flow. If no arrives within (default 300s) after the connection has been up for 2 minutes, it force-closes the socket so reconnects and re-subscribes.
  • **Track and **: Updated on every real message callback.
  • Reset on reconnect: Fixes the bug where the watchdog used the previous connection's last message time.
  • Disable reply-stream for groups entirely: Proactive works reliably for both DMs and groups, while reply streams silently fail in groups with errcode 600039.
  • Improve logging: Added inbound/outbound payload logging for easier debugging.

Testing

  • Watchdog successfully detected a stalled connection (~59 min without group messages) and forced a reconnect.
  • After reconnect, group messages () resumed immediately.
  • Reply sending now consistently returns .

Closes #11554

Changed files

  • gateway/platforms/wecom.py (modified, +88/-5)

PR #11897: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 errors (#11554)

Description (problem / solution / changelog)

Summary

WeCom AI Bot WebSocket reconnects no longer silently drop inbound group messages, and group replies no longer fail with errcode 600039: device type not support. Fixes #11554.

Root cause

Three interacting WeCom protocol quirks:

  1. aibot_subscribe didn't include a device_id, so on reconnect the WeCom server kept routing inbound callbacks to the old (zombie) session.
  2. _send_reply_stream used msgtype: stream, which is unsupported on many WeCom mobile clients and triggers errcode 600039.
  3. WeCom AI Bots can't initiate APP_CMD_SEND in group chats at all — proactive sends to groups must piggyback on a prior inbound req_id via APP_CMD_RESPONSE.

Approach

Salvaged from #11572 (devorun):

  • Generate a stable device_id per adapter instance and include it in the subscribe body so WeCom takes over the zombie session on reconnect.
  • Replace _send_reply_stream with _send_reply_markdown (msgtype markdown) so replies render on all WeCom clients.
  • Cache the most recent inbound req_id per chat, and fall back to it for proactive sends when no explicit reply_to is available — required for group sends.

Follow-up fixes applied on top during salvage:

  • Bounded the req_id cache. Extracted _remember_chat_req_id() helper and capped it at DEDUP_MAX_SIZE like the existing _reply_req_ids eviction — otherwise a long-running gateway with many chats leaks memory forever.
  • Moved cache write to after the policy check. We don't cache req_ids from blocked senders.
  • Reverted the undocumented is_group change. The original PR flipped chattype == "group" to bool(chatid) without mentioning it — that weakens the signal since chattype is the explicit hint. Kept the original check.
  • Dropped defensive getattr(self, '_last_chat_req_ids', {}) reads at both send sites — the attribute is initialized in __init__.
  • Added tests. New TestWeComZombieSessionFix class covering: device_id presence in subscribe, distinct device_ids per instance, per-chat req_id caching, blocked-sender cache exclusion, cache bounding, and the group APP_CMD_RESPONSE fallback. Also updated the existing test_send_uses_passive_reply_stream_when_reply_context_exists_markdown_... to match the new msgtype.

Why this over #11656 (competing PR)

#11656 (olbwmly-png) tried to solve the same bug by adding a 5-minute watchdog that force-reconnects when no callbacks arrive, plus a fallback-on-600039 path. That's a defensive workaround — if device_id really is the missing piece, each watchdog-driven reconnect would fail the same way. #11572's targeted protocol fix (device_id + markdown + req_id fallback) addresses the root cause. The watchdog idea is valid as defense-in-depth and a follow-up PR adding it on top would be welcome. Closing #11656 with credit.

Changes

FileChange
gateway/platforms/wecom.pydevice_id in __init__ + subscribe body; _last_chat_req_ids cache; _remember_chat_req_id() helper (bounded); _send_reply_stream_send_reply_markdown; proactive send falls back to cached req_id
tests/gateway/test_wecom.pyUpdate reply_stream test → reply_markdown; new TestWeComZombieSessionFix class (9 tests)

Validation

Result
tests/gateway/test_wecom.py40/40 passing (was 31, +9 new)
tests/gateway/test_wecom.py + test_wecom_callback.py + test_text_batching.py70/70 passing
E2E: distinct device_id per adapter, stable within instance
E2E: _last_chat_req_ids bounded at DEDUP_MAX_SIZE=1000 under 1100-insert stress
E2E: empty values ignored, _send_reply_stream fully removed in favor of _send_reply_markdown

Merge method

Please rebase-merge (gh pr merge --rebase) — the first commit is authored by Devorun and must keep that authorship. Squash would collapse both commits under the merger.

Closes #11554. Supersedes and will close #11572 (devorun, cherry-picked with authorship preserved) and #11656 (olbwmly-png, watchdog approach — welcome as a follow-up).

Changed files

  • gateway/platforms/wecom.py (modified, +41/-10)
  • tests/gateway/test_wecom.py (modified, +195/-4)

Code Example

[Wecom] Received websocket payload: cmd=aibot_msg_callback req_id=...

---

[Wecom] WebSocket error: WeCom websocket closed
[Wecom] Reconnected
[Wecom] Received websocket payload: cmd=None req_id=ping-...
[Wecom] Ignoring websocket payload: {'headers': {'req_id': 'ping-...'}, 'errcode': 0, 'errmsg': 'ok'}
# ... repeated pings forever, zero aibot_msg_callback

---

WeCom errcode 600039: device type not support
RAW_BUFFERClick to expand / collapse

Description

When the WeCom AI Bot WebSocket connection drops and reconnects (which happens every ~5 minutes in my environment), the new connection appears healthy (heartbeats/ping work fine), but inbound group messages (aibot_msg_callback) are no longer delivered.

The gateway process itself never crashes, and there are no errors in the logs after reconnect — the messages simply stop arriving.

Reproduction Steps

  1. Start hermes gateway with WeCom AI Bot (WebSocket) enabled.
  2. Add the bot to a WeCom group chat.
  3. Send a message in the group → bot receives it (aibot_msg_callback).
  4. Wait for the WebSocket to disconnect (happens every ~5 min in my logs: WeCom websocket closedReconnected).
  5. Send another message in the same group → bot never receives it. No aibot_msg_callback is logged.

Expected vs Actual Behavior

  • Expected: After WebSocket reconnects, group messages should continue to be delivered via aibot_msg_callback.
  • Actual: After reconnect, only ping payloads are received. Group messages are silently dropped by the WeCom server side (or the subscription is lost).

Environment

  • Hermes version: latest main (as of 2026-04-17)
  • Platform: macOS (Apple Silicon)
  • Python: 3.11
  • WeCom mode: AI Bot (WebSocket) — gateway/platforms/wecom.py
  • Config: group_policy: open, no allowlist restrictions

Logs

Before disconnect (messages arrive normally)

[Wecom] Received websocket payload: cmd=aibot_msg_callback req_id=...

After reconnect (only pings, no messages)

[Wecom] WebSocket error: WeCom websocket closed
[Wecom] Reconnected
[Wecom] Received websocket payload: cmd=None req_id=ping-...
[Wecom] Ignoring websocket payload: {'headers': {'req_id': 'ping-...'}, 'errcode': 0, 'errmsg': 'ok'}
# ... repeated pings forever, zero aibot_msg_callback

Additional symptom: 600039 on group replies

When the bot does manage to process a group message (before the disconnect), sending a reply back to the group fails with:

WeCom errcode 600039: device type not support

This may or may not be related, but it only happens for group chats — proactive sends to groups via WeComAdapter.send_text() also hit this error.

Suspected Root Cause

After _open_connection() re-subscribes via aibot_subscribe, the WeCom server may not re-associate the new WebSocket connection with the existing group chat subscriptions. The reconnect logic in _listen_loop() only re-opens the connection and re-sends aibot_subscribe, but there may be a missing step to re-register or re-sync group chat message routing.

Possible Fix Ideas

  1. After reconnect, explicitly trigger a re-subscription or re-sync with WeCom server.
  2. Add a health-check that verifies aibot_msg_callback is actually being received (not just pings), and force a harder reconnect if no messages arrive within a timeout window.
  3. Log a warning when reconnect happens but no non-ping messages have been seen for N minutes.

Related

  • gateway/platforms/wecom.py
  • RECONNECT_BACKOFF logic in _listen_loop()

extent analysis

TL;DR

The WeCom AI Bot may need to re-register or re-sync group chat message routing after a WebSocket reconnect to receive inbound group messages.

Guidance

  • Verify that the aibot_subscribe request is successfully sent after reconnecting and that the WeCom server responds with a success message.
  • Investigate adding a health-check to ensure aibot_msg_callback messages are being received after a reconnect, and trigger a harder reconnect if no messages arrive within a timeout window.
  • Review the gateway/platforms/wecom.py code to see if there's a missing step in the reconnect logic to re-associate the new WebSocket connection with existing group chat subscriptions.
  • Consider logging a warning when a reconnect happens but no non-ping messages have been seen for a certain amount of time to help diagnose the issue.

Example

No code example is provided as the issue requires further investigation into the WeCom API and the gateway/platforms/wecom.py code.

Notes

The issue may be related to the WeCom server not re-associating the new WebSocket connection with existing group chat subscriptions after a reconnect. The errcode 600039 error when sending replies to groups may be a separate issue, but it could be related to the same underlying problem.

Recommendation

Apply a workaround by adding a health-check to verify aibot_msg_callback messages are being received after a reconnect, and trigger a harder reconnect if no messages arrive within a timeout window. This may help mitigate the issue until a more permanent fix can be found.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug]: WeCom AI Bot WebSocket reconnects cause group message delivery to stop silently [3 pull requests, 1 comments, 1 participants]