hermes - ✅(Solved) Fix [Bug]: WeCom AI Bot WebSocket reconnects cause group message delivery to stop silently [3 pull requests, 1 comments, 1 participants]

chengoak · 2026-04-17T11:07:41Z

[hermes] When the WeCom AI Bot WebSocket connection drops and reconnects which happens every ~5 minutes in my environment , the new connection appears healthy… When the WeCom AI Bot WebSocket connection drops and reconnects (which happens every ~5 minutes in my environment), the new connection appears healthy (heartbeats/ping work fine), but **inbound group messages (`aibot_msg_callback`) are no longer delivered**. The gateway process itself never crashes, and there are no errors in the logs after reconnect — the messages simply stop arriving. # PR #11572: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 e… - Repository: NousResearch/hermes-agent - Author: devorun - State: closed | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/11572 ## Description (problem / solution / changelog) …rrors #11554 ## What does this PR do? This PR addresses two critical issues with the WeCom AI Bot WebSocket gateway integration, specifically resolving #11554. 1. **Silent Message Drops on Reconnect:** When the WebSocket reconnects, the WeCom server keeps the previous session alive as a "zombie", routing inbound group messages to the dead connection instead of the new one. 2. **WeCom errcode 600039 in Group Chats:** WeCom AI Bots are restricted from sending proactive messages (`APP_CMD_SEND`) in group chats and must reply to an existing `req_id` (`APP_CMD_RESPONSE`). Additionally, the `stream` message type is unsupported on many WeCom mobile clients, triggering the same 600039 error. ## Changes Made - **Added `device_id` to `aibot_subscribe`:** Generating and persisting a `device_id` per adapter instance ensures the WeCom server correctly overrides the old zombie session and re-routes messages upon reconnection. - **Replaced `stream` with `markdown`:** Removed `_send_reply_stream` in favor of `_send_reply_markdown` to ensure message formatting compatibility across all WeCom clients. - **Implemented `req_id` caching for group fallbacks:** The adapter now tracks the last seen `req_id` for each `chat_id` (`_last_chat_req_ids`). If the bot attempts to send a message to a group without a specific `reply_to` context, it gracefully falls back to sending a reply using the cached `req_id`, bypassing the proactive send restriction. ## Related Issue Fixes #11554 ## Testing - Verified WebSocket reconnects no longer result in silent message drops. - Verified bot can successfully send messages to group chats without hitting the `600039: device type not support` error. ## Related Issue Fixes # ## Type of Change - [ ] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [ ] ✅ Tests (adding or improving test coverage) - [ ] ♻️ Refactor (no behavior change) - [ ] 🎯 New skill (bundled or hub) ## Changes Made - ## How to Test 1. 2. 3. ## Checklist ### Code - [ ] I've read the [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md) - [ ] My commit messages follow [Conventional Commits](https://www.conventionalcommits.org/) (`fix(scope):`, `feat(scope):`, etc.) - [ ] I searched for [existing PRs](https://github.com/NousResearch/hermes-agent/pulls) to make sure this isn't a duplicate - [ ] My PR contains **only** changes related to this fix/feature (no unrelated commits) - [ ] I've run `pytest tests/ -q` and all tests pass - [ ] I've added tests for my changes (required for bug fixes, strongly encouraged for features) - [ ] I've tested on my platform: ### Documentation & Housekeeping - [ ] I've updated relevant documentation (README, `docs/`, docstrings) — or N/A - [ ] I've updated `cli-config.yaml.example` if I added/changed config keys — or N/A - [ ] I've updated `CONTRIBUTING.md` or `AGENTS.md` if I changed architecture or workflows — or N/A - [ ] I've considered cross-platform impact (Windows, macOS) per the [compatibility guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md#cross-platform-compatibility) — or N/A - [ ] I've updated tool descriptions/schemas if I changed tool behavior — or N/A ## For New Skills - [ ] This skill is **broadly useful** to most users (if bundled) — see [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob

hermes2026-04-17 11:07:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#11554•Fetched 2026-04-18 06:00:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

chengoak

Participants

chengoak

Timeline (top)

cross-referenced ×3referenced ×3closed ×1commented ×1

When the WeCom AI Bot WebSocket connection drops and reconnects (which happens every ~5 minutes in my environment), the new connection appears healthy (heartbeats/ping work fine), but inbound group messages (aibot_msg_callback) are no longer delivered.

The gateway process itself never crashes, and there are no errors in the logs after reconnect — the messages simply stop arriving.

Error Message

[Wecom] WebSocket error: WeCom websocket closed [Wecom] Reconnected [Wecom] Received websocket payload: cmd=None req_id=ping-... [Wecom] Ignoring websocket payload: {'headers': {'req_id': 'ping-...'}, 'errcode': 0, 'errmsg': 'ok'}

... repeated pings forever, zero aibot_msg_callback

Root Cause

Suspected Root Cause

After _open_connection() re-subscribes via aibot_subscribe, the WeCom server may not re-associate the new WebSocket connection with the existing group chat subscriptions. The reconnect logic in _listen_loop() only re-opens the connection and re-sends aibot_subscribe, but there may be a missing step to re-register or re-sync group chat message routing.

Fix Action

Fixed

Fixed by PR: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 e… (https://github.com/NousResearch/hermes-agent/pull/11572)
Fixed by PR: fix(wecom): watchdog + reset _last_msg_at on reconnect to fix silent group message loss (https://github.com/NousResearch/hermes-agent/pull/11656)
Fixed by PR: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 errors (#11554) (https://github.com/NousResearch/hermes-agent/pull/11897)

PR fix notes

PR #11572: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 e…

Repository: NousResearch/hermes-agent
Author: devorun
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/11572

Description (problem / solution / changelog)

…rrors #11554

What does this PR do?

This PR addresses two critical issues with the WeCom AI Bot WebSocket gateway integration, specifically resolving #11554.

Silent Message Drops on Reconnect: When the WebSocket reconnects, the WeCom server keeps the previous session alive as a "zombie", routing inbound group messages to the dead connection instead of the new one.
WeCom errcode 600039 in Group Chats: WeCom AI Bots are restricted from sending proactive messages (APP_CMD_SEND) in group chats and must reply to an existing req_id (APP_CMD_RESPONSE). Additionally, the stream message type is unsupported on many WeCom mobile clients, triggering the same 600039 error.

Changes Made

Added device_id to aibot_subscribe: Generating and persisting a device_id per adapter instance ensures the WeCom server correctly overrides the old zombie session and re-routes messages upon reconnection.
Replaced stream with markdown: Removed _send_reply_stream in favor of _send_reply_markdown to ensure message formatting compatibility across all WeCom clients.
Implemented req_id caching for group fallbacks: The adapter now tracks the last seen req_id for each chat_id (_last_chat_req_ids). If the bot attempts to send a message to a group without a specific reply_to context, it gracefully falls back to sending a reply using the cached req_id, bypassing the proactive send restriction.

Related Issue

Fixes #11554

Testing

Verified WebSocket reconnects no longer result in silent message drops.
Verified bot can successfully send messages to group chats without hitting the 600039: device type not support error.

Related Issue

Fixes #

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

How to Test

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform:

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

This skill is broadly useful to most users (if bundled) — see Contributing Guide
SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

Changed files

gateway/platforms/wecom.py (modified, +22/-11)

PR #11656: fix(wecom): watchdog + reset _last_msg_at on reconnect to fix silent group message loss

Repository: NousResearch/hermes-agent
Author: olbwmly-png
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/11656

Description (problem / solution / changelog)

Problem

WeCom AI Bot WebSocket connections experience silent message delivery failures after reconnects:

The WebSocket drops and reconnects every ~5 minutes.
After reconnect, the connection appears healthy (pings flow), but inbound group messages () stop arriving.
The auto-reconnect only reset , leaving stale. This caused the watchdog (if present) to miscalculate idle time.

Changes

**Add **: Monitors inbound message flow. If no arrives within (default 300s) after the connection has been up for 2 minutes, it force-closes the socket so reconnects and re-subscribes.
**Track and **: Updated on every real message callback.
Reset on reconnect: Fixes the bug where the watchdog used the previous connection's last message time.
Disable reply-stream for groups entirely: Proactive works reliably for both DMs and groups, while reply streams silently fail in groups with errcode 600039.
Improve logging: Added inbound/outbound payload logging for easier debugging.

Testing

Watchdog successfully detected a stalled connection (~59 min without group messages) and forced a reconnect.
After reconnect, group messages () resumed immediately.
Reply sending now consistently returns .

Closes #11554

Changed files

gateway/platforms/wecom.py (modified, +88/-5)

PR #11897: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 errors (#11554)

Repository: NousResearch/hermes-agent
Author: teknium1
State: closed | merged: True
Link: https://github.com/NousResearch/hermes-agent/pull/11897

Description (problem / solution / changelog)

Summary

WeCom AI Bot WebSocket reconnects no longer silently drop inbound group messages, and group replies no longer fail with errcode 600039: device type not support. Fixes #11554.

Root cause

Three interacting WeCom protocol quirks:

aibot_subscribe didn't include a device_id, so on reconnect the WeCom server kept routing inbound callbacks to the old (zombie) session.
_send_reply_stream used msgtype: stream, which is unsupported on many WeCom mobile clients and triggers errcode 600039.
WeCom AI Bots can't initiate APP_CMD_SEND in group chats at all — proactive sends to groups must piggyback on a prior inbound req_id via APP_CMD_RESPONSE.

Approach

Salvaged from #11572 (devorun):

Generate a stable device_id per adapter instance and include it in the subscribe body so WeCom takes over the zombie session on reconnect.
Replace _send_reply_stream with _send_reply_markdown (msgtype markdown) so replies render on all WeCom clients.
Cache the most recent inbound req_id per chat, and fall back to it for proactive sends when no explicit reply_to is available — required for group sends.

Follow-up fixes applied on top during salvage:

Bounded the req_id cache. Extracted _remember_chat_req_id() helper and capped it at DEDUP_MAX_SIZE like the existing _reply_req_ids eviction — otherwise a long-running gateway with many chats leaks memory forever.
Moved cache write to after the policy check. We don't cache req_ids from blocked senders.
Reverted the undocumented is_group change. The original PR flipped chattype == "group" to bool(chatid) without mentioning it — that weakens the signal since chattype is the explicit hint. Kept the original check.
Dropped defensive getattr(self, '_last_chat_req_ids', {}) reads at both send sites — the attribute is initialized in __init__.
Added tests. New TestWeComZombieSessionFix class covering: device_id presence in subscribe, distinct device_ids per instance, per-chat req_id caching, blocked-sender cache exclusion, cache bounding, and the group APP_CMD_RESPONSE fallback. Also updated the existing test_send_uses_passive_reply_stream_when_reply_context_exists → _markdown_... to match the new msgtype.

Why this over #11656 (competing PR)

#11656 (olbwmly-png) tried to solve the same bug by adding a 5-minute watchdog that force-reconnects when no callbacks arrive, plus a fallback-on-600039 path. That's a defensive workaround — if device_id really is the missing piece, each watchdog-driven reconnect would fail the same way. #11572's targeted protocol fix (device_id + markdown + req_id fallback) addresses the root cause. The watchdog idea is valid as defense-in-depth and a follow-up PR adding it on top would be welcome. Closing #11656 with credit.

Changes

File	Change
`gateway/platforms/wecom.py`	`device_id` in `__init__` + subscribe body; `_last_chat_req_ids` cache; `_remember_chat_req_id()` helper (bounded); `_send_reply_stream` → `_send_reply_markdown`; proactive send falls back to cached req_id
`tests/gateway/test_wecom.py`	Update `reply_stream` test → `reply_markdown`; new `TestWeComZombieSessionFix` class (9 tests)

Validation

	Result
`tests/gateway/test_wecom.py`	40/40 passing (was 31, +9 new)
`tests/gateway/test_wecom.py + test_wecom_callback.py + test_text_batching.py`	70/70 passing
E2E: distinct `device_id` per adapter, stable within instance	✓
E2E: `_last_chat_req_ids` bounded at DEDUP_MAX_SIZE=1000 under 1100-insert stress	✓
E2E: empty values ignored, `_send_reply_stream` fully removed in favor of `_send_reply_markdown`	✓

Merge method

Please rebase-merge (gh pr merge --rebase) — the first commit is authored by Devorun and must keep that authorship. Squash would collapse both commits under the merger.

Closes #11554. Supersedes and will close #11572 (devorun, cherry-picked with authorship preserved) and #11656 (olbwmly-png, watchdog approach — welcome as a follow-up).

Changed files

gateway/platforms/wecom.py (modified, +41/-10)
tests/gateway/test_wecom.py (modified, +195/-4)

Code Example

[Wecom] Received websocket payload: cmd=aibot_msg_callback req_id=...

---

[Wecom] WebSocket error: WeCom websocket closed
[Wecom] Reconnected
[Wecom] Received websocket payload: cmd=None req_id=ping-...
[Wecom] Ignoring websocket payload: {'headers': {'req_id': 'ping-...'}, 'errcode': 0, 'errmsg': 'ok'}
# ... repeated pings forever, zero aibot_msg_callback

---

WeCom errcode 600039: device type not support

RAW_BUFFERClick to expand / collapse

Description

The gateway process itself never crashes, and there are no errors in the logs after reconnect — the messages simply stop arriving.

Reproduction Steps

Start hermes gateway with WeCom AI Bot (WebSocket) enabled.
Add the bot to a WeCom group chat.
Send a message in the group → bot receives it (aibot_msg_callback).
Wait for the WebSocket to disconnect (happens every ~5 min in my logs: WeCom websocket closed → Reconnected).
Send another message in the same group → bot never receives it. No aibot_msg_callback is logged.

Expected vs Actual Behavior

Expected: After WebSocket reconnects, group messages should continue to be delivered via aibot_msg_callback.
Actual: After reconnect, only ping payloads are received. Group messages are silently dropped by the WeCom server side (or the subscription is lost).

Environment

Hermes version: latest main (as of 2026-04-17)
Platform: macOS (Apple Silicon)
Python: 3.11
WeCom mode: AI Bot (WebSocket) — gateway/platforms/wecom.py
Config: group_policy: open, no allowlist restrictions

Logs

Before disconnect (messages arrive normally)

[Wecom] Received websocket payload: cmd=aibot_msg_callback req_id=...

After reconnect (only pings, no messages)

[Wecom] WebSocket error: WeCom websocket closed
[Wecom] Reconnected
[Wecom] Received websocket payload: cmd=None req_id=ping-...
[Wecom] Ignoring websocket payload: {'headers': {'req_id': 'ping-...'}, 'errcode': 0, 'errmsg': 'ok'}
# ... repeated pings forever, zero aibot_msg_callback

Additional symptom: 600039 on group replies

When the bot does manage to process a group message (before the disconnect), sending a reply back to the group fails with:

WeCom errcode 600039: device type not support

This may or may not be related, but it only happens for group chats — proactive sends to groups via WeComAdapter.send_text() also hit this error.

Suspected Root Cause

Possible Fix Ideas

After reconnect, explicitly trigger a re-subscription or re-sync with WeCom server.
Add a health-check that verifies aibot_msg_callback is actually being received (not just pings), and force a harder reconnect if no messages arrive within a timeout window.
Log a warning when reconnect happens but no non-ping messages have been seen for N minutes.

gateway/platforms/wecom.py
RECONNECT_BACKOFF logic in _listen_loop()

extent analysis

TL;DR

The WeCom AI Bot may need to re-register or re-sync group chat message routing after a WebSocket reconnect to receive inbound group messages.

Guidance

Verify that the aibot_subscribe request is successfully sent after reconnecting and that the WeCom server responds with a success message.
Investigate adding a health-check to ensure aibot_msg_callback messages are being received after a reconnect, and trigger a harder reconnect if no messages arrive within a timeout window.
Review the gateway/platforms/wecom.py code to see if there's a missing step in the reconnect logic to re-associate the new WebSocket connection with existing group chat subscriptions.
Consider logging a warning when a reconnect happens but no non-ping messages have been seen for a certain amount of time to help diagnose the issue.

Example

No code example is provided as the issue requires further investigation into the WeCom API and the gateway/platforms/wecom.py code.

Notes

The issue may be related to the WeCom server not re-associating the new WebSocket connection with existing group chat subscriptions after a reconnect. The errcode 600039 error when sending replies to groups may be a separate issue, but it could be related to the same underlying problem.

Recommendation

Apply a workaround by adding a health-check to verify aibot_msg_callback messages are being received after a reconnect, and trigger a harder reconnect if no messages arrive within a timeout window. This may help mitigate the issue until a more permanent fix can be found.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#integration issue #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug]: WeCom AI Bot WebSocket reconnects cause group message delivery to stop silently [3 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

... repeated pings forever, zero aibot_msg_callback

Root Cause

Suspected Root Cause

Fix Action

Fixed

PR fix notes

PR #11572: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 e…

Description (problem / solution / changelog)

What does this PR do?

Changes Made

Related Issue

Testing

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

For New Skills

Screenshots / Logs

Changed files

PR #11656: fix(wecom): watchdog + reset _last_msg_at on reconnect to fix silent group message loss

Description (problem / solution / changelog)

Problem

Changes

Testing

Changed files

PR #11897: fix(wecom): resolve WebSocket zombie sessions and group chat 600039 errors (#11554)

Description (problem / solution / changelog)

Summary

Root cause

Approach

Why this over #11656 (competing PR)

Changes

Validation

Merge method

Changed files

Code Example

Description

Reproduction Steps

Expected vs Actual Behavior

Environment

Logs

Before disconnect (messages arrive normally)

After reconnect (only pings, no messages)

Additional symptom: 600039 on group replies

Suspected Root Cause

Possible Fix Ideas

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING