hermes - ✅(Solved) Fix DingTalk Stream Mode: reconnection storm causes gateway to hang [4 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#24851Fetched 2026-05-14 03:51:15
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×4labeled ×4

When the MiniMax stream connection drops (ReadTimeout), the DingTalk Stream Mode connection also disconnects. Hermes enters a reconnection loop, but dingtalk_stream.DingTalkStreamClient.start() throws:

TypeError: 'coroutine' object does not support the asynchronous context manager protocol

This causes a reconnection storm — the gateway logs 315MB of repeated errors without ever recovering, and the DingTalk adapter never successfully reconnects.

Error Message

TypeError: 'coroutine' object does not support the asynchronous context manager protocol

Root Cause

When the MiniMax stream connection drops (ReadTimeout), the DingTalk Stream Mode connection also disconnects. Hermes enters a reconnection loop, but dingtalk_stream.DingTalkStreamClient.start() throws:

TypeError: 'coroutine' object does not support the asynchronous context manager protocol

This causes a reconnection storm — the gateway logs 315MB of repeated errors without ever recovering, and the DingTalk adapter never successfully reconnects.

Fix Action

Fixed

PR fix notes

PR #24866: fix(dingtalk): add circuit breaker to prevent infinite reconnect storm

Description (problem / solution / changelog)

Summary

Add a circuit breaker to the DingTalk Stream Mode reconnection loop that stops retries after 5 consecutive failures, preventing infinite error log explosion and CPU waste.

Root Cause

When dingtalk_stream.DingTalkStreamClient.start() throws a persistent error (e.g. TypeError: 'coroutine' object does not support the asynchronous context manager protocol from an incompatible SDK version), the _run_stream() method retries indefinitely every 60 seconds. Each retry logs the same warning, generating hundreds of MB of repeated error logs (315 MB reported) without ever recovering.

Fix

  • Track consecutive failures in _run_stream()
  • After _MAX_RECONNECT_FAILURES (5) consecutive errors, log a CRITICAL message and exit the reconnect loop
  • Mark the adapter as disconnected so the gateway knows the platform is down
  • Reset the failure counter on successful connection (handles transient failures correctly)
  • Requires manual gateway restart to recover from circuit breaker trip

Regression Coverage

3 new tests in tests/gateway/test_dingtalk.py::TestReconnectCircuitBreaker:

  1. test_circuit_breaker_stops_after_max_failures — verifies loop exits after 5 consecutive TypeError exceptions
  2. test_circuit_breaker_resets_on_success — verifies transient failures (3 errors + 1 success) don't trip the breaker
  3. test_circuit_breaker_does_not_trigger_below_threshold — verifies 4 failures + manual stop doesn't trip the breaker

Testing

scripts/run_tests.sh tests/gateway/test_dingtalk.py::TestReconnectCircuitBreaker -xvs
# 3 passed

9 pre-existing failures in TestCardLifecycle and TestIncomingHandlerProcess (DingTalk SDK mock issues) are unchanged.

Fixes #24851

Changed files

  • gateway/platforms/dingtalk.py (modified, +27/-2)
  • tests/gateway/test_dingtalk.py (modified, +95/-1)

PR #24868: fix(dingtalk): trip reconnect loop after 5 consecutive startup failures (#24851)

Description (problem / solution / changelog)

Summary

  • Adds a consecutive-failure circuit breaker to DingTalkAdapter._run_stream so a permanently-broken SDK/credential/auth doesn't produce a multi-hundred-MB error log via an infinite reconnect storm (#24851).
  • Healthy stream sessions reset the counter, so the breaker only catches the case where start() fails immediately every iteration — transient mid-session disconnects keep reconnecting indefinitely as before.

The bug

#24851 (DingTalk Stream Mode: reconnection storm causes gateway to hang) reports that when dingtalk_stream.DingTalkStreamClient.start() throws synchronously every retry, the existing reconnect loop spins forever:

WARNING gateway.platforms.dingtalk: [Dingtalk] Stream client error: 'coroutine' object does not support the asynchronous context manager protocol
--- (repeats ~6000 times, 315MB total)

The user's gateway accumulated 315 MB of repeated stack traces before they manually restarted. The reporter explicitly asked for a circuit breaker after N consecutive failures.

The existing _run_stream at gateway/platforms/dingtalk.py:277 only sleeps between retries; there's no ceiling on the number of consecutive failures, so any persistent breakage (SDK incompatibility, revoked credentials, network firewall) loops forever.

The fix

  1. New constants MAX_CONSECUTIVE_RECONNECT_FAILURES = 5 and RECONNECT_HEALTHY_THRESHOLD_S = 30.0.
  2. Track time.monotonic() before each start() call. After an exception, compare elapsed time:
    • Elapsed ≥ 30 s → the stream actually ran for a while (genuine mid-session disconnect). Reset both consecutive_failures and backoff_idx so subsequent reconnects start fresh.
    • Elapsed < 30 s → counts toward the failure budget. Once the counter hits MAX_CONSECUTIVE_RECONNECT_FAILURES, log CRITICAL with the last error, set _running = False, call _mark_disconnected(), and exit the loop.
  3. asyncio.CancelledError still exits cleanly without engaging the breaker (existing shutdown semantics preserved).

This matches the existing convention in the codebase — see hermes_cli/goals.py (DEFAULT_MAX_CONSECUTIVE_PARSE_FAILURES) and gateway/stream_consumer.py (_MAX_FLOOD_STRIKES) for consecutive-failure breakers with permanent-disable semantics.

Test plan

  • Focused regression test (tests/gateway/test_dingtalk.py::TestReconnectCircuitBreaker, 3 cases):
    • test_trips_after_max_consecutive_startup_failures — every start() raises immediately, breaker trips after exactly MAX_CONSECUTIVE_RECONNECT_FAILURES attempts and _running flips to False.
    • test_long_lived_failure_resets_counter — every start() runs for 2× RECONNECT_HEALTHY_THRESHOLD_S before failing. Loops past the trip threshold without the breaker firing.
    • test_cancelled_error_exits_without_tripping_breakerCancelledError exits cleanly; _running left untouched.
  • Adjacent suite: full tests/gateway/test_dingtalk.py68 passed locally with uv run --with pytest --with pytest-xdist --with pytest-asyncio --with 'alibabacloud-dingtalk>=2.0.0' --with 'dingtalk-stream>=0.20,<1' python3 -m pytest tests/gateway/test_dingtalk.py -v.
  • Regression guard: removing the consecutive_failures >= MAX_CONSECUTIVE_RECONNECT_FAILURES branch would cause test_trips_after_max_consecutive_startup_failures to hang indefinitely — asyncio.sleep is patched to a no-op so the loop has nothing to bound it without the breaker.

Related

  • Fixes #24851

Changed files

  • gateway/platforms/dingtalk.py (modified, +33/-0)
  • tests/gateway/test_dingtalk.py (modified, +132/-0)

PR #24881: fix(gateway): add circuit breaker to DingTalk reconnect loop to prevent log storm (#24851)

Description (problem / solution / changelog)

Summary

Fixes #24851 — DingTalk Stream Mode reconnection loop generates 315 MB of logs and never recovers.

Root Cause

When dingtalk_stream.DingTalkStreamClient.start() raises the same error on every call (in this case TypeError: 'coroutine' object does not support the asynchronous context manager protocol from dingtalk-stream 0.24.3), _run_stream had:

  • No limit on repeated identical WARNING log lines
  • No mechanism to detect a persistent failure beyond the 60 s maximum backoff

This produced ~6,000 identical log entries totaling 315 MB.

Fix

Added a circuit breaker to _run_stream:

  1. Track consecutive_same_error and last_error_type across loop iterations.
  2. After _RECONNECT_CIRCUIT_BREAKER_TRIPS (5) consecutive identical errors:
    • Emit a single ERROR log explaining the circuit-breaker pause.
    • Sleep _RECONNECT_CIRCUIT_BREAKER_DELAY (300 s) instead of 60 s.
    • Suppress further WARNING lines for the same error type until after the pause.
  3. Reset the counter after the pause so the next attempt gets a fresh log line.
  4. Reset backoff_idx after a clean start() return so a recovered connection restarts at backoff[0] (2 s), not backoff[max] (60 s).

Before / After

Before (315 MB log storm):

WARNING gateway.platforms.dingtalk: [Dingtalk] Stream client error: 'coroutine' object...
WARNING gateway.platforms.dingtalk: [Dingtalk] Stream client error: 'coroutine' object...
... (×6000)

After (5 warnings + 1 circuit-breaker notice + 300 s pause):

WARNING  [Dingtalk] Stream client error: 'coroutine' object...  (×5)
ERROR    [Dingtalk] Stream client error repeated 6 times: ... — entering circuit-breaker pause (300s)
# 300 s silence
WARNING  [Dingtalk] Stream client error: ...  # fresh attempt after pause

Testing

4 new tests in tests/gateway/test_dingtalk_reconnect_circuit_breaker.py:

  • test_circuit_breaker_kicks_in_after_repeated_failures — verifies >=300 s sleep after TRIPS errors
  • test_first_few_errors_are_logged — verifies the WARNING storm is suppressed (at most TRIPS+1 lines)
  • test_backoff_resets_after_clean_connection — verifies backoff_idx resets after clean start()
  • test_cancelled_error_exits_cleanly — regression guard: CancelledError must exit immediately

All 4 pass.

Changed files

  • gateway/platforms/dingtalk.py (modified, +49/-6)
  • tests/gateway/test_dingtalk_reconnect_circuit_breaker.py (added, +166/-0)

PR #25024: fix(dingtalk): monkey-patch DingTalkStreamClient.start for websockets >= 11

Description (problem / solution / changelog)

Summary

dingtalk-stream (<=0.24.3) uses async with websockets.connect(uri) which breaks with websockets >= 11 because websockets.connect() became a coroutine function and must be awaited first.

This PR adds a monkey-patch in gateway/platforms/dingtalk.py that replaces DingTalkStreamClient.start with a fixed version using async with await websockets.connect(uri) — the only change from upstream.

Why not pin websockets?

dingtalk-stream requires websockets >= 11.0.2, but all versions >= 11 have connect() as a coroutine, so pinning alone cannot fix this. A runtime monkey-patch is the only self-contained fix.

How it works

The patch is applied at module import time in the DingTalk adapter. It can be removed cleanly once dingtalk-stream releases a fix upstream.

Testing

  • ✅ Gateway starts without the coroutine TypeError
  • ✅ DingTalk connects and stays connected (tested 40+ min after fix)
  • ✅ Other platforms (Weixin, Feishu, Yuanbao, Webhook) unaffected

Changed files

  • gateway/platforms/dingtalk.py (modified, +207/-997)

Code Example

TypeError: 'coroutine' object does not support the asynchronous context manager protocol

---

WARNING gateway.platforms.dingtalk: [Dingtalk] Stream client error: 'coroutine' object does not support the asynchronous context manager protocol
--- (repeats ~6000 times, 315MB total)
RAW_BUFFERClick to expand / collapse

DingTalk Stream Mode: reconnection storm causes gateway to hang

Description

When the MiniMax stream connection drops (ReadTimeout), the DingTalk Stream Mode connection also disconnects. Hermes enters a reconnection loop, but dingtalk_stream.DingTalkStreamClient.start() throws:

TypeError: 'coroutine' object does not support the asynchronous context manager protocol

This causes a reconnection storm — the gateway logs 315MB of repeated errors without ever recovering, and the DingTalk adapter never successfully reconnects.

Environment

  • Hermes: 0.13 (upgraded 2026-05-11)
  • dingtalk-stream: 0.24.3
  • Profile: butler
  • Platform: macOS

Steps to Reproduce

  1. Have DingTalk Stream Mode connected and working
  2. Network glitch causes MiniMax stream ReadTimeout (~904s)
  3. DingTalk Stream Mode disconnects
  4. Gateway enters reconnect loop, every attempt fails with the coroutine error
  5. Gateway appears alive but DingTalk is completely unresponsive

Expected Behavior

The reconnect loop should:

  • Either successfully reconnect (current behavior: never succeeds)
  • Or detect repeated failures and stop retrying to avoid log explosion

A circuit breaker after N consecutive failures would prevent the 315MB error log.

Relevant Logs

WARNING gateway.platforms.dingtalk: [Dingtalk] Stream client error: 'coroutine' object does not support the asynchronous context manager protocol
--- (repeats ~6000 times, 315MB total)

Gateway.error.log: 315MB of repeated errors

Potential Fix

In gateway/platforms/dingtalk.py _run_stream():

  • Add a circuit breaker: after N consecutive failures (e.g., 5), stop retrying and mark the adapter as permanently failed until manual restart
  • Log a CRITICAL message when circuit breaker trips
  • Consider adding exponential backoff with a cap that resets on success

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING