hermes - 💡(How to fix) Fix fix(gateway): Feishu session cancellation orphans session guard, permanently blocking messages

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a Feishu DM session is cancelled (e.g., by a /stop or /new command from another platform), cancel_session_processing() in gateway/platforms/base.py waits up to 5s for the old task to exit. If the task doesn't exit within 5s (asyncio.TimeoutError):

  1. The task is removed from _session_tasks but continues running (orphaned)
  2. When it eventually finishes, its finally block checks self._session_tasks.get(session_key) is current_taskthis is False because the task was removed, so _release_session_guard() is never called
  3. _active_sessions[session_key] is never released
  4. All subsequent messages for that session go into _pending_messages but nobody ever processes them

Error Message

except asyncio.TimeoutError: logger.warning( "[%s] Cancelled task for %s did not exit within 5s; " "unblocking dispatch and letting the task unwind in the background", self.name, session_key, )

← NO cleanup of _active_sessions or _expected_cancelled_tasks here

...

if release_guard: self._release_session_guard(session_key) # ← NO orphan message respawn logic

Root Cause

https://github.com/NousResearch/hermes-agent/blob/v2026.5.28/gateway/platforms/base.py#L3209-L3237

except asyncio.TimeoutError:
    logger.warning(
        "[%s] Cancelled task for %s did not exit within 5s; "
        "unblocking dispatch and letting the task unwind in the background",
        self.name, session_key,
    )
# ← NO cleanup of _active_sessions or _expected_cancelled_tasks here

# ...

if release_guard:
    self._release_session_guard(session_key)
    # ← NO orphan message respawn logic

Two missing pieces:

  1. Cleanup on timeout: _expected_cancelled_tasks.discard(task) + _active_sessions.pop(session_key, None) in the TimeoutError handler
  2. Respawn on release: Check _pending_messages after release_guard and respawn if orphaned messages are queued

Fix Action

Fix / Workaround

  1. Send a message to Feishu that starts a long-running task
  2. While it's running, send /stop or /new from CLI
  3. If the old task takes >5s to cancel, the session guard is orphaned
  4. Send a new Feishu message — it's queued but never processed
  5. Check gateway.log: "Cancelled task for ... did not exit within 5s; unblocking dispatch and letting the task unwind in the background"
except asyncio.TimeoutError:
    logger.warning(
        "[%s] Cancelled task for %s did not exit within 5s; "
        "unblocking dispatch and letting the task unwind in the background",
        self.name, session_key,
    )
# ← NO cleanup of _active_sessions or _expected_cancelled_tasks here

Code Example

except asyncio.TimeoutError:
    logger.warning(
        "[%s] Cancelled task for %s did not exit within 5s; "
        "unblocking dispatch and letting the task unwind in the background",
        self.name, session_key,
    )
# ← NO cleanup of _active_sessions or _expected_cancelled_tasks here

# ...

if release_guard:
    self._release_session_guard(session_key)
    # ← NO orphan message respawn logic

---

except asyncio.TimeoutError:
    logger.warning(
        "[%s] Cancelled task for %s did not exit within 5s; "
        "forcing session release and respawning pending messages",
        self.name, session_key,
    )
    self._expected_cancelled_tasks.discard(task)
    self._active_sessions.pop(session_key, None)

---

if release_guard:
    if session_key in self._pending_messages and session_key not in self._active_sessions:
        pending = self._pending_messages.pop(session_key, None)
        if pending is not None:
            logger.info(
                "[%s] Respawning pending message for orphaned session %s",
                self.name, session_key,
            )
            self._start_session_processing(pending, session_key)
    self._release_session_guard(session_key)

---

if not self._allow_group_message(sender_id, state.get("chat_id", ""), is_bot=False):
    logger.warning("[Feishu] Unauthorized approval click by %s", open_id)
    return

---

def _is_interactive_operator_authorized(self, open_id: str) -> bool:
    admins = self.config.get("admins", [])
    allowed = self.config.get("allowed_users", [])
    if not admins and not allowed:
        return True  # No restrictions configured → allow everyone
    return open_id in admins or open_id in allowed

---

# In gateway/config.py _apply_env_overrides(), add to Feishu extra dict:
config.platforms[Platform.FEISHU].extra["ws_ping_interval"] = int(os.getenv("FEISHU_WS_PING_INTERVAL", "0")) or None
config.platforms[Platform.FEISHU].extra["ws_ping_timeout"] = int(os.getenv("FEISHU_WS_PING_TIMEOUT", "0")) or None
RAW_BUFFERClick to expand / collapse

title: "fix(gateway): Feishu session cancellation can orphan session guard, permanently blocking subsequent messages" labels: ["bug", "gateway", "feishu"]

Description

When a Feishu DM session is cancelled (e.g., by a /stop or /new command from another platform), cancel_session_processing() in gateway/platforms/base.py waits up to 5s for the old task to exit. If the task doesn't exit within 5s (asyncio.TimeoutError):

  1. The task is removed from _session_tasks but continues running (orphaned)
  2. When it eventually finishes, its finally block checks self._session_tasks.get(session_key) is current_taskthis is False because the task was removed, so _release_session_guard() is never called
  3. _active_sessions[session_key] is never released
  4. All subsequent messages for that session go into _pending_messages but nobody ever processes them

Impact

  • Feishu DM stops responding after a /stop or /new from CLI/another platform
  • Session shows input_tokens: 0, output_tokens: 0, total_tokens: 0 (empty shell)
  • Affected user must wait for gateway restart or session timeout to recover
  • Particularly common when multiple platforms access the same session

Reproduction

  1. Send a message to Feishu that starts a long-running task
  2. While it's running, send /stop or /new from CLI
  3. If the old task takes >5s to cancel, the session guard is orphaned
  4. Send a new Feishu message — it's queued but never processed
  5. Check gateway.log: "Cancelled task for ... did not exit within 5s; unblocking dispatch and letting the task unwind in the background"

Root Cause

https://github.com/NousResearch/hermes-agent/blob/v2026.5.28/gateway/platforms/base.py#L3209-L3237

except asyncio.TimeoutError:
    logger.warning(
        "[%s] Cancelled task for %s did not exit within 5s; "
        "unblocking dispatch and letting the task unwind in the background",
        self.name, session_key,
    )
# ← NO cleanup of _active_sessions or _expected_cancelled_tasks here

# ...

if release_guard:
    self._release_session_guard(session_key)
    # ← NO orphan message respawn logic

Two missing pieces:

  1. Cleanup on timeout: _expected_cancelled_tasks.discard(task) + _active_sessions.pop(session_key, None) in the TimeoutError handler
  2. Respawn on release: Check _pending_messages after release_guard and respawn if orphaned messages are queued

Proposed Fix

Part 1: Clean up orphaned references on timeout

except asyncio.TimeoutError:
    logger.warning(
        "[%s] Cancelled task for %s did not exit within 5s; "
        "forcing session release and respawning pending messages",
        self.name, session_key,
    )
    self._expected_cancelled_tasks.discard(task)
    self._active_sessions.pop(session_key, None)

Part 2: Respawn queued messages after guard release

if release_guard:
    if session_key in self._pending_messages and session_key not in self._active_sessions:
        pending = self._pending_messages.pop(session_key, None)
        if pending is not None:
            logger.info(
                "[%s] Respawning pending message for orphaned session %s",
                self.name, session_key,
            )
            self._start_session_processing(pending, session_key)
    self._release_session_guard(session_key)

Related Issues

Issue 2a: Feishu approval card unauthorized in DM

In gateway/platforms/feishu.py, approval card button clicks (Always Approve / Deny) are checked via _allow_group_message():

if not self._allow_group_message(sender_id, state.get("chat_id", ""), is_bot=False):
    logger.warning("[Feishu] Unauthorized approval click by %s", open_id)
    return

This function is designed for group chat access control. In DM context, when FEISHU_GROUP_POLICY defaults to "allowlist" and FEISHU_ALLOWED_USERS is unset, every user is blocked from approving. The fix is to use a DM-aware authorization check that defaults to permissive in DM:

def _is_interactive_operator_authorized(self, open_id: str) -> bool:
    admins = self.config.get("admins", [])
    allowed = self.config.get("allowed_users", [])
    if not admins and not allowed:
        return True  # No restrictions configured → allow everyone
    return open_id in admins or open_id in allowed

Issue 2b: Feishu WebSocket lacks ping/heartbeat

The Feishu platform adapter does not configure WebSocket ping intervals. On unstable networks, the WebSocket connection can enter a zombie state — TCP shows ESTABLISHED but data stops flowing. Messages sent by the user are silently lost. The fix is to add ws_ping_interval and ws_ping_timeout as configurable env vars:

# In gateway/config.py _apply_env_overrides(), add to Feishu extra dict:
config.platforms[Platform.FEISHU].extra["ws_ping_interval"] = int(os.getenv("FEISHU_WS_PING_INTERVAL", "0")) or None
config.platforms[Platform.FEISHU].extra["ws_ping_timeout"] = int(os.getenv("FEISHU_WS_PING_TIMEOUT", "0")) or None

Together with FEISHU_WS_PING_INTERVAL=30 and FEISHU_WS_PING_TIMEOUT=10 in .env, this keeps the WebSocket alive.

Environment

  • Hermes Agent v2026.5.28 (present in all versions)
  • Feishu/Lark platform with WebSocket connection mode
  • Multi-platform usage (Feishu + CLI)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING