hermes - ✅(Solved) Fix [Bug]: Slack gateway can remain 'running' while Socket Mode is dead, requiring manual restart [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14326Fetched 2026-04-24 06:17:47
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Timeline (top)
labeled ×4commented ×2cross-referenced ×1

Error Message

2026-04-23 04:26:45,181 ERROR slack_bolt.AsyncApp: Failed to connect (error: ); Retrying... ... TimeoutError 2026-04-23 04:26:55,222 INFO slack_bolt.AsyncApp: The old session (s_282243233) has been abandoned 2026-04-23 04:26:55,729 INFO slack_bolt.AsyncApp: A new session (s_289885077) has been established

Root Cause

From debugging logs and the current code, the likely root cause is that SlackAdapter.connect() marks Slack as connected before Socket Mode has actually established a live WebSocket session, and failures in the background Socket Mode task are not propagated back into the gateway's fatal-error / reconnect logic.

Fix Action

Fixed

PR fix notes

PR #14377: fix(slack): surface Socket Mode task failures

Description (problem / solution / changelog)

Root cause

Slack connect() marked the adapter running immediately after spawning the Socket Mode background task. If that task failed or exited later, Hermes did not promote the failure into the gateway fatal/reconnect path, leaving Slack dead while the gateway process still looked healthy.

Closes #14326.

Fix

  • Attach a done-callback to the Socket Mode background task.
  • If the task exits while Slack is still marked running, set a retryable fatal error and notify the gateway runner.
  • Mark Slack not running before normal disconnect cleanup so expected task completion/cancellation does not create a false fatal state.
  • Cancel and clear the Socket Mode task during disconnect.

Tests

  • uv run --frozen --python 3.11 --extra dev pytest -o addopts= tests/gateway/test_slack.py::TestAppMentionHandler::test_connect_monitors_socket_mode_task tests/gateway/test_slack.py::TestAppMentionHandler::test_socket_mode_task_failure_marks_retryable_fatal tests/gateway/test_slack.py::TestAppMentionHandler::test_socket_mode_task_completion_marks_retryable_fatal tests/gateway/test_slack.py::TestAppMentionHandler::test_socket_mode_task_done_ignored_after_disconnect -q -> 4 passed
  • uv run --frozen --python 3.11 --extra dev pytest -o addopts= tests/gateway/test_slack.py tests/gateway/test_platform_reconnect.py -q -> 149 passed
  • git diff --check

Changed files

  • gateway/platforms/slack.py (modified, +30/-1)
  • tests/gateway/test_slack.py (modified, +94/-0)

Code Example

2026-04-23 04:20:57,213 INFO slack_bolt.AsyncApp: The session (s_289228377) seems to be stale. Reconnecting... reason: disconnected for 51+ seconds)
2026-04-23 04:20:57,814 INFO slack_bolt.AsyncApp: The old session (s_289228377) has been abandoned
2026-04-23 04:20:58,284 INFO slack_bolt.AsyncApp: A new session (s_282243233) has been established
2026-04-23 04:21:38,291 INFO slack_bolt.AsyncApp: The session (s_282243233) seems to be already closed. Reconnecting...
2026-04-23 04:21:45,153 INFO slack_bolt.AsyncApp: The old session (s_282243233) has been abandoned

---

2026-04-23 04:26:45,181 ERROR slack_bolt.AsyncApp: Failed to connect (error: ); Retrying...
...
TimeoutError
2026-04-23 04:26:55,222 INFO slack_bolt.AsyncApp: The old session (s_282243233) has been abandoned
2026-04-23 04:26:55,729 INFO slack_bolt.AsyncApp: A new session (s_289885077) has been established

---

# Start Socket Mode handler in background
self._handler = AsyncSocketModeHandler(self._app, app_token)
self._socket_mode_task = asyncio.create_task(self._handler.start_async())

self._running = True
logger.info("[Slack] Socket Mode connected (%d workspace(s))", len(self._team_clients))
return True
RAW_BUFFERClick to expand / collapse

Bug Description

Slack gateway can become completely unresponsive while the Hermes gateway process still appears healthy/running. In this state, Slack messages receive no response until hermes gateway restart is performed.

From debugging logs and the current code, the likely root cause is that SlackAdapter.connect() marks Slack as connected before Socket Mode has actually established a live WebSocket session, and failures in the background Socket Mode task are not propagated back into the gateway's fatal-error / reconnect logic.

This leaves a bad state where:

  • process is still running
  • gateway status still looks healthy
  • Slack Socket Mode is dead / stuck reconnecting internally
  • inbound Slack messages are silently ignored
  • manual restart recovers it

Symptoms Observed

  • Slack messages sent to Hermes got no response in any DM/thread window
  • hermes gateway status still reported the gateway service as running
  • restarting the gateway restored Slack responsiveness

Relevant Logs / Traceback

The most relevant evidence for the unresponsive window was around 2026-04-23 04:24 KST.

Observed in ~/.hermes/logs/agent.log:

2026-04-23 04:20:57,213 INFO slack_bolt.AsyncApp: The session (s_289228377) seems to be stale. Reconnecting... reason: disconnected for 51+ seconds)
2026-04-23 04:20:57,814 INFO slack_bolt.AsyncApp: The old session (s_289228377) has been abandoned
2026-04-23 04:20:58,284 INFO slack_bolt.AsyncApp: A new session (s_282243233) has been established
2026-04-23 04:21:38,291 INFO slack_bolt.AsyncApp: The session (s_282243233) seems to be already closed. Reconnecting...
2026-04-23 04:21:45,153 INFO slack_bolt.AsyncApp: The old session (s_282243233) has been abandoned

Then there is a gap with no successful new session establishment while Slack was observed to be unresponsive. The next relevant lines are:

2026-04-23 04:26:45,181 ERROR slack_bolt.AsyncApp: Failed to connect (error: ); Retrying...
...
TimeoutError
2026-04-23 04:26:55,222 INFO slack_bolt.AsyncApp: The old session (s_282243233) has been abandoned
2026-04-23 04:26:55,729 INFO slack_bolt.AsyncApp: A new session (s_289885077) has been established

So there appears to have been a real Slack Socket Mode disruption from roughly 04:21:45 until 04:26:55, and the user-observed unresponsive time (04:24 KST) fell inside that gap.

This issue is not about a previously observed invalid_auth event. That was from a separate later incident caused by intentionally incorrect credentials and should not be treated as the direct cause of this report.

Current Code Path

Current gateway/platforms/slack.py still does this:

# Start Socket Mode handler in background
self._handler = AsyncSocketModeHandler(self._app, app_token)
self._socket_mode_task = asyncio.create_task(self._handler.start_async())

self._running = True
logger.info("[Slack] Socket Mode connected (%d workspace(s))", len(self._team_clients))
return True

This means the adapter returns success immediately after spawning start_async(), before actual Socket Mode establishment has been confirmed.

Why This Is Problematic

start_async() eventually calls Slack SDK socket-mode connect logic, but that work happens in a background task. If that task fails, retries for a while, or ends up in a stale/closed/reconnect loop, Hermes has already:

  • marked the adapter as connected
  • returned success from connect()
  • allowed the gateway to continue as if Slack were healthy

I could not find code that:

  • waits for Socket Mode establishment before setting _running = True
  • attaches a done-callback to _socket_mode_task
  • converts background Socket Mode task failure into _set_fatal_error(...) / _notify_fatal_error()
  • downgrades platform state when the Slack background task dies after startup

So the gateway can enter a "running but Slack is dead" zombie state.

Expected Behavior

One of the following should happen:

  1. SlackAdapter.connect() should only return success after Socket Mode is actually established, or
  2. background Socket Mode task failure should be promoted into the gateway fatal-error / reconnect path

In either case, Hermes should not remain in a state where the process looks healthy while Slack is functionally dead.

Actual Behavior

  • Slack becomes unresponsive
  • gateway process remains running
  • gateway status still looks fine
  • no automatic self-recovery occurs
  • manual hermes gateway restart is needed

Affected Component

  • Gateway (Telegram/Discord/Slack/WhatsApp)

Messaging Platform (if gateway-related)

  • Slack

Operating System

macOS

Python Version

3.11.11

Hermes Version

Hermes Agent v0.10.0 (2026.4.16)

Root Cause Analysis (optional)

Likely root cause:

  • gateway/platforms/slack.py treats asyncio.create_task(self._handler.start_async()) as a successful connection
  • Slack SDK connection/auth/websocket failures happen later inside the background task
  • those failures are not surfaced into Hermes gateway runtime health / reconnect handling

Proposed Fix (optional)

Possible fixes:

  1. Do not mark Slack connected until actual Socket Mode establishment is confirmed
  2. Add background task monitoring for _socket_mode_task
  3. If _socket_mode_task fails/exits unexpectedly, call _set_fatal_error(...) + _notify_fatal_error() so gateway reconnect / restart logic can take over
  4. Optionally add a Slack adapter health check based on the underlying Socket Mode client session/task state

Related Issues

Potentially related, but not duplicates:

  • #5499 — Feishu gateway intermittently drops inbound messages, delays reconnects, and may hang on shutdown
  • #11163 — Gateway silently drops messages when WebSocket platform adapter is briefly disconnected

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

extent analysis

TL;DR

The most likely fix is to modify the SlackAdapter.connect() method to wait for the Socket Mode establishment before returning success.

Guidance

  • Review the gateway/platforms/slack.py file and update the connect() method to await the completion of the start_async() task before setting _running = True.
  • Consider adding a try-except block to catch any exceptions raised by the start_async() task and handle them accordingly.
  • Implement a mechanism to monitor the _socket_mode_task and trigger a reconnect or fatal error handling if it fails or exits unexpectedly.
  • Add logging to track the state of the Socket Mode connection and any errors that occur during the connection process.

Example

async def connect(self):
    # Start Socket Mode handler in background
    self._handler = AsyncSocketModeHandler(self._app, app_token)
    try:
        await self._handler.start_async()
        self._running = True
        logger.info("[Slack] Socket Mode connected (%d workspace(s))", len(self._team_clients))
        return True
    except Exception as e:
        # Handle exception and trigger reconnect or fatal error handling
        logger.error("[Slack] Error connecting to Socket Mode: %s", e)
        self._set_fatal_error(...)
        self._notify_fatal_error()
        return False

Notes

The proposed fix assumes that the start_async() method returns a awaitable object that can be used to wait for its completion. If this is not the case, an alternative approach may be needed.

Recommendation

Apply the workaround by modifying the SlackAdapter.connect() method to wait for the Socket Mode establishment before returning success, as this will ensure that the gateway does not enter a "running but Slack is dead" zombie state.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING