hermes - ✅(Solved) Fix gateway restart: race condition causes Weixin token conflict [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17198Fetched 2026-04-29 06:36:47
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Author
Timeline (top)
labeled ×4commented ×1cross-referenced ×1

Error Message

Error Log

2026-04-29 08:26:44,359 ERROR gateway.platforms.base: [Weixin] Weixin bot token already in use (PID 30033). Stop the other gateway first. 2026-04-29 08:26:44,360 ERROR gateway.run: Gateway hit a non-retryable startup conflict: weixin: Weixin bot token already in use (PID 30033). Stop the other gateway first. 2026-04-29 08:26:44,361 ERROR gateway.run: Gateway exiting cleanly

Root Cause

The restart command kills the old process via SIGTERM and immediately starts a new one, without waiting for the old process to fully exit. The old process takes ~18 seconds to release the Weixin token after receiving SIGTERM, creating a race condition window.

Fix Action

Workaround

hermes gateway stop
sleep 3
hermes gateway start

PR fix notes

PR #17292: fix: use configured drain timeout for gateway restart wait (#17198)

Description (problem / solution / changelog)

Problem

The hermes gateway restart command hardcoded a 10-second timeout for _wait_for_gateway_exit(), but platform adapters like Weixin can take 18+ seconds to release their tokens during graceful shutdown. This causes a race condition where the new gateway fails to start because the old process's platform lock is still held.

Fix

Now uses the configured restart_drain_timeout (default 60s) with a minimum of 20s for the wait timeout, and 50% of the drain timeout (capped at 10s) for the force-kill threshold. This matches the existing drain timeout mechanism used by systemd/launchd restarts.

Before:

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

After:

_drain = _get_restart_drain_timeout()
_wait_for_gateway_exit(timeout=max(_drain, 20.0), force_after=min(_drain * 0.5, 10.0))

Before vs After

ScenarioBeforeAfter
Weixin disconnect takes 18sTimeout at 10s, new gateway failsWaits up to 60s (configurable), new gateway succeeds
Quick disconnect (2s)Works (10s > 2s)Works (20s > 2s)
Process hung (needs force-kill)SIGKILL at 5sSIGKILL at 10s (or 50% of drain timeout)

Fixes #17198

Changed files

  • hermes_cli/gateway.py (modified, +4/-2)

Code Example

2026-04-29 08:26:25,601 INFO gateway.run: Received SIGTERM/SIGINT — initiating shutdown
2026-04-29 08:26:25,665 WARNING gateway.run: Shutdown diagnostic — other hermes processes running
2026-04-29 08:26:44,134 INFO gateway.run: Starting Hermes Gateway...
2026-04-29 08:26:44,359 ERROR gateway.platforms.base: [Weixin] Weixin bot token already in use (PID 30033). Stop the other gateway first.
2026-04-29 08:26:44,360 ERROR gateway.run: Gateway hit a non-retryable startup conflict: weixin: Weixin bot token already in use (PID 30033). Stop the other gateway first.
2026-04-29 08:26:44,361 ERROR gateway.run: Gateway exiting cleanly

---

hermes gateway stop
sleep 3
hermes gateway start
RAW_BUFFERClick to expand / collapse

Bug Description

hermes gateway restart fails when the old gateway process still holds the Weixin (WeChat) bot token while the new process tries to claim it.

Error Log

2026-04-29 08:26:25,601 INFO gateway.run: Received SIGTERM/SIGINT — initiating shutdown
2026-04-29 08:26:25,665 WARNING gateway.run: Shutdown diagnostic — other hermes processes running
2026-04-29 08:26:44,134 INFO gateway.run: Starting Hermes Gateway...
2026-04-29 08:26:44,359 ERROR gateway.platforms.base: [Weixin] Weixin bot token already in use (PID 30033). Stop the other gateway first.
2026-04-29 08:26:44,360 ERROR gateway.run: Gateway hit a non-retryable startup conflict: weixin: Weixin bot token already in use (PID 30033). Stop the other gateway first.
2026-04-29 08:26:44,361 ERROR gateway.run: Gateway exiting cleanly

Steps to Reproduce

  1. Start Hermes gateway with Weixin platform enabled
  2. Run hermes gateway restart
  3. Observe: new gateway fails to start because old PID (30033) hasn't released the Weixin token yet
  4. Manual workaround: hermes gateway stop && sleep 3 && hermes gateway start (succeeds)

Environment

  • OS: macOS (Apple Silicon)
  • Hermes version: latest (as of 2026-04-29)
  • Platform: Weixin (WeChat) via iLink Bot API
  • Architecture: gateway running as foreground process (not systemd service)

Root Cause

The restart command kills the old process via SIGTERM and immediately starts a new one, without waiting for the old process to fully exit. The old process takes ~18 seconds to release the Weixin token after receiving SIGTERM, creating a race condition window.

Suggested Fix

Add a wait / waitpid step between killing the old gateway and starting the new one. The restart should:

  1. Send SIGTERM to old process
  2. waitpid() for old process to exit (with timeout)
  3. Only then start the new gateway process

This would eliminate the race condition entirely.

Workaround

hermes gateway stop
sleep 3
hermes gateway start

extent analysis

TL;DR

Implement a wait or waitpid step in the hermes gateway restart command to ensure the old process fully exits before starting a new one.

Guidance

  • Modify the hermes gateway restart command to send SIGTERM to the old process and then wait for it to exit using waitpid() with a suitable timeout.
  • Verify the fix by checking the gateway logs for successful restarts without Weixin token conflicts.
  • Consider adding a retry mechanism to handle cases where the old process takes longer than expected to exit.
  • Review the waitpid() timeout value to balance between waiting long enough for the old process to exit and not introducing unnecessary delays.

Example

# Pseudocode example of the modified restart command
hermes_gateway_restart() {
  # Send SIGTERM to old process
  kill -SIGTERM $OLD_PID
  
  # Wait for old process to exit
  waitpid $OLD_PID $TIMEOUT
  
  # Start new gateway process
  hermes gateway start
}

Notes

The suggested fix assumes that the waitpid() system call is available and suitable for this use case. Additionally, the choice of timeout value for waitpid() may require experimentation to find a balance between reliability and performance.

Recommendation

Apply the suggested fix by modifying the hermes gateway restart command to include a wait or waitpid step, as this directly addresses the root cause of the issue and eliminates the race condition.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING