hermes - 💡(How to fix) Fix [Bug]: 平台连接失败导致 Gateway 重启死循环,Cron 定时任务全部静默失效

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

# gateway.log
2026-05-08 15:22:48 Gateway stopped
2026-05-08 15:23:53 Cron ticker started (interval=60s)
2026-05-08 15:38:35 Gateway drain timed out after 60.0s with 1 active agent(s)
2026-05-08 15:38:36 Gateway stopped
2026-05-08 15:40:36 ERROR gateway.run: ✗ discord error: discord connect timed out after 30s
2026-05-08 15:40:36 discord.errors.LoginFailure: Improper token has been passed.
2026-05-08 15:47:12 Gateway stopped
2026-05-08 15:49:42 discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents...
2026-05-08 15:55:15 systemd: restart counter is at 4

# journalctl
May 08 15:23:48 systemd[1]: Scheduled restart job, restart counter is at 1.
May 08 15:40:00 systemd[1]: Scheduled restart job, restart counter is at 2.
May 08 15:49:07 systemd[1]: Scheduled restart job, restart counter is at 3.
May 08 15:55:15 systemd[1]: Scheduled restart job, restart counter is at 4.

# Cron 任务状态 (从未触发)
job_id: bfb62214764d
state: scheduled
last_run_at: null
schedule: "0 3 * * *"

Root Cause

从源码看,cron/scheduler.py :: tick()cron/jobs.py :: get_due_jobs() 本身的调度逻辑是正确的。问题在于:

  1. Gateway 没有优雅处理单平台失败 — 平台 adapter 连接失败直接导致进程 exit
  2. Cron ticker 依赖 gateway 进程存活 — ticker 是 gateway 进程内的一个线程/定时器,gateway 一死它就没了
  3. 重启太密集 — systemd Restart=always + 无退避,导致 15 分钟内重启 4 次

Fix Action

Fix / Workaround

Workaround Used

Code Example

# 环境: LXC container, systemd service, --run-as-user root

# Discord token 被删除后
15:22  Gateway stopped → systemd restarts (counter=1)
15:38  Gateway stopped → discord: 401 Unauthorized → systemd restarts (counter=2)
15:47  Gateway stopped → discord: 401 → systemd restarts (counter=3)
15:52  Gateway stopped → discord: 401 → systemd restarts
15:59  Gateway stopped → systemd restarts (counter=4)
16:14  Gateway stopped → systemd restarts
1小时内崩溃 6+ 次!

---

# gateway.log
2026-05-08 15:22:48 Gateway stopped
2026-05-08 15:23:53 Cron ticker started (interval=60s)
2026-05-08 15:38:35 Gateway drain timed out after 60.0s with 1 active agent(s)
2026-05-08 15:38:36 Gateway stopped
2026-05-08 15:40:36 ERROR gateway.run: ✗ discord error: discord connect timed out after 30s
2026-05-08 15:40:36 discord.errors.LoginFailure: Improper token has been passed.
2026-05-08 15:47:12 Gateway stopped
2026-05-08 15:49:42 discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents...
2026-05-08 15:55:15 systemd: restart counter is at 4

# journalctl
May 08 15:23:48 systemd[1]: Scheduled restart job, restart counter is at 1.
May 08 15:40:00 systemd[1]: Scheduled restart job, restart counter is at 2.
May 08 15:49:07 systemd[1]: Scheduled restart job, restart counter is at 3.
May 08 15:55:15 systemd[1]: Scheduled restart job, restart counter is at 4.

# Cron 任务状态 (从未触发)
job_id: bfb62214764d
state: scheduled
last_run_at: null
schedule: "0 3 * * *"
RAW_BUFFERClick to expand / collapse

Bug Description

当一个消息平台(如 Discord)因 token 失效导致连接失败时,Gateway 会连续崩溃 → systemd 自动重启 → 再次崩溃,形成重启死循环。每次重启会 kill cron ticker 线程,导致所有 cron 定时任务无法触发。

When a messaging platform (e.g. Discord) fails to connect due to invalid/missing auth, the Gateway crashes repeatedly → systemd restarts it → crashes again, creating a restart-death-loop. Each restart kills the cron ticker thread, preventing ALL cron jobs from ever firing.


Steps to Reproduce

  1. 配置一个消息平台(如 Discord),填入有效 token
  2. 启动 gateway 运行
  3. .env 删除该平台的 token
  4. 观察 gateway 日志:平台连接失败 → gateway 崩溃 → systemd 自动重启
# 环境: LXC container, systemd service, --run-as-user root

# Discord token 被删除后
15:22  Gateway stopped → systemd restarts (counter=1)
15:38  Gateway stopped → discord: 401 Unauthorized → systemd restarts (counter=2)
15:47  Gateway stopped → discord: 401 → systemd restarts (counter=3)
15:52  Gateway stopped → discord: 401 → systemd restarts
15:59  Gateway stopped → systemd restarts (counter=4)
16:14  Gateway stopped → systemd restarts
       ↑ 1小时内崩溃 6+ 次!

Expected Behavior

  • 单个平台连接失败不应该导致整个 gateway 进程崩溃
  • 失败的平台应当被优雅降级(标记为 disconnected),不影响其他平台和 cron 调度器
  • 或者至少 systemd restart 次数应该在短时间内快速退避(exponential backoff),而不是 5 分钟内重试 4 次

Actual Behavior

  1. Gateway 进程崩溃 — 平台连接异常时整个进程 exit
  2. systemd 无限重启Restart=always 导致 restart counter 持续递增
  3. Cron ticker 被反复 kill/restart — 每个 tick 周期 60 秒,但 gateway 每 4-15 分钟就崩溃一次,cron ticker 永远跑不满一个完整周期
  4. 所有 cron 任务静默失败last_run_at 始终为 null,无任何错误通知

Impact

  • 2+ 个定时任务完全失效(收盘分析、网关维护)
  • 用户完全不知道任务没执行,只能手动触发
  • 问题隐蔽——next_run_at 正常、state: scheduled 正常,但任务就是不触发

Root Cause Analysis

从源码看,cron/scheduler.py :: tick()cron/jobs.py :: get_due_jobs() 本身的调度逻辑是正确的。问题在于:

  1. Gateway 没有优雅处理单平台失败 — 平台 adapter 连接失败直接导致进程 exit
  2. Cron ticker 依赖 gateway 进程存活 — ticker 是 gateway 进程内的一个线程/定时器,gateway 一死它就没了
  3. 重启太密集 — systemd Restart=always + 无退避,导致 15 分钟内重启 4 次

Proposed Solutions

  1. 优雅降级:平台连接失败时标记为 disconnected,不 crash 整个 gateway
  2. 独立 cron 守护进程:将 cron scheduler 从 gateway 进程分离为独立 systemd 服务,这样 gateway 重启不影响 cron
  3. systemd 退避:给 service unit 添加 RestartSec=60s 或更激进的退避

Environment

  • Hermes Agent v0.13.0 (commit faa13e49)
  • Install method: curl installer
  • OS: Ubuntu 24.04 LTS (LXC container)
  • Gateway: systemd service, --run-as-user root
  • Platforms: feishu (websocket), weixin (iLink), discord (removed)
  • Discord bot token 已从 .env 删除但 platform config 仍在

Logs

# gateway.log
2026-05-08 15:22:48 Gateway stopped
2026-05-08 15:23:53 Cron ticker started (interval=60s)
2026-05-08 15:38:35 Gateway drain timed out after 60.0s with 1 active agent(s)
2026-05-08 15:38:36 Gateway stopped
2026-05-08 15:40:36 ERROR gateway.run: ✗ discord error: discord connect timed out after 30s
2026-05-08 15:40:36 discord.errors.LoginFailure: Improper token has been passed.
2026-05-08 15:47:12 Gateway stopped
2026-05-08 15:49:42 discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents...
2026-05-08 15:55:15 systemd: restart counter is at 4

# journalctl
May 08 15:23:48 systemd[1]: Scheduled restart job, restart counter is at 1.
May 08 15:40:00 systemd[1]: Scheduled restart job, restart counter is at 2.
May 08 15:49:07 systemd[1]: Scheduled restart job, restart counter is at 3.
May 08 15:55:15 systemd[1]: Scheduled restart job, restart counter is at 4.

# Cron 任务状态 (从未触发)
job_id: bfb62214764d
state: scheduled
last_run_at: null
schedule: "0 3 * * *"

Workaround Used

手动删除 Discord 配置 + 手动重启 gateway 后恢复正常。关键定时任务改用 terminal(background=true) 手动执行,不再依赖 cron。

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING