hermes - 💡(How to fix) Fix [Bug]: 平台连接失败导致 Gateway 重启死循环，Cron 定时任务全部静默失效

Error Message

# gateway.log
2026-05-08 15:22:48 Gateway stopped
2026-05-08 15:23:53 Cron ticker started (interval=60s)
2026-05-08 15:38:35 Gateway drain timed out after 60.0s with 1 active agent(s)
2026-05-08 15:38:36 Gateway stopped
2026-05-08 15:40:36 ERROR gateway.run: ✗ discord error: discord connect timed out after 30s
2026-05-08 15:40:36 discord.errors.LoginFailure: Improper token has been passed.
2026-05-08 15:47:12 Gateway stopped
2026-05-08 15:49:42 discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents...
2026-05-08 15:55:15 systemd: restart counter is at 4

# journalctl
May 08 15:23:48 systemd[1]: Scheduled restart job, restart counter is at 1.
May 08 15:40:00 systemd[1]: Scheduled restart job, restart counter is at 2.
May 08 15:49:07 systemd[1]: Scheduled restart job, restart counter is at 3.
May 08 15:55:15 systemd[1]: Scheduled restart job, restart counter is at 4.

# Cron 任务状态 (从未触发)
job_id: bfb62214764d
state: scheduled
last_run_at: null
schedule: "0 3 * * *"

Root Cause

从源码看，cron/scheduler.py :: tick() 和 cron/jobs.py :: get_due_jobs() 本身的调度逻辑是正确的。问题在于：

Gateway 没有优雅处理单平台失败 — 平台 adapter 连接失败直接导致进程 exit
Cron ticker 依赖 gateway 进程存活 — ticker 是 gateway 进程内的一个线程/定时器，gateway 一死它就没了
重启太密集 — systemd Restart=always + 无退避，导致 15 分钟内重启 4 次

Code Example

# 环境: LXC container, systemd service, --run-as-user root

# Discord token 被删除后
15:22  Gateway stopped → systemd restarts (counter=1)
15:38  Gateway stopped → discord: 401 Unauthorized → systemd restarts (counter=2)
15:47  Gateway stopped → discord: 401 → systemd restarts (counter=3)
15:52  Gateway stopped → discord: 401 → systemd restarts
15:59  Gateway stopped → systemd restarts (counter=4)
16:14  Gateway stopped → systemd restarts
       ↑ 1小时内崩溃 6+ 次！

---

# gateway.log
2026-05-08 15:22:48 Gateway stopped
2026-05-08 15:23:53 Cron ticker started (interval=60s)
2026-05-08 15:38:35 Gateway drain timed out after 60.0s with 1 active agent(s)
2026-05-08 15:38:36 Gateway stopped
2026-05-08 15:40:36 ERROR gateway.run: ✗ discord error: discord connect timed out after 30s
2026-05-08 15:40:36 discord.errors.LoginFailure: Improper token has been passed.
2026-05-08 15:47:12 Gateway stopped
2026-05-08 15:49:42 discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents...
2026-05-08 15:55:15 systemd: restart counter is at 4

# journalctl
May 08 15:23:48 systemd[1]: Scheduled restart job, restart counter is at 1.
May 08 15:40:00 systemd[1]: Scheduled restart job, restart counter is at 2.
May 08 15:49:07 systemd[1]: Scheduled restart job, restart counter is at 3.
May 08 15:55:15 systemd[1]: Scheduled restart job, restart counter is at 4.

# Cron 任务状态 (从未触发)
job_id: bfb62214764d
state: scheduled
last_run_at: null
schedule: "0 3 * * *"

Bug Description

当一个消息平台（如 Discord）因 token 失效导致连接失败时，Gateway 会连续崩溃 → systemd 自动重启 → 再次崩溃，形成重启死循环。每次重启会 kill cron ticker 线程，导致所有 cron 定时任务无法触发。

When a messaging platform (e.g. Discord) fails to connect due to invalid/missing auth, the Gateway crashes repeatedly → systemd restarts it → crashes again, creating a restart-death-loop. Each restart kills the cron ticker thread, preventing ALL cron jobs from ever firing.

Steps to Reproduce

配置一个消息平台（如 Discord），填入有效 token
启动 gateway 运行
从 .env 删除该平台的 token
观察 gateway 日志：平台连接失败 → gateway 崩溃 → systemd 自动重启

# 环境: LXC container, systemd service, --run-as-user root

# Discord token 被删除后
15:22  Gateway stopped → systemd restarts (counter=1)
15:38  Gateway stopped → discord: 401 Unauthorized → systemd restarts (counter=2)
15:47  Gateway stopped → discord: 401 → systemd restarts (counter=3)
15:52  Gateway stopped → discord: 401 → systemd restarts
15:59  Gateway stopped → systemd restarts (counter=4)
16:14  Gateway stopped → systemd restarts
       ↑ 1小时内崩溃 6+ 次！

Expected Behavior

单个平台连接失败不应该导致整个 gateway 进程崩溃
失败的平台应当被优雅降级（标记为 disconnected），不影响其他平台和 cron 调度器
或者至少 systemd restart 次数应该在短时间内快速退避（exponential backoff），而不是 5 分钟内重试 4 次

Actual Behavior

Gateway 进程崩溃 — 平台连接异常时整个进程 exit
systemd 无限重启 — Restart=always 导致 restart counter 持续递增
Cron ticker 被反复 kill/restart — 每个 tick 周期 60 秒，但 gateway 每 4-15 分钟就崩溃一次，cron ticker 永远跑不满一个完整周期
所有 cron 任务静默失败 — last_run_at 始终为 null，无任何错误通知

Impact

2+ 个定时任务完全失效（收盘分析、网关维护）
用户完全不知道任务没执行，只能手动触发
问题隐蔽——next_run_at 正常、state: scheduled 正常，但任务就是不触发

Root Cause Analysis

从源码看，cron/scheduler.py :: tick() 和 cron/jobs.py :: get_due_jobs() 本身的调度逻辑是正确的。问题在于：

Gateway 没有优雅处理单平台失败 — 平台 adapter 连接失败直接导致进程 exit
Cron ticker 依赖 gateway 进程存活 — ticker 是 gateway 进程内的一个线程/定时器，gateway 一死它就没了
重启太密集 — systemd Restart=always + 无退避，导致 15 分钟内重启 4 次

Proposed Solutions

优雅降级：平台连接失败时标记为 disconnected，不 crash 整个 gateway
独立 cron 守护进程：将 cron scheduler 从 gateway 进程分离为独立 systemd 服务，这样 gateway 重启不影响 cron
systemd 退避：给 service unit 添加 RestartSec=60s 或更激进的退避

Environment

Hermes Agent v0.13.0 (commit faa13e49)
Install method: curl installer
OS: Ubuntu 24.04 LTS (LXC container)
Gateway: systemd service, --run-as-user root
Platforms: feishu (websocket), weixin (iLink), discord (removed)
Discord bot token 已从 .env 删除但 platform config 仍在

Logs

# gateway.log
2026-05-08 15:22:48 Gateway stopped
2026-05-08 15:23:53 Cron ticker started (interval=60s)
2026-05-08 15:38:35 Gateway drain timed out after 60.0s with 1 active agent(s)
2026-05-08 15:38:36 Gateway stopped
2026-05-08 15:40:36 ERROR gateway.run: ✗ discord error: discord connect timed out after 30s
2026-05-08 15:40:36 discord.errors.LoginFailure: Improper token has been passed.
2026-05-08 15:47:12 Gateway stopped
2026-05-08 15:49:42 discord.errors.PrivilegedIntentsRequired: Shard ID None is requesting privileged intents...
2026-05-08 15:55:15 systemd: restart counter is at 4

# journalctl
May 08 15:23:48 systemd[1]: Scheduled restart job, restart counter is at 1.
May 08 15:40:00 systemd[1]: Scheduled restart job, restart counter is at 2.
May 08 15:49:07 systemd[1]: Scheduled restart job, restart counter is at 3.
May 08 15:55:15 systemd[1]: Scheduled restart job, restart counter is at 4.

# Cron 任务状态 (从未触发)
job_id: bfb62214764d
state: scheduled
last_run_at: null
schedule: "0 3 * * *"

Workaround Used

手动删除 Discord 配置 + 手动重启 gateway 后恢复正常。关键定时任务改用 terminal(background=true) 手动执行，不再依赖 cron。

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: 平台连接失败导致 Gateway 重启死循环，Cron 定时任务全部静默失效

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround Used

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Root Cause Analysis

Proposed Solutions

Environment

Logs

Workaround Used

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [Bug]: 平台连接失败导致 Gateway 重启死循环，Cron 定时任务全部静默失效

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround Used

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Root Cause Analysis

Proposed Solutions

Environment

Logs

Workaround Used

Still need to ship something?

RELATED_DISCOVERY

TRENDING