hermes - ✅(Solved) Fix _safe_sync_slash_commands times out under post-prune rate-limit pressure (30s budget too tight) [1 pull requests, 1 participants]

davidbordenwi · 2026-04-27T21:21:10Z

[hermes] safe sync slash commands gateway/platforms/discord.py:937-1009 issues one Discord HTTP call per desired command plus one DELETE per orphan inside a ti… `_safe_sync_slash_commands` (`gateway/platforms/discord.py:937-1009`) issues one Discord HTTP call per desired command plus one DELETE per orphan inside a tight `for` loop, then the whole thing runs inside an outer `asyncio.wait_for(..., 30)` at `_run_post_connect_initialization` (`gateway/platforms/discord.py:801-832`). After a mass prune, the post-orphan-cleanup upsert burst pushes against Discord's per-app command-management rate-limit window (~5 writes / 20 s). The 30 s outer budget then blows before the reconciler finishes, the post-connect handler returns a timeout, and the gateway retries on its next connect — only succeeding ~60 min later once the bucket has fully recovered. # PR #16739: fix(discord): scale slash sync timeout to actual write count (#16713) - Repository: NousResearch/hermes-agent - Author: Tranquil-Flow - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/16739 ## Description (problem / solution / changelog) ## What does this PR do? `_run_post_connect_initialization` in `gateway/platforms/discord.py` wraps the entire `_safe_sync_slash_commands` call in `asyncio.wait_for(..., 30)`. Discord's per-app command-management bucket allows roughly 5 writes / 20-second window, so a mass-prune-plus-upsert reconcile reliably blows the 30 s budget under back-pressure. The reported case had 77 orphans + 30 desired = 107 writes; two consecutive 30 s timeouts, then a clean 22 s sync only after the bucket fully recovered ~60 minutes later. This PR splits the read-only diff from the writes and sizes the execute-phase timeout to the actual workload, so a heavy reconcile gets enough budget to finish under bucket pressure but pathological loads still get a hard cap. ## Related Issue Fixes #16713 ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [ ] ✅ Tests (adding or improving test coverage) - [ ] ♻️ Refactor (no behavior change) - [ ] 🎯 New skill (bundled or hub) ## Changes Made - `gateway/platforms/discord.py` — extracted two helpers from `_safe_sync_slash_commands`: - `_plan_slash_command_sync()` — read-only; diffs desired vs existing global commands, returns a list of typed actions (`create` / `update` / `recreate`) plus the orphan deletes and a `write_count`. - `_execute_slash_command_sync(plan)` — applies the plan, returns the existing summary dict. - `_safe_sync_slash_commands` is now a thin wrapper that calls plan + execute, preserving the single-shot signature for callers and existing tests. - `gateway/platforms/discord.py:_run_post_connect_initialization` — now plans first under a 30 s budget, then computes a write-count-aware execute budget (`30 + 5 × write_count`, capped at 600 s) before running the plan. `recreate` actions count as two writes (delete + upsert) so the budget covers Discord's actual rate-limit math. Timeout log message updated to point at saturated rate-limit bucket as the cause instead of a flat "after 30 s". - Three new module-level constants on the adapter (`_SLASH_SYNC_BASE_BUDGET_SECONDS`, `_SLASH_SYNC_PER_WRITE_SECONDS`, `_SLASH_SYNC_MAX_BUDGET_SECONDS`) and a static `_estimate_slash_sync_budget(write_count)` so the budget formula is testable and trivially tunable. - `tests/gateway/test_discord_connect.py` — added five tests: - `_estimate_slash_sync_budget` scales with write count, is monotonic, caps at the maximum. - `_plan_slash_command_sync` counts a `recreate` as two writes plus an orphan delete (matches Discord's bucket math). - Plan + execute produces the same summary dict as the existing `_safe_sync_slash_commands` single-shot path (so callers/tests that read the summary stay correct). ## How to Test Reproduction (matches the issue body): trigger a session that produces ≥ ~30 orphan slash commands (e.g., a command-registry refactor that drops or renames commands), restart the gateway, and watch the post-connect initialization. Before this fix, `wait_for(..., 30)` raised `asyncio.TimeoutError` and subsequent reconnects within the rate-limit cooldown also timed out. After this fix, the budget scales with the planned write count and the reconcile completes under bucket pressure. Automated: ``` pytest tests/gateway/test_discord_connect.py -q ``` Result on macOS 15.6.1 / Python 3.14.2: `16 passed` (11 pre-existing + 5 new). All five new tests fail on `main` without the production change. ## Checklist ### Code - [x] I've read the [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md) - [x] My commit messages follow [Conventional Commits](https://www.conventionalcommits.org/) (`fix(scope):`, `feat(scope):`, etc.) - [x] I searched for [existing PRs](https://github.com/NousResearch/hermes-agent/pul

hermes2026-04-27 21:21:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#16713•Fetched 2026-04-28 06:51:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

davidbordenwi

Participants

davidbordenwi

Timeline (top)

labeled ×4cross-referenced ×1referenced ×1

_safe_sync_slash_commands (gateway/platforms/discord.py:937-1009) issues one Discord HTTP call per desired command plus one DELETE per orphan inside a tight for loop, then the whole thing runs inside an outer asyncio.wait_for(..., 30) at _run_post_connect_initialization (gateway/platforms/discord.py:801-832). After a mass prune, the post-orphan-cleanup upsert burst pushes against Discord's per-app command-management rate-limit window (~5 writes / 20 s). The 30 s outer budget then blows before the reconciler finishes, the post-connect handler returns a timeout, and the gateway retries on its next connect — only succeeding ~60 min later once the bucket has fully recovered.

Root Cause

Fix Action

Fix / Workaround

Three patch options (priority order)

Most user-visible commands live in COMMAND_REGISTRY plus the single /skill autocomplete dispatcher. Sync those inside the 30 s budget; once they land the gateway is operational. Move orphan cleanup at line 998 into a detached background task with a separate, much wider budget. Operators don't notice deferred deletes if /skill autocomplete works.

Option 1 as the structural fix; option 2 as a follow-up if you prefer incremental patches. Option 3 alone hides the burst rather than handling it.

PR fix notes

PR #16739: fix(discord): scale slash sync timeout to actual write count (#16713)

Repository: NousResearch/hermes-agent
Author: Tranquil-Flow
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/16739

Description (problem / solution / changelog)

What does this PR do?

_run_post_connect_initialization in gateway/platforms/discord.py wraps the entire _safe_sync_slash_commands call in asyncio.wait_for(..., 30). Discord's per-app command-management bucket allows roughly 5 writes / 20-second window, so a mass-prune-plus-upsert reconcile reliably blows the 30 s budget under back-pressure. The reported case had 77 orphans + 30 desired = 107 writes; two consecutive 30 s timeouts, then a clean 22 s sync only after the bucket fully recovered ~60 minutes later.

This PR splits the read-only diff from the writes and sizes the execute-phase timeout to the actual workload, so a heavy reconcile gets enough budget to finish under bucket pressure but pathological loads still get a hard cap.

Related Issue

Fixes #16713

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

gateway/platforms/discord.py — extracted two helpers from _safe_sync_slash_commands:
- _plan_slash_command_sync() — read-only; diffs desired vs existing global commands, returns a list of typed actions (create / update / recreate) plus the orphan deletes and a write_count.
- _execute_slash_command_sync(plan) — applies the plan, returns the existing summary dict.
- _safe_sync_slash_commands is now a thin wrapper that calls plan + execute, preserving the single-shot signature for callers and existing tests.
gateway/platforms/discord.py:_run_post_connect_initialization — now plans first under a 30 s budget, then computes a write-count-aware execute budget (30 + 5 × write_count, capped at 600 s) before running the plan. recreate actions count as two writes (delete + upsert) so the budget covers Discord's actual rate-limit math. Timeout log message updated to point at saturated rate-limit bucket as the cause instead of a flat "after 30 s".
Three new module-level constants on the adapter (_SLASH_SYNC_BASE_BUDGET_SECONDS, _SLASH_SYNC_PER_WRITE_SECONDS, _SLASH_SYNC_MAX_BUDGET_SECONDS) and a static _estimate_slash_sync_budget(write_count) so the budget formula is testable and trivially tunable.
tests/gateway/test_discord_connect.py — added five tests:
- _estimate_slash_sync_budget scales with write count, is monotonic, caps at the maximum.
- _plan_slash_command_sync counts a recreate as two writes plus an orphan delete (matches Discord's bucket math).
- Plan + execute produces the same summary dict as the existing _safe_sync_slash_commands single-shot path (so callers/tests that read the summary stay correct).

How to Test

Reproduction (matches the issue body): trigger a session that produces ≥ ~30 orphan slash commands (e.g., a command-registry refactor that drops or renames commands), restart the gateway, and watch the post-connect initialization. Before this fix, wait_for(..., 30) raised asyncio.TimeoutError and subsequent reconnects within the rate-limit cooldown also timed out. After this fix, the budget scales with the planned write count and the reconcile completes under bucket pressure.

Automated:

pytest tests/gateway/test_discord_connect.py -q

Result on macOS 15.6.1 / Python 3.14.2: 16 passed (11 pre-existing + 5 new). All five new tests fail on main without the production change.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS 15.6.1 (Python 3.14.2)

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A (new helpers carry inline docstrings explaining the bucket math)
I've updated cli-config.yaml.example if I added/changed config keys — or N/A (N/A — budget tuning lives on adapter class constants; not user-facing yaml)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (N/A)
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (N/A — pure asyncio + discord.py HTTP calls, no platform-specific syscalls)
I've updated tool descriptions/schemas if I changed tool behavior — or N/A (N/A — gateway adapter, not a tool)

Screenshots / Logs

$ pytest tests/gateway/test_discord_connect.py -q
................                                                         [100%]
16 passed, 3 warnings in 3.17s

Changed files

gateway/platforms/discord.py (modified, +112/-26)
tests/gateway/test_discord_connect.py (modified, +194/-0)

Code Example

fast = {k: v for k, v in desired_by_key.items() if _is_critical(v)}
slow = {k: v for k, v in desired_by_key.items() if k not in fast}

# inside the wait_for(..., 30) budget:
await self._reconcile_subset(fast)

# detached, no outer timeout, rate-limit-aware retries:
self._spawn_deferred_sync(slow, orphans=existing_by_key)

RAW_BUFFERClick to expand / collapse

Summary

Repro

Run a session that produces ≥ ~30 orphan slash commands (e.g., command-registry refactor that drops or renames commands).
Restart the gateway. The first reconcile runs the orphan delete loop (line 998) followed by upserts under back-pressure.
wait_for(... , 30) raises asyncio.TimeoutError inside _run_post_connect_initialization.
Subsequent reconnects within the rate-limit cooldown also time out. After ~60 min the bucket recovers and the next reconnect's reconcile finishes in ~22 s.

Observed in production with 77 orphans pruned. Two consecutive 30 s timeouts, then a clean 22 s sync.

Why the current code times out

Inner loop ignores X-RateLimit-Remaining / X-RateLimit-Reset and relies on discord.py's default backoff. Steady-state this is fine; under burst the serialized waits stack to 40-60 s.
Outer wait_for(... , 30) is the same budget for both fast (steady-state) and slow (post-prune) paths. There's no signal to tell the outer handler that the inner work is progressing, just slowly.

Three patch options (priority order)

1. Split sync into a fast critical path + deferred sweep (recommended)

fast = {k: v for k, v in desired_by_key.items() if _is_critical(v)}
slow = {k: v for k, v in desired_by_key.items() if k not in fast}

# inside the wait_for(..., 30) budget:
await self._reconcile_subset(fast)

# detached, no outer timeout, rate-limit-aware retries:
self._spawn_deferred_sync(slow, orphans=existing_by_key)

Trade-off: heartbeat card briefly reports "sync in progress" until the deferred sweep lands. Net win — cmd-sync timeouts cascade into 30032 errors today.

2. Pace upserts off `X-RateLimit-Remaining` / `X-RateLimit-Reset` headers

discord.py's HTTPClient exposes the headers via the rate-limit handler, but _safe_sync_slash_commands issues calls in a tight loop without inspecting them. A wrapper that sleeps until X-RateLimit-Reset when X-RateLimit-Remaining <= 1 smooths the post-prune burst without changing the contract.

Trade-off: still serial and still bounded by the outer 30 s — buys ~10-15 s of headroom, not unlimited.

3. Widen / drop the outer `wait_for` for the cleanup branch

Cheapest: bump the timeout at line 816 to 120 s, or split into a fast-stage 30 s and a slow-stage None. Fixes the symptom without addressing the burst.

Trade-off: a real failure now hangs the post-connect handler for 2 min instead of 30 s. Acceptable, but feels like papering over the burst.

Recommendation

Option 1 as the structural fix; option 2 as a follow-up if you prefer incremental patches. Option 3 alone hides the burst rather than handling it.

References

gateway/platforms/discord.py:801-832 — _run_post_connect_initialization outer wait_for(..., 30).
gateway/platforms/discord.py:937-1009 — _safe_sync_slash_commands reconciler.
Autocomplete refactor commit d7fb435e — the test surface is rooted here.
Discord global command-management rate-limit: ~5 writes / app / 20 s, headers X-RateLimit-Remaining, X-RateLimit-Reset.

extent analysis

TL;DR

Split the sync into a fast critical path and a deferred sweep to avoid timing out due to Discord's rate limit.

Guidance

Identify critical commands that need to be synced within the 30-second budget, such as those in COMMAND_REGISTRY and the /skill autocomplete dispatcher.
Move orphan cleanup into a detached background task with a separate, wider budget to avoid blocking the main sync process.
Consider pacing upserts based on X-RateLimit-Remaining and X-RateLimit-Reset headers to smooth out the post-prune burst.
Review the trade-offs of each patch option, including the impact on user visibility and error handling.

Example

fast = {k: v for k, v in desired_by_key.items() if _is_critical(v)}
slow = {k: v for k, v in desired_by_key.items() if k not in fast}

await self._reconcile_subset(fast)
self._spawn_deferred_sync(slow, orphans=existing_by_key)

Notes

The provided patch options have different trade-offs, and the choice of solution depends on the specific requirements and constraints of the system. Option 1 is recommended as the structural fix, while option 2 can be considered as a follow-up incremental patch.

Recommendation

Apply patch option 1, splitting the sync into a fast critical path and a deferred sweep, as it addresses the root cause of the issue and provides a net win by avoiding cmd-sync timeouts.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#embedding generation #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix _safe_sync_slash_commands times out under post-prune rate-limit pressure (30s budget too tight) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Three patch options (priority order)

PR fix notes

PR #16739: fix(discord): scale slash sync timeout to actual write count (#16713)

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Code Example

Summary

Repro

Why the current code times out

Three patch options (priority order)

1. Split sync into a fast critical path + deferred sweep (recommended)

2. Pace upserts off X-RateLimit-Remaining / X-RateLimit-Reset headers

3. Widen / drop the outer wait_for for the cleanup branch

Recommendation

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

2. Pace upserts off `X-RateLimit-Remaining` / `X-RateLimit-Reset` headers

3. Widen / drop the outer `wait_for` for the cleanup branch