hermes - ✅(Solved) Fix _safe_sync_slash_commands times out under post-prune rate-limit pressure (30s budget too tight) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16713Fetched 2026-04-28 06:51:20
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×4cross-referenced ×1referenced ×1

_safe_sync_slash_commands (gateway/platforms/discord.py:937-1009) issues one Discord HTTP call per desired command plus one DELETE per orphan inside a tight for loop, then the whole thing runs inside an outer asyncio.wait_for(..., 30) at _run_post_connect_initialization (gateway/platforms/discord.py:801-832). After a mass prune, the post-orphan-cleanup upsert burst pushes against Discord's per-app command-management rate-limit window (~5 writes / 20 s). The 30 s outer budget then blows before the reconciler finishes, the post-connect handler returns a timeout, and the gateway retries on its next connect — only succeeding ~60 min later once the bucket has fully recovered.

Root Cause

_safe_sync_slash_commands (gateway/platforms/discord.py:937-1009) issues one Discord HTTP call per desired command plus one DELETE per orphan inside a tight for loop, then the whole thing runs inside an outer asyncio.wait_for(..., 30) at _run_post_connect_initialization (gateway/platforms/discord.py:801-832). After a mass prune, the post-orphan-cleanup upsert burst pushes against Discord's per-app command-management rate-limit window (~5 writes / 20 s). The 30 s outer budget then blows before the reconciler finishes, the post-connect handler returns a timeout, and the gateway retries on its next connect — only succeeding ~60 min later once the bucket has fully recovered.

Fix Action

Fix / Workaround

Three patch options (priority order)

Most user-visible commands live in COMMAND_REGISTRY plus the single /skill autocomplete dispatcher. Sync those inside the 30 s budget; once they land the gateway is operational. Move orphan cleanup at line 998 into a detached background task with a separate, much wider budget. Operators don't notice deferred deletes if /skill autocomplete works.

Option 1 as the structural fix; option 2 as a follow-up if you prefer incremental patches. Option 3 alone hides the burst rather than handling it.

PR fix notes

PR #16739: fix(discord): scale slash sync timeout to actual write count (#16713)

Description (problem / solution / changelog)

What does this PR do?

_run_post_connect_initialization in gateway/platforms/discord.py wraps the entire _safe_sync_slash_commands call in asyncio.wait_for(..., 30). Discord's per-app command-management bucket allows roughly 5 writes / 20-second window, so a mass-prune-plus-upsert reconcile reliably blows the 30 s budget under back-pressure. The reported case had 77 orphans + 30 desired = 107 writes; two consecutive 30 s timeouts, then a clean 22 s sync only after the bucket fully recovered ~60 minutes later.

This PR splits the read-only diff from the writes and sizes the execute-phase timeout to the actual workload, so a heavy reconcile gets enough budget to finish under bucket pressure but pathological loads still get a hard cap.

Related Issue

Fixes #16713

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/discord.py — extracted two helpers from _safe_sync_slash_commands:
    • _plan_slash_command_sync() — read-only; diffs desired vs existing global commands, returns a list of typed actions (create / update / recreate) plus the orphan deletes and a write_count.
    • _execute_slash_command_sync(plan) — applies the plan, returns the existing summary dict.
    • _safe_sync_slash_commands is now a thin wrapper that calls plan + execute, preserving the single-shot signature for callers and existing tests.
  • gateway/platforms/discord.py:_run_post_connect_initialization — now plans first under a 30 s budget, then computes a write-count-aware execute budget (30 + 5 × write_count, capped at 600 s) before running the plan. recreate actions count as two writes (delete + upsert) so the budget covers Discord's actual rate-limit math. Timeout log message updated to point at saturated rate-limit bucket as the cause instead of a flat "after 30 s".
  • Three new module-level constants on the adapter (_SLASH_SYNC_BASE_BUDGET_SECONDS, _SLASH_SYNC_PER_WRITE_SECONDS, _SLASH_SYNC_MAX_BUDGET_SECONDS) and a static _estimate_slash_sync_budget(write_count) so the budget formula is testable and trivially tunable.
  • tests/gateway/test_discord_connect.py — added five tests:
    • _estimate_slash_sync_budget scales with write count, is monotonic, caps at the maximum.
    • _plan_slash_command_sync counts a recreate as two writes plus an orphan delete (matches Discord's bucket math).
    • Plan + execute produces the same summary dict as the existing _safe_sync_slash_commands single-shot path (so callers/tests that read the summary stay correct).

How to Test

Reproduction (matches the issue body): trigger a session that produces ≥ ~30 orphan slash commands (e.g., a command-registry refactor that drops or renames commands), restart the gateway, and watch the post-connect initialization. Before this fix, wait_for(..., 30) raised asyncio.TimeoutError and subsequent reconnects within the rate-limit cooldown also timed out. After this fix, the budget scales with the planned write count and the reconcile completes under bucket pressure.

Automated:

pytest tests/gateway/test_discord_connect.py -q

Result on macOS 15.6.1 / Python 3.14.2: 16 passed (11 pre-existing + 5 new). All five new tests fail on main without the production change.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.6.1 (Python 3.14.2)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A (new helpers carry inline docstrings explaining the bucket math)
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A (N/A — budget tuning lives on adapter class constants; not user-facing yaml)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (N/A)
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (N/A — pure asyncio + discord.py HTTP calls, no platform-specific syscalls)
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A (N/A — gateway adapter, not a tool)

Screenshots / Logs

$ pytest tests/gateway/test_discord_connect.py -q
................                                                         [100%]
16 passed, 3 warnings in 3.17s

Changed files

  • gateway/platforms/discord.py (modified, +112/-26)
  • tests/gateway/test_discord_connect.py (modified, +194/-0)

Code Example

fast = {k: v for k, v in desired_by_key.items() if _is_critical(v)}
slow = {k: v for k, v in desired_by_key.items() if k not in fast}

# inside the wait_for(..., 30) budget:
await self._reconcile_subset(fast)

# detached, no outer timeout, rate-limit-aware retries:
self._spawn_deferred_sync(slow, orphans=existing_by_key)
RAW_BUFFERClick to expand / collapse

Summary

_safe_sync_slash_commands (gateway/platforms/discord.py:937-1009) issues one Discord HTTP call per desired command plus one DELETE per orphan inside a tight for loop, then the whole thing runs inside an outer asyncio.wait_for(..., 30) at _run_post_connect_initialization (gateway/platforms/discord.py:801-832). After a mass prune, the post-orphan-cleanup upsert burst pushes against Discord's per-app command-management rate-limit window (~5 writes / 20 s). The 30 s outer budget then blows before the reconciler finishes, the post-connect handler returns a timeout, and the gateway retries on its next connect — only succeeding ~60 min later once the bucket has fully recovered.

Repro

  1. Run a session that produces ≥ ~30 orphan slash commands (e.g., command-registry refactor that drops or renames commands).
  2. Restart the gateway. The first reconcile runs the orphan delete loop (line 998) followed by upserts under back-pressure.
  3. wait_for(... , 30) raises asyncio.TimeoutError inside _run_post_connect_initialization.
  4. Subsequent reconnects within the rate-limit cooldown also time out. After ~60 min the bucket recovers and the next reconnect's reconcile finishes in ~22 s.

Observed in production with 77 orphans pruned. Two consecutive 30 s timeouts, then a clean 22 s sync.

Why the current code times out

  • Inner loop ignores X-RateLimit-Remaining / X-RateLimit-Reset and relies on discord.py's default backoff. Steady-state this is fine; under burst the serialized waits stack to 40-60 s.
  • Outer wait_for(... , 30) is the same budget for both fast (steady-state) and slow (post-prune) paths. There's no signal to tell the outer handler that the inner work is progressing, just slowly.

Three patch options (priority order)

1. Split sync into a fast critical path + deferred sweep (recommended)

Most user-visible commands live in COMMAND_REGISTRY plus the single /skill autocomplete dispatcher. Sync those inside the 30 s budget; once they land the gateway is operational. Move orphan cleanup at line 998 into a detached background task with a separate, much wider budget. Operators don't notice deferred deletes if /skill autocomplete works.

fast = {k: v for k, v in desired_by_key.items() if _is_critical(v)}
slow = {k: v for k, v in desired_by_key.items() if k not in fast}

# inside the wait_for(..., 30) budget:
await self._reconcile_subset(fast)

# detached, no outer timeout, rate-limit-aware retries:
self._spawn_deferred_sync(slow, orphans=existing_by_key)

Trade-off: heartbeat card briefly reports "sync in progress" until the deferred sweep lands. Net win — cmd-sync timeouts cascade into 30032 errors today.

2. Pace upserts off X-RateLimit-Remaining / X-RateLimit-Reset headers

discord.py's HTTPClient exposes the headers via the rate-limit handler, but _safe_sync_slash_commands issues calls in a tight loop without inspecting them. A wrapper that sleeps until X-RateLimit-Reset when X-RateLimit-Remaining <= 1 smooths the post-prune burst without changing the contract.

Trade-off: still serial and still bounded by the outer 30 s — buys ~10-15 s of headroom, not unlimited.

3. Widen / drop the outer wait_for for the cleanup branch

Cheapest: bump the timeout at line 816 to 120 s, or split into a fast-stage 30 s and a slow-stage None. Fixes the symptom without addressing the burst.

Trade-off: a real failure now hangs the post-connect handler for 2 min instead of 30 s. Acceptable, but feels like papering over the burst.

Recommendation

Option 1 as the structural fix; option 2 as a follow-up if you prefer incremental patches. Option 3 alone hides the burst rather than handling it.

References

  • gateway/platforms/discord.py:801-832_run_post_connect_initialization outer wait_for(..., 30).
  • gateway/platforms/discord.py:937-1009_safe_sync_slash_commands reconciler.
  • Autocomplete refactor commit d7fb435e — the test surface is rooted here.
  • Discord global command-management rate-limit: ~5 writes / app / 20 s, headers X-RateLimit-Remaining, X-RateLimit-Reset.

extent analysis

TL;DR

Split the sync into a fast critical path and a deferred sweep to avoid timing out due to Discord's rate limit.

Guidance

  • Identify critical commands that need to be synced within the 30-second budget, such as those in COMMAND_REGISTRY and the /skill autocomplete dispatcher.
  • Move orphan cleanup into a detached background task with a separate, wider budget to avoid blocking the main sync process.
  • Consider pacing upserts based on X-RateLimit-Remaining and X-RateLimit-Reset headers to smooth out the post-prune burst.
  • Review the trade-offs of each patch option, including the impact on user visibility and error handling.

Example

fast = {k: v for k, v in desired_by_key.items() if _is_critical(v)}
slow = {k: v for k, v in desired_by_key.items() if k not in fast}

await self._reconcile_subset(fast)
self._spawn_deferred_sync(slow, orphans=existing_by_key)

Notes

The provided patch options have different trade-offs, and the choice of solution depends on the specific requirements and constraints of the system. Option 1 is recommended as the structural fix, while option 2 can be considered as a follow-up incremental patch.

Recommendation

Apply patch option 1, splitting the sync into a fast critical path and a deferred sweep, as it addresses the root cause of the issue and provides a net win by avoiding cmd-sync timeouts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix _safe_sync_slash_commands times out under post-prune rate-limit pressure (30s budget too tight) [1 pull requests, 1 participants]