hermes - 💡(How to fix) Fix gateway: Telegram clarify-button resolver loses state on SIGTERM restart and 600s timeout [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

tools.clarify_gateway stores pending clarify requests in module-level dicts (_entries, _session_index). Two independent paths cause resolve_gateway_clarify(clarify_id, response) to return False silently, dropping the users tapped-button reply:

Root Cause

tools.clarify_gateway stores pending clarify requests in module-level dicts (_entries, _session_index). Two independent paths cause resolve_gateway_clarify(clarify_id, response) to return False silently, dropping the users tapped-button reply:

Fix Action

Fixed

RAW_BUFFERClick to expand / collapse

Summary

tools.clarify_gateway stores pending clarify requests in module-level dicts (_entries, _session_index). Two independent paths cause resolve_gateway_clarify(clarify_id, response) to return False silently, dropping the users tapped-button reply:

Path 1 — gateway restart between ask and reply

State is in-memory only. When the gateway receives SIGTERM (e.g. launchd watchdog under high load on macOS) and restarts, the new process starts with empty _entries. Any clarify-button callback that arrives after the restart hits _entries.get(clarify_id) is None → return False.

Path 2 — wait_for_response evicts on timeout

Default agent.clarify_timeout: 600 (10 min). If the user takes longer than 600s — meeting, AFK, slow to read — wait_for_response returns None and removes the entry from _entries in its cleanup block. When the user finally taps the inline button, resolve_gateway_clarify returns False. The Telegram button shows as tapped but the agent never receives the choice.

Reproduction

  1. Default agent.clarify_timeout: 600 in config.yaml.
  2. Trigger any agent task that ends with clarify.
  3. Either:
    • Wait > 600s before tapping a button, OR
    • Force restart the gateway (launchctl kickstart -k gui/501/ai.hermes.gateway on macOS) between the prompt being delivered and the user tapping.
  4. Tap a choice button on Telegram.
  5. Observe in logs: WARNING gateway.platforms.telegram: Telegram clarify button: resolve_gateway_clarify returned False (id=<clarify_id>). Agent iteration counter freezes; agent reports running: clarify indefinitely.

User-visible symptom

Agent shows "running: clarify" forever. From the users perspective they answered, but the agent ignored it. Eventually the agent hits max_iterations and the run aborts with no useful output.

Code references

  • tools/clarify_gateway.py:144-148wait_for_response removes entry from _entries after the wait loop exits (resolved OR timeout).
  • tools/clarify_gateway.py:150-159resolve_gateway_clarify returns False for unknown IDs, no notification to user.
  • gateway/platforms/telegram.py:3365 — call site where the False return is logged but not surfaced to the user.

Suggested fixes (any combination)

  1. Persist clarify state to disk (SQLite/JSON) so SIGTERM survives. Restore on gateway boot.
  2. Notify the user on expired tap — when resolve_gateway_clarify returns False, edit the original message via the Telegram callback to say "⚠️ This question expired or the session reset — please /retry." Today the button silently does nothing.
  3. Raise default clarify_timeout to 3600s and document tradeoff against the running-agent guard.
  4. Decouple timeout-eviction from _entries — let wait_for_response return None on timeout but keep the entry around for a grace window, so late taps can still re-deliver the response.

Happy to send a PR for fix #2 (smallest scope, biggest UX win) if interest.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix gateway: Telegram clarify-button resolver loses state on SIGTERM restart and 600s timeout [2 pull requests]