hermes - ✅(Solved) Fix gateway run --replace race condition: multiple instances run simultaneously [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#11718Fetched 2026-04-18 05:59:13
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
referenced ×4cross-referenced ×2

When starting the gateway with --replace, a race condition can leave multiple gateway instances running simultaneously. This triggers Telegram (and likely other platform) polling conflicts and causes the bot to become unresponsive.

Error Message

WARNING gateway.platforms.telegram: [Telegram] Telegram polling conflict (1/3), will retry in 10s. Error: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

Root Cause

In start_gateway() (gateway/run.py), the new process writes its PID to the PID file before the old process has exited. A racing second --replace invocation then reads its own PID from the file (instead of the old process PID), so it skips the termination step and both instances run.

Fix Action

Fixed

PR fix notes

PR #11720: fix(gateway): claim PID file immediately after old process exits on --replace

Description (problem / solution / changelog)

Fixes #11718

Problem

When two --replace invocations overlap (e.g. launchd auto-restart racing a manual restart), the second process reads an empty PID file after the first has called remove_pid_file(). It finds no existing PID, skips termination, and both instances end up running simultaneously — causing repeated Telegram polling conflicts:

Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

Fix

Write the new process PID to the file right after remove_pid_file(), before continuing initialization. Any concurrent --replace invocation will then see the current process PID and correctly terminate it instead of finding an empty file and proceeding unchecked.

remove_pid_file()
# Claim the PID file immediately so any concurrent --replace
# invocation sees this process rather than an empty file and
# skips the termination step, preventing duplicate instances.
from gateway.status import write_pid_file as _write_pid_now
_write_pid_now()

Changed files

  • gateway/run.py (modified, +5/-0)

PR #11734: fix(gateway): prevent --replace race condition causing multiple instances

Description (problem / solution / changelog)

Summary

Fixes #11718.

When starting the gateway with --replace, concurrent invocations could leave multiple instances running simultaneously. This triggered Telegram (and other platform) polling conflicts, causing the bot to become unresponsive.

Root Cause

write_pid_file() in gateway/status.py used a plain file overwrite. Two racing --replace processes could both proceed through the termination-wait logic, then the second process would silently overwrite the first process's PID record, leaving both instances alive.

Changes

  • gateway/status.py: write_pid_file() now uses atomic O_CREAT | O_EXCL file creation. If the PID file already exists, it raises FileExistsError, ensuring exactly one process wins the race.
  • gateway/run.py: Before writing the PID file, performs a defensive re-check via get_running_pid(). Also catches FileExistsError from write_pid_file(). In both cases, stops the runner and returns False so the process exits cleanly (non-zero exit code for systemd auto-restart compatibility).

Reproduction

Run two gateway --replace invocations concurrently (e.g. via systemd restart overlap or manual double-start):

# Terminal 1
hermes gateway run --replace &

# Terminal 2 (within ~1s)
hermes gateway run --replace &

Before the fix: both processes write to gateway.pid and start polling Telegram, causing:

Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

After the fix: the second process detects the race, logs an error, and exits before starting any platform adapters.

Test Plan

Existing tests

pytest tests/gateway/test_status.py -q                    # 14 passed
pytest tests/gateway/test_runner_startup_failures.py -q   # included in 75 passed
pytest tests/hermes_cli/test_update_gateway_restart.py -q # 34 passed
pytest tests/gateway/test_telegram_conflict.py -q         # 6 passed

Concurrent race simulation

A standalone script was used to verify the atomicity under real OS-level concurrency:

import multiprocessing
import os
import tempfile
from pathlib import Path
from gateway import status

def racer(pid_file, queue):
    status._get_pid_path = lambda: Path(pid_file)
    try:
        status.write_pid_file()
        queue.put(("ok", os.getpid()))
    except FileExistsError:
        queue.put(("race_lost", os.getpid()))

pid_file = Path(tempfile.mkdtemp()) / "gateway.pid"
q = multiprocessing.Queue()
p1 = multiprocessing.Process(target=racer, args=(str(pid_file), q))
p2 = multiprocessing.Process(target=racer, args=(str(pid_file), q))
p1.start(); p2.start()
p1.join(); p2.join()

results = [q.get() for _ in range(2)]
assert len([r for r in results if r[0] == "ok"]) == 1
assert len([r for r in results if r[0] == "race_lost"]) == 1

Result: exactly 1 process succeeds, the other receives FileExistsError.

Backwards Compatibility

  • write_pid_file() signature unchanged; callers that don't handle FileExistsError will see the exception propagate (which is the desired behavior for race detection).
  • remove_pid_file() already guards against deleting another process's PID file, so old-process atexit handlers won't clobber the new PID record.

Changed files

  • gateway/run.py (modified, +19/-2)
  • gateway/status.py (modified, +22/-2)
  • scripts/release.py (modified, +1/-0)

Code Example

WARNING gateway.platforms.telegram: [Telegram] Telegram polling conflict (1/3), will retry in 10s.
Error: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running
RAW_BUFFERClick to expand / collapse

Description

When starting the gateway with --replace, a race condition can leave multiple gateway instances running simultaneously. This triggers Telegram (and likely other platform) polling conflicts and causes the bot to become unresponsive.

Steps to Reproduce

  1. Start the gateway normally (e.g. via launchd/systemd)
  2. A second instance starts with --replace (e.g. manual restart or service restart overlap)
  3. Both processes remain alive simultaneously

Actual Behavior

Multiple processes run at once (observed PIDs 548, 4101, and 4188 all alive simultaneously). Repeated errors in logs:

WARNING gateway.platforms.telegram: [Telegram] Telegram polling conflict (1/3), will retry in 10s.
Error: Conflict: terminated by other getUpdates request; make sure that only one bot instance is running

Expected Behavior

The old process should be fully terminated before the new one starts polling.

Root Cause

In start_gateway() (gateway/run.py), the new process writes its PID to the PID file before the old process has exited. A racing second --replace invocation then reads its own PID from the file (instead of the old process PID), so it skips the termination step and both instances run.

Environment

  • Platform: macOS (darwin)
  • Triggered by launchd auto-restart overlapping with a manual gateway run --replace

Suggested Fix

Write the new PID to the PID file only after the old process has been confirmed dead, or use a separate lock file that is held for the duration of the transition.

extent analysis

TL;DR

The issue can be fixed by modifying the start_gateway() function to write the new PID to the PID file only after the old process has been confirmed dead.

Guidance

  • Modify the start_gateway() function in gateway/run.py to wait for the old process to exit before writing the new PID to the PID file.
  • Consider using a separate lock file to synchronize the transition between the old and new processes.
  • Verify that the old process has exited by checking its PID and waiting for a short period of time before proceeding.
  • Test the modified start_gateway() function to ensure that it correctly handles the --replace flag and prevents multiple instances from running simultaneously.

Example

import os
import time

def start_gateway():
    # ...
    old_pid = read_pid_from_file()
    if old_pid:
        # Wait for the old process to exit
        while is_process_running(old_pid):
            time.sleep(0.1)
    # Write the new PID to the PID file
    write_pid_to_file(os.getpid())
    # ...

Notes

The suggested fix assumes that the start_gateway() function has access to the PID file and can read and write to it. Additionally, the is_process_running() function is not defined in the issue, so its implementation is left to the developer.

Recommendation

Apply the workaround by modifying the start_gateway() function to wait for the old process to exit before writing the new PID to the PID file, as this will prevent multiple instances from running simultaneously and resolve the polling conflicts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix gateway run --replace race condition: multiple instances run simultaneously [2 pull requests, 1 participants]