hermes - 💡(How to fix) Fix [Bug]: Hermes Gateway PID 文件竞态问题及修复方案 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13511Fetched 2026-04-22 08:06:03
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Error Message

def write_pid_file() -> None: """Write the current process PID and metadata to the gateway PID file.

Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
invocations race: exactly one process wins and the rest get
FileExistsError.
"""
path = _get_pid_path()
path.parent.mkdir(parents=True, exist_ok=True)
record = json.dumps(_build_pid_record())
try:
    fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
except FileExistsError:
    raise  # Let caller decide: another gateway is racing us
...

Root Cause

Root Cause Analysis (optional)

Code Example

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

---

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())
    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    ...

---

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.

    Before attempting O_EXCL, cleans up stale PID files where the recorded
    process has exited (ProcessLookupError) - this prevents a stale PID file
    from blocking a new gateway from starting.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())

    # Pre-check: if PID file exists but the recorded process is dead,
    # remove it first so the O_EXCL write won't fail on a stale file.
    if path.exists():
        existing = _read_pid_record(path)
        if existing:
            try:
                pid = int(existing["pid"])
                os.kill(pid, 0)
            except (ProcessLookupError, PermissionError, ValueError, TypeError):
                # Process is gone - remove stale PID file
                try:
                    path.unlink(missing_ok=True)
                except OSError:
                    pass

    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    try:
        with os.fdopen(fd, "w", encoding="utf-8") as f:
            f.write(record)
    except Exception:
        try:
            path.unlink(missing_ok=True)
        except OSError:
            pass
        raise

---

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

---



---

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())
    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    ...

---

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.

    Before attempting O_EXCL, cleans up stale PID files where the recorded
    process has exited (ProcessLookupError) - this prevents a stale PID file
    from blocking a new gateway from starting.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())

    # Pre-check: if PID file exists but the recorded process is dead,
    # remove it first so the O_EXCL write won't fail on a stale file.
    if path.exists():
        existing = _read_pid_record(path)
        if existing:
            try:
                pid = int(existing["pid"])
                os.kill(pid, 0)
            except (ProcessLookupError, PermissionError, ValueError, TypeError):
                # Process is gone - remove stale PID file
                try:
                    path.unlink(missing_ok=True)
                except OSError:
                    pass

    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    try:
        with os.fdopen(fd, "w", encoding="utf-8") as f:
            f.write(record)
    except Exception:
        try:
            path.unlink(missing_ok=True)
        except OSError:
            pass
        raise
RAW_BUFFERClick to expand / collapse

Bug Description

Hermes Gateway PID 文件竞态问题及修复方案

问题现象

Hermes Gateway 启动时反复报错:

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

日志中可以看到 Gateway 多次尝试启动,每次都因为 PID 文件竞争而退出,但实际没有活跃的 Gateway 进程在运行。

问题原因

核心原因:过期 PID 文件未被自动清理

Hermes Gateway 使用 PID 文件(~/.hermes/gateway.pid)来检测是否存在已运行的 Gateway 实例,防止多个实例同时运行。PID 文件采用 JSON 格式,记录了进程的 PID、启动时间(start_time)、命令行参数等信息。

正常生命周期中:

  • Gateway 启动时写入 PID 文件
  • Gateway 正常退出时通过 remove_pid_file() 删除 PID 文件
  • Gateway 异常退出(如 SIGKILL、崩溃)时 PID 文件不会被删除

问题场景

  1. Gateway 异常退出(如被 kill -9、终端断开、OOM 等),PID 文件未被删除
  2. 新 Gateway 启动时调用 write_pid_file(),尝试用 O_CREAT | O_EXCL 原子创建 PID 文件
  3. O_EXCL 标志导致文件已存在时抛出 FileExistsError
  4. write_pid_file() 原来的逻辑没有检查 PID 文件中记录的进程是否还存活,直接把 FileExistsError 向上传播
  5. 结果:即使旧进程早已死亡,新 Gateway 也会因为过期 PID 文件而退出

原始 write_pid_file() 代码

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())
    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    ...

修复方案

write_pid_file() 中,O_EXCL 写入之前增加过期 PID 文件检测与清理逻辑:

  1. 检查 PID 文件是否存在
  2. 读取 PID 文件中记录的 PID
  3. os.kill(pid, 0) 检测该进程是否存活
  4. 如果进程已死(ProcessLookupError),删除过期 PID 文件
  5. 然后正常尝试 O_EXCL 写入

修复后的代码

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.

    Before attempting O_EXCL, cleans up stale PID files where the recorded
    process has exited (ProcessLookupError) - this prevents a stale PID file
    from blocking a new gateway from starting.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())

    # Pre-check: if PID file exists but the recorded process is dead,
    # remove it first so the O_EXCL write won't fail on a stale file.
    if path.exists():
        existing = _read_pid_record(path)
        if existing:
            try:
                pid = int(existing["pid"])
                os.kill(pid, 0)
            except (ProcessLookupError, PermissionError, ValueError, TypeError):
                # Process is gone - remove stale PID file
                try:
                    path.unlink(missing_ok=True)
                except OSError:
                    pass

    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    try:
        with os.fdopen(fd, "w", encoding="utf-8") as f:
            f.write(record)
    except Exception:
        try:
            path.unlink(missing_ok=True)
        except OSError:
            pass
        raise

修改的文件

  • hermes-agent/gateway/status.py - write_pid_file() 函数

修复效果

  • Gateway 异常退出后,下次启动自动清理过期 PID 文件,不再报错
  • 并发 gateway run --replace 的竞态行为不受影响(真正的竞争进程会通过 O_EXCL 正常处理)
  • get_running_pid() 已有完整的进程存活检测逻辑(PID + start_time + cmdline),修复不影响已有行为

Steps to Reproduce

让hermes自动更新并重启网关,这个时候hermes直接执行了kill命令,再次启动时,无法收到消息,重启网关也一样.

Expected Behavior

pid占用

Actual Behavior

...

Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp)

Messaging Platform (if gateway-related)

No response

Debug Report

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

Operating System

debian

Python Version

No response

Hermes Version

v0.10.0

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

No response

Proposed Fix (optional)

原始 write_pid_file() 代码

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())
    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    ...

修复方案

write_pid_file() 中,O_EXCL 写入之前增加过期 PID 文件检测与清理逻辑:

  1. 检查 PID 文件是否存在
  2. 读取 PID 文件中记录的 PID
  3. os.kill(pid, 0) 检测该进程是否存活
  4. 如果进程已死(ProcessLookupError),删除过期 PID 文件
  5. 然后正常尝试 O_EXCL 写入

修复后的代码

def write_pid_file() -> None:
    """Write the current process PID and metadata to the gateway PID file.

    Uses atomic O_CREAT | O_EXCL creation so that concurrent --replace
    invocations race: exactly one process wins and the rest get
    FileExistsError.

    Before attempting O_EXCL, cleans up stale PID files where the recorded
    process has exited (ProcessLookupError) - this prevents a stale PID file
    from blocking a new gateway from starting.
    """
    path = _get_pid_path()
    path.parent.mkdir(parents=True, exist_ok=True)
    record = json.dumps(_build_pid_record())

    # Pre-check: if PID file exists but the recorded process is dead,
    # remove it first so the O_EXCL write won't fail on a stale file.
    if path.exists():
        existing = _read_pid_record(path)
        if existing:
            try:
                pid = int(existing["pid"])
                os.kill(pid, 0)
            except (ProcessLookupError, PermissionError, ValueError, TypeError):
                # Process is gone - remove stale PID file
                try:
                    path.unlink(missing_ok=True)
                except OSError:
                    pass

    try:
        fd = os.open(path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
    except FileExistsError:
        raise  # Let caller decide: another gateway is racing us
    try:
        with os.fdopen(fd, "w", encoding="utf-8") as f:
            f.write(record)
    except Exception:
        try:
            path.unlink(missing_ok=True)
        except OSError:
            pass
        raise

修改的文件

  • hermes-agent/gateway/status.py - write_pid_file() 函数

修复效果

  • Gateway 异常退出后,下次启动自动清理过期 PID 文件,不再报错
  • 并发 gateway run --replace 的竞态行为不受影响(真正的竞争进程会通过 O_EXCL 正常处理)
  • get_running_pid() 已有完整的进程存活检测逻辑(PID + start_time + cmdline),修复不影响已有行为

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

extent analysis

TL;DR

The issue can be fixed by modifying the write_pid_file() function to detect and clean up stale PID files before attempting to write a new one.

Guidance

  • Check if the PID file exists and read the recorded PID.
  • Use os.kill(pid, 0) to detect if the process is still alive.
  • If the process is dead, remove the stale PID file before attempting to write a new one.
  • Modify the write_pid_file() function to include this logic, as shown in the provided fix.

Example

The modified write_pid_file() function is provided in the issue:

def write_pid_file() -> None:
    # ...
    if path.exists():
        existing = _read_pid_record(path)
        if existing:
            try:
                pid = int(existing["pid"])
                os.kill(pid, 0)
            except (ProcessLookupError, PermissionError, ValueError, TypeError):
                # Process is gone - remove stale PID file
                try:
                    path.unlink(missing_ok=True)
                except OSError:
                    pass
    # ...

Notes

The fix only modifies the write_pid_file() function in the hermes-agent/gateway/status.py file and does not affect other parts of the code.

Recommendation

Apply the provided workaround by modifying the write_pid_file() function to detect and clean up stale PID files. This fix should resolve the issue without introducing any significant changes to the existing codebase.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Hermes Gateway PID 文件竞态问题及修复方案 [1 participants]