hermes - 💡(How to fix) Fix [Bug]: skill_usage._mutate() has read-modify-write race across concurrent processes [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18795Fetched 2026-05-03 04:54:14
View on GitHub
Comments
2
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×3commented ×2

Error Message

def _mutate(skill_name, mutator): if not skill_name: return try: if not is_agent_created(skill_name): return with _file_lock(_usage_file()): # ← add lock data = load_usage() rec = data.get(skill_name) if not isinstance(rec, dict): rec = _empty_record() mutator(rec) data[skill_name] = rec save_usage(data) except Exception as e: logger.debug(...)

Root Cause

Impact is currently low because usage counters are explicitly best-effort telemetry for the skill curator (see tools/skill_usage.py line 13: "failures log at DEBUG and return silently"). However, if usage data gains more importance in the future (e.g., curator auto-archive decisions), this race becomes a correctness issue. The fix is trivial (~10 lines) since the locking pattern already exists in the codebase.

Fix Action

Fix / Workaround

  • tools/skill_usage.py:
    • _mutate() (line 286) — backing all bump_view/bump_use/bump_patch/set_state/set_pinned calls
    • forget() (line 359) — direct load→modify→save
  • Downstream consumers:
    • agent/curator.py apply_automatic_transitions() — depends on last_activity_at for stale/archive decisions
    • hermes_cli/curator.py — depends on use_count/view_count/patch_count for status display

Code Example

# tools/skill_usage.py:286-306
def _mutate(skill_name, mutator):
    data = load_usage()       # ① read file → dict
    rec = data.get(skill_name)
    mutator(rec)              # ② in-memory mutate
    data[skill_name] = rec
    save_usage(data)          # ③ atomic os.replace() write-back

---

Process A: load_usage() → sees use_count=5
Process B: load_usage() → sees use_count=5  (same state)
Process A: mutates count to 6, save_usage() → writes {"x": {"use_count": 6}}
Process B: mutates count to 6, save_usage() → writes {"x": {"use_count": 6}}A's +1 lost

---

@contextmanager
def _file_lock(path: Path):
    """Acquire an exclusive file lock for read-modify-write safety."""
    lock_path = path.with_suffix(path.suffix + ".lock")
    fd = open(lock_path, "a+")
    fcntl.flock(fd, fcntl.LOCK_EX)
    yield
    fcntl.flock(fd, fcntl.LOCK_UN)

---

def _mutate(skill_name, mutator):
    if not skill_name:
        return
    try:
        if not is_agent_created(skill_name):
            return
        with _file_lock(_usage_file()):          # ← add lock
            data = load_usage()
            rec = data.get(skill_name)
            if not isinstance(rec, dict):
                rec = _empty_record()
            mutator(rec)
            data[skill_name] = rec
            save_usage(data)
    except Exception as e:
        logger.debug(...)
RAW_BUFFERClick to expand / collapse

[Bug]: skill_usage._mutate() has read-modify-write race across concurrent processes

Bug Description

tools/skill_usage.py's _mutate() (line 286-307) uses a classic read-modify-write pattern to update .usage.json with no inter-process synchronization:

# tools/skill_usage.py:286-306
def _mutate(skill_name, mutator):
    data = load_usage()       # ① read file → dict
    rec = data.get(skill_name)
    mutator(rec)              # ② in-memory mutate
    data[skill_name] = rec
    save_usage(data)          # ③ atomic os.replace() write-back

save_usage() uses atomic rename (os.replace) which only guarantees the write itself is atomic (no torn files), but the full read→mutate→write cycle is not. Concurrent processes cause lost updates:

Process A: load_usage() → sees use_count=5
Process B: load_usage() → sees use_count=5  (same state)
Process A: mutates count to 6, save_usage() → writes {"x": {"use_count": 6}}
Process B: mutates count to 6, save_usage() → writes {"x": {"use_count": 6}}  ← A's +1 lost

Steps to Reproduce

Any multi-process concurrency can trigger this:

  1. Two hermes CLI sessions running simultaneously
  2. Gateway handling messages from multiple platforms (Telegram + Discord) concurrently
  3. batch_runner parallel workers
  4. Cron job overlapping with an interactive session
  5. delegate_task sub-agents running in parallel

Expected Behavior

_mutate() and forget() should acquire an inter-process lock on .usage.json before the read-modify-write cycle.

Actual Behavior

Every write succeeds (os.replace itself never fails), but concurrent writers silently overwrite each other's updates, losing counter increments and timestamp bumps.

Affected Component

  • tools/skill_usage.py:
    • _mutate() (line 286) — backing all bump_view/bump_use/bump_patch/set_state/set_pinned calls
    • forget() (line 359) — direct load→modify→save
  • Downstream consumers:
    • agent/curator.py apply_automatic_transitions() — depends on last_activity_at for stale/archive decisions
    • hermes_cli/curator.py — depends on use_count/view_count/patch_count for status display

Proposed Fix

The same repo already handles this correctly in tools/memory_tool.py (line 146-179) using fcntl.flock on a lock file to protect read-modify-write:

@contextmanager
def _file_lock(path: Path):
    """Acquire an exclusive file lock for read-modify-write safety."""
    lock_path = path.with_suffix(path.suffix + ".lock")
    fd = open(lock_path, "a+")
    fcntl.flock(fd, fcntl.LOCK_EX)
    yield
    fcntl.flock(fd, fcntl.LOCK_UN)

_mutate() and forget() can reuse the same pattern:

def _mutate(skill_name, mutator):
    if not skill_name:
        return
    try:
        if not is_agent_created(skill_name):
            return
        with _file_lock(_usage_file()):          # ← add lock
            data = load_usage()
            rec = data.get(skill_name)
            if not isinstance(rec, dict):
                rec = _empty_record()
            mutator(rec)
            data[skill_name] = rec
            save_usage(data)
    except Exception as e:
        logger.debug(...)

The same lock should wrap forget() as well.

Debug Report

N/A — no crash or stack trace. Silent lost-update bug.

Impact is currently low because usage counters are explicitly best-effort telemetry for the skill curator (see tools/skill_usage.py line 13: "failures log at DEBUG and return silently"). However, if usage data gains more importance in the future (e.g., curator auto-archive decisions), this race becomes a correctness issue. The fix is trivial (~10 lines) since the locking pattern already exists in the codebase.

Operating System

macOS (race is OS-independent — applies to all platforms)

extent analysis

TL;DR

The most likely fix for the read-modify-write race in skill_usage._mutate() is to acquire an inter-process lock on .usage.json using fcntl.flock before the read-modify-write cycle.

Guidance

  • Identify the critical section of code in _mutate() and forget() where the lock should be acquired to prevent concurrent access.
  • Reuse the existing _file_lock pattern from tools/memory_tool.py to implement the lock.
  • Apply the lock to both _mutate() and forget() to ensure atomicity of the read-modify-write cycle.
  • Verify the fix by testing concurrent access scenarios, such as running multiple hermes CLI sessions simultaneously.

Example

def _mutate(skill_name, mutator):
    with _file_lock(_usage_file()):  # acquire lock
        data = load_usage()
        rec = data.get(skill_name)
        mutator(rec)
        data[skill_name] = rec
        save_usage(data)

Notes

The proposed fix assumes that the _file_lock pattern is correctly implemented and works as intended. Additionally, the fix may introduce performance overhead due to the locking mechanism, which should be considered in high-concurrency scenarios.

Recommendation

Apply the workaround by acquiring an inter-process lock on .usage.json using fcntl.flock to prevent the read-modify-write race in _mutate() and forget(). This fix is relatively simple and reuses existing code, making it a low-risk solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING