hermes - ✅(Solved) Fix bug: _last_poll_timestamps read/write in _poll_once missing lock protection [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#23096Fetched 2026-05-11 03:31:14
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
referenced ×4labeled ×3cross-referenced ×2commented ×1

Fix Action

Fixed

PR fix notes

PR #23174: fix(tools): protect _last_poll_timestamps with self._lock in EventBridge

Description (problem / solution / changelog)

Acquire self._lock around the read and write of _last_poll_timestamps in EventBridge._poll_once so it matches the locking pattern used everywhere else in the class.

What changed and why

  • mcp_serve.py: wrap the self._last_poll_timestamps.get(...) read at the top of the per-session loop and the self._last_poll_timestamps[session_key] = latest write at the end with with self._lock:. Every other access to shared state on EventBridge (_queue, _cursor, _pending_approvals) is already taken under self._lock; this dict was the one outlier the bug report flagged.
  • The issue suggested wrapping the whole _poll_once loop body in with self._lock:. That would deadlock: self._lock is a non-reentrant threading.Lock, and the loop body calls self._enqueue(...) whenever new messages arrive, which itself acquires self._lock. Holding the lock around just the dict accesses gives the consistency the report asked for without introducing a deadlock or a long critical section that would block poll_events / wait_for_event callers during DB reads.
  • tests/test_mcp_serve.py: add test_poll_once_holds_lock_around_last_poll_timestamps. It swaps in a wrapper around bridge._lock that tracks held depth and a dict subclass that records every read/write attempted while the lock is not held, then runs _poll_once against an in-memory test SQLite DB and asserts the unlocked-access list is empty.

How to test

  • pytest tests/test_mcp_serve.py -q — 86 tests pass, including the new regression test.
  • pytest tests/test_mcp_serve.py tests/tools/test_mcp_stability.py -q — 102 tests pass; no other MCP suites regressed.

What platforms tested on

  • macOS on darwin-arm64 (local)

Fixes #23096

<!-- autocontrib:worker-id=issue-new-6612ee04 kind=pr-open -->

Changed files

  • mcp_serve.py (modified, +4/-2)
  • tests/test_mcp_serve.py (modified, +95/-0)

PR #23219: Fix/mcp poll timestamps lock

Description (problem / solution / changelog)

What does this PR do?

Fixes a real data race in EventBridge._poll_once (mcp_serve.py): it was reading and writing _last_poll_timestamps from the background poll thread without acquiring self._lock, while every other bridge method (_enqueue, poll_events, wait_for_event, list_pending_approvals, respond_to_approval) consistently holds the lock around its access to shared state.

The race produces stale/torn reads of the per-session timestamp on the GIL (the dict access is not atomic with respect to foreground iteration once you account for the upcoming free-threaded CPython builds, and even on the GIL the hand-off can lose updates).

The PR is split into four commits in test-first order so each step is independently reviewable:

  1. test(mcp): add regression tests for EventBridge._poll_once lock invariant — pins the contract by replacing _last_poll_timestamps with a wrapper that fails fast when self._lock is not held by the current thread on access. Three scenarios: single poll cycle, repeated polls, and concurrent _poll_once calls hammered alongside foreground bridge API calls. All three fail on main.
  2. fix(mcp): protect _last_poll_timestamps with self._lock in _poll_once — wraps the two touch points (read at the top of the per-session loop, write at the end) in with self._lock:. The lock is intentionally released across the blocking db.get_messages call and the in-Python filtering loop so foreground methods are not starved. The regression tests from commit 1 now pass.
  3. refactor(mcp): factor lock-safe helpers for _last_poll_timestamps — replaces the inline with self._lock: ... blocks with two narrow helpers, _read_last_poll_timestamp and _record_last_poll_timestamp, so the lock discipline is not something each future caller has to remember. No behaviour change.
  4. docs(mcp): document EventBridge lock invariant in class docstring — spells out which EventBridge attributes are shared between the background poll thread and foreground callers, and which are poll-thread-private. Adds a pointed comment on _last_poll_timestamps itself reminding contributors to go through the _read_/_record_ helpers rather than touching the dict directly. This is the readme that would have prevented the original bug.

Related Issue

Fixes #23096

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • mcp_serve.py (+62, −6):
    • Added _read_last_poll_timestamp(session_key) and _record_last_poll_timestamp(session_key, ts) helpers — both acquire self._lock for the single dict op.
    • Rewrote _poll_once to go through the helpers; the rest of the loop body still runs outside the lock so blocking I/O on db.get_messages cannot starve foreground bridge methods.
    • Expanded the EventBridge class docstring to document the lock invariant: which attributes are shared, which are poll-thread-private, and that any access to _last_poll_timestamps MUST go through the new helpers.
  • tests/test_mcp_serve.py (+177): three regression tests covering the read at line 386, the write at line 443, and concurrent foreground/background access.

No other files modified.

How to Test

  1. Set up .venv: python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"
  2. Run the new MCP tests: scripts/run_tests.sh tests/test_mcp_serve.py

Expected: 88 passed (≈1.6s). 3. Confirm the regression cases fail on main before this PR: git stash && git checkout upstream/main -- mcp_serve.py scripts/run_tests.sh tests/test_mcp_serve.py -k "lock_invariant or poll_once" git checkout HEAD -- mcp_serve.py && git stash pop Expected: 3 failures with _last_poll_timestamps get happened without holding bridge._lock.

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (test(mcp):, fix(mcp):, refactor(mcp):, docs(mcp):)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (4 commits, 2 files)
  • I've run scripts/run_tests.sh tests/test_mcp_serve.py and all tests pass
  • I've added tests for my changes (commit 1 — written test-first so the regression bites on main)
  • I've tested on my platform: macOS 15.6 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — class docstring + inline _last_poll_timestamps comment in mcp_serve.py (commit 4)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guidethreading.RLock is stdlib, no platform-specific calls; tests use only threading + unittest.mock
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A (no public-facing tool changes; _poll_once remains internal)

Screenshots / Logs

============================== 88 passed in 1.65s ==============================

Changed files

  • mcp_serve.py (modified, +62/-6)
  • tests/test_mcp_serve.py (modified, +177/-0)
RAW_BUFFERClick to expand / collapse

Body:

Bug Description

In mcp_serve.py, the _last_poll_timestamps dictionary is read and written in _poll_once() without holding self._lock, while the rest of the
class consistently uses with self._lock: to protect shared state.

Affected Lines

  • Line 386: last_seen = self._last_poll_timestamps.get(session_key, 0.0) — read without lock
  • Line 443: self._last_poll_timestamps[session_key] = latest — write without lock

Why This Is a Bug

_poll_loop runs in a background thread. Other methods (poll_events, wait_for_event, _enqueue) all acquire self._lock before accessing shared data. The missing lock on _last_poll_timestamps creates a data race that could cause stale reads or inconsistent updates under concurrent access.

Suggested Fix

Wrap the loop body in _poll_once() with with self._lock: to match the existing locking pattern used throughout the class.

Environment

  • Component: mcp_serve.py_SessionPollBroker._poll_once()
  • Type: Thread safety / data race

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING