PR #23174: fix(tools): protect _last_poll_timestamps with self._lock in EventBridge

Repository: NousResearch/hermes-agent
Author: konsisumer
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/23174

Description (problem / solution / changelog)

Acquire self._lock around the read and write of _last_poll_timestamps in EventBridge._poll_once so it matches the locking pattern used everywhere else in the class.

What changed and why

mcp_serve.py: wrap the self._last_poll_timestamps.get(...) read at the top of the per-session loop and the self._last_poll_timestamps[session_key] = latest write at the end with with self._lock:. Every other access to shared state on EventBridge (_queue, _cursor, _pending_approvals) is already taken under self._lock; this dict was the one outlier the bug report flagged.
The issue suggested wrapping the whole _poll_once loop body in with self._lock:. That would deadlock: self._lock is a non-reentrant threading.Lock, and the loop body calls self._enqueue(...) whenever new messages arrive, which itself acquires self._lock. Holding the lock around just the dict accesses gives the consistency the report asked for without introducing a deadlock or a long critical section that would block poll_events / wait_for_event callers during DB reads.
tests/test_mcp_serve.py: add test_poll_once_holds_lock_around_last_poll_timestamps. It swaps in a wrapper around bridge._lock that tracks held depth and a dict subclass that records every read/write attempted while the lock is not held, then runs _poll_once against an in-memory test SQLite DB and asserts the unlocked-access list is empty.

How to test

pytest tests/test_mcp_serve.py -q — 86 tests pass, including the new regression test.
pytest tests/test_mcp_serve.py tests/tools/test_mcp_stability.py -q — 102 tests pass; no other MCP suites regressed.

What platforms tested on

macOS on darwin-arm64 (local)

Fixes #23096

Changed files

mcp_serve.py (modified, +4/-2)
tests/test_mcp_serve.py (modified, +95/-0)

PR #23219: Fix/mcp poll timestamps lock

Repository: NousResearch/hermes-agent
Author: xxxigm
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/23219

Description (problem / solution / changelog)

What does this PR do?

Fixes a real data race in EventBridge._poll_once (mcp_serve.py): it was reading and writing _last_poll_timestamps from the background poll thread without acquiring self._lock, while every other bridge method (_enqueue, poll_events, wait_for_event, list_pending_approvals, respond_to_approval) consistently holds the lock around its access to shared state.

The race produces stale/torn reads of the per-session timestamp on the GIL (the dict access is not atomic with respect to foreground iteration once you account for the upcoming free-threaded CPython builds, and even on the GIL the hand-off can lose updates).

The PR is split into four commits in test-first order so each step is independently reviewable:

test(mcp): add regression tests for EventBridge._poll_once lock invariant — pins the contract by replacing _last_poll_timestamps with a wrapper that fails fast when self._lock is not held by the current thread on access. Three scenarios: single poll cycle, repeated polls, and concurrent _poll_once calls hammered alongside foreground bridge API calls. All three fail on main.
fix(mcp): protect _last_poll_timestamps with self._lock in _poll_once — wraps the two touch points (read at the top of the per-session loop, write at the end) in with self._lock:. The lock is intentionally released across the blocking db.get_messages call and the in-Python filtering loop so foreground methods are not starved. The regression tests from commit 1 now pass.
refactor(mcp): factor lock-safe helpers for _last_poll_timestamps — replaces the inline with self._lock: ... blocks with two narrow helpers, _read_last_poll_timestamp and _record_last_poll_timestamp, so the lock discipline is not something each future caller has to remember. No behaviour change.
docs(mcp): document EventBridge lock invariant in class docstring — spells out which EventBridge attributes are shared between the background poll thread and foreground callers, and which are poll-thread-private. Adds a pointed comment on _last_poll_timestamps itself reminding contributors to go through the _read_/_record_ helpers rather than touching the dict directly. This is the readme that would have prevented the original bug.

Related Issue

Fixes #23096

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

mcp_serve.py (+62, −6):
- Added _read_last_poll_timestamp(session_key) and _record_last_poll_timestamp(session_key, ts) helpers — both acquire self._lock for the single dict op.
- Rewrote _poll_once to go through the helpers; the rest of the loop body still runs outside the lock so blocking I/O on db.get_messages cannot starve foreground bridge methods.
- Expanded the EventBridge class docstring to document the lock invariant: which attributes are shared, which are poll-thread-private, and that any access to _last_poll_timestamps MUST go through the new helpers.
tests/test_mcp_serve.py (+177): three regression tests covering the read at line 386, the write at line 443, and concurrent foreground/background access.

No other files modified.

How to Test

Set up .venv: python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"
Run the new MCP tests: scripts/run_tests.sh tests/test_mcp_serve.py

Expected: 88 passed (≈1.6s). 3. Confirm the regression cases fail on main before this PR: git stash && git checkout upstream/main -- mcp_serve.py scripts/run_tests.sh tests/test_mcp_serve.py -k "lock_invariant or poll_once" git checkout HEAD -- mcp_serve.py && git stash pop Expected: 3 failures with _last_poll_timestamps get happened without holding bridge._lock.

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (test(mcp):, fix(mcp):, refactor(mcp):, docs(mcp):)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix (4 commits, 2 files)
I've run scripts/run_tests.sh tests/test_mcp_serve.py and all tests pass
I've added tests for my changes (commit 1 — written test-first so the regression bites on main)
I've tested on my platform: macOS 15.6 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — class docstring + inline _last_poll_timestamps comment in mcp_serve.py (commit 4)
I've updated cli-config.yaml.example if I added/changed config keys — N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — threading.RLock is stdlib, no platform-specific calls; tests use only threading + unittest.mock
I've updated tool descriptions/schemas if I changed tool behavior — N/A (no public-facing tool changes; _poll_once remains internal)

Screenshots / Logs

============================== 88 passed in 1.65s ==============================

Changed files

mcp_serve.py (modified, +62/-6)
tests/test_mcp_serve.py (modified, +177/-0)

Body：

Bug Description

In mcp_serve.py, the _last_poll_timestamps dictionary is read and written in _poll_once() without holding self._lock, while the rest of the
class consistently uses with self._lock: to protect shared state.

Affected Lines

Line 386: last_seen = self._last_poll_timestamps.get(session_key, 0.0) — read without lock
Line 443: self._last_poll_timestamps[session_key] = latest — write without lock

Why This Is a Bug

_poll_loop runs in a background thread. Other methods (poll_events, wait_for_event, _enqueue) all acquire self._lock before accessing shared data. The missing lock on _last_poll_timestamps creates a data race that could cause stale reads or inconsistent updates under concurrent access.

Suggested Fix

Wrap the loop body in _poll_once() with with self._lock: to match the existing locking pattern used throughout the class.

Environment

Component: mcp_serve.py — _SessionPollBroker._poll_once()
Type: Thread safety / data race

hermes - ✅(Solved) Fix bug: _last_poll_timestamps read/write in _poll_once missing lock protection [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #23174: fix(tools): protect _last_poll_timestamps with self._lock in EventBridge

Description (problem / solution / changelog)

What changed and why

How to test

What platforms tested on

Changed files

PR #23219: Fix/mcp poll timestamps lock

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Bug Description

Affected Lines

Why This Is a Bug

Suggested Fix

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING