hermes - 💡(How to fix) Fix [Bug]: gateway progress_callback has unguarded shared state (last_progress_msg / repeat_count) called from concurrent ThreadPoolExecutor workers

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

progress_callback in gateway/run.py (~line 14248) reads and writes two shared closure variables — last_progress_msg[0] and repeat_count[0] — without any synchronization. The callback is passed as tool_progress_callback to the agent, which calls it from concurrent ThreadPoolExecutor worker threads inside _execute_tool_calls_concurrent (run_agent.py:10951).

Root Cause

# gateway/run.py ~line 14223
last_progress_msg = [None]   # shared mutable state
repeat_count = [0]           # shared mutable state

def progress_callback(event_type, tool_name=None, preview=None, ...):
    ...
    # Called from multiple concurrent worker threads — NO lock
    if msg == last_progress_msg[0]:       # ← RACE: read
        repeat_count[0] += 1              # ← RACE: non-atomic read-modify-write
        progress_queue.put(("__dedup__", msg, repeat_count[0]))
        return
    last_progress_msg[0] = msg            # ← RACE: write
    repeat_count[0] = 0                   # ← RACE: write
    progress_queue.put(msg)

Two workers executing concurrently can:

  1. Both read last_progress_msg[0] = None before either writes → both emit their message as "new" (one is lost from dedup perspective)
  2. Interleave on repeat_count[0] += 1 → lost increments (the (×N) counter is wrong)
  3. Thread A sets last_progress_msg[0] = msg_A, Thread B immediately overwrites with msg_B before A's next call checks it → A's dedup window is broken

last_tool[0] (line 14306, used in "new" mode) has the same issue.

Fix Action

Fix

Add a threading.Lock() created alongside last_progress_msg and repeat_count, and hold it for the entire dedup check-and-update block:

last_progress_msg = [None]
repeat_count = [0]
_dedup_lock = threading.Lock()

def progress_callback(...):
    ...
    with _dedup_lock:
        if msg == last_progress_msg[0]:
            repeat_count[0] += 1
            progress_queue.put(("__dedup__", msg, repeat_count[0]))
            return
        last_progress_msg[0] = msg
        repeat_count[0] = 0
    progress_queue.put(msg)

The lock covers the entire read-check-write sequence, making the dedup atomic regardless of how many workers call it simultaneously.

Code Example

# gateway/run.py ~line 14223
last_progress_msg = [None]   # shared mutable state
repeat_count = [0]           # shared mutable state

def progress_callback(event_type, tool_name=None, preview=None, ...):
    ...
    # Called from multiple concurrent worker threads — NO lock
    if msg == last_progress_msg[0]:       # ← RACE: read
        repeat_count[0] += 1              # ← RACE: non-atomic read-modify-write
        progress_queue.put(("__dedup__", msg, repeat_count[0]))
        return
    last_progress_msg[0] = msg            # ← RACE: write
    repeat_count[0] = 0                   # ← RACE: write
    progress_queue.put(msg)

---

last_progress_msg = [None]
repeat_count = [0]
_dedup_lock = threading.Lock()

def progress_callback(...):
    ...
    with _dedup_lock:
        if msg == last_progress_msg[0]:
            repeat_count[0] += 1
            progress_queue.put(("__dedup__", msg, repeat_count[0]))
            return
        last_progress_msg[0] = msg
        repeat_count[0] = 0
    progress_queue.put(msg)
RAW_BUFFERClick to expand / collapse

Description

progress_callback in gateway/run.py (~line 14248) reads and writes two shared closure variables — last_progress_msg[0] and repeat_count[0] — without any synchronization. The callback is passed as tool_progress_callback to the agent, which calls it from concurrent ThreadPoolExecutor worker threads inside _execute_tool_calls_concurrent (run_agent.py:10951).

Root cause

# gateway/run.py ~line 14223
last_progress_msg = [None]   # shared mutable state
repeat_count = [0]           # shared mutable state

def progress_callback(event_type, tool_name=None, preview=None, ...):
    ...
    # Called from multiple concurrent worker threads — NO lock
    if msg == last_progress_msg[0]:       # ← RACE: read
        repeat_count[0] += 1              # ← RACE: non-atomic read-modify-write
        progress_queue.put(("__dedup__", msg, repeat_count[0]))
        return
    last_progress_msg[0] = msg            # ← RACE: write
    repeat_count[0] = 0                   # ← RACE: write
    progress_queue.put(msg)

Two workers executing concurrently can:

  1. Both read last_progress_msg[0] = None before either writes → both emit their message as "new" (one is lost from dedup perspective)
  2. Interleave on repeat_count[0] += 1 → lost increments (the (×N) counter is wrong)
  3. Thread A sets last_progress_msg[0] = msg_A, Thread B immediately overwrites with msg_B before A's next call checks it → A's dedup window is broken

last_tool[0] (line 14306, used in "new" mode) has the same issue.

Affected paths

  • gateway/run.py ~line 14223–14356 (progress_callback closure)
  • run_agent.py ~line 10951 (_execute_tool_calls_concurrent with ThreadPoolExecutor)

Fix

Add a threading.Lock() created alongside last_progress_msg and repeat_count, and hold it for the entire dedup check-and-update block:

last_progress_msg = [None]
repeat_count = [0]
_dedup_lock = threading.Lock()

def progress_callback(...):
    ...
    with _dedup_lock:
        if msg == last_progress_msg[0]:
            repeat_count[0] += 1
            progress_queue.put(("__dedup__", msg, repeat_count[0]))
            return
        last_progress_msg[0] = msg
        repeat_count[0] = 0
    progress_queue.put(msg)

The lock covers the entire read-check-write sequence, making the dedup atomic regardless of how many workers call it simultaneously.

Steps to reproduce

Run a request that triggers parallel tool execution (e.g., parallel bash calls) and observe the (×N) counter showing incorrect values or duplicate "new" progress lines that should have been collapsed.

Environment

  • hermes-agent main branch
  • Any multi-tool parallel execution via gateway (Slack, Telegram, Discord, etc.)
  • tool_progress enabled in config

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING