hermes - 💡(How to fix) Fix Bug: Race condition in OpenAI client replacement can close active connections [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

_ensure_primary_openai_client has a TOCTOU (time-of-check-time-of-use) race condition. The lock is released between detecting a closed client and replacing it, allowing two threads to simultaneously replace the client — with the second thread closing the first thread's freshly created client.

Error Message

def _ensure_primary_openai_client(self, *, reason: str) -> Any: with self._openai_client_lock(): client = getattr(self, "client", None) if client is not None and not self._is_openai_client_closed(client): return client # Replace INSIDE the lock: old_client = client try: new_client = self._create_openai_client(self._client_kwargs, reason=reason, shared=True) except Exception: raise RuntimeError("Failed to recreate closed OpenAI client") self.client = new_client # Close old client outside lock (safe now — no thread can re-select it) if old_client is not None: self._close_openai_client(old_client, reason=f"replace:{reason}", shared=True) return new_client

Root Cause

In run_agent.py, lines 2615-2629:

def _ensure_primary_openai_client(self, *, reason: str) -> Any:
    with self._openai_client_lock():
        client = getattr(self, "client", None)
        if client is not None and not self._is_openai_client_closed(client):
            return client
    # ← Lock released here. Thread B can enter and also see closed client.

    # Both Thread A and Thread B reach here:
    if not self._replace_primary_openai_client(reason=f"recreate_closed:{reason}"):
        raise RuntimeError("Failed to recreate closed OpenAI client")
    with self._openai_client_lock():
        return self.client

Compounded by _replace_primary_openai_client (lines 2598-2613) closing the old client outside the lock:

def _replace_primary_openai_client(self, *, reason: str) -> bool:
    with self._openai_client_lock():
        old_client = getattr(self, "client", None)
        ...
        self.client = new_client
    # ← Lock released, then old client closed:
    self._close_openai_client(old_client, reason=f"replace:{reason}", shared=True)
    return True

Fix Action

Fixed

Code Example

def _ensure_primary_openai_client(self, *, reason: str) -> Any:
    with self._openai_client_lock():
        client = getattr(self, "client", None)
        if client is not None and not self._is_openai_client_closed(client):
            return client
    # ← Lock released here. Thread B can enter and also see closed client.

    # Both Thread A and Thread B reach here:
    if not self._replace_primary_openai_client(reason=f"recreate_closed:{reason}"):
        raise RuntimeError("Failed to recreate closed OpenAI client")
    with self._openai_client_lock():
        return self.client

---

def _replace_primary_openai_client(self, *, reason: str) -> bool:
    with self._openai_client_lock():
        old_client = getattr(self, "client", None)
        ...
        self.client = new_client
    # ← Lock released, then old client closed:
    self._close_openai_client(old_client, reason=f"replace:{reason}", shared=True)
    return True

---

def _ensure_primary_openai_client(self, *, reason: str) -> Any:
    with self._openai_client_lock():
        client = getattr(self, "client", None)
        if client is not None and not self._is_openai_client_closed(client):
            return client
        # Replace INSIDE the lock:
        old_client = client
        try:
            new_client = self._create_openai_client(self._client_kwargs, reason=reason, shared=True)
        except Exception:
            raise RuntimeError("Failed to recreate closed OpenAI client")
        self.client = new_client
    # Close old client outside lock (safe now — no thread can re-select it)
    if old_client is not None:
        self._close_openai_client(old_client, reason=f"replace:{reason}", shared=True)
    return new_client
RAW_BUFFERClick to expand / collapse

Summary

_ensure_primary_openai_client has a TOCTOU (time-of-check-time-of-use) race condition. The lock is released between detecting a closed client and replacing it, allowing two threads to simultaneously replace the client — with the second thread closing the first thread's freshly created client.

Root Cause

In run_agent.py, lines 2615-2629:

def _ensure_primary_openai_client(self, *, reason: str) -> Any:
    with self._openai_client_lock():
        client = getattr(self, "client", None)
        if client is not None and not self._is_openai_client_closed(client):
            return client
    # ← Lock released here. Thread B can enter and also see closed client.

    # Both Thread A and Thread B reach here:
    if not self._replace_primary_openai_client(reason=f"recreate_closed:{reason}"):
        raise RuntimeError("Failed to recreate closed OpenAI client")
    with self._openai_client_lock():
        return self.client

Compounded by _replace_primary_openai_client (lines 2598-2613) closing the old client outside the lock:

def _replace_primary_openai_client(self, *, reason: str) -> bool:
    with self._openai_client_lock():
        old_client = getattr(self, "client", None)
        ...
        self.client = new_client
    # ← Lock released, then old client closed:
    self._close_openai_client(old_client, reason=f"replace:{reason}", shared=True)
    return True

Race Scenario

  1. Thread A acquires lock, sees closed client, releases lock
  2. Thread B acquires lock, sees same closed client, releases lock
  3. Thread A calls _replace_primary_openai_client → creates client_A, sets self.client = client_A
  4. Thread B calls _replace_primary_openai_client → sees client_A as old_client, creates client_B, sets self.client = client_B
  5. Thread B closes client_A (the "old" client) — but Thread A may still have in-flight requests on it
  6. Thread A's requests fail with closed-client errors

Impact

Under concurrent load in gateway mode, intermittent API call failures when the shared OpenAI client gets recycled.

Suggested Fix

Make the check-and-replace atomic:

def _ensure_primary_openai_client(self, *, reason: str) -> Any:
    with self._openai_client_lock():
        client = getattr(self, "client", None)
        if client is not None and not self._is_openai_client_closed(client):
            return client
        # Replace INSIDE the lock:
        old_client = client
        try:
            new_client = self._create_openai_client(self._client_kwargs, reason=reason, shared=True)
        except Exception:
            raise RuntimeError("Failed to recreate closed OpenAI client")
        self.client = new_client
    # Close old client outside lock (safe now — no thread can re-select it)
    if old_client is not None:
        self._close_openai_client(old_client, reason=f"replace:{reason}", shared=True)
    return new_client

Files

run_agent.py, lines 2598-2629

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING