hermes - 💡(How to fix) Fix Anthropic streaming: stale/retry paths call _replace_primary_openai_client, causing 15-min hang on stuck streams

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When api_mode == "anthropic_messages", three streaming-cleanup paths in agent/chat_completion_helpers.py unconditionally rebuild the OpenAI primary client via agent._replace_primary_openai_client(...). For Anthropic-native users this is wrong on both counts:

  1. The OpenAI rebuild fails with Missing credentials. Please pass an api_key, ... or set the OPENAI_API_KEY environment variable. because OPENAI_API_KEY is unset on Anthropic-only configurations.
  2. The in-flight Anthropic httpx stream is never closed, so the worker thread iterating messages.stream(...) keeps blocking on the dead socket until the 900s httpx read-timeout fires.

User-visible symptom (running Telegram → claude-opus-4-7): the agent appears to hang ~15 minutes before any retry or fallback engages.

Error Message

WARNING run_agent: Failed to rebuild shared OpenAI client (stale_stream_pool_cleanup) thread=asyncio_1:... provider=anthropic base_url=https://api.anthropic.com model=claude-opus-4-7 error=Missing credentials. Please pass an `api_key`, ... or set the `OPENAI_API_KEY` or `OPENAI_ADMIN_KEY` environment variable. WARNING run_agent: Stream drop on attempt 2/3 — retrying. ... provider=anthropic base_url=https://api.anthropic.com error_type=ReadTimeout error=The read operation timed out chain=ReadTimeout(...) <- TimeoutError(...) http_status=200 bytes=0 chunks=0 elapsed=930.56s ttfb=- upstream=[cf-ray=... cf-cache-status=DYNAMIC server=cloudflare] except Exception: except Exception:

Root Cause

  1. The OpenAI rebuild fails with Missing credentials. Please pass an api_key, ... or set the OPENAI_API_KEY environment variable. because OPENAI_API_KEY is unset on Anthropic-only configurations.
  2. The in-flight Anthropic httpx stream is never closed, so the worker thread iterating messages.stream(...) keeps blocking on the dead socket until the 900s httpx read-timeout fires.

Fix Action

Fix / Workaround

All three branches dispatch to OpenAI cleanup regardless of agent.api_mode. The _interrupt_requested branch in the same function already does the right thing for Anthropic — it calls agent._anthropic_client.close() followed by agent._rebuild_anthropic_client(). The three cleanup sites just need to mirror that pattern.

RAW_BUFFERClick to expand / collapse

Summary

When api_mode == "anthropic_messages", three streaming-cleanup paths in agent/chat_completion_helpers.py unconditionally rebuild the OpenAI primary client via agent._replace_primary_openai_client(...). For Anthropic-native users this is wrong on both counts:

  1. The OpenAI rebuild fails with Missing credentials. Please pass an api_key, ... or set the OPENAI_API_KEY environment variable. because OPENAI_API_KEY is unset on Anthropic-only configurations.
  2. The in-flight Anthropic httpx stream is never closed, so the worker thread iterating messages.stream(...) keeps blocking on the dead socket until the 900s httpx read-timeout fires.

User-visible symptom (running Telegram → claude-opus-4-7): the agent appears to hang ~15 minutes before any retry or fallback engages.

Affected lines (current agent/chat_completion_helpers.py)

  • ~L1775 — reason="stream_mid_tool_retry_pool_cleanup"
  • ~L1833 — reason="stream_retry_pool_cleanup"
  • ~L1977 — reason="stale_stream_pool_cleanup"

All three branches dispatch to OpenAI cleanup regardless of agent.api_mode. The _interrupt_requested branch in the same function already does the right thing for Anthropic — it calls agent._anthropic_client.close() followed by agent._rebuild_anthropic_client(). The three cleanup sites just need to mirror that pattern.

Evidence from a real `errors.log` (timestamps redacted)

``` WARNING run_agent: Stream stale for 180s (threshold 180s) — no chunks received. model=claude-opus-4-7 context=~13,191 tokens. Killing connection. WARNING run_agent: Failed to rebuild shared OpenAI client (stale_stream_pool_cleanup) thread=asyncio_1:... provider=anthropic base_url=https://api.anthropic.com model=claude-opus-4-7 error=Missing credentials. Please pass an `api_key`, ... or set the `OPENAI_API_KEY` or `OPENAI_ADMIN_KEY` environment variable. ```

And the eventual unblock (only when the 900s httpx read-timeout finally fires):

``` WARNING run_agent: Stream drop on attempt 2/3 — retrying. ... provider=anthropic base_url=https://api.anthropic.com error_type=ReadTimeout error=The read operation timed out chain=ReadTimeout(...) <- TimeoutError(...) http_status=200 bytes=0 chunks=0 elapsed=930.56s ttfb=- upstream=[cf-ray=... cf-cache-status=DYNAMIC server=cloudflare] ```

`bytes=0 chunks=0 elapsed=930.56s` is the smoking gun — the connection was held open for ~15 minutes with zero data flow because nothing was closing it.

Regression test currently encodes the bug

`tests/run_agent/test_streaming.py::test_anthropic_stream_parser_valueerror_retries_before_delivery` (and possibly siblings) currently asserts `mock_replace.call_count == 1` for the Anthropic path — i.e. the test passes precisely because the buggy OpenAI rebuild is invoked. This means the upstream test suite is green while the bug is live in production. Worth re-pointing this test at the Anthropic close+rebuild path as part of the fix.

Proposed fix (drop-in)

At each of the three sites in `agent/chat_completion_helpers.py`, branch on `api_mode`:

```python if agent.api_mode == "anthropic_messages": try: agent._anthropic_client.close() except Exception: pass try: agent._rebuild_anthropic_client() except Exception: pass else: agent._replace_primary_openai_client( reason="stale_stream_pool_cleanup" # or stream_retry_pool_cleanup / stream_mid_tool_retry_pool_cleanup ) ```

This mirrors the existing `_interrupt_requested` branch verbatim.

Local verification

Applied the fix at all three sites + corrected the regression test; full streaming-related test suite (72 tests across `test_anthropic_error_handling`, `test_interrupt_propagation`, `test_openai_client_lifecycle`, `test_streaming`, `test_stream_drop_logging`, `test_stream_interrupt_retry`) passes. Telegram + `claude-opus-4-7` agent no longer hangs after the fix.

Happy to open a PR if useful — let me know.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Anthropic streaming: stale/retry paths call _replace_primary_openai_client, causing 15-min hang on stuck streams