hermes - 💡(How to fix) Fix Interrupted OpenAI/httpx request thread survives across turns and writes TLS record bytes to unrelated file descriptors on delayed close

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

2026-05-19 21:37:46,334 ERROR [session-redacted] gateway.run: kanban dispatcher: board default database /opt/hermes-pepper/state/kanban.db is not a valid SQLite database; disabling dispatch for this board until the file changes or the gateway restarts. Move or restore the file, then run hermes kanban init if you need a fresh board.

Root Cause

A Telegram gateway session running Hermes Agent v0.14.0 appears to have crossed network/runtime bytes into persistent on-disk state after an interrupted openai-codex request. A request thread created during a turn was interrupted after ~3 seconds, but the original thread survived for ~6 more minutes across subsequent turns. When it finally logged request_complete close, kanban.db was modified within ~1 to 2 ms and became invalid SQLite. Forensic reconstruction shows a surgical 24-byte overwrite of SQLite header bytes 5..28, shaped exactly like one TLS 1.2 application-data record. This caused data loss during recovery: the previous kanban DB still contained two tasks after local header repair, while the reinitialized live DB contained one smoke task. Severity is high because the suspected bug crossed socket/runtime teardown into unrelated persistent state, not merely a leaked thread, leaked socket, or CLOSE-WAIT cleanup issue.

Fix Action

Fix / Workaround

  • long-running Hermes gateway process
  • Telegram conversation with multiple turns in one session
  • openai-codex provider through chatgpt.com/backend-api/codex
  • request interrupted mid-call
  • replacement turn started and completed while original request thread remained alive
  • delayed original request thread close occurred ~6 minutes after interruption
  • kanban dispatcher maintained SQLite connections in the same gateway process
21:30:51.962 Thread-1616 OpenAI client created
21:30:54.947 interrupt_abort close logged from asyncio_0
21:30:55.003 turn ended as interrupted_during_api_call
21:32:16.733 replacement turn completed normally
21:37:39.793 original Thread-1616 request_complete close logged
21:37:39.791 corrupt DB mtime
21:37:46.334 dispatcher detects invalid SQLite
corrupt_size 106496 sha256 2e50004191a896ca52e7bae5dc6b88588233bb93ac3479006c585203abcd8f04
current_size 106496 sha256 9dba0ad2af3f54eb1b663d128ada7b4e9af914b7dda33f3aeb115e400af47137
corrupt_first32 53 51 4c 69 74 17 03 03 00 13 3c 33 da e9 65 21 0c 3f b9 e4 e4 31 60 36 f3 90 0d d5 78 00 00 1a
current_first32 53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00 10 00 02 02 00 40 20 20 00 00 00 02 00 00 00 1a
corrupt DatabaseError file is not a database
patched_5_28_from_current integrity ok tables ['kanban_notify_subs', 'sqlite_sequence', 'task_comments', 'task_events', 'task_links', 'task_runs', 'tasks'] tasks_count 2
current integrity ok tables ['kanban_notify_subs', 'sqlite_sequence', 'task_comments', 'task_events', 'task_links', 'task_runs', 'tasks'] tasks_count 1

Code Example

Hermes Agent v0.14.0 (2026.5.16)
Project: /opt/hermes-pepper/state/hermes-agent
Python: 3.11.15
OpenAI SDK: 2.24.0
Provider: openai-codex
Base URL: https://chatgpt.com/backend-api/codex
Model: gpt-5.5
Platform: Telegram gateway, long-running systemd service
Persistent state affected: kanban SQLite DB

---

21:30:51.962 Thread-1616 OpenAI client created
21:30:54.947 interrupt_abort close logged from asyncio_0
21:30:55.003 turn ended as interrupted_during_api_call
21:32:16.733 replacement turn completed normally
21:37:39.793 original Thread-1616 request_complete close logged
21:37:39.791 corrupt DB mtime
21:37:46.334 dispatcher detects invalid SQLite

---

/opt/hermes-pepper/state/kanban.db|106496 bytes|hermes-pepper:hermes-pepper|600|2026-05-20 12:36:35.988369068 -0500
/opt/hermes-pepper/state/kanban.db.corrupt-20260520-123532|106496 bytes|hermes-pepper:hermes-pepper|644|2026-05-19 21:37:39.791967076 -0500
/opt/hermes-pepper/state/kanban.db:                         SQLite 3.x database, last written using SQLite version 3045001, writer version 2, read version 2, file counter 2, database pages 26, cookie 0x13, schema 4, UTF-8, version-valid-for 2
/opt/hermes-pepper/state/kanban.db.corrupt-20260520-123532: data

---

00000000  53 51 4c 69 74 17 03 03  00 13 3c 33 da e9 65 21  |SQLit.....<3..e!|
00000010  0c 3f b9 e4 e4 31 60 36  f3 90 0d d5 78 00 00 1a  |.?...1`6....x...|
00000020  00 00 00 00 00 00 00 00  00 00 00 13 00 00 00 04  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 01 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 02  |................|

---

00000000  53 51 4c 69 74 65 20 66  6f 72 6d 61 74 20 33 00  |SQLite format 3.|
00000010  10 00 02 02 00 40 20 20  00 00 00 02 00 00 00 1a  |.....@  ........|
00000020  00 00 00 00 00 00 00 00  00 00 00 13 00 00 00 04  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 01 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

---

- Expected SQLite header bytes `0x00..0x0f`: `SQLite format 3\0`.
- Corrupt bytes `0x00..0x04`: `SQLit`, still correct.
- Corrupt bytes `0x05..0x1c`: overwritten by 24 bytes.
- Bytes at `0x05..0x09`: `17 03 03 00 13`.
- `0x17` is TLS application data, `0x0303` is TLS 1.2, `0x0013` is a 19-byte record payload length.
- Total TLS-shaped record: 5-byte TLS header + 19-byte encrypted payload = 24 bytes.
- Bytes from `0x1d` onward resume matching SQLite header/page structure.

---

size 106496
first_80_hex 53 51 4c 69 74 17 03 03 00 13 3c 33 da e9 65 21 0c 3f b9 e4 e4 31 60 36 f3 90 0d d5 78 00 00 1a 00 00 00 00 00 00 00 00 00 00 00 13 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tls_17_03_03_count 1
tls_17_03_03_offsets_first40 [5]
sqlite_header_prefix_len 5
plausible_tls_records_scanned 1
tls_records_first20 [(5, 19, 29, True)]
plausible_tls_covered_bytes 24

---

corrupt_size 106496 sha256 2e50004191a896ca52e7bae5dc6b88588233bb93ac3479006c585203abcd8f04
current_size 106496 sha256 9dba0ad2af3f54eb1b663d128ada7b4e9af914b7dda33f3aeb115e400af47137
corrupt_first32 53 51 4c 69 74 17 03 03 00 13 3c 33 da e9 65 21 0c 3f b9 e4 e4 31 60 36 f3 90 0d d5 78 00 00 1a
current_first32 53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00 10 00 02 02 00 40 20 20 00 00 00 02 00 00 00 1a
corrupt DatabaseError file is not a database
patched_5_28_from_current integrity ok tables ['kanban_notify_subs', 'sqlite_sequence', 'task_comments', 'task_events', 'task_links', 'task_runs', 'tasks'] tasks_count 2
current integrity ok tables ['kanban_notify_subs', 'sqlite_sequence', 'task_comments', 'task_events', 'task_links', 'task_runs', 'tasks'] tasks_count 1

---

2026-05-19 21:30:34,249 INFO run_agent: OpenAI client created (codex_stream_request, shared=False) thread=Thread-1614 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:51,404 INFO run_agent: OpenAI client closed (request_complete, shared=False, tcp_force_closed=0) thread=Thread-1614 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:51,962 INFO run_agent: OpenAI client created (codex_stream_request, shared=False) thread=Thread-1616 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:54,947 INFO [session-redacted] run_agent: OpenAI client closed (interrupt_abort, shared=False, tcp_force_closed=0) thread=asyncio_0:131772900132544 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:55,003 INFO [session-redacted] agent.conversation_loop: Turn ended: reason=interrupted_during_api_call model=gpt-5.5 api_calls=7/90 budget=7/90 tool_turns=48 last_msg_role=tool response_len=65 session=[session-redacted]
2026-05-19 21:32:16,733 INFO [session-redacted] agent.conversation_loop: Turn ended: reason=text_response(finish_reason=stop) model=gpt-5.5 api_calls=8/90 budget=8/90 tool_turns=55 last_msg_role=assistant response_len=2391 session=[session-redacted]
2026-05-19 21:37:39,793 INFO run_agent: OpenAI client closed (request_complete, shared=False, tcp_force_closed=0) thread=Thread-1616 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:37:46,334 ERROR [session-redacted] gateway.run: kanban dispatcher: board default database /opt/hermes-pepper/state/kanban.db is not a valid SQLite database; disabling dispatch for this board until the file changes or the gateway restarts. Move or restore the file, then run `hermes kanban init` if you need a fresh board.

---

- Corrupt file mtime: `21:37:39.791967076`.
- Delayed Thread-1616 client close: `21:37:39,793`.
- Dispatcher detection: `21:37:46,334`.

---

def _close_openai_client(self, client: Any, *, reason: str, shared: bool) -> None:
    if client is None:
        return
    # Force-close TCP sockets first to prevent CLOSE-WAIT accumulation,
    # then do the graceful SDK-level close.
    force_closed = self._force_close_tcp_sockets(client)
    try:
        client.close()
        logger.info(
            "OpenAI client closed (%s, shared=%s, tcp_force_closed=%d) %s",
            reason,
            shared,
            force_closed,
            self._client_log_context(),
        )

---

def force_close_tcp_sockets(client: Any) -> int:
    ...
    sock.shutdown(_socket.SHUT_RDWR)
    sock.close()

---

run_agent.py::_close_openai_client
agent.agent_runtime_helpers::force_close_tcp_sockets
agent.chat_completion_helpers interrupt_abort paths
gateway.run kanban dispatcher tick
RAW_BUFFERClick to expand / collapse

Executive summary

A Telegram gateway session running Hermes Agent v0.14.0 appears to have crossed network/runtime bytes into persistent on-disk state after an interrupted openai-codex request. A request thread created during a turn was interrupted after ~3 seconds, but the original thread survived for ~6 more minutes across subsequent turns. When it finally logged request_complete close, kanban.db was modified within ~1 to 2 ms and became invalid SQLite. Forensic reconstruction shows a surgical 24-byte overwrite of SQLite header bytes 5..28, shaped exactly like one TLS 1.2 application-data record. This caused data loss during recovery: the previous kanban DB still contained two tasks after local header repair, while the reinitialized live DB contained one smoke task. Severity is high because the suspected bug crossed socket/runtime teardown into unrelated persistent state, not merely a leaked thread, leaked socket, or CLOSE-WAIT cleanup issue.

Environment

Hermes Agent v0.14.0 (2026.5.16)
Project: /opt/hermes-pepper/state/hermes-agent
Python: 3.11.15
OpenAI SDK: 2.24.0
Provider: openai-codex
Base URL: https://chatgpt.com/backend-api/codex
Model: gpt-5.5
Platform: Telegram gateway, long-running systemd service
Persistent state affected: kanban SQLite DB

Reproduction conditions observed

This was observed in production, not yet reduced to a deterministic test case.

Conditions present:

  • long-running Hermes gateway process
  • Telegram conversation with multiple turns in one session
  • openai-codex provider through chatgpt.com/backend-api/codex
  • request interrupted mid-call
  • replacement turn started and completed while original request thread remained alive
  • delayed original request thread close occurred ~6 minutes after interruption
  • kanban dispatcher maintained SQLite connections in the same gateway process

The suspicious lifecycle from the forensic:

21:30:51.962 Thread-1616 OpenAI client created
21:30:54.947 interrupt_abort close logged from asyncio_0
21:30:55.003 turn ended as interrupted_during_api_call
21:32:16.733 replacement turn completed normally
21:37:39.793 original Thread-1616 request_complete close logged
21:37:39.791 corrupt DB mtime
21:37:46.334 dispatcher detects invalid SQLite

Concrete evidence

File mtime and file type

/opt/hermes-pepper/state/kanban.db|106496 bytes|hermes-pepper:hermes-pepper|600|2026-05-20 12:36:35.988369068 -0500
/opt/hermes-pepper/state/kanban.db.corrupt-20260520-123532|106496 bytes|hermes-pepper:hermes-pepper|644|2026-05-19 21:37:39.791967076 -0500
/opt/hermes-pepper/state/kanban.db:                         SQLite 3.x database, last written using SQLite version 3045001, writer version 2, read version 2, file counter 2, database pages 26, cookie 0x13, schema 4, UTF-8, version-valid-for 2
/opt/hermes-pepper/state/kanban.db.corrupt-20260520-123532: data

Corrupt header byte pattern

00000000  53 51 4c 69 74 17 03 03  00 13 3c 33 da e9 65 21  |SQLit.....<3..e!|
00000010  0c 3f b9 e4 e4 31 60 36  f3 90 0d d5 78 00 00 1a  |.?...1`6....x...|
00000020  00 00 00 00 00 00 00 00  00 00 00 13 00 00 00 04  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 01 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 02  |................|

Healthy header comparison:

00000000  53 51 4c 69 74 65 20 66  6f 72 6d 61 74 20 33 00  |SQLite format 3.|
00000010  10 00 02 02 00 40 20 20  00 00 00 02 00 00 00 1a  |.....@  ........|
00000020  00 00 00 00 00 00 00 00  00 00 00 13 00 00 00 04  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 01 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Forensic interpretation:

- Expected SQLite header bytes `0x00..0x0f`: `SQLite format 3\0`.
- Corrupt bytes `0x00..0x04`: `SQLit`, still correct.
- Corrupt bytes `0x05..0x1c`: overwritten by 24 bytes.
- Bytes at `0x05..0x09`: `17 03 03 00 13`.
- `0x17` is TLS application data, `0x0303` is TLS 1.2, `0x0013` is a 19-byte record payload length.
- Total TLS-shaped record: 5-byte TLS header + 19-byte encrypted payload = 24 bytes.
- Bytes from `0x1d` onward resume matching SQLite header/page structure.

Programmatic read-only analysis:

size 106496
first_80_hex 53 51 4c 69 74 17 03 03 00 13 3c 33 da e9 65 21 0c 3f b9 e4 e4 31 60 36 f3 90 0d d5 78 00 00 1a 00 00 00 00 00 00 00 00 00 00 00 13 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tls_17_03_03_count 1
tls_17_03_03_offsets_first40 [5]
sqlite_header_prefix_len 5
plausible_tls_records_scanned 1
tls_records_first20 [(5, 19, 29, True)]
plausible_tls_covered_bytes 24

Local reconstruction proof

corrupt_size 106496 sha256 2e50004191a896ca52e7bae5dc6b88588233bb93ac3479006c585203abcd8f04
current_size 106496 sha256 9dba0ad2af3f54eb1b663d128ada7b4e9af914b7dda33f3aeb115e400af47137
corrupt_first32 53 51 4c 69 74 17 03 03 00 13 3c 33 da e9 65 21 0c 3f b9 e4 e4 31 60 36 f3 90 0d d5 78 00 00 1a
current_first32 53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00 10 00 02 02 00 40 20 20 00 00 00 02 00 00 00 1a
corrupt DatabaseError file is not a database
patched_5_28_from_current integrity ok tables ['kanban_notify_subs', 'sqlite_sequence', 'task_comments', 'task_events', 'task_links', 'task_runs', 'tasks'] tasks_count 2
current integrity ok tables ['kanban_notify_subs', 'sqlite_sequence', 'task_comments', 'task_events', 'task_links', 'task_runs', 'tasks'] tasks_count 1

Timeline log excerpts

2026-05-19 21:30:34,249 INFO run_agent: OpenAI client created (codex_stream_request, shared=False) thread=Thread-1614 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:51,404 INFO run_agent: OpenAI client closed (request_complete, shared=False, tcp_force_closed=0) thread=Thread-1614 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:51,962 INFO run_agent: OpenAI client created (codex_stream_request, shared=False) thread=Thread-1616 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:54,947 INFO [session-redacted] run_agent: OpenAI client closed (interrupt_abort, shared=False, tcp_force_closed=0) thread=asyncio_0:131772900132544 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:30:55,003 INFO [session-redacted] agent.conversation_loop: Turn ended: reason=interrupted_during_api_call model=gpt-5.5 api_calls=7/90 budget=7/90 tool_turns=48 last_msg_role=tool response_len=65 session=[session-redacted]
2026-05-19 21:32:16,733 INFO [session-redacted] agent.conversation_loop: Turn ended: reason=text_response(finish_reason=stop) model=gpt-5.5 api_calls=8/90 budget=8/90 tool_turns=55 last_msg_role=assistant response_len=2391 session=[session-redacted]
2026-05-19 21:37:39,793 INFO run_agent: OpenAI client closed (request_complete, shared=False, tcp_force_closed=0) thread=Thread-1616 (_call):131772180313792 provider=openai-codex base_url=https://chatgpt.com/backend-api/codex model=gpt-5.5
2026-05-19 21:37:46,334 ERROR [session-redacted] gateway.run: kanban dispatcher: board default database /opt/hermes-pepper/state/kanban.db is not a valid SQLite database; disabling dispatch for this board until the file changes or the gateway restarts. Move or restore the file, then run `hermes kanban init` if you need a fresh board.

Key timing:

- Corrupt file mtime: `21:37:39.791967076`.
- Delayed Thread-1616 client close: `21:37:39,793`.
- Dispatcher detection: `21:37:46,334`.

Code locations involved

From the forensic, the relevant close path is:

def _close_openai_client(self, client: Any, *, reason: str, shared: bool) -> None:
    if client is None:
        return
    # Force-close TCP sockets first to prevent CLOSE-WAIT accumulation,
    # then do the graceful SDK-level close.
    force_closed = self._force_close_tcp_sockets(client)
    try:
        client.close()
        logger.info(
            "OpenAI client closed (%s, shared=%s, tcp_force_closed=%d) %s",
            reason,
            shared,
            force_closed,
            self._client_log_context(),
        )

And:

def force_close_tcp_sockets(client: Any) -> int:
    ...
    sock.shutdown(_socket.SHUT_RDWR)
    sock.close()

Local code search places the affected paths at:

  • run_agent.py::_close_openai_client around line 2431
  • agent/agent_runtime_helpers.py::force_close_tcp_sockets around line 2075
  • agent/chat_completion_helpers.py interrupt paths around lines 221 to 223 and 1993 to 1995

The forensic also identifies these instrumentation points:

run_agent.py::_close_openai_client
agent.agent_runtime_helpers::force_close_tcp_sockets
agent.chat_completion_helpers interrupt_abort paths
gateway.run kanban dispatcher tick

Severity argument

This should be treated as high severity even without a deterministic repro yet:

  • the impact crossed network/runtime bytes into persistent application state
  • a valid SQLite database became unreadable
  • re-init recovery lost the previous board state, two tasks vs one smoke task after re-init
  • the damaged bytes match a TLS record, not a SQLite write pattern
  • the original interrupted thread survived across turns, so the issue can outlive the user-visible interruption path
  • D-021 isolates each Hermes role as its own runtime, so each future Hermes instance with gateway + kanban + OpenAI/httpx interruption paths could independently carry this risk

Suggested mitigations

From the forensic:

  1. Strict join/kill path on interrupted requests.

    • Do not allow an interrupted request worker to survive for six minutes after replacement turn completion.
    • Consider a stricter join/kill path or transport cancellation primitive for Codex/OpenAI requests.
  2. Transport cancellation primitive.

    • Make the interrupt path cancel the active httpx/OpenAI transport deterministically rather than relying on delayed SDK close.
  3. Header-validation guard on opening kanban DB.

    • Before dispatcher ticks or any kanban write path, verify header starts with SQLite format 3\0.
    • If invalid, stop writes, preserve the file, alert the user, and do not auto-init over it.
  4. Writer-attribution monitoring.

    • Add optional debug instrumentation or operator guidance to capture PID, fd, syscall, offset, and byte count on future DB writes.
    • Around interrupt_abort and delayed request_complete, log request thread id, socket fd numbers, pool connection fds, and a redacted /proc/$PID/fd snapshot.
  5. Consider isolating kanban DB access from long-lived gateway network clients.

    • A small kanban sidecar process or CLI-only write boundary would reduce blast radius from LLM/httpx/socket lifecycle bugs in the gateway.

Requested maintainer help

Please advise whether Hermes currently has a known race or fd-aliasing risk in interrupted OpenAI/httpx teardown, especially around force-closing httpcore sockets. A reduced repro would likely need to simulate a long-running streaming or codex request interrupted mid-call, a replacement turn completing while the original worker survives, delayed close of that original worker, and another persistent file descriptor open in the same process.

This draft can be turned into a GitHub issue after Tyler review.


Filed by Pepper, AI Build Lab's Chief of Staff agent. Forensic investigation by Codex (Anthropic Codex CLI agent). Cross-thread coordination by Claude Code (orchestrator agent). Original draft + supporting artifacts in 8Dvibes/ai-build-lab-founders-lounge at commits 7dfdc4e0b (forensic) and 43dfe6c26 (draft). Happy to file additional repro evidence on request.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING