hermes - ✅(Solved) Fix [Bug]: Codex Responses commentary-phase tool planning leaks as visible Telegram text [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#24933Fetched 2026-05-14 03:50:29
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
labeled ×6commented ×1cross-referenced ×1

Error Message

Additional Logs / Traceback (optional)

No Python traceback is required to reproduce this. The issue is a visibility/suppression bug in normalisation and streaming rather than a crash.

Root Cause

Root Cause Analysis (optional)

Fix Action

Fix / Workaround

def test_normalize_codex_response_hides_commentary_when_tool_calls_present(monkeypatch):
    ...
    assistant_message, finish_reason = _normalize_codex_response(response)

```python
def test_run_codex_stream_does_not_emit_commentary_phase_tool_planning(monkeypatch):
    ...
    agent.stream_delta_callback = observed.append

PR fix notes

PR #25268: fix(agent): hide codex commentary messages

Description (problem / solution / changelog)

What does this PR do?

Suppresses user-invisible Codex Responses commentary/analysis text so it does not leak into gateway-visible interim assistant messages.

The bug shows up as short internal-looking chat bubbles such as “Need inspect files” or “Use tool X” before the actual tool call runs. Those phase=commentary/analysis message items are provider state for replay/debugging, not final assistant content.

This PR keeps that metadata in codex_message_items for replay, but prevents it from being promoted into visible assistant content or streamed gateway deltas.

Related existing PR: #21568 addresses the same family of leak, but this branch is rebased on current main and also covers streaming delta suppression / create-stream fallback behavior.

Related Issue

Fixes #

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • agent/codex_responses_adapter.py
    • Does not promote phase=commentary or phase=analysis Codex message text into normalized assistant content.
    • Avoids falling back to response.output_text when hidden message items were already present.
  • run_agent.py
    • Tracks hidden commentary/analysis stream items by item id and output index.
    • Suppresses matching response.output_text.delta events before gateway callbacks can deliver them.
    • Applies the same suppression to create-stream fallback reconstruction.
  • tests/run_agent/test_run_agent_codex_responses.py
    • Adds regressions for commentary message normalization, streaming delta suppression, and create-stream fallback behavior.

How to Test

  1. python -m compileall -q agent/codex_responses_adapter.py run_agent.py tests/run_agent/test_run_agent_codex_responses.py
  2. python -m pytest tests/run_agent/test_run_agent_codex_responses.py -q -o 'addopts='
  3. ruff check agent/codex_responses_adapter.py run_agent.py tests/run_agent/test_run_agent_codex_responses.py
  4. git diff --check origin/main...HEAD

Local result:

  • tests/run_agent/test_run_agent_codex_responses.py: 63 passed
  • ruff check: passed
  • compileall: passed
  • git diff --check: passed

CI note:

  • All non-test PR checks passed: ruff/ty, Windows footguns, attribution, Nix, Docker builds, supply chain audit, and e2e.
  • The full test job currently fails on this PR, but the same failure set is present on main's latest Tests workflow runs. The failures are unrelated to this Codex commentary change, e.g. missing botocore, missing faster_whisper, missing numpy, DingTalk mock errors, WeCom/Weixin OpenSSL/cffi errors, Matrix requirements, and test_switch_model_preserves_config_context_length.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.7.4

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

Targeted local verification:

63 passed in 40.79s
All checks passed!

Changed files

  • agent/codex_responses_adapter.py (modified, +10/-2)
  • run_agent.py (modified, +79/-1)
  • tests/run_agent/test_run_agent_codex_responses.py (modified, +178/-14)

Code Example

Need verify. Use terminal.

---

response.output = [
    SimpleNamespace(
        type="message",
        id="msg_commentary",
        phase="commentary",
        status="completed",
        content=[
            SimpleNamespace(
                type="output_text",
                text="Need verify. Use terminal.",
            )
        ],
    ),
    SimpleNamespace(
        type="function_call",
        id="fc_1",
        call_id="call_1",
        name="terminal",
        arguments="{}",
    ),
]

---

finish_reason == "tool_calls"
assistant_message.content == ""
assistant_message.reasoning == "Need verify. Use terminal."
assistant_message.codex_message_items[0]["phase"] == "commentary"
assistant_message.tool_calls[0].function.name == "terminal"

---

Need verify. Use terminal.

---

Hermes Agent v0.13.0 (2026.5.7)
OpenAI SDK: 2.24.0

---

Need verify. Use terminal.

---

agent/codex_responses_adapter.py

---

_normalize_codex_response(response)

---

if message_text:
    if normalized_phase in {"commentary", "analysis"}:
        commentary_parts.append(message_text)
    else:
        content_parts.append(message_text)

# After tool calls are known:
if tool_calls and commentary_parts:
    reasoning_parts.extend(commentary_parts)
elif commentary_parts:
    content_parts.extend(commentary_parts)

---

run_agent.py

---

AIAgent._run_codex_stream(...)

---

suppress_current_text_item = False

if event_type == "response.output_item.added":
    added_item = getattr(event, "item", None)
    added_type = getattr(added_item, "type", None)
    added_phase = getattr(added_item, "phase", None)

    if isinstance(added_phase, str):
        added_phase = added_phase.strip().lower()
    else:
        added_phase = None

    if added_type == "function_call":
        has_tool_calls = True
    elif added_type == "message":
        suppress_current_text_item = added_phase in {"commentary", "analysis"}

if event_type == "response.output_text.delta":
    delta_text = getattr(event, "delta", "")
    delta_phase = getattr(event, "phase", None)

    if isinstance(delta_phase, str):
        suppress_current_text_item = delta_phase.strip().lower() in {"commentary", "analysis"}

    if delta_text:
        self._codex_streamed_text_parts.append(delta_text)

    if delta_text and not has_tool_calls and not suppress_current_text_item:
        stream_delta_callback(delta_text)

---

def test_normalize_codex_response_hides_commentary_when_tool_calls_present(monkeypatch):
    ...
    assistant_message, finish_reason = _normalize_codex_response(response)

    assert finish_reason == "tool_calls"
    assert assistant_message.content == ""
    assert assistant_message.reasoning == "Need verify. Use terminal."
    assert assistant_message.codex_message_items[0]["phase"] == "commentary"
    assert assistant_message.tool_calls[0].function.name == "terminal"

---

def test_run_codex_stream_does_not_emit_commentary_phase_tool_planning(monkeypatch):
    ...
    agent.stream_delta_callback = observed.append

    response = agent._run_codex_stream(_codex_request_kwargs())

    assert response is final_response
    assert observed == []

---

python -m pytest tests/run_agent/test_run_agent_codex_responses.py -q -o 'addopts='
61 passed
RAW_BUFFERClick to expand / collapse

Bug Description

Hermes can leak Codex / Responses API commentary-phase planning text to gateway users, especially Telegram, when a Responses API turn contains both:

  • a message output item with phase: "commentary" or phase: "analysis", and
  • one or more real function_call output items.

That commentary-phase text is model/tool-planning scratchpad, for example text like:

Need verify. Use terminal.

It should not be sent as user-visible assistant content. In the observed gateway path, the text was streamed/sent to Telegram before the tool call completed, making private planner-looking content visible in chat.

This appears related to, but narrower than, #7233. #7233 tracks Telegram reasoning/scratchpad leakage more generally. This report is specifically about the Codex Responses adapter and live-streaming path treating message items with phase: "commentary" as visible assistant text when they accompany tool calls.

No personal prompts, chat IDs, local usernames, file paths, or session IDs are included here.

Steps to Reproduce

A minimal synthetic reproduction is:

  1. Configure Hermes to use the OpenAI Codex / Responses API provider path.
  2. Use a gateway session, for example Telegram.
  3. Trigger a tool-using turn where the Responses API output contains a commentary-phase message item followed by a function call.

Equivalent synthetic Responses output shape:

response.output = [
    SimpleNamespace(
        type="message",
        id="msg_commentary",
        phase="commentary",
        status="completed",
        content=[
            SimpleNamespace(
                type="output_text",
                text="Need verify. Use terminal.",
            )
        ],
    ),
    SimpleNamespace(
        type="function_call",
        id="fc_1",
        call_id="call_1",
        name="terminal",
        arguments="{}",
    ),
]
  1. Observe the gateway live-stream/interim-message behaviour.

A regression test can reproduce this without a live Telegram session by feeding the above shape into:

  • agent.codex_responses_adapter._normalize_codex_response
  • AIAgent._run_codex_stream with streamed response.output_item.added / response.output_text.delta events

Expected Behavior

When a Codex Responses turn contains tool calls:

  • phase: "commentary" and phase: "analysis" message text should be treated as hidden reasoning / provider metadata, not user-visible assistant content.
  • Gateway adapters should not send that text as Telegram/Discord/Slack/etc. messages.
  • The original Responses item metadata should still be preserved for replay/continuity, for example in codex_message_items.
  • The final assistant response after tool completion should be the only user-visible text.

Expected normalisation for a commentary-plus-tool-call turn:

finish_reason == "tool_calls"
assistant_message.content == ""
assistant_message.reasoning == "Need verify. Use terminal."
assistant_message.codex_message_items[0]["phase"] == "commentary"
assistant_message.tool_calls[0].function.name == "terminal"

Actual Behavior

Hermes may currently map commentary-phase message text into normal assistant content or stream it through the live delta callback before it is suppressed.

In a gateway session, that can result in visible Telegram messages containing raw planner text before the tool call runs or before the final response is generated.

Example redacted visible text:

Need verify. Use terminal.

This is not a final answer to the user. It is internal/tool-planning text emitted by the Codex Responses API as a commentary-phase item.

Affected Component

  • Gateway (Telegram/Discord/Slack/WhatsApp)
  • Agent Core (conversation loop, context compression, memory)

Messaging Platform

  • Telegram

This may affect other streaming gateway platforms too, but Telegram is where it was observed.

Debug Report

Not attached in this public report to avoid exposing personal session details, local paths, chat identifiers, or gateway logs.

I can provide a redacted hermes debug share --local report if maintainers need it. The reproduction and root-cause details below should be enough to produce a focused regression test without private logs.

Operating System

Ubuntu 24.04

Python Version

Python 3.11.15

Hermes Version

Hermes Agent v0.13.0 (2026.5.7)
OpenAI SDK: 2.24.0

Additional Logs / Traceback (optional)

No Python traceback is required to reproduce this. The issue is a visibility/suppression bug in normalisation and streaming rather than a crash.

Redacted example of leaked text:

Need verify. Use terminal.

Root Cause Analysis (optional)

The root cause appears to be that the Codex Responses API can return message output items with phase: "commentary" or phase: "analysis" in the same response as function_call output items.

Those commentary/analysis message items are useful provider metadata for continuity/replay, but they are not user-facing assistant answers when the turn is a tool-call turn.

Two paths need to enforce the same visibility rule:

1. Non-streaming normalisation

File:

agent/codex_responses_adapter.py

Relevant function:

_normalize_codex_response(response)

Current vulnerable pattern:

  • Extract text from every message output item.
  • Append extracted text to visible content_parts.
  • Later discover/normalise function_call items.
  • Return an assistant message with finish_reason == "tool_calls" but with planner text still present in assistant_message.content.

That content can later be persisted or emitted by gateway code as if it were user-visible assistant text.

Correct behaviour:

  • Collect phase: "commentary" / phase: "analysis" message text separately.
  • If tool calls are present, move that text to hidden reasoning metadata rather than visible content.
  • Preserve the raw message item in codex_message_items for provider replay/continuity.
  • Keep backwards-compatible behaviour for commentary-only/incomplete messages where no tool call is present, if that is an intentional existing contract.

Pseudo-fix:

if message_text:
    if normalized_phase in {"commentary", "analysis"}:
        commentary_parts.append(message_text)
    else:
        content_parts.append(message_text)

# After tool calls are known:
if tool_calls and commentary_parts:
    reasoning_parts.extend(commentary_parts)
elif commentary_parts:
    content_parts.extend(commentary_parts)

2. Live streaming

File:

run_agent.py

Relevant function/method:

AIAgent._run_codex_stream(...)

Current vulnerable pattern:

  • Streaming sees response.output_text.delta.
  • The delta is sent to stream_delta_callback before it is suppressed.
  • In gateway mode, that callback can become visible Telegram text.

Correct behaviour:

  • Track response.output_item.added events.
  • If the current output item is a message with phase: "commentary" or phase: "analysis", suppress text deltas from being emitted to the live gateway callback.
  • Still accumulate raw streamed text internally if needed for recovery/fallback logic.
  • Mark real function_call items as tool-call turns.

Pseudo-fix:

suppress_current_text_item = False

if event_type == "response.output_item.added":
    added_item = getattr(event, "item", None)
    added_type = getattr(added_item, "type", None)
    added_phase = getattr(added_item, "phase", None)

    if isinstance(added_phase, str):
        added_phase = added_phase.strip().lower()
    else:
        added_phase = None

    if added_type == "function_call":
        has_tool_calls = True
    elif added_type == "message":
        suppress_current_text_item = added_phase in {"commentary", "analysis"}

if event_type == "response.output_text.delta":
    delta_text = getattr(event, "delta", "")
    delta_phase = getattr(event, "phase", None)

    if isinstance(delta_phase, str):
        suppress_current_text_item = delta_phase.strip().lower() in {"commentary", "analysis"}

    if delta_text:
        self._codex_streamed_text_parts.append(delta_text)

    if delta_text and not has_tool_calls and not suppress_current_text_item:
        stream_delta_callback(delta_text)

Proposed Fix (optional)

Add regression coverage for both paths:

  1. Normalisation regression:
def test_normalize_codex_response_hides_commentary_when_tool_calls_present(monkeypatch):
    ...
    assistant_message, finish_reason = _normalize_codex_response(response)

    assert finish_reason == "tool_calls"
    assert assistant_message.content == ""
    assert assistant_message.reasoning == "Need verify. Use terminal."
    assert assistant_message.codex_message_items[0]["phase"] == "commentary"
    assert assistant_message.tool_calls[0].function.name == "terminal"
  1. Streaming regression:
def test_run_codex_stream_does_not_emit_commentary_phase_tool_planning(monkeypatch):
    ...
    agent.stream_delta_callback = observed.append

    response = agent._run_codex_stream(_codex_request_kwargs())

    assert response is final_response
    assert observed == []

A focused Codex Responses regression suite passed locally with this coverage:

python -m pytest tests/run_agent/test_run_agent_codex_responses.py -q -o 'addopts='
61 passed

Are you willing to submit a PR for this?

No. I am not planning to submit a PR for this. The maintainers know the system internals better and can decide the safest fix across provider replay, streaming, and gateway delivery paths.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING