hermes - ✅(Solved) Fix Gateway emits run.completed for runs that failed with a non-retryable client error (no run.failed) [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#15561Fetched 2026-04-26 05:26:40
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
labeled ×4commented ×1cross-referenced ×1

When Agent.run_conversation encounters a non-retryable client error (401 invalid API key, 400 unsupported model on the configured provider, etc.), the gateway's _run_and_close still emits an event: run.completed SSE frame with output: null / usage: {0, 0, 0}. There is no run.failed event. Anything downstream that listens for run.completed vs run.failed will misclassify these as successful runs.

Error Message

Non-retryable client error: Error code: 401 - {'error': {...'Authentication Fails, Your api key: ****xxxx is invalid'}}

Root Cause

run_agent.py returns a dict with failed: True instead of raising:

https://github.com/NousResearch/hermes-agent/blob/main/run_agent.py#L11462-L11469

return {
    "final_response": None,
    "messages": messages,
    "api_calls": api_call_count,
    "completed": False,
    "failed": True,
    "error": str(api_error),
}

gateway/platforms/api_server.py only branches on exceptions and never inspects the failed flag:

https://github.com/NousResearch/hermes-agent/blob/main/gateway/platforms/api_server.py#L2444-L2476

def _run_sync():
    r = agent.run_conversation(...)
    ...
    return r, u

result, usage = await asyncio.get_running_loop().run_in_executor(None, _run_sync)
final_response = result.get("final_response", "") if isinstance(result, dict) else ""
q.put_nowait({
    "event": "run.completed",   # always completed even when result["failed"] is True
    ...
    "output": final_response,
    "usage": usage,
})

The except Exception block right below this would emit run.failed, but it never fires because run_conversation returns a value rather than raising.

Fix Action

Fix / Workaround

This affects any client that drives the gateway via SSE. For example, hermes-web-ui treats run.completed as success and so the UI stays silent on bad-key runs — users see their message hanging with no error indication. I just landed a defensive client-side workaround there (PR #206) that detects "no assistant text + no tool activity + empty output" and synthesises a system message, but the proper fix is upstream so any other gateway client benefits.

PR fix notes

PR #206: fix(chat): surface silently-swallowed run errors with a system message

Description (problem / solution / changelog)

Problem

When the upstream hermes-agent swallows an LLM error (e.g. invalid API key, model not supported by the configured provider), the gateway still emits run.completed — but with output: "" and usage: {0,0,0}. The web UI treats this as a successful run and shows nothing: no error toast, no system message, no indication that anything went wrong. The user just sees their own message hanging there.

Repro: configure a provider with an invalid API key (or a model the provider doesn't support), send a message. Chat looks frozen. ~/.hermes/logs/agent.log has the 401, but the UI is silent.

The root cause is upstream (agent layer catches the exception instead of letting run_conversation raise so gateway can emit run.failed). This PR adds a defensive workaround on the web-ui side so users at least get a visible hint until the upstream fix lands.

Changes

packages/client/src/stores/hermes/chat.ts

  • Add two per-run flags inside send():
    • runProducedAssistantText — flipped on reasoning.delta / thinking.delta / message.delta
    • runHadToolActivity — flipped on tool.started / tool.completed
  • On run.completed:
    1. Fallback rendering: if no assistant text was streamed but evt.output is non-empty, render it as an assistant message. Defends against providers that may deliver the final reply only via run.completed.output.
    2. Swallowed-error detection: if !runProducedAssistantText && !runHadToolActivity && output is empty, append a system message asking the user to check the hermes-agent logs. usage.total_tokens === 0 is not part of the condition because some providers / local models legitimately omit usage.

packages/client/src/api/hermes/chat.ts

  • Add output?: string | null to RunEvent interface (was already sent by the gateway, just not typed).

Test plan

  • Configure deepseek with an invalid API key, send a message → system message appears with the hint, the spinner stops, the input is re-enabled.
  • Normal session with a working provider → no false positive, assistant reply renders normally.
  • Session with only tool calls and no final assistant text → no false positive (tool activity counts).
  • vue-tsc -b passes.

Notes

  • This is a workaround. The proper fix lives in hermes-agent (gateway/platforms/api_server.py), where run_conversation should propagate / surface upstream LLM errors instead of returning final_response="". I'll open a separate issue/PR there.
  • Thanks to the rubber-duck reviewer who caught two important issues with the first version: tool-only runs were being misclassified as errors, and usage===0 was too strong a condition.

Co-authored-by: Copilot [email protected]

Changed files

  • packages/client/src/api/hermes/chat.ts (modified, +3/-0)
  • packages/client/src/stores/hermes/chat.ts (modified, +51/-0)

PR #15564: fix(api_server): emit run.failed when run_conversation returns failed=True

Description (problem / solution / changelog)

Problem

When run_conversation encounters a non-retryable client error (401, 400, etc.), it returns {failed: True, error: ...} instead of raising. The gateway _run_and_close only branched on exceptions — so it always emitted run.completed even for failed runs. Clients could not distinguish success from failure.

Fix

Inspect result.get("failed") before deciding which event to emit. If True, emit run.failed with the error message; otherwise emit run.completed as before. The existing except Exception path is unchanged for genuine programming errors.

Fixes #15561

Changed files

  • gateway/platforms/api_server.py (modified, +19/-8)

Code Example

Non-retryable client error: Error code: 401 - {'error': {...'Authentication Fails, Your api key: ****xxxx is invalid'}}

---

{"event":"run.completed","run_id":"...","output":null,"usage":{"input_tokens":0,"output_tokens":0,"total_tokens":0}}

---

return {
    "final_response": None,
    "messages": messages,
    "api_calls": api_call_count,
    "completed": False,
    "failed": True,
    "error": str(api_error),
}

---

def _run_sync():
    r = agent.run_conversation(...)
    ...
    return r, u

result, usage = await asyncio.get_running_loop().run_in_executor(None, _run_sync)
final_response = result.get("final_response", "") if isinstance(result, dict) else ""
q.put_nowait({
    "event": "run.completed",   # always completed even when result["failed"] is True
    ...
    "output": final_response,
    "usage": usage,
})

---

result, usage = await asyncio.get_running_loop().run_in_executor(None, _run_sync)
if isinstance(result, dict) and result.get("failed"):
    q.put_nowait({
        "event": "run.failed",
        "run_id": run_id,
        "timestamp": time.time(),
        "error": result.get("error") or "agent run failed",
    })
else:
    final_response = result.get("final_response", "") if isinstance(result, dict) else ""
    q.put_nowait({
        "event": "run.completed",
        "run_id": run_id,
        "timestamp": time.time(),
        "output": final_response,
        "usage": usage,
    })
RAW_BUFFERClick to expand / collapse

Summary

When Agent.run_conversation encounters a non-retryable client error (401 invalid API key, 400 unsupported model on the configured provider, etc.), the gateway's _run_and_close still emits an event: run.completed SSE frame with output: null / usage: {0, 0, 0}. There is no run.failed event. Anything downstream that listens for run.completed vs run.failed will misclassify these as successful runs.

Repro

  1. Configure any provider with an invalid API key.
  2. POST /v1/runs to start a run, then GET /v1/runs/{run_id}/events.
  3. Observe in ~/.hermes/logs/agent.log:
    Non-retryable client error: Error code: 401 - {'error': {...'Authentication Fails, Your api key: ****xxxx is invalid'}}
  4. Observe the SSE stream:
    {"event":"run.completed","run_id":"...","output":null,"usage":{"input_tokens":0,"output_tokens":0,"total_tokens":0}}
    No run.failed is ever emitted.

Root cause

run_agent.py returns a dict with failed: True instead of raising:

https://github.com/NousResearch/hermes-agent/blob/main/run_agent.py#L11462-L11469

return {
    "final_response": None,
    "messages": messages,
    "api_calls": api_call_count,
    "completed": False,
    "failed": True,
    "error": str(api_error),
}

gateway/platforms/api_server.py only branches on exceptions and never inspects the failed flag:

https://github.com/NousResearch/hermes-agent/blob/main/gateway/platforms/api_server.py#L2444-L2476

def _run_sync():
    r = agent.run_conversation(...)
    ...
    return r, u

result, usage = await asyncio.get_running_loop().run_in_executor(None, _run_sync)
final_response = result.get("final_response", "") if isinstance(result, dict) else ""
q.put_nowait({
    "event": "run.completed",   # always completed even when result["failed"] is True
    ...
    "output": final_response,
    "usage": usage,
})

The except Exception block right below this would emit run.failed, but it never fires because run_conversation returns a value rather than raising.

Suggested fix

Inspect the result dict before deciding which event to emit:

result, usage = await asyncio.get_running_loop().run_in_executor(None, _run_sync)
if isinstance(result, dict) and result.get("failed"):
    q.put_nowait({
        "event": "run.failed",
        "run_id": run_id,
        "timestamp": time.time(),
        "error": result.get("error") or "agent run failed",
    })
else:
    final_response = result.get("final_response", "") if isinstance(result, dict) else ""
    q.put_nowait({
        "event": "run.completed",
        "run_id": run_id,
        "timestamp": time.time(),
        "output": final_response,
        "usage": usage,
    })

This keeps the existing except Exception path for genuine programming errors while also surfacing the structured "failed" returns.

Downstream impact

This affects any client that drives the gateway via SSE. For example, hermes-web-ui treats run.completed as success and so the UI stays silent on bad-key runs — users see their message hanging with no error indication. I just landed a defensive client-side workaround there (PR #206) that detects "no assistant text + no tool activity + empty output" and synthesises a system message, but the proper fix is upstream so any other gateway client benefits.

Environment

  • hermes-agent: main (HEAD 6407b3d5)
  • Reproduced with deepseek and dashscope providers; any non-retryable HTTP 4xx will trigger the same path.

extent analysis

TL;DR

Inspect the result dictionary from agent.run_conversation and emit a run.failed event when the failed flag is True.

Guidance

  • Check the result dictionary for a failed key and emit a run.failed event if it's True.
  • Update the _run_sync function in gateway/platforms/api_server.py to handle the failed case.
  • Verify that the fix works by testing with an invalid API key and checking the SSE stream for a run.failed event.
  • Consider adding error handling for other potential failure cases.

Example

result, usage = await asyncio.get_running_loop().run_in_executor(None, _run_sync)
if isinstance(result, dict) and result.get("failed"):
    q.put_nowait({
        "event": "run.failed",
        "run_id": run_id,
        "timestamp": time.time(),
        "error": result.get("error") or "agent run failed",
    })

Notes

This fix assumes that the failed flag is correctly set in the run_agent.py module. If the flag is not set correctly, additional debugging may be necessary.

Recommendation

Apply the suggested fix to inspect the result dictionary and emit a run.failed event when the failed flag is True. This will ensure that clients driving the gateway via SSE receive accurate event notifications.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Gateway emits run.completed for runs that failed with a non-retryable client error (no run.failed) [2 pull requests, 1 comments, 2 participants]