hermes - ✅(Solved) Fix Cron: API failure incorrectly reported as last_status=ok, no error notification delivered [3 pull requests, 2 comments, 2 participants]

Kim-Huang-JunKai · 2026-04-30T09:07:04Z

[hermes] When a cron job's LLM API call fails e.g. timeout, retries exhausted , the job's last status is incorrectly set to "ok" and no error notification is d… When a cron job's LLM API call fails (e.g. timeout, retries exhausted), the job's `last_status` is incorrectly set to `"ok"` and no error notification is delivered to the user. The job appears to have succeeded even though the agent produced no useful output. # PR #17859: fix(cron): surface agent run_conversation failure flags as job failure - Repository: NousResearch/hermes-agent - Author: briandevans - State: closed | merged: True - Link: https://github.com/NousResearch/hermes-agent/pull/17859 ## Description (problem / solution / changelog) ## Summary - Cron jobs were silently marked `last_status=\"ok\"` when the agent's API call exhausted retries — the failure was never delivered to the user even though the agent reported it. - `run_job()` now consults the `failed` / `completed` flags that `agent.run_conversation` populates on those paths and raises so the existing except handler emits the proper failure tuple. - Adds a parametrized regression covering API exhaustion, mid-run interrupts, and partial-reply failure shapes; plus a success-path guard. ## The bug `agent.run_conversation()` returns `{\"final_response\": \"API call failed after 3 retries: Request timed out.\", \"failed\": True, \"completed\": False, \"error\": \"...\"}` on hard failure paths in `run_agent.py` (multiple paths around lines 10811–12292 set this shape). `cron/scheduler.py:run_job` only read `final_response` and returned `True, output, final_response, None` regardless. The downstream `_process_job` empty-response soft-fail at line 1342 only triggers when `final_response == \"\"` — but in this scenario the error text *is* the final_response, so: 1. `should_deliver = bool(\"API call failed...\")` → `True`, so `_deliver_result` ships the error text to the user as the agent's reply. 2. `success` stays `True`, so `mark_job_run` records `last_status=\"ok\"` with `last_error=null`. Production repro from the issue: a job pointed at a slow self-hosted endpoint timed out after 3 retries; output file showed the error, status showed \"ok\", no notification fired. ## The fix Right after the existing dict-shape guard in `run_job`, check `result.get(\"failed\") is True or result.get(\"completed\") is False` and raise `RuntimeError` with the agent's `error` (falling back to the trimmed `final_response`, then a generic string). The pre-existing except block already builds the FAILED output template and returns `False, output, \"\", error_msg`, so `mark_job_run` sees the failure and `_process_job` builds the user-visible error notification (`\"⚠️ Cron job '…' failed: …\"`). The check uses `is True` / `is False` so a result with neither flag (older or simpler success paths) keeps its current success behavior; only explicit failure markers trip it. ## Test plan - [x] Focused regression: `tests/cron/test_scheduler.py::test_run_job_treats_agent_failure_flag_as_failure` (parametrized — 4 cases: API-retry exhaustion with error text in final_response; failed+completed=False with no final_response; completed=False without an explicit failed flag; partial-reply + failed=True). - [x] Success-path guard: `test_run_job_completed_true_without_failed_flag_succeeds` confirms a normal `{completed: True, final_response: ...}` result still succeeds. - [x] Adjacent suite: full `tests/cron/test_scheduler.py` (103 passed). No prior tests stub `failed=True`, so no existing assertion changes. - [x] Regression guard: with the production fix reverted, all 4 parametrized cases fail with `assert True is False` (success was wrongly True); restoring the fix passes all 5. ## Related Fixes #17855 ## Changed files - `cron/scheduler.py` (modified, +15/-0) - `tests/cron/test_scheduler.py` (modified, +114/-0) --- # PR #17882: fix(cron): report agent API failure correctly, never hide error notif… - Repository: NousResearch/hermes-agent - Author: ambition0802 - State: closed | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/17882 ## Description (problem / solution / changelog) fix(cron): report agent API failure correctly, never hide error notifications (#17855) ## What does this PR do? Two long-unnoticed bugs were causing cron job LLM API failures (timeout, exhausted retries, internal agent execution exceptions) to be incorrectly recorded as `last_status=ok` with zero user-facing error notification, making silent cron failures completely undetectable to end users. This fix resolves both root causes while fully preserving existing correct behavior for normal successful jobs and explicitly marked [SILENT] jobs. ## Related Issue Fixes #17855 ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [ ] ✅ Tests (adding or improving test coverage) - [ ] ♻️ Refactor (no beh

hermes2026-04-30 09:07:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17855•Fetched 2026-05-01 05:55:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Kim-Huang-JunKai

Participants

alt-glitch

Kim-Huang-JunKai

Timeline (top)

cross-referenced ×3labeled ×3referenced ×3commented ×2

When a cron job's LLM API call fails (e.g. timeout, retries exhausted), the job's last_status is incorrectly set to "ok" and no error notification is delivered to the user. The job appears to have succeeded even though the agent produced no useful output.

Error Message

Job: "搞钱路子调研 - 早间 9:00" (ID: 5e91f26431f2) Provider: custom (hrs.kstu.vip:10070) Model: qwen3.6-27b Last Run: 2026-04-30T01:05:55 Session: only 1 message (user prompt), no assistant reply Output: "API call failed after 3 retries: Request timed out." last_status: "ok" ← wrong last_error: null ← should contain the timeout error last_delivery_error: null

Root Cause

Two issues in cron/scheduler.py:

Code Example

final_response = result.get("final_response", "") or ""
# ... only uses final_response, never checks result.get("failed")
logger.info("Job '%s' completed successfully", job_name)
return True, output, final_response, None  # Always returns success=True

---

_final_response = f"API call failed after {max_retries} retries: {_final_summary}"
return {
    "final_response": _final_response,
    "failed": True,        # <-- ignored by run_job
    "completed": False,    # <-- ignored by run_job
    "error": _final_summary,
}

---

# Delivery happens here (line ~1269) — already decided based on original success
if should_deliver:
    delivery_error = _deliver_result(...)

# Empty response check happens AFTER delivery (line ~1277)
if success and not final_response:
    success = False
    error = "Agent completed but produced empty response..."

---

Job: "搞钱路子调研 - 早间 9:00" (ID: 5e91f26431f2)
Provider: custom (hrs.kstu.vip:10070)
Model: qwen3.6-27b
Last Run: 2026-04-30T01:05:55
Session: only 1 message (user prompt), no assistant reply
Output: "API call failed after 3 retries: Request timed out."
last_status: "ok"   ← wrong
last_error: null     ← should contain the timeout error
last_delivery_error: null

---

_agent_failed = result.get("failed", False)
_agent_error = result.get("error") or ""
_agent_completed = result.get("completed", True)

# ... build output doc ...

if _agent_failed or not _agent_completed:
    error = _agent_error or final_response or "Agent failed to produce a response"
    return False, output, "", error

return True, output, final_response, None

---

# 1. Detect failures FIRST
if success and not final_response:
    success = False
    error = "Agent completed but produced empty response ..."

# 2. Then deliver (failed jobs get error notification)
deliver_content = final_response if success else f"⚠️ Cron job failed:\n{error}"

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

Two issues in cron/scheduler.py:

1. `run_job()` ignores the agent's `failed` flag (primary bug)

agent.run_conversation() returns a result dict with "failed": True, "completed": False, "error": "..." when the LLM API call fails internally (e.g. all retries exhausted). However, run_job() only reads result.get("final_response") and ignores the failed / completed / error fields:

cron/scheduler.py lines ~1099-1129:

final_response = result.get("final_response", "") or ""
# ... only uses final_response, never checks result.get("failed")
logger.info("Job '%s' completed successfully", job_name)
return True, output, final_response, None  # Always returns success=True

The agent's run_agent.py line ~11507 generates:

_final_response = f"API call failed after {max_retries} retries: {_final_summary}"
return {
    "final_response": _final_response,
    "failed": True,        # <-- ignored by run_job
    "completed": False,    # <-- ignored by run_job
    "error": _final_summary,
}

Since final_response is non-empty (contains the error text), _process_job's empty-response check at line ~1277 also doesn't trigger. Result: last_status="ok" with no error notification.

2. `_process_job()`: empty-response check runs AFTER delivery logic

The soft-failure detection for empty responses (line ~1277) happens after the delivery attempt (line ~1269). When success=True and final_response="", delivery is skipped because should_deliver = bool("") == False, then success is corrected to False — but by then the delivery window has passed:

# Delivery happens here (line ~1269) — already decided based on original success
if should_deliver:
    delivery_error = _deliver_result(...)

# Empty response check happens AFTER delivery (line ~1277)
if success and not final_response:
    success = False
    error = "Agent completed but produced empty response..."

Reproduction

Configure a cron job with a custom provider that is slow/unreachable (e.g. a self-hosted endpoint)
Let the API call time out after all retries
Observe last_status: "ok", last_error: null, last_delivery_error: null
No notification is delivered to the user
The output file contains the error text but is treated as a successful run

Observed in Production

Job: "搞钱路子调研 - 早间 9:00" (ID: 5e91f26431f2)
Provider: custom (hrs.kstu.vip:10070)
Model: qwen3.6-27b
Last Run: 2026-04-30T01:05:55
Session: only 1 message (user prompt), no assistant reply
Output: "API call failed after 3 retries: Request timed out."
last_status: "ok"   ← wrong
last_error: null     ← should contain the timeout error
last_delivery_error: null

Suggested Fix

In run_job() — check the agent's failed flag before returning success:

_agent_failed = result.get("failed", False)
_agent_error = result.get("error") or ""
_agent_completed = result.get("completed", True)

# ... build output doc ...

if _agent_failed or not _agent_completed:
    error = _agent_error or final_response or "Agent failed to produce a response"
    return False, output, "", error

return True, output, final_response, None

In _process_job() — move failure detection before delivery so error notifications are sent:

# 1. Detect failures FIRST
if success and not final_response:
    success = False
    error = "Agent completed but produced empty response ..."

# 2. Then deliver (failed jobs get error notification)
deliver_content = final_response if success else f"⚠️ Cron job failed:\n{error}"

Optionally, also attempt delivery in the outer except Exception block of _process_job so unexpected crashes also notify the user.

Impact

Users are not notified when cron jobs fail silently
Failed jobs appear as successful in hermes cron list, making debugging difficult
No way to detect the failure without manually checking output files

🤖 Generated with Claude Code

extent analysis

TL;DR

The cron job's last_status is incorrectly set to "ok" when the LLM API call fails due to ignoring the agent's failed flag and incorrect ordering of failure detection and delivery logic.

Guidance

Check the agent's failed flag in run_job() before returning success to correctly handle API call failures.
Move failure detection before delivery in _process_job() to ensure error notifications are sent for failed jobs.
Consider adding delivery logic in the outer except Exception block of _process_job() to handle unexpected crashes.
Verify the fix by configuring a cron job with a custom provider that is slow/unreachable and checking the last_status and error notifications.

Example

_agent_failed = result.get("failed", False)
if _agent_failed:
    error = result.get("error") or ""
    return False, output, "", error

Notes

The suggested fix assumes that the failed flag is correctly set by the agent when the LLM API call fails. Additional logging or debugging may be necessary to ensure the fix is working as expected.

Recommendation

Apply the suggested fix to run_job() and _process_job() to correctly handle API call failures and ensure error notifications are sent to users. This will improve the reliability and debugging capabilities of the cron job system.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix Cron: API failure incorrectly reported as last_status=ok, no error notification delivered [3 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #17859: fix(cron): surface agent run_conversation failure flags as job failure

Description (problem / solution / changelog)

Summary

The bug

The fix

Test plan

Related

Changed files

PR #17882: fix(cron): report agent API failure correctly, never hide error notif…

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

For New Skills

Screenshots / Logs

Changed files

PR #17886: Fix/nameerror prompt defined 17787

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

For New Skills

Screenshots / Logs

Changed files

Code Example

Summary

Root Cause

1. run_job() ignores the agent's failed flag (primary bug)

2. _process_job(): empty-response check runs AFTER delivery logic

Reproduction

Observed in Production

Suggested Fix

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `run_job()` ignores the agent's `failed` flag (primary bug)

2. `_process_job()`: empty-response check runs AFTER delivery logic