hermes - ✅(Solved) Fix Fix file mutation verifier false positive for diagnostic-bearing writes [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#24927Fetched 2026-05-14 03:50:31
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×3closed ×1

Successful file mutations can be reported as failed when their tool result includes nested diagnostics that contain an error status or error text.

Error Message

Successful file mutations can be reported as failed when their tool result includes nested diagnostics that contain an error status or error text. The file mutation verifier and related tool failure classification currently look for generic error markers in the whole tool result. A successful write_file result with bytes_written plus lint diagnostics, or a successful patch result with LSP diagnostics, can therefore be treated as a failed mutation even though the file was changed on disk.

Root Cause

Successful file mutations can be reported as failed when their tool result includes nested diagnostics that contain an error status or error text.

Fix Action

Fix / Workaround

The file mutation verifier and related tool failure classification currently look for generic error markers in the whole tool result. A successful write_file result with bytes_written plus lint diagnostics, or a successful patch result with LSP diagnostics, can therefore be treated as a failed mutation even though the file was changed on disk.

  • write_file with bytes_written should count as landed even when lint diagnostics are present.
  • patch with success true should count as landed even when LSP diagnostics are present.
  • Genuine top level mutation failures should still produce the verifier footer.

Add regression coverage for write_file, patch, display failure tags, and tool guardrail classification.

PR fix notes

PR #24929: fix: classify landed file mutations with diagnostics

Description (problem / solution / changelog)

What does this PR do?

Fixes a false positive in file mutation result handling where successful write_file and patch calls could be treated as failed when their JSON result contained nested diagnostic text such as lint errors or LSP diagnostics. The change centralizes landed mutation classification so verifier footers, CLI tool status, and tool guardrails agree on the same success contract.

Related Issue

Fixes #24927

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Security fix
  • Documentation update
  • Tests (adding or improving test coverage)
  • Refactor (no behavior change)
  • New skill (bundled or hub)

Changes Made

  • Added shared file mutation result classification in agent/tool_result_classification.py.
  • Updated the turn-end file mutation verifier to clear stale failures when a diagnostic-bearing mutation result proves the write landed.
  • Updated CLI display and tool guardrail failure classification to avoid marking landed file mutations as failed because of nested diagnostics.
  • Added regression coverage for verifier recovery, display status, guardrail classification, and the shared classifier helper.

How to Test

  1. uv run --extra dev python -m pytest -q tests/run_agent/test_file_mutation_verifier.py tests/agent/test_display.py::TestCuteToolMessagePreviewLength tests/agent/test_tool_guardrails.py tests/agent/test_tool_result_classification.py
  2. uv run --extra dev python -m ruff check agent/tool_result_classification.py agent/display.py agent/tool_guardrails.py run_agent.py tests/agent/test_tool_result_classification.py tests/agent/test_display.py tests/agent/test_tool_guardrails.py tests/run_agent/test_file_mutation_verifier.py
  3. git diff --check

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: Windows 11

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior or N/A

For New Skills

N/A

Screenshots / Logs

Targeted verifier, display, guardrail, and classifier tests passed: 55 passed. Ruff passed on changed Python files. Whitespace check passed with git diff --check.

Changed files

  • agent/display.py (modified, +3/-0)
  • agent/tool_guardrails.py (modified, +3/-0)
  • agent/tool_result_classification.py (added, +26/-0)
  • run_agent.py (modified, +3/-1)
  • tests/agent/test_display.py (modified, +22/-0)
  • tests/agent/test_tool_guardrails.py (modified, +16/-0)
  • tests/agent/test_tool_result_classification.py (added, +30/-0)
  • tests/run_agent/test_file_mutation_verifier.py (modified, +50/-0)

PR #24960: fix(agent): prevent mutation verifier false positives from lint diagnostics

Description (problem / solution / changelog)

Summary

Fixes #24927

Successful file mutations (write_file, patch) were incorrectly reported as failures when their tool results included nested lint or LSP diagnostics containing "error" or "failed" keys.

Problem

The generic heuristic in _detect_tool_failure() and classify_tool_failure() searches the entire result string for '"error"' or '"failed"' substrings. This causes false positives when:

  • write_file returns {"bytes_written": 100, "lint": {"errors": [{"message": "unused variable", "severity": "error"}]}} — the nested "error" in lint diagnostics triggers the heuristic
  • patch returns {"success": true, "diagnostics": {"error": "type mismatch"}} — the nested "error" in LSP diagnostics triggers the heuristic

Root Cause

Both _detect_tool_failure() in agent/display.py and its mirror classify_tool_failure() in agent/tool_guardrails.py apply a generic substring heuristic that doesn't account for file-mutation tools returning structured diagnostic payloads alongside success markers.

Fix

Added an early-exit check for write_file and patch tools before the generic heuristic:

  • write_file: if bytes_written is present and no top-level error → return success
  • patch: if success is True → return success regardless of diagnostics

Genuine failures (top-level error key, or no success markers) still fall through to the heuristic correctly.

Testing

Added 12 regression tests across both files:

  • TestDetectToolFailure in test_display.py (8 tests):

    • write_file with bytes_written → success
    • write_file with lint diagnostics → success
    • write_file with top-level error → failure
    • patch with success=true → success
    • patch without success → failure
    • None result → success
    • terminal non-zero exit → failure
    • generic tools with error → failure
  • TestClassifyToolFailureMutationFalsePositive in test_tool_guardrails.py (4 tests):

    • Mirror tests for the guardrail classifier

All 53 tests pass (41 existing + 12 new), confirming zero regression.

Files Changed

  • agent/display.py — early-exit for mutation tools in _detect_tool_failure()
  • agent/tool_guardrails.py — matching early-exit in classify_tool_failure()
  • tests/agent/test_display.py — 8 new tests
  • tests/agent/test_tool_guardrails.py — 4 new tests

Changed files

  • agent/display.py (modified, +13/-0)
  • agent/tool_guardrails.py (modified, +11/-0)
  • tests/agent/test_display.py (modified, +87/-0)
  • tests/agent/test_tool_guardrails.py (modified, +45/-0)

PR #25011: salvage: batch 2 LSP-related fixes (#24738 lsp gating + #24929 mutation diagnostics false positive)

Description (problem / solution / changelog)

Salvage of two independent LSP-related fixes

Cherry-picks two small, unrelated bug fixes from open PRs onto current main. Both authors are preserved as commit authors and are already in AUTHOR_MAP.

Original PRAuthorSubsystemFix
#24738@briandevansCLI startup gatingAdds lsp to _BUILTIN_SUBCOMMANDS so hermes lsp … skips the ~500-650ms plugin-discovery pass.
#24929@GodsBoyTool result classificationStops misclassifying successful write_file/patch results as failures when their JSON payload contains nested lint or LSP diagnostics with "error" keys (issue #24927).

Two independent fixes are batched here because each is small, the diffs touch disjoint files, and neither has interactions with the other. Each retains its original contributor as commit author.

#24738 — lsp missing from _BUILTIN_SUBCOMMANDS

The parity guard added in #22120 (tests/hermes_cli/test_startup_plugin_gating.py::test_builtin_set_covers_every_registered_subcommand) has been failing on main since lsp shipped via #24168. Verified empirically:

$ bash scripts/run_tests.sh tests/hermes_cli/test_startup_plugin_gating.py
FAILED test_builtin_set_covers_every_registered_subcommand
  AssertionError: _BUILTIN_SUBCOMMANDS is missing these live subcommands: ['lsp'].

After the fix: 37 passed.

#24929 — File mutation false positive on diagnostic-bearing results (#24927)

_detect_tool_failure() and classify_tool_failure() substring-match '"error"' and '"failed"' over the full result string. After PR #24168 wired post-write LSP diagnostics into write_file and patch, any successful mutation whose result includes nested lint/LSP errors gets a [error] failure tag in the CLI display, plus a turn-end verifier footer claiming the write didn't land — even though it did.

This PR introduces a shared agent/tool_result_classification.py::file_mutation_result_landed() helper and patches three callsites:

  • agent/display.py::_detect_tool_failure() — CLI failure tag
  • agent/tool_guardrails.py::classify_tool_failure() — guardrail mirror
  • run_agent.py::_record_file_mutation_result() — turn-end verifier upstream

A duplicate PR (#24960) was filed against the same issue. It has a real bug (data.get("bytes_written") truthiness check, which misclassifies an empty 0-byte write with lint diagnostics as a failure). #24929 uses "bytes_written" in data (presence check) which is correct. We are closing #24960 in favor of this salvage.

Test plan

bash scripts/run_tests.sh \
  tests/hermes_cli/test_startup_plugin_gating.py \
  tests/agent/test_display.py \
  tests/agent/test_tool_guardrails.py \
  tests/agent/test_tool_result_classification.py \
  tests/run_agent/test_file_mutation_verifier.py

Result: 117 passed.

E2E sanity:

  • 'lsp' in _BUILTIN_SUBCOMMANDS → True
  • _detect_tool_failure("write_file", '{"bytes_written": 0, "lint": {"errors": [{"severity": "error"}]}}')(False, "") (would be (True, " [error]") without the fix; this is the case PR #24960 mishandles)
  • _detect_tool_failure("write_file", '{"error": "Permission denied"}')(True, " [error]") (genuine failures still detected)
  • _detect_tool_failure("patch", '{"success": true, "lsp_diagnostics": "<diagnostics>ERROR foo</diagnostics>"}')(False, "")

Closes #24927 (via #24929). Supersedes #24738, #24929, #24960.

cc @briandevans @GodsBoy — your commits are preserved as-is in this salvage; original authorship intact in git log.

Changed files

  • agent/display.py (modified, +3/-0)
  • agent/tool_guardrails.py (modified, +3/-0)
  • agent/tool_result_classification.py (added, +26/-0)
  • hermes_cli/main.py (modified, +4/-4)
  • run_agent.py (modified, +7/-2)
  • tests/agent/test_display.py (modified, +22/-0)
  • tests/agent/test_tool_guardrails.py (modified, +16/-0)
  • tests/agent/test_tool_result_classification.py (added, +30/-0)
  • tests/run_agent/test_file_mutation_verifier.py (modified, +50/-0)
RAW_BUFFERClick to expand / collapse

Summary

Successful file mutations can be reported as failed when their tool result includes nested diagnostics that contain an error status or error text.

Problem

The file mutation verifier and related tool failure classification currently look for generic error markers in the whole tool result. A successful write_file result with bytes_written plus lint diagnostics, or a successful patch result with LSP diagnostics, can therefore be treated as a failed mutation even though the file was changed on disk.

Expected behavior

  • write_file with bytes_written should count as landed even when lint diagnostics are present.
  • patch with success true should count as landed even when LSP diagnostics are present.
  • Genuine top level mutation failures should still produce the verifier footer.

Verification target

Add regression coverage for write_file, patch, display failure tags, and tool guardrail classification.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • write_file with bytes_written should count as landed even when lint diagnostics are present.
  • patch with success true should count as landed even when LSP diagnostics are present.
  • Genuine top level mutation failures should still produce the verifier footer.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Fix file mutation verifier false positive for diagnostic-bearing writes [3 pull requests, 1 participants]