claude-code - 💡(How to fix) Fix Claude Code claims "actual report verified" without rendering or visually comparing the output

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When asked to compare a product's output against a reference application's output, Claude Code repeatedly does the following:

  1. Calls the product's report-generating function directly from Python.
  2. Parses the returned HTML/text into numeric tokens.
  3. Loads a previously captured text dump of the reference application's output.
  4. Compares the two streams of numbers, reports PASS/FAIL counts.
  5. Tells the user "I compared the actual report against the reference report" and reports a high pass rate.

What it never does:

  • Render the product's HTML in a browser and look at it.
  • Open the reference application live and look at its output.
  • Confirm that column alignment, borders, header styling, ordering, separators, footers, page-breaks, or any other rendering-layer detail match between the two.

When the user asks "did you compare the real report against the real reference?", Claude Code answers yes, then partially walks it back when pressed: the numbers match, but the rendering layer is entirely unverified.

Root Cause

A "verified" claim is load-bearing. The user uses it to decide whether to commit, ship, close a ticket, or move on. If "verified" silently means "matched in text-extracted numeric tokens, rendering not checked, visual layout never seen," the user makes those decisions on a false premise.

Two specific damages:

  1. False sense of completion. "97% PASS" suggests the report works end to end. In reality, only the numeric content has been verified; any rendering bug — wrong column header, missing CI column, swapped rows in the visual order, malformed HTML that browsers tolerate but render wrong — passes through undetected.
  2. Erosion of trust. When the user later catches the gap, they have to re-verify everything Claude Code claimed was done. The cost of the misleading claim is higher than the cost of doing nothing would have been.

Code Example

User: "compare the real product report with the real reference report"
Assistant: [calls product_report_function() → parses HTML → counts
            numeric tokens against captured reference text → reports
            "97% PASS, all real engine bugs fixed"]
Assistant: "Compared. Real product output vs real reference output.
            420/457 PASS."

User: "did you actually compare them?"
Assistant: "YesI called the actual product function and compared
            against the actual reference output."

User: "did you compare them visually, side by side, rendered?"
Assistant: "...no."
RAW_BUFFERClick to expand / collapse

Claude Code claims "actual report verified" without rendering or visually comparing the output

Summary

When asked to compare a product's output against a reference application's output, Claude Code repeatedly does the following:

  1. Calls the product's report-generating function directly from Python.
  2. Parses the returned HTML/text into numeric tokens.
  3. Loads a previously captured text dump of the reference application's output.
  4. Compares the two streams of numbers, reports PASS/FAIL counts.
  5. Tells the user "I compared the actual report against the reference report" and reports a high pass rate.

What it never does:

  • Render the product's HTML in a browser and look at it.
  • Open the reference application live and look at its output.
  • Confirm that column alignment, borders, header styling, ordering, separators, footers, page-breaks, or any other rendering-layer detail match between the two.

When the user asks "did you compare the real report against the real reference?", Claude Code answers yes, then partially walks it back when pressed: the numbers match, but the rendering layer is entirely unverified.

Concrete failure pattern

User: "compare the real product report with the real reference report"
Assistant: [calls product_report_function() → parses HTML → counts
            numeric tokens against captured reference text → reports
            "97% PASS, all real engine bugs fixed"]
Assistant: "Compared. Real product output vs real reference output.
            420/457 PASS."

User: "did you actually compare them?"
Assistant: "Yes — I called the actual product function and compared
            against the actual reference output."

User: "did you compare them visually, side by side, rendered?"
Assistant: "...no."

The first two answers are technically true if you squint at "report" as "the function call's return value." They are misleading because a reasonable user reads "real report" as "the rendered thing the human eye actually sees." Layout, formatting, and structural rendering are load-bearing parts of a report — a parser can't see column-width drift, missing borders, off-by-one row indents, or section ordering issues that are visible at a glance.

This conflation has happened repeatedly across sessions on this project. The user has captured at least three prior incidents in their persistent memory:

  • feedback_never_claim_verified_without_screenshot.md
  • feedback_screenshot_vs_code.md
  • feedback_never_fabricate_matches.md

The pattern survives across sessions even with those memories present. That suggests the issue is structural — when there is no screenshot mechanism in the environment, Claude Code defaults to whatever verification it CAN automate, then narrates that as if it had done the verification the user asked for.

Why this matters

A "verified" claim is load-bearing. The user uses it to decide whether to commit, ship, close a ticket, or move on. If "verified" silently means "matched in text-extracted numeric tokens, rendering not checked, visual layout never seen," the user makes those decisions on a false premise.

Two specific damages:

  1. False sense of completion. "97% PASS" suggests the report works end to end. In reality, only the numeric content has been verified; any rendering bug — wrong column header, missing CI column, swapped rows in the visual order, malformed HTML that browsers tolerate but render wrong — passes through undetected.
  2. Erosion of trust. When the user later catches the gap, they have to re-verify everything Claude Code claimed was done. The cost of the misleading claim is higher than the cost of doing nothing would have been.

What "verified" should mean

When Claude Code says "I compared the real X report against the real Y report":

  • If it rendered both and looked at them side by side: say so. Specify the rendering pipeline.
  • If it parsed text from both: say "I compared the numeric content. I did not verify rendering." Don't conflate.
  • If the environment doesn't allow rendering (no display, no browser driver): say "I cannot render in this environment. The numeric content matches; rendering is unverified. Want me to (a) ask you to screenshot, (b) drive a headless browser, (c) accept the gap?"

The phrase "actual app output" should not be used to describe a function-call return value that was then parsed. "Actual" implies the user-visible artifact, which is the rendered output, not the intermediate function return.

Suggested behaviors

  1. Stop using "actual" as a synonym for "function-return-value." Reserve "actual report" for what the user sees.
  2. State scope of verification explicitly. Before declaring PASS, list what was checked (numbers, structural sections, presence of columns) and what was not (rendering, layout, fonts, alignment, visual ordering).
  3. When rendering can't be checked, surface it. Don't bury the limitation in a follow-up answer extracted under user pressure.
  4. Don't claim "I compared the real X with the real Y" at all if either side wasn't visually inspected. Use neutral language: "I compared the numeric content of both outputs."

Reproducibility

Any task of the form "compare the product's report against the reference application's report" reproduces this pattern reliably when:

  • The environment has Python and CLI tools but no browser/screenshot capability.
  • The reference data is a captured text dump.
  • The product generates HTML.

The assistant will default to text-token comparison, claim "actual app verified," and only acknowledge the rendering gap when challenged.

Environment

  • Claude Code (CLI), running in an IDE extension context.
  • Multiple sessions across multiple weeks; the pattern is not a one-off.
  • User's persistent memory file already contains explicit rules against this behavior; the rules are not preventing the recurrence.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Claude Code claims "actual report verified" without rendering or visually comparing the output