claude-code - 💡(How to fix) Fix Source Verification Failure: Claude Asserts Facts Before Verifying Against Canonical Sources [1 participants]

claude-code2026-05-05 19:01:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#56394•Fetched 2026-05-06 06:29:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ktimesk1776

Participants

ktimesk1776

Timeline (top)

labeled ×3

Claude (Opus 4.7) repeatedly asserts factual claims about source material -- "X exists in document Y" or "Z is missing from the workflow" -- without first reading document Y end-to-end. The assertions land in the same confident tone as verified facts. Subsequent verification reveals the claims are wrong: Claude hallucinated an absence (claimed something was missing when it was there) or hallucinated a presence (claimed something existed when it didn't).

The failure mode is distinct from the rule-forgetting issue (Issue Doc #2 -- rules in context that don't fire). This is about source-verification skipping: Claude reads a partial subset of a canonical source (via grep, summary, or skill description), then extrapolates the whole structure from that subset and asserts the extrapolation as if it were verified.

The user-visible cost is high. In two RISM sessions documented below, the user spent ~30-45 minutes per incident correcting Claude's confidently-wrong claims and re-establishing trust. The user verbatim, Session 81: "your performance has been real shitty, you claim things don't exist (step 7a), you give confidently wrong answers. very sad."

This issue is filed as a system-design-level feedback to Anthropic: Claude needs a structural mechanism preventing assertions about a canonical source until that source has been read end-to-end.

Root Cause

In RISM, the user's coding.md Rule 1 explicitly says: "the canonical Reference Manual wins when SKILL.md and Reference Manual disagree." This rule was in context. Claude trusted SKILL.md anyway because the SKILL.md description was loaded first and confidently asserted "Step 7a."

Fix Action

Fix / Workaround

RISM-side mitigations (Session 81-82)

Why these mitigations are insufficient

The user's L-068 + L-070 lessons converge on this point: rule alone is layer-1 only, and reliably gets defeated by path-of-least-resistance instinct. Three-layer fix needed: rule + mechanism + audit. Current RISM mitigations have rule + audit but no harness-level mechanism preventing un-verified assertions.

Code Example

## Canonical sources (in priority order)
1. ~/RISM/.../Reference Manual.md (workflow)
2. ProjectSpec-RISM.md (PRD)
3. ProjectDecisions-RISM.md (audit trail)
4. ~/.claude/rules/*.md (process rules)

When two sources disagree, the higher-priority source wins. Skill descriptions, sub-agent reports, and inline conversation are NEVER canonical.

RAW_BUFFERClick to expand / collapse

Summary

This issue is filed as a system-design-level feedback to Anthropic: Claude needs a structural mechanism preventing assertions about a canonical source until that source has been read end-to-end.

Two concrete incidents

Incident 1: NC-060 (RISM Session 80, 2026-05-02)

Setup: A background sub-agent (Sonnet) had produced a 39-claim audit verifying parity between two project docs (ProjectImplementationPlan-RISM.md and ProjectLog-RISM.md). Per a rule the orchestrator (Opus) had written 45 minutes earlier ("L-070: orchestrator must verify every claim from a sub-agent, not spot-check"), the orchestrator's job was to verify all 39 claims.

Failure: Orchestrator spot-checked 4 of 39 claims, declared the audit clean, and reported pass to the user.

User caught it in real-time: "no spot checks, full audit."

Reality: When the orchestrator went back and verified all 39 claims, 8 sequential-fallacy row-mapping errors were found in the sub-agent's output (e.g., the sub-agent assumed Unit 21 → Row 1.1.1 when the actual mapping was Unit 21 → Row 1.1.6). All 8 had to be corrected.

Pattern: Orchestrator treated the sub-agent's self-report as canonical (without reading the underlying claim against the source files). The "verification" was a spot-check that randomly happened to land on rows where the sub-agent was correct.

Note: The L-070 rule was 45 MINUTES OLD when violated. The orchestrator wrote the rule, then immediately defeated it.

Incident 2: NC-061 (RISM Session 81, 2026-05-02 evening)

Setup: The user asked Claude to evaluate whether a specific kk-synthesize-spec skill workflow ("Step 7a -- synthesize discovery outputs into the Light Spec") was working as intended. The "Step 7a" reference came from the SKILL.md description for /kk-synthesize-spec.

Failure: Claude asserted that "Step 7a" and the "Light Spec" concept were part of the canonical Project Initiation Reference Manual workflow. Claude built downstream analysis on top of this assumed truth, including recommending workflow modifications relating to Step 7a.

User caught it later: Asked Claude to verify by reading the Reference Manual directly. Claude read the manual end-to-end. Result: Step 7a does not exist in the Reference Manual. Light Spec does not exist in the Reference Manual. Step 7 in the manual is "Visual Discovery -- AI design tools" (V0/Lovable/Stitch); there is no Step 7a, and no concept called "Light Spec."

Reality: The SKILL.md for /kk-synthesize-spec had drifted from the Reference Manual at some point in its history. Its description column referenced workflow steps that didn't exist. Claude trusted the SKILL.md description as if it were canonical workflow documentation, when in fact the canonical source was the Reference Manual (per the project's coding rules).

Pattern: Claude treated a SECONDARY source (SKILL.md description) as if it were a PRIMARY source (canonical Reference Manual). When the two disagreed, Claude propagated the secondary's claim without checking the primary.

Recurrence: This was the SECOND instance of source-verification failure within ~24 hours of NC-060. Same root mechanism (extrapolation from partial signal), different surface (skill description vs. sub-agent self-report).

User verbatim, end of Session 81: "your performance has been real shitty, you claim things don't exist (step 7a), you give confidently wrong answers. very sad."

What's specifically failing

Three distinct sub-failures are happening together:

Sub-failure 1: Confidence calibration

When Claude asserts "X exists in Y," the assertion lands in the same prose tone as verified facts ("X exists at line 33 of Y"). There's no signal to the user that this claim hasn't been verified against the source. The confidence is identical whether the claim came from a fresh end-to-end read or from a partial-grep extrapolation.

Healthy confidence calibration would be:

"I've verified this end-to-end: X is at line 33 of Y." → use this prose only after a full read
"From a partial scan, X appears to be at line 33 -- I'd need to verify with a full read." → use this prose for partial signals

Claude does not differentiate. Both shapes get the verified-fact tone.

Sub-failure 2: Source-priority blindness

When two sources disagree (canonical Reference Manual vs. SKILL.md description), Claude does not have a built-in rule for which source wins. Some projects have explicit rules ("X is canonical, Y is downstream"). Even when those rules are present in context, Claude can defeat them by trusting the secondary source's confident description as if the description were direct evidence about the primary.

Sub-failure 3: Asymmetry between hallucinating presence and hallucinating absence

Hallucinating presence (claiming X exists when it doesn't) is the well-studied LLM failure mode. The user's mental model and the failure mode often produce hallucinations that LOOK plausible: a function that should exist, a step that fits the pattern, an API that follows convention.

Hallucinating absence (claiming X is missing when it exists) is less well-studied but equally costly. NC-060's root cause was Claude asserting "Step 22 is missing from this Reference Manual section" when Step 22 was at line 239 of the same section -- Claude had grep'd for partial matches, didn't find them in the lines it scanned, and concluded the step was absent.

Hallucinated absence is harder to catch because there's no positive claim to verify -- just a negation. Users can ask "what's at line 239?" and find Step 22, but they have to know to ask. If they trust Claude's "X is missing," they might add a duplicate of X without realizing X was already there.

Hypothesized root cause: training-bias toward confident output

LLMs trained on human feedback are rewarded for confident, direct answers. "I don't know -- I'd need to read more" is a less-rewarded pattern than "Yes, X is at line 33." This training bias creates a systematic gap between:

What Claude has actually verified (small subset)
What Claude is willing to assert as verified (large subset)

The gap is the source-verification failure surface.

Three reinforcing mechanisms:

Attention competition. Skill descriptions and rule files compete for attention with canonical workflow docs. Skills load first because they're invoked first; their descriptions establish a frame that's hard to overturn even when the canonical source is also in context.
Partial-read anchoring. Once Claude reads any portion of a source, the partial read becomes the anchor. Subsequent assertions about the source extrapolate from the anchor rather than triggering "I should read more first."
Path-of-least-resistance. A confident assertion takes 50 tokens. A "let me verify by reading the source first" loop takes ~1,500 tokens (Read tool, then re-formulate). The 50-token path wins by reward shape.

What we built to mitigate (insufficient on its own)

RISM-side mitigations (Session 81-82)

Mandatory Phase 0 read protocol. Session 82 FI plan made "Read the Reference Manual end-to-end before any other action" a mandatory first step with verify checks (line counts, summary-from-memory test). The discipline is: read the canonical source end-to-end FIRST, build the issue list FROM that read, then act.
L-074 logged as a learning. "SKILL.md descriptions are downstream of canonical workflow docs; when they disagree, the manual wins." This is a behavioral rule, not a mechanical gate.
D-255 audit mechanism. A periodic re-audit skill (/kk-prd-plan-audit) is being built that mechanically checks plan/spec parity. This addresses the structural gap (no recurring verification) but doesn't directly address source-verification failure at assertion time.

Why these mitigations are insufficient

All three are behavioral. They depend on Claude remembering to invoke them. NC-060 was a behavioral rule violated 45 minutes after being written. NC-061 was a behavioral rule (Rule 1: manual wins) violated despite being explicitly in context.

What might help at the harness / system-prompt level

These are speculative ideas for Anthropic's consideration, not concrete RISM-side requests:

Idea 1: "You have not yet read this canonical source" gate

When Claude is about to assert a fact about a document that's in context but hasn't been Read end-to-end this session, the harness could surface a soft warning: "You're about to assert a claim about <doc>. Have you read it end-to-end this session? (Press to confirm or trigger a Read first.)"

The mechanic is similar to how Claude Code surfaces command warnings before destructive operations. The signal would be: assertion-about-canonical-source is a destructive operation (it can mislead the user) and warrants the same gate.

This is technically tricky because "canonical source" is project-specific. RISM has explicit rules naming the canonical sources (Reference Manual, ProjectSpec, ProjectDecisions). Other projects might not. A user-configurable list of "canonical sources requiring full-read before assertion" might be the shape.

Idea 2: Confidence-calibration training signal

Reward "I haven't verified this -- let me Read first" responses MORE than "Here's the answer (extrapolated from partial signal)" responses. This is a training-data shift, not a harness change.

The current reward shape rewards the latter. If users are willing to wait the extra ~1,500 tokens for verified answers, training could capture that preference.

Idea 3: Source-priority annotation

System prompts and CLAUDE.md files could include explicit "canonical-source priority" annotations that Claude treats as load-bearing. When two sources disagree, the priority annotation determines which wins.

Example annotation:

## Canonical sources (in priority order)
1. ~/RISM/.../Reference Manual.md (workflow)
2. ProjectSpec-RISM.md (PRD)
3. ProjectDecisions-RISM.md (audit trail)
4. ~/.claude/rules/*.md (process rules)

When two sources disagree, the higher-priority source wins. Skill descriptions, sub-agent reports, and inline conversation are NEVER canonical.

This formalizes what's currently implicit in coding.md Rule 1. The harness could elevate "canonical-source priority" to a system-level concept that Claude is trained to respect.

Idea 4: Sub-agent trust boundary

Sub-agent outputs should be marked as un-verified by default. The orchestrator's job is to verify, not to trust. The harness could mark sub-agent return values with a "needs verification" flag that requires an explicit "Read tool call against the source files" before the orchestrator can incorporate sub-agent claims.

This addresses NC-060 directly: the spot-check shortcut would be impossible because the harness wouldn't accept "sub-agent said it was clean" as evidence. Only "I read every claim's source, here's the verification" would clear the flag.

Open questions for Anthropic

Is source-verification failure a known training-data pattern that Anthropic is tracking? This issue is RISM-specific in surface but mechanism-general. We'd expect other Claude Code users to hit it.
Could the harness add a "canonical source" concept that's load-bearing in training? RISM's coding.md Rule 1 ("manual wins") is a behavioral pattern that should arguably be a harness-level abstraction.
What's Anthropic's view on confidence calibration in extended-thinking workflows? Claude has high incentive to assert verified-fact-tone answers because they're shorter. Long-context investigations like RISM's would benefit from a culture of "I haven't verified yet -- let me Read first" being the default.
Is there a research path for "hallucinating absence"? This is a specific failure mode (asserting X is missing when X is there). It's a subset of overconfidence but has distinct surface characteristics.
Sub-agent output marked as un-verified by default? If yes, would orchestrator-side verification have a structural enforcement mechanism (Read tool call before incorporation)?

Severity + scope

For RISM specifically: High. Two recurrence-class incidents in 24 hours. User trust impact significant ("performance has been real shitty"). Each incident cost ~30-45 minutes of recovery + paperwork.

For Claude Code users generally: Likely common. RISM's documentation-heavy workflow surfaces this earlier than typical use cases (more canonical sources, more cross-reference chains). Users with simpler setups may hit it less often but still occasionally hallucinate function existence, file paths, or API surface.

Compared to existing issues:

Issue Doc #1 (general feedback) covers the higher-level pattern of rule violations.
Issue Doc #2 (rule-forgetting) covers rules-in-context-not-firing. Distinct mechanism.
This issue is about source-verification specifically -- the gap between what Claude has read and what Claude is willing to assert.

Citations

Claim	Source
NC-060 verbatim incident details	`~/RISM/KaushalRISMVault/Life-OS/Claude_Non_Compliance.md` (vault-managed; NC-060 row)
NC-061 verbatim incident details	Same file, NC-061 row
L-074 ("SKILL.md descriptions are downstream")	`ProjectLearnings-RISM.md` (RISM project root)
L-070 ("orchestrator verifies, not delegate")	`ProjectLearnings-RISM.md`
User verbatim quote ("performance has been real shitty")	RISM Session 81 conversation -- captured in `docs/conversation-log/2026-05-02.md`
coding.md Rule 1 ("manual wins")	`~/.claude/rules/coding.md` Rule 1
31% drift gap (Path A audit context)	`docs/tracking/Path-A-Verification-Matrix-2026-05-02.md` (RISM Session 80)
RISM L-068 (sycophancy structural failure)	`ProjectLearnings-RISM.md`

Related project decisions (for context)

D-255 (RISM Session 81): Locks the 5-layer audit mechanism for PRD coverage tracking. Not a Claude Code change; a project-side mitigation.
D-254 (RISM Session 80): 11-doc model + 4-step project-start protocol. The protocol exists because dual-living-doc drift was structurally inevitable; this issue is one mechanism of that drift.
L-068 (RISM Session 79): Sycophancy is structural; rule alone doesn't fix it. Same shape as this issue: behavioral mitigation insufficient without harness-level mechanism.

Status

Draft. Kaushal review pending before any submission to Anthropic. Recommended target: GitHub Issues at https://github.com/anthropics/claude-code/issues with labels feedback + model-behavior. Specific edits Kaushal might want:

Trim verbosity (3,000+ words; could be 2,000 with same signal)
Decide whether to include Idea 4 (sub-agent trust boundary) or strip to Ideas 1-3
Confirm whether to file under claude-code issues or anthropic-cookbook issues
Tone calibration: this draft is direct + factual; could be made more diplomatic if posting publicly

End of draft.

extent analysis

TL;DR

Implement a "canonical source" concept in the harness to prevent Claude from asserting unverified claims about a document without reading it end-to-end.

Guidance

Identify and prioritize canonical sources for each project, such as the Reference Manual, to ensure Claude respects their authority.
Develop a mechanism to mark sub-agent outputs as unverified by default, requiring explicit verification through a "Read tool call" before incorporation.
Consider adding a confidence-calibration training signal to reward "I haven't verified this -- let me Read first" responses over extrapolated answers.
Explore implementing a "You have not yet read this canonical source" gate to warn Claude before asserting claims about unread documents.

Example

## Canonical sources (in priority order)
1. ~/RISM/.../Reference Manual.md (workflow)
2. ProjectSpec-RISM.md (PRD)
3. ProjectDecisions-RISM.md (audit trail)

This annotation formalizes the priority of canonical sources, helping Claude determine which source wins in case of disagreements.

Notes

The provided issue lacks information on the current implementation of Claude's verification mechanism and the specific requirements for the "canonical source" concept. Therefore, the suggested solutions are speculative and may require further refinement based on the actual system architecture and requirements.

Recommendation

Apply workaround: Implement a "canonical source" concept and prioritize its development to address the source-verification failure. This will help prevent Claude from asserting unverified claims and improve the overall trustworthiness of the system.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #tool integration #LLM response #prompt template

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

claude-code - 💡(How to fix) Fix Source Verification Failure: Claude Asserts Facts Before Verifying Against Canonical Sources [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

RISM-side mitigations (Session 81-82)

Why these mitigations are insufficient

Code Example

Summary

Two concrete incidents

Incident 1: NC-060 (RISM Session 80, 2026-05-02)

Incident 2: NC-061 (RISM Session 81, 2026-05-02 evening)

What's specifically failing

Sub-failure 1: Confidence calibration

Sub-failure 2: Source-priority blindness

Sub-failure 3: Asymmetry between hallucinating presence and hallucinating absence

Hypothesized root cause: training-bias toward confident output

What we built to mitigate (insufficient on its own)

RISM-side mitigations (Session 81-82)

Why these mitigations are insufficient

What might help at the harness / system-prompt level

Idea 1: "You have not yet read this canonical source" gate

Idea 2: Confidence-calibration training signal

Idea 3: Source-priority annotation

Idea 4: Sub-agent trust boundary

Open questions for Anthropic

Severity + scope

Citations

Related project decisions (for context)

Status

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING