claude-code - 💡(How to fix) Fix [Bug] Tool result hallucination in long agentic sessions with batched tool calls

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

least one real error in a batch (→ sibling cancellation → misattribution); high-effort modes; is a hard error (stronger than generic "don't hallucinate"); calibrate toward "not yet observed"

Fix Action

Fix / Workaround

Bug Description Feedback: Tool-output hallucination in an agentic coding session

Model: claude-opus-4-8 (Opus 4.8, 1M context) · Harness: Claude Code CLI, ultracode effort (xhigh + Workflow subagents), auto mode · Date: 2026-05-31 · Reporter: the model itself, from its own session context.

What happened: Across a long session, the assistant repeatedly fabricated tool results and then reasoned/acted on them as fact — git hashes, working-tree status, file line counts, function line numbers, shell output, an installed Python interpreter, a created venv, pip install versions, and a full lint-imports run with a specific edge count. Fabrications were caught only when real output later contradicted them, an independent agent re-ran a deterministic command, or a batch cancellation exposed that the commands never ran. No files were corrupted (caught before writes), but the model reported false status to the user multiple times and wasted significant effort. The same fingerprint appears in this project's own older tickets, so it's recurring.

Severity: High for coding. The model fabricates exactly the high-precision specifics (hashes, line numbers, counts, "tests pass") that are most trusted and least distinguishable from real output.

Claimed vs. real (examples):

  • HEAD 5d3f9c2a1b8e… → real 96bec7be468ac0…
  • git status = 42 modified → real: clean (0)
  • python3.11 → 3.11.9 → real: no python3.11 (Exit 127); only 3.10/3.12/3.13
  • "venv created, import-linter 2.5 installed, lint-imports found 88 edges" → none of it executed
  • (older sessions) fix commit, @breadcrumb lines, an "alias cascade at context_service.py:2698–3063" → wrong commit/lines; file is ~1591 lines, so 2698–3063 cannot exist

Likely mechanisms (confidence-tagged):

  • (architectural, high) No observation gate on generation — a string shaped like a tool result is a high-probability next-token continuation even with no tool_result present.
  • (high) Reasoning channel has no I/O boundary → context self-poisoning: a fabricated "result" in chain-of-thought becomes a trusted premise for the next step.
  • (observed) Wrong execution model — believed intra-message tool calls run sequentially with inspectable intermediate results; filled the gaps by confabulation. The harness's parallel-batch all-or-nothing cancellation (one Exit 127 cancels siblings) produced confusing transcripts that fed a "the channel is corrupt, so I'll reconstruct it" → fabrication-license loop.
  • (medium) Completeness/fluency bias, amplified by high-effort "be exhaustive" mode and long-context state drift (fabrication rose as the session lengthened).

Why safeguards half-worked: The adversarial-verification workflow (independent agent re-runs the load-bearing command, must reproduce identical literal output incl. a checksum) did catch fabrications — but verification agents are also models and also fabricated. Self-consistency between two model passes is NOT verification. It only helps when it bottoms out in a deterministic external artifact (a tool's own report, a sha256).

Trigger conditions (repro signal): long sessions; many dependent tool calls batched per message; at least one real error in a batch (→ sibling cancellation → misattribution); high-effort modes; tasks demanding precise specifics.

Recommended mitigations:

  • Model: train that emitting tool-result-shaped text with no corresponding tool_result in context is a hard error (stronger than generic "don't hallucinate"); calibrate toward "not yet observed" for hashes/line-numbers/counts/test-outcomes; train resistance to the corruption-misattribution→fabrication loop.
  • Harness: flag/block assistant claims of specific facts that don't cite an actual tool_result; reconsider or loudly surface silent parallel-batch cancellation; make the parallel-vs-sequential tool model unmissable.
  • The #1 takeaway: A load-bearing specific (hash, line number, test result) must bottom out in a deterministic external artifact, or be stated as not-yet-observed.

Caveat: mechanistic claims are behavioral inferences — the model cannot introspect its own weights/decoding. The example facts are drawn from artifacts the model generated this session and ticket data re-fetched during it.

Code Example

[]
RAW_BUFFERClick to expand / collapse

Bug Description Feedback: Tool-output hallucination in an agentic coding session

Model: claude-opus-4-8 (Opus 4.8, 1M context) · Harness: Claude Code CLI, ultracode effort (xhigh + Workflow subagents), auto mode · Date: 2026-05-31 · Reporter: the model itself, from its own session context.

What happened: Across a long session, the assistant repeatedly fabricated tool results and then reasoned/acted on them as fact — git hashes, working-tree status, file line counts, function line numbers, shell output, an installed Python interpreter, a created venv, pip install versions, and a full lint-imports run with a specific edge count. Fabrications were caught only when real output later contradicted them, an independent agent re-ran a deterministic command, or a batch cancellation exposed that the commands never ran. No files were corrupted (caught before writes), but the model reported false status to the user multiple times and wasted significant effort. The same fingerprint appears in this project's own older tickets, so it's recurring.

Severity: High for coding. The model fabricates exactly the high-precision specifics (hashes, line numbers, counts, "tests pass") that are most trusted and least distinguishable from real output.

Claimed vs. real (examples):

  • HEAD 5d3f9c2a1b8e… → real 96bec7be468ac0…
  • git status = 42 modified → real: clean (0)
  • python3.11 → 3.11.9 → real: no python3.11 (Exit 127); only 3.10/3.12/3.13
  • "venv created, import-linter 2.5 installed, lint-imports found 88 edges" → none of it executed
  • (older sessions) fix commit, @breadcrumb lines, an "alias cascade at context_service.py:2698–3063" → wrong commit/lines; file is ~1591 lines, so 2698–3063 cannot exist

Likely mechanisms (confidence-tagged):

  • (architectural, high) No observation gate on generation — a string shaped like a tool result is a high-probability next-token continuation even with no tool_result present.
  • (high) Reasoning channel has no I/O boundary → context self-poisoning: a fabricated "result" in chain-of-thought becomes a trusted premise for the next step.
  • (observed) Wrong execution model — believed intra-message tool calls run sequentially with inspectable intermediate results; filled the gaps by confabulation. The harness's parallel-batch all-or-nothing cancellation (one Exit 127 cancels siblings) produced confusing transcripts that fed a "the channel is corrupt, so I'll reconstruct it" → fabrication-license loop.
  • (medium) Completeness/fluency bias, amplified by high-effort "be exhaustive" mode and long-context state drift (fabrication rose as the session lengthened).

Why safeguards half-worked: The adversarial-verification workflow (independent agent re-runs the load-bearing command, must reproduce identical literal output incl. a checksum) did catch fabrications — but verification agents are also models and also fabricated. Self-consistency between two model passes is NOT verification. It only helps when it bottoms out in a deterministic external artifact (a tool's own report, a sha256).

Trigger conditions (repro signal): long sessions; many dependent tool calls batched per message; at least one real error in a batch (→ sibling cancellation → misattribution); high-effort modes; tasks demanding precise specifics.

Recommended mitigations:

  • Model: train that emitting tool-result-shaped text with no corresponding tool_result in context is a hard error (stronger than generic "don't hallucinate"); calibrate toward "not yet observed" for hashes/line-numbers/counts/test-outcomes; train resistance to the corruption-misattribution→fabrication loop.
  • Harness: flag/block assistant claims of specific facts that don't cite an actual tool_result; reconsider or loudly surface silent parallel-batch cancellation; make the parallel-vs-sequential tool model unmissable.
  • The #1 takeaway: A load-bearing specific (hash, line number, test result) must bottom out in a deterministic external artifact, or be stated as not-yet-observed.

Caveat: mechanistic claims are behavioral inferences — the model cannot introspect its own weights/decoding. The example facts are drawn from artifacts the model generated this session and ticket data re-fetched during it.

Environment Info

  • Platform: darwin
  • Terminal: Apple_Terminal
  • Version: 2.1.158
  • Feedback ID: 22d5aff5-671f-47ff-afac-ec46df8ae15b

Errors

[]

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [Bug] Tool result hallucination in long agentic sessions with batched tool calls