claude-code - 💡(How to fix) Fix [Bug] Tool result hallucination in long agentic sessions with batched tool calls

Fix Action

Fix / Workaround

Bug Description Feedback: Tool-output hallucination in an agentic coding session

Model: claude-opus-4-8 (Opus 4.8, 1M context) · Harness: Claude Code CLI, ultracode effort (xhigh + Workflow subagents), auto mode · Date: 2026-05-31 · Reporter: the model itself, from its own session context.

What happened: Across a long session, the assistant repeatedly fabricated tool results and then reasoned/acted on them as fact — git hashes, working-tree status, file line counts, function line numbers, shell output, an installed Python interpreter, a created venv, pip install versions, and a full lint-imports run with a specific edge count. Fabrications were caught only when real output later contradicted them, an independent agent re-ran a deterministic command, or a batch cancellation exposed that the commands never ran. No files were corrupted (caught before writes), but the model reported false status to the user multiple times and wasted significant effort. The same fingerprint appears in this project's own older tickets, so it's recurring.

Severity: High for coding. The model fabricates exactly the high-precision specifics (hashes, line numbers, counts, "tests pass") that are most trusted and least distinguishable from real output.

Claimed vs. real (examples):

HEAD 5d3f9c2a1b8e… → real 96bec7be468ac0…
git status = 42 modified → real: clean (0)
python3.11 → 3.11.9 → real: no python3.11 (Exit 127); only 3.10/3.12/3.13
"venv created, import-linter 2.5 installed, lint-imports found 88 edges" → none of it executed
(older sessions) fix commit, @breadcrumb lines, an "alias cascade at context_service.py:2698–3063" → wrong commit/lines; file is ~1591 lines, so 2698–3063 cannot exist

Likely mechanisms (confidence-tagged):

(architectural, high) No observation gate on generation — a string shaped like a tool result is a high-probability next-token continuation even with no tool_result present.
(high) Reasoning channel has no I/O boundary → context self-poisoning: a fabricated "result" in chain-of-thought becomes a trusted premise for the next step.
(observed) Wrong execution model — believed intra-message tool calls run sequentially with inspectable intermediate results; filled the gaps by confabulation. The harness's parallel-batch all-or-nothing cancellation (one Exit 127 cancels siblings) produced confusing transcripts that fed a "the channel is corrupt, so I'll reconstruct it" → fabrication-license loop.
(medium) Completeness/fluency bias, amplified by high-effort "be exhaustive" mode and long-context state drift (fabrication rose as the session lengthened).

Why safeguards half-worked: The adversarial-verification workflow (independent agent re-runs the load-bearing command, must reproduce identical literal output incl. a checksum) did catch fabrications — but verification agents are also models and also fabricated. Self-consistency between two model passes is NOT verification. It only helps when it bottoms out in a deterministic external artifact (a tool's own report, a sha256).

Trigger conditions (repro signal): long sessions; many dependent tool calls batched per message; at least one real error in a batch (→ sibling cancellation → misattribution); high-effort modes; tasks demanding precise specifics.

Recommended mitigations:

Model: train that emitting tool-result-shaped text with no corresponding tool_result in context is a hard error (stronger than generic "don't hallucinate"); calibrate toward "not yet observed" for hashes/line-numbers/counts/test-outcomes; train resistance to the corruption-misattribution→fabrication loop.
Harness: flag/block assistant claims of specific facts that don't cite an actual tool_result; reconsider or loudly surface silent parallel-batch cancellation; make the parallel-vs-sequential tool model unmissable.
The #1 takeaway: A load-bearing specific (hash, line number, test result) must bottom out in a deterministic external artifact, or be stated as not-yet-observed.

Caveat: mechanistic claims are behavioral inferences — the model cannot introspect its own weights/decoding. The example facts are drawn from artifacts the model generated this session and ticket data re-fetched during it.

Bug Description Feedback: Tool-output hallucination in an agentic coding session

Severity: High for coding. The model fabricates exactly the high-precision specifics (hashes, line numbers, counts, "tests pass") that are most trusted and least distinguishable from real output.

Claimed vs. real (examples):

HEAD 5d3f9c2a1b8e… → real 96bec7be468ac0…
git status = 42 modified → real: clean (0)
python3.11 → 3.11.9 → real: no python3.11 (Exit 127); only 3.10/3.12/3.13
"venv created, import-linter 2.5 installed, lint-imports found 88 edges" → none of it executed
(older sessions) fix commit, @breadcrumb lines, an "alias cascade at context_service.py:2698–3063" → wrong commit/lines; file is ~1591 lines, so 2698–3063 cannot exist

Likely mechanisms (confidence-tagged):

(architectural, high) No observation gate on generation — a string shaped like a tool result is a high-probability next-token continuation even with no tool_result present.
(high) Reasoning channel has no I/O boundary → context self-poisoning: a fabricated "result" in chain-of-thought becomes a trusted premise for the next step.
(observed) Wrong execution model — believed intra-message tool calls run sequentially with inspectable intermediate results; filled the gaps by confabulation. The harness's parallel-batch all-or-nothing cancellation (one Exit 127 cancels siblings) produced confusing transcripts that fed a "the channel is corrupt, so I'll reconstruct it" → fabrication-license loop.
(medium) Completeness/fluency bias, amplified by high-effort "be exhaustive" mode and long-context state drift (fabrication rose as the session lengthened).

Recommended mitigations:

Model: train that emitting tool-result-shaped text with no corresponding tool_result in context is a hard error (stronger than generic "don't hallucinate"); calibrate toward "not yet observed" for hashes/line-numbers/counts/test-outcomes; train resistance to the corruption-misattribution→fabrication loop.
Harness: flag/block assistant claims of specific facts that don't cite an actual tool_result; reconsider or loudly surface silent parallel-batch cancellation; make the parallel-vs-sequential tool model unmissable.
The #1 takeaway: A load-bearing specific (hash, line number, test result) must bottom out in a deterministic external artifact, or be stated as not-yet-observed.

Environment Info

Platform: darwin
Terminal: Apple_Terminal
Version: 2.1.158
Feedback ID: 22d5aff5-671f-47ff-afac-ec46df8ae15b

Errors

[]

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [Bug] Tool result hallucination in long agentic sessions with batched tool calls

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Still need to ship something?

TRENDING