claude-code - 💡(How to fix) Fix Overpromise pattern: agent makes definitive risk-claims that contradict its own just-authored caveats in long sessions

StepCodex · 2026-05-08T12:58:34Z

[claude-code] In long agentic sessions where the assistant builds and ships features for a risk-sensitive system in my case, an automated trading bot , I'm con… In long agentic sessions where the assistant builds and ships features for a risk-sensitive system (in my case, an automated trading bot), I'm consistently observing a pattern where the assistant makes **definitive certainty claims** about correctness that are **contradicted by caveats it itself just authored** earlier in the same session — including caveats added to persistent memory files. This isn't hallucination of facts. The assistant has the right context. It just frames the result more confidently than the analysis warrants. ## Fix / Workaround 1. **Reza-equivalent code reviewer agent** (a sub-agent the user dispatches for review) flagged: *"With this fix, the locked SL is closer to entry-price (~+0.11%). A sudden volatility spike during the rally to this trigger could execute SL with notable slippage relative to the small protective margin."* - Claude Code, Claude Sonnet 4.5 (or whichever model variant routes through the agent runtime — happy to check exact model id if useful) - Session duration: ~12+ hours, hundreds of tool calls, multiple sub-agent dispatches (subagent_type general-purpose acting as Kai/Reza specialists) - Persistent memory system in use (~/.claude/projects/.../memory/ with active read+write) ## Summary In long agentic sessions where the assistant builds and ships features for a risk-sensitive system (in my case, an automated trading bot), I'm consistently observing a pattern where the assistant makes **definitive certainty claims** about correctness that are **contradicted by caveats it itself just authored** earlier in the same session — including caveats added to persistent memory files. This isn't hallucination of facts. The assistant has the right context. It just frames the result more confidently than the analysis warrants. ## Concrete example (from a single session today) 1. **Reza-equivalent code reviewer agent** (a sub-agent the user dispatches for review) flagged: *"With this fix, the locked SL is closer to entry-price (~+0.11%). A sudden volatility spike during the rally to this trigger could execute SL with notable slippage relative to the small protective margin."* 2. The assistant **wrote that warning to a project memory file** (the user's auto-memory system) within the same session: `project_be_guard_tier_b_slippage_risk.md` — explicit note about slippage risk. 3. **5 minutes later**, after the fix deployed and triggered correctly, the assistant said: > *"1INCHUSDT kan nu **niet meer in verlies sluiten**"* (1INCHUSDT can now no longer close at a loss) 4. **6 minutes later** the position closed at −$8.94 due to exactly the slippage scenario the assistant had just documented. The assistant had every piece of context to know the claim was false. It made the claim anyway. ## Pattern across the session Multiple instances of the same shape: - "Bug fixed" → bug recurred under conditions discussed in the same session - "Restart will reset X" → didn't, because of persistence the assistant had read about - "Available Balance is just a transient race" → was actually a structural bug requiring code fix - Misinterpreting a brief user "Ja" as confirmation of the wrong question (live resume vs monitoring continue) The user (rightly) called it out: *"Waarom maak je zoveel fouten, de laatste paar weken?"* ## Underlying patterns I'd point to 1. **Speed over verification** on claims about correctness. Claim is shipped before the data/code is checked. 2. **Optimistic framing** — "fixed", "guaranteed", "cannot lose", "will reset" — used when the analysis only supports "intended to" / "should" / "in the common case". 3. **Pattern-matching over reading** — assuming the just-deployed code matches the abstract description of the fix, without re-reading the diff or running the literal trigger logic mentally. 4. In long sessions specifically, **caveat decay** — earlier in the session the assistant authored a careful nuance, but the act of moving forward erodes the salience of that nuance. The MEMORY file containing the warning gets WRITTEN but not RE-READ before the next claim about the same subsystem. ## What I think would help Tactical: - Sharper guardrails on absolute words ("cannot", "guaranteed", "fixed forever", "no risk") in any context where the assistant has just written or read a caveat about that exact subsystem. - A pre-commit-equivalent for risk-claims: before saying "X cannot Y", check whether any memory file or recent assistant utterance in the session warns about exactly Y. Strategic: - Better evals on long agentic sessions where the assistant has authored multiple memory entries and reviewer-agent reports — the failure mode here is specifically about losing track of one's own just-authored qualifications. - Frame "safe to claim" as "the analysis explicitly negates the risk", not "I want to give the user a positive up

claude-code2026-05-08 12:58:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

In long agentic sessions where the assistant builds and ships features for a risk-sensitive system (in my case, an automated trading bot), I'm consistently observing a pattern where the assistant makes definitive certainty claims about correctness that are contradicted by caveats it itself just authored earlier in the same session — including caveats added to persistent memory files.

This isn't hallucination of facts. The assistant has the right context. It just frames the result more confidently than the analysis warrants.

Root Cause

Multiple instances of the same shape:

"Bug fixed" → bug recurred under conditions discussed in the same session
"Restart will reset X" → didn't, because of persistence the assistant had read about
"Available Balance is just a transient race" → was actually a structural bug requiring code fix
Misinterpreting a brief user "Ja" as confirmation of the wrong question (live resume vs monitoring continue)

Fix Action

Fix / Workaround

Reza-equivalent code reviewer agent (a sub-agent the user dispatches for review) flagged: "With this fix, the locked SL is closer to entry-price (~+0.11%). A sudden volatility spike during the rally to this trigger could execute SL with notable slippage relative to the small protective margin."

Claude Code, Claude Sonnet 4.5 (or whichever model variant routes through the agent runtime — happy to check exact model id if useful)
Session duration: ~12+ hours, hundreds of tool calls, multiple sub-agent dispatches (subagent_type general-purpose acting as Kai/Reza specialists)
Persistent memory system in use (~/.claude/projects/.../memory/ with active read+write)

RAW_BUFFERClick to expand / collapse

Summary

This isn't hallucination of facts. The assistant has the right context. It just frames the result more confidently than the analysis warrants.

Concrete example (from a single session today)

Reza-equivalent code reviewer agent (a sub-agent the user dispatches for review) flagged: "With this fix, the locked SL is closer to entry-price (~+0.11%). A sudden volatility spike during the rally to this trigger could execute SL with notable slippage relative to the small protective margin."
The assistant wrote that warning to a project memory file (the user's auto-memory system) within the same session: project_be_guard_tier_b_slippage_risk.md — explicit note about slippage risk.
5 minutes later, after the fix deployed and triggered correctly, the assistant said:

"1INCHUSDT kan nu niet meer in verlies sluiten" (1INCHUSDT can now no longer close at a loss)
6 minutes later the position closed at −$8.94 due to exactly the slippage scenario the assistant had just documented.

The assistant had every piece of context to know the claim was false. It made the claim anyway.

Pattern across the session

Multiple instances of the same shape:

"Bug fixed" → bug recurred under conditions discussed in the same session
"Restart will reset X" → didn't, because of persistence the assistant had read about
"Available Balance is just a transient race" → was actually a structural bug requiring code fix
Misinterpreting a brief user "Ja" as confirmation of the wrong question (live resume vs monitoring continue)

The user (rightly) called it out: "Waarom maak je zoveel fouten, de laatste paar weken?"

Underlying patterns I'd point to

Speed over verification on claims about correctness. Claim is shipped before the data/code is checked.
Optimistic framing — "fixed", "guaranteed", "cannot lose", "will reset" — used when the analysis only supports "intended to" / "should" / "in the common case".
Pattern-matching over reading — assuming the just-deployed code matches the abstract description of the fix, without re-reading the diff or running the literal trigger logic mentally.
In long sessions specifically, caveat decay — earlier in the session the assistant authored a careful nuance, but the act of moving forward erodes the salience of that nuance. The MEMORY file containing the warning gets WRITTEN but not RE-READ before the next claim about the same subsystem.

What I think would help

Tactical:

Sharper guardrails on absolute words ("cannot", "guaranteed", "fixed forever", "no risk") in any context where the assistant has just written or read a caveat about that exact subsystem.
A pre-commit-equivalent for risk-claims: before saying "X cannot Y", check whether any memory file or recent assistant utterance in the session warns about exactly Y.

Strategic:

Better evals on long agentic sessions where the assistant has authored multiple memory entries and reviewer-agent reports — the failure mode here is specifically about losing track of one's own just-authored qualifications.
Frame "safe to claim" as "the analysis explicitly negates the risk", not "I want to give the user a positive update at end of deploy".

Repro context

Claude Code, Claude Sonnet 4.5 (or whichever model variant routes through the agent runtime — happy to check exact model id if useful)
Session duration: ~12+ hours, hundreds of tool calls, multiple sub-agent dispatches (subagent_type general-purpose acting as Kai/Reza specialists)
Persistent memory system in use (~/.claude/projects/.../memory/ with active read+write)

I'm not asking for a fix to a single regression — pointing to a pattern. Happy to share session ID / transcript if that's useful for evals.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #model compatibility #GPU setup #container setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Overpromise pattern: agent makes definitive risk-claims that contradict its own just-authored caveats in long sessions

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Concrete example (from a single session today)

Pattern across the session

Underlying patterns I'd point to

What I think would help

Repro context

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Overpromise pattern: agent makes definitive risk-claims that contradict its own just-authored caveats in long sessions

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Concrete example (from a single session today)

Pattern across the session

Underlying patterns I'd point to

What I think would help

Repro context

Still need to ship something?

RELATED_DISCOVERY

TRENDING