claude-code - 💡(How to fix) Fix [Bug] Malformed tool calls in long sessions due to in-context few-shot poisoning

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In a long Claude Code session, tool calls (Bash) began emitting malformed XML — a bare <invoke name="Bash">...</invoke> with no enclosing function_calls wrapper. The harness reported "malformed and could not be parsed." Critically, once one malformed tool call entered the conversation history, every subsequent tool call reproduced the same broken format, and the error could not be recovered within the session. Only /clear (a fresh session with no poisoned history) restored correct behavior.

Environment

  • Model: Claude Opus 4.7 (1M context), max effort
  • Surface: Claude Code CLI
  • Platform: macOS (darwin 25.5.0), zsh
  • Session shape: long-running — many sequential Bash calls, one codex consult, several large files read into context, multiple skills loaded at startup (superpowers: using-superpowers / test-driven-development / systematic-debugging) plus a gstack /codex skill invoked mid-session.

Symptom

Expected: a tool call wrapped in the normal function_calls structure.

Actual: the model emitted bare text like:

  <invoke name="Bash">
  <parameter name="command">...</parameter>
  </invoke>

with no enclosing wrapper, so the parser rejected it as malformed. The command never ran; the user saw the tool call "break."

Key characteristics (the important part)

  1. Self-perpetuating / not self-recoverable. After the first malformed tool call landed in history, later same-type calls copied the broken format. When the model "retried," it was anchoring on the broken example from the prior turn, so retries failed deterministically — 4 consecutive retries all failed identically.

  2. State-dependent, not random. Failures recurred stably in specific contexts (consecutive Bash calls, retry-after-failure) while other calls (a turn's first call, calls following a block of prose) succeeded. Random attention slips would not produce this structure.

  3. Model self-correction did not help. "Be more careful" and "send one call at a time" did not fix it, consistent with an in-context anchoring cause rather than an attention cause.

  4. /clear was the only fix. A fresh session with no poisoned history was required. For in-progress long work this is a costly interruption.

Root-cause hypothesis (model's own analysis, via elimination)

  • Ruled out skill-file contamination: grep of the loaded /codex SKILL.md (1405 lines) found zero literal invoke / parameter / function_calls strings.
  • Ruled out CLAUDE.md as a format influence.
  • Remaining hypothesis consistent with the evidence: few-shot self-poisoning — once a malformed tool call is in context, autoregressive generation keeps reproducing it, which is why retrying is the worst action (it copies the broken template).

Possible trigger context (for investigation — correlation, not confirmed cause)

  • Failures clustered after loading a very long skill file (gstack /codex SKILL.md, 1405 lines) dense with XML-ish markup: <result> blocks, bash heredocs, structured AskUserQuestion examples.
  • That file contains no literal tool-call syntax, but the high density of "markup-looking" text may have diluted the model's control over its own tool-call format. The initial trigger is not confirmed; the self-perpetuation mechanism after the first failure is clear.

Impact

  • Long, stateful sessions (the exact case where context is most valuable on a 1M-context model) are the ones that hit this and cannot recover in place.
  • Forces /clear and a handoff/rehydration, losing the warm session.

Suggested investigation directions

  1. Harness-level tolerant parsing / auto-repair. When a bare <invoke> without the enclosing wrapper is detected, repair or re-prompt instead of emitting "malformed" and letting the broken text persist in history. This is the most direct way to break the self-perpetuation chain.
  2. Sanitize malformed tool calls in history. If a rejected tool-call string were marked/scrubbed rather than left verbatim in context, the few-shot
  3. Test long markup-dense skills as a format-poisoning source. Measure whether tool-call malformation rate rises after loading large skill files heavy with <result>/heredoc/structured-example text.

What I cannot provide

I don't have a guaranteed minimal repro — the initial trigger was not isolated. But the failure→retry→re-failure loop reproduced every time within the affected session, and /clear reliably resolved it.

Environment Info

  • Platfor… Note: Content was truncated.

Error Message

subsequent tool call reproduced the same broken format**, and the error could

Root Cause

In a long Claude Code session, tool calls (Bash) began emitting malformed XML — a bare <invoke name="Bash">...</invoke> with no enclosing function_calls wrapper. The harness reported "malformed and could not be parsed." Critically, once one malformed tool call entered the conversation history, every subsequent tool call reproduced the same broken format, and the error could not be recovered within the session. Only /clear (a fresh session with no poisoned history) restored correct behavior.

Environment

  • Model: Claude Opus 4.7 (1M context), max effort
  • Surface: Claude Code CLI
  • Platform: macOS (darwin 25.5.0), zsh
  • Session shape: long-running — many sequential Bash calls, one codex consult, several large files read into context, multiple skills loaded at startup (superpowers: using-superpowers / test-driven-development / systematic-debugging) plus a gstack /codex skill invoked mid-session.

Symptom

Expected: a tool call wrapped in the normal function_calls structure.

Actual: the model emitted bare text like:

  <invoke name="Bash">
  <parameter name="command">...</parameter>
  </invoke>

with no enclosing wrapper, so the parser rejected it as malformed. The command never ran; the user saw the tool call "break."

Key characteristics (the important part)

  1. Self-perpetuating / not self-recoverable. After the first malformed tool call landed in history, later same-type calls copied the broken format. When the model "retried," it was anchoring on the broken example from the prior turn, so retries failed deterministically — 4 consecutive retries all failed identically.

  2. State-dependent, not random. Failures recurred stably in specific contexts (consecutive Bash calls, retry-after-failure) while other calls (a turn's first call, calls following a block of prose) succeeded. Random attention slips would not produce this structure.

  3. Model self-correction did not help. "Be more careful" and "send one call at a time" did not fix it, consistent with an in-context anchoring cause rather than an attention cause.

  4. /clear was the only fix. A fresh session with no poisoned history was required. For in-progress long work this is a costly interruption.

Root-cause hypothesis (model's own analysis, via elimination)

  • Ruled out skill-file contamination: grep of the loaded /codex SKILL.md (1405 lines) found zero literal invoke / parameter / function_calls strings.
  • Ruled out CLAUDE.md as a format influence.
  • Remaining hypothesis consistent with the evidence: few-shot self-poisoning — once a malformed tool call is in context, autoregressive generation keeps reproducing it, which is why retrying is the worst action (it copies the broken template).

Possible trigger context (for investigation — correlation, not confirmed cause)

  • Failures clustered after loading a very long skill file (gstack /codex SKILL.md, 1405 lines) dense with XML-ish markup: <result> blocks, bash heredocs, structured AskUserQuestion examples.
  • That file contains no literal tool-call syntax, but the high density of "markup-looking" text may have diluted the model's control over its own tool-call format. The initial trigger is not confirmed; the self-perpetuation mechanism after the first failure is clear.

Impact

  • Long, stateful sessions (the exact case where context is most valuable on a 1M-context model) are the ones that hit this and cannot recover in place.
  • Forces /clear and a handoff/rehydration, losing the warm session.

Suggested investigation directions

  1. Harness-level tolerant parsing / auto-repair. When a bare <invoke> without the enclosing wrapper is detected, repair or re-prompt instead of emitting "malformed" and letting the broken text persist in history. This is the most direct way to break the self-perpetuation chain.
  2. Sanitize malformed tool calls in history. If a rejected tool-call string were marked/scrubbed rather than left verbatim in context, the few-shot
  3. Test long markup-dense skills as a format-poisoning source. Measure whether tool-call malformation rate rises after loading large skill files heavy with <result>/heredoc/structured-example text.

What I cannot provide

I don't have a guaranteed minimal repro — the initial trigger was not isolated. But the failure→retry→re-failure loop reproduced every time within the affected session, and /clear reliably resolved it.

Environment Info

  • Platfor… Note: Content was truncated.
RAW_BUFFERClick to expand / collapse

Bug Description

Summary

In a long Claude Code session, tool calls (Bash) began emitting malformed XML — a bare <invoke name="Bash">...</invoke> with no enclosing function_calls wrapper. The harness reported "malformed and could not be parsed." Critically, once one malformed tool call entered the conversation history, every subsequent tool call reproduced the same broken format, and the error could not be recovered within the session. Only /clear (a fresh session with no poisoned history) restored correct behavior.

Environment

  • Model: Claude Opus 4.7 (1M context), max effort
  • Surface: Claude Code CLI
  • Platform: macOS (darwin 25.5.0), zsh
  • Session shape: long-running — many sequential Bash calls, one codex consult, several large files read into context, multiple skills loaded at startup (superpowers: using-superpowers / test-driven-development / systematic-debugging) plus a gstack /codex skill invoked mid-session.

Symptom

Expected: a tool call wrapped in the normal function_calls structure.

Actual: the model emitted bare text like:

  <invoke name="Bash">
  <parameter name="command">...</parameter>
  </invoke>

with no enclosing wrapper, so the parser rejected it as malformed. The command never ran; the user saw the tool call "break."

Key characteristics (the important part)

  1. Self-perpetuating / not self-recoverable. After the first malformed tool call landed in history, later same-type calls copied the broken format. When the model "retried," it was anchoring on the broken example from the prior turn, so retries failed deterministically — 4 consecutive retries all failed identically.

  2. State-dependent, not random. Failures recurred stably in specific contexts (consecutive Bash calls, retry-after-failure) while other calls (a turn's first call, calls following a block of prose) succeeded. Random attention slips would not produce this structure.

  3. Model self-correction did not help. "Be more careful" and "send one call at a time" did not fix it, consistent with an in-context anchoring cause rather than an attention cause.

  4. /clear was the only fix. A fresh session with no poisoned history was required. For in-progress long work this is a costly interruption.

Root-cause hypothesis (model's own analysis, via elimination)

  • Ruled out skill-file contamination: grep of the loaded /codex SKILL.md (1405 lines) found zero literal invoke / parameter / function_calls strings.
  • Ruled out CLAUDE.md as a format influence.
  • Remaining hypothesis consistent with the evidence: few-shot self-poisoning — once a malformed tool call is in context, autoregressive generation keeps reproducing it, which is why retrying is the worst action (it copies the broken template).

Possible trigger context (for investigation — correlation, not confirmed cause)

  • Failures clustered after loading a very long skill file (gstack /codex SKILL.md, 1405 lines) dense with XML-ish markup: <result> blocks, bash heredocs, structured AskUserQuestion examples.
  • That file contains no literal tool-call syntax, but the high density of "markup-looking" text may have diluted the model's control over its own tool-call format. The initial trigger is not confirmed; the self-perpetuation mechanism after the first failure is clear.

Impact

  • Long, stateful sessions (the exact case where context is most valuable on a 1M-context model) are the ones that hit this and cannot recover in place.
  • Forces /clear and a handoff/rehydration, losing the warm session.

Suggested investigation directions

  1. Harness-level tolerant parsing / auto-repair. When a bare <invoke> without the enclosing wrapper is detected, repair or re-prompt instead of emitting "malformed" and letting the broken text persist in history. This is the most direct way to break the self-perpetuation chain.
  2. Sanitize malformed tool calls in history. If a rejected tool-call string were marked/scrubbed rather than left verbatim in context, the few-shot
  3. Test long markup-dense skills as a format-poisoning source. Measure whether tool-call malformation rate rises after loading large skill files heavy with <result>/heredoc/structured-example text.

What I cannot provide

I don't have a guaranteed minimal repro — the initial trigger was not isolated. But the failure→retry→re-failure loop reproduced every time within the affected session, and /clear reliably resolved it.

Environment Info

  • Platfor… Note: Content was truncated.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING