claude-code - 💡(How to fix) Fix Agent tool: add context isolation mode to prevent evaluator bias in multi-agent setups

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Multi-agent Claude Code setups are increasingly using agents-as-evaluators — for quality gates, deliberation, adversarial review, and synthesis. Without context isolation, these evaluators are architecturally biased toward the generator's reasoning, which defeats their core purpose.

The problem scales with harness complexity: the more reasoning the orchestrator accumulates, the stronger the evaluator's inherited bias.


Fix Action

Fix / Workaround

This would make evaluator sub-agents structurally independent — solving the self-evaluation bias at the platform level rather than requiring harness-level workarounds.

Code Example

Orchestrator generates Proposal A
Orchestrator spawns Agent("evaluate Proposal A")
Agent receives prompt + implicit parent context (including how Proposal A was reasoned)
Agent's "evaluation" skews toward validating Proposal A
Quality gate fails its purpose

---

Innovator (Agent)Devil-Advocate (Agent)Mediator

---

# Before: biased Mediator
[Orchestrator context = full session + Innovator + Devil]
Orchestrator performs synthesis ← same instance, sees all prior reasoning

# After: isolated Mediator
Agent(
  prompt="Synthesize the following without prior context",
  input=only(innovator_output, devil_output)  # explicit context boundary
)

---

# Illustrative API
Agent(
  prompt="...",
  context_isolation="strict",   # spawned agent sees ONLY what's in `prompt`
                                # no parent reasoning chain inherited
)
RAW_BUFFERClick to expand / collapse

Problem

When Claude Code's Agent tool spawns a sub-agent for evaluation roles (critic, mediator, quality-gate), the spawned agent implicitly inherits the parent's reasoning trajectory. Even with a neutral system prompt, the evaluator "sees" the generator's context and biases its judgment toward agreement.

This is the Cost of Consensus problem — empirically measured: a 32.3 percentage point performance drop when the same model evaluates outputs generated within a shared context (arXiv 2605.00914).

Concrete failure mode:

Orchestrator generates Proposal A
  → Orchestrator spawns Agent("evaluate Proposal A")
  → Agent receives prompt + implicit parent context (including how Proposal A was reasoned)
  → Agent's "evaluation" skews toward validating Proposal A
  → Quality gate fails its purpose

This affects any multi-agent pattern that uses agents-as-evaluators: deliberation, adversarial review, automated critique loops, and multi-persona synthesis.


Real-World Evidence (forge-harness)

We run forge-harness (github.com/chrono-code/forge-harness), a Claude Code multi-agent harness with 23 skills. One skill, deliberation, implements a 3-layer pipeline:

Innovator (Agent) → Devil-Advocate (Agent) → Mediator

Before redesign: Mediator step was performed by the orchestrating skill itself (same Claude instance, same accumulated context). Observed behavior: Mediator consistently favored whichever persona had appeared last — not genuine synthesis.

After redesign (based on Cost of Consensus findings): Mediator is now spawned as a separate isolated Agent call receiving only the Innovator and Devil outputs — no parent reasoning chain:

# Before: biased Mediator
[Orchestrator context = full session + Innovator + Devil]
Orchestrator performs synthesis ← same instance, sees all prior reasoning

# After: isolated Mediator
Agent(
  prompt="Synthesize the following without prior context",
  input=only(innovator_output, devil_output)  # explicit context boundary
)

Synthesis quality improved measurably after isolation. The evaluator no longer "knew" which side the orchestrator had been leaning toward.


Feature Request

Option A — Documentation (minimal)

Publish an official pattern for context-isolated evaluation in multi-agent setups. Clarify which parts of parent context are inherited when using the Agent tool, and how practitioners can minimize implicit context leakage.

Option B — Explicit isolation parameter (ideal)

Add a context isolation option to the Agent tool invocation so developers can declare a hard context boundary:

# Illustrative API
Agent(
  prompt="...",
  context_isolation="strict",   # spawned agent sees ONLY what's in `prompt`
                                # no parent reasoning chain inherited
)

This would make evaluator sub-agents structurally independent — solving the self-evaluation bias at the platform level rather than requiring harness-level workarounds.

Option C — Worktree guidance

If strict isolation already exists via worktrees or other mechanisms, document the recommended pattern for evaluation sub-agents specifically (vs. task execution sub-agents).


Why This Matters

Multi-agent Claude Code setups are increasingly using agents-as-evaluators — for quality gates, deliberation, adversarial review, and synthesis. Without context isolation, these evaluators are architecturally biased toward the generator's reasoning, which defeats their core purpose.

The problem scales with harness complexity: the more reasoning the orchestrator accumulates, the stronger the evaluator's inherited bias.


References

  • arXiv 2605.00914 — "Cost of Consensus: The Hidden Price of Multi-Agent Concordance" (32.3pp degradation measured)
  • forge-harness deliberation skill — real-world implementation and redesign: https://github.com/chrono-code/forge-harness
  • Anthropic internal experiment: $9 single-agent vs $200 multi-agent harness quality gap (cited in official materials)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING