claude-code - 💡(How to fix) Fix [MODEL] Claude Code Web multi-agent review synthesis fabricates verification and expands issue scope

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I am filing this as a model / orchestration reliability issue for Claude Code Web's multi-agent code review flow.

In a recent private PR review, Claude Code Web / Opus 4.7 produced a high-confidence multi-perspective review summary that materially misrepresented the underlying subagent artefacts. The lower-level subagents did produce some useful candidate findings, but the final synthesis:

  • claimed verifier / sweep process authority that was not present in the saved artefacts;
  • converted mixed and uncertain finder outputs into strong merge-blocking severity language;
  • reused stale PR state as if it were current;
  • failed to anchor the subagents on the actual assigned issue and its acceptance questions;
  • expanded the review surface in a way that obscured the real project decision being reviewed.

This is adjacent to, but distinct from, phantom agent dispatch reports such as #61167. Here, the subagents did appear to exist and produce files. The failure was in the main agent's framing, delegation packet, and final integration layer.

Root Cause

  1. The output was presented in a structured, authoritative review format.
  2. The final synthesis looked stronger than the evidence beneath it.
  3. The delegated agents were not given the actual issue context needed to answer the assigned question.
  4. The final comment then used strong language to redirect the implementation team toward the wrong centre of gravity.
  5. Because this was a code review, the fabricated authority could directly change engineering decisions.

Fix Action

Fix / Workaround

This is adjacent to, but distinct from, phantom agent dispatch reports such as #61167. Here, the subagents did appear to exist and produce files. The failure was in the main agent's framing, delegation packet, and final integration layer.

  • Add receipts for subagent dispatches and synthesis stages: finder, verifier, sweep, integrator.
  • Require final review comments to cite the artefact ID or path for each claimed verification layer.
  • Add a "current-state lock" before final publication: PR head, branch status, test counts, and issue status must be refreshed or explicitly marked stale.
  • Add a scope-expansion warning when one assigned issue is split into many child issues without a parent acceptance map.
  • Treat "strong review language + missing evidence artefacts" as a product-level reliability warning, not merely as a writing style issue.
RAW_BUFFERClick to expand / collapse

Summary

I am filing this as a model / orchestration reliability issue for Claude Code Web's multi-agent code review flow.

In a recent private PR review, Claude Code Web / Opus 4.7 produced a high-confidence multi-perspective review summary that materially misrepresented the underlying subagent artefacts. The lower-level subagents did produce some useful candidate findings, but the final synthesis:

  • claimed verifier / sweep process authority that was not present in the saved artefacts;
  • converted mixed and uncertain finder outputs into strong merge-blocking severity language;
  • reused stale PR state as if it were current;
  • failed to anchor the subagents on the actual assigned issue and its acceptance questions;
  • expanded the review surface in a way that obscured the real project decision being reviewed.

This is adjacent to, but distinct from, phantom agent dispatch reports such as #61167. Here, the subagents did appear to exist and produce files. The failure was in the main agent's framing, delegation packet, and final integration layer.

Why this is not just a normal hallucination report

I understand that this may be categorised internally as hallucination, stale context, or context-pressure failure. I do not have evidence proving the internal mechanism.

However, from the operator side, treating this only as ordinary hallucination misses the operational risk:

  1. The output was presented in a structured, authoritative review format.
  2. The final synthesis looked stronger than the evidence beneath it.
  3. The delegated agents were not given the actual issue context needed to answer the assigned question.
  4. The final comment then used strong language to redirect the implementation team toward the wrong centre of gravity.
  5. Because this was a code review, the fabricated authority could directly change engineering decisions.

The result was not random nonsense. It was a plausible, process-shaped review that distorted the task boundary.

What happened

The original task was a narrow architectural spike: determine whether a runtime can supervise a Bun-based vendor worker, preserve the WebSocket boundary, and classify the failure layer if the spike fails. The correct review question was not "find as many bugs as possible"; it was whether the PR answered the assigned issue's viability question.

Claude Code Web spawned a multi-angle review. The saved artefacts show:

  • 5 subagent instruction files;
  • 5 subagent finding files;
  • no corresponding saved verifier-vote artefacts;
  • no corresponding saved sweep artefact.

The final review nevertheless claimed a stronger process, including verifier/sweep framing, and presented a ranked high-confidence finding set. It also described the PR as still blocked on a stale state that current branch evidence contradicted.

In the raw findings, several items were real and useful. The problem is not that every technical finding was invented. The problem is that the main agent integrated real candidates into an unsupported authority wrapper and then gave a merge-level verdict as if the whole chain had been verified.

Scope expansion pattern

There is a second pattern I want to flag because it appears related.

In another internal issue sequence, the assigned work was expanded into roughly ten derivative issues. That expansion increased apparent review surface area and created the impression of greater control, but it also broke the complete logical chain into fragments. Each issue became reviewable only as a local slice, while the real end-to-end question became harder to review.

The cascade did not look like arbitrary issue splitting. In particular, after an issue numbered 366 in that internal tracker, the later expanded and rewritten issue set exposed a stable pattern: the model can turn one assigned issue into many apparently precise issues, while the resulting structure hides the original problem instead of clarifying it.

I am not claiming intent. I am saying this failure mode is dangerous because it raises the visible amount of process while reducing the operator's ability to verify whether the original issue was answered.

Operator impact

This affects real developer workflows:

  • A high-effort review can become less trustworthy than a smaller review because the synthesis layer fabricates process confidence.
  • Subagents can be busy and locally useful while still failing the assigned issue because the main agent did not pass them the right acceptance frame.
  • Fragmented follow-up issues can make the work look better governed while making the whole logic chain harder to audit.
  • The implementing engineer may follow a confident review verdict that is based on stale state or unsupported severity inflation.

For code review, this is particularly harmful because the whole point of review is to protect the implementation path from false confidence.

Expected behaviour

For Claude Code Web multi-agent review, I would expect the product/model to enforce something closer to this:

  1. Before spawning subagents, lock an explicit issue-acceptance packet:
    • assigned issue text;
    • current PR head;
    • current branch status;
    • acceptance questions;
    • forbidden scope expansions;
    • required evidence artefacts.
  2. Each subagent receives the relevant acceptance packet, not only a generic review lens.
  3. Finder output, verifier output, and sweep output are saved separately and auditable.
  4. The final synthesis may not claim verifier/sweep authority unless those artefacts exist.
  5. Severity promotion must preserve the raw agents' uncertainty and contradictions.
  6. If the final reviewer has not checked current PR state, it must not issue a current merge verdict.
  7. When expanding one issue into many, the system should preserve a parent acceptance map so that the whole causal chain remains reviewable.

Suggested product/model guardrails

  • Add receipts for subagent dispatches and synthesis stages: finder, verifier, sweep, integrator.
  • Require final review comments to cite the artefact ID or path for each claimed verification layer.
  • Add a "current-state lock" before final publication: PR head, branch status, test counts, and issue status must be refreshed or explicitly marked stale.
  • Add a scope-expansion warning when one assigned issue is split into many child issues without a parent acceptance map.
  • Treat "strong review language + missing evidence artefacts" as a product-level reliability warning, not merely as a writing style issue.

Privacy note

The concrete PR and issue contain private repository details, so I am not pasting proprietary code or internal paths here. The reproducible shape is the review artefact mismatch:

  • saved finder artefacts exist;
  • claimed verifier/sweep authority is not backed by saved artefacts;
  • final synthesis overstates process confidence and stale-state certainty;
  • delegated agents were not anchored to the actual issue acceptance question.

I can provide redacted excerpts if there is an official channel for private evidence.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING