claude-code - 💡(How to fix) Fix [FEATURE] Replace text-similarity dup-detection bot with model-based semantic triage [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#55006Fetched 2026-05-01 05:48:42
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1labeled ×1

Root Cause

The bot's behavior was correct given a similarity heuristic. The issue is that the heuristic is wrong for the domain — it conflates "thematically related" with "same root cause."

Fix Action

Fix / Workaround

  1. Tune the similarity threshold higher. Doesn't fix the structural problem — moves it. Higher threshold → more missed dups; lower threshold → more false positives. There's no setting that gets you semantic comparison out of a similarity score.
  2. Have the bot just retrieve and post candidates without claiming "duplicate." Reduces the auto-closure threat but loses the dup-detection benefit entirely. The bot becomes "here are some related issues" — useful but redundant with GitHub's own related-issue UI.
  3. Manual maintainer triage with no bot. Doesn't scale.
  4. Filer-side mitigation only (👎 on the bot comment, written rebuttal). What I had to do for #54990. Works but shifts work to the filer and only works for filers who realize they need to push back — which selects against the new and the time-constrained.
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing requests and this feature hasn't been requested yet
  • This is a single feature request (not multiple features)

Problem Statement

The github-actions auto-triage bot that runs on new issues uses text-similarity for duplicate detection. This systematically false-positives on careful, well-cross-referenced regression reports — the exact reports the maintainer team most needs.

Concrete instance: when I filed #54990 (an Opus 4.7 / 2.1.121 regression report with cross-version cohort evidence), the bot flagged three "possible duplicates": #50507, #44317, #53988. None were duplicates. They were thematically adjacent reports in the same failure-pattern family ("insufficient grounding / false verification / fabrication from priors") with different specific claims, different evidence, and different time periods. My report cross-referenced #50507 verbatim and named the failure pattern by its common term, which is exactly the input that text-similarity rewards as "matches existing issue."

The bot's behavior was correct given a similarity heuristic. The issue is that the heuristic is wrong for the domain — it conflates "thematically related" with "same root cause."

The structural problem this creates:

  1. Selection against effort. Vague "model felt off" reports are less likely to trip the dup-detector than careful regression reports that include cross-references, prior-issue context, and use the standard vocabulary for the failure pattern. The current heuristic actively penalizes the most useful filings.
  2. Auto-closure threat. The bot announces it will auto-close as a duplicate in 3 days unless the filer responds. A first-time filer who doesn't realize they need to push back will lose their report.
  3. Maintainer time tax. Even when filers do push back (with careful written rebuttals), a maintainer still has to read both bodies and the rebuttal to confirm. The bot has shifted work from itself to the maintainer + filer, while contributing little signal.
  4. Particularly ironic source. Anthropic's core product is an LLM well-suited to exactly this kind of semantic comparison. Using 1990s text-similarity for triage on the issue tracker for that product is on-its-face odd.

Proposed Solution

Replace the text-similarity dup-detection step with a model-based semantic triage. Concretely:

When a new issue is opened, the bot:

  1. Retrieves the top-N similarity-matched candidate issues (the existing step is fine as a retrieval stage)
  2. Calls Claude (Haiku-grade is plenty) with the new issue body + the candidate bodies, and a prompt like:

    "You are a triage assistant. The new issue is below. The N candidate issues that were retrieved as possibly similar are below it. For each candidate, decide whether the new issue is a duplicate (same root claim, different example) or adjacent (related failure pattern, distinct contribution). Output a one-paragraph reasoning per candidate, and a final verdict (duplicate of #N / not a duplicate, here's why)."

  3. Posts the model's reasoning as the bot comment, with the verdict and citations to the differentiating claims.
  4. Only marks the issue for auto-closure if the model is confident it's a duplicate, with reasoning visible. Otherwise, posts as informational ("here are related reports — maintainer review").

Cost estimate: at Haiku pricing (~$0.25 input / $1.25 output per 1M tokens as of late 2025), an average new-issue triage is probably 10-20K input tokens (new body + 3-5 candidate bodies) and 500-1000 output tokens. That's ~$0.005-$0.01 per issue. For a repo receiving ~50-100 new issues/day, that's $0.50-$1.00/day. Even at Sonnet pricing it's a few dollars/day for the entire repo.

The cost of one false-positive auto-closure that drives away a careful filer almost certainly exceeds a year of Haiku-grade triage at this volume.

Alternative Solutions

I considered:

  1. Tune the similarity threshold higher. Doesn't fix the structural problem — moves it. Higher threshold → more missed dups; lower threshold → more false positives. There's no setting that gets you semantic comparison out of a similarity score.
  2. Have the bot just retrieve and post candidates without claiming "duplicate." Reduces the auto-closure threat but loses the dup-detection benefit entirely. The bot becomes "here are some related issues" — useful but redundant with GitHub's own related-issue UI.
  3. Manual maintainer triage with no bot. Doesn't scale.
  4. Filer-side mitigation only (👎 on the bot comment, written rebuttal). What I had to do for #54990. Works but shifts work to the filer and only works for filers who realize they need to push back — which selects against the new and the time-constrained.

The model-based approach is the only one I see that addresses the structural perverse incentive while still doing real triage work.

Priority

Low - Nice to have

Feature Category

Other

Use Case Example

Worked example from 2026-04-30:

  1. I file #54990 — an Opus 4.7 regression report tied to CLI version 2.1.121, with cross-version cohort evidence (0/9 sessions on 2.1.111 had the pattern, 0/N on 2.1.119, 3/3 on 2.1.121), inlined memory files, and explicit cross-references to three related-but-distinct prior issues.
  2. github-actions[bot] flags three "possible duplicates": #50507 (Opus 4.7 false-verification report from a different user, single visual-design example, no version data), #44317 (plan-following failure on a frontend project, filed 3+ weeks earlier on a different CLI version), #53988 (identifier hallucination in a 30-session web-app migration; steady-state body-of-evidence with no time correlation). None are duplicates of #54990.
  3. I write a rebuttal comment walking through each flagged issue and the differentiator. 👎 the bot's auto-dup comment to prevent 3-day auto-closure.
  4. A maintainer eventually reads the issue + the rebuttal + (possibly) the three flagged candidates to confirm.

With the proposed feature, step 2 would have been:

"These three candidates are all in the same failure-pattern family as the new issue ('insufficient grounding / fabrication from priors'). However, the new issue's distinct contribution is a cross-version cohort comparison with a clean cutoff (CLI 2.1.111 → 2.1.119 → 2.1.121, with 0/9 → 0/N → 3/3 affected sessions across same user / same project / same model), and a same-passage relapse instance ('twice in one reflection-about-confidently-asserting-without-checking turn'). #50507, #44317, and #53988 each establish that the failure pattern exists; #54990 is making a regression claim tied to a specific release. Verdict: not a duplicate. These are adjacent reports."

That's a 30-second read for a maintainer instead of a several-minute reconciliation, the filer doesn't have to write a rebuttal, and the auto-closure threat doesn't apply because the bot itself reached the right conclusion.

Additional Context

Prefetch — anticipating Fin's flagging on this very issue:

Since this is a feature request about Fin's text-similarity dup-detection, Fin will almost certainly flag this issue itself based on the same heuristic the issue critiques. That's a small live demonstration of the problem. Pre-empting the most likely flags:

  • #50898 ("Duplicate detection bot is auto-closing valid issues at scale — 15,240 closures and counting", open since 2026-04-19) — closest prior art, not a duplicate. That issue's central proposal is remove auto-close behavior (advisory-only mode, or human-approval-required). This issue's central proposal is replace the heuristic with model-based semantic triage. Both could be acted on independently — auto-close behavior is a separate dial from the underlying detection mechanism. #50898's proposal item #2 ("if auto-close must stay, implement semantic similarity thresholds rather than keyword matching") brushes the same direction my proposal goes, but as a fallback, not the primary recommendation. The fact that #50898 has been open and unactioned for 11 days at scale of 15,240 closures is itself a strong supporting argument for this issue.
  • #35923 ("AI-driven issue triage is systematically destroying valid bug reports", closed) — adjacent prior critique focused on circular duplicate chains and the auto-close+auto-lock interaction. Cites its own prior attempts (#19267, #26434), suggesting this is at least the third or fourth time this complaint has been raised. Different specific failure mode than this issue (chain effects vs. heuristic class), but same underlying frustration.
  • #40842 ("Duplicate detection closes issues without verifying the original is resolved") — different specific bug (closes against unresolved originals).
  • #46666 ("Duplicate bot is silently suppressing unresolved claude-in-chrome issues") — claude-in-chrome-specific instance of the broader problem.
  • #45867 ("closeIssueAsDuplicate strips all triage labels") — tactical bug in the closure handler, not the same thing.

Why pre-listing these matters: if Fin flags exactly these issues, that's a small live demo of the predictability of the heuristic — I produced the same list it will produce, with no special tooling beyond the GitHub search API and a few minutes' reading. If Fin flags issues not on this list, that's useful diagnostic information about what its heuristic actually weights. Either way, it's data on the case. (And if a maintainer reads only this section: #50898 is the closest prior art, this issue is making a distinct proposal complementary to it, please don't merge them.)

Implementation notes:

  • The Anthropic API has prompt caching, which fits this use case naturally — the candidate-issue bodies don't change between triage calls within a short window, so the candidate-retrieval-and-formatting can be cached.
  • Output should be public (a comment, like the current bot's). Reasoning is more useful than verdict alone, both for filers and for downstream maintainers.
  • The bot should err conservative on the "it's a dup" side — false-flag-as-possibly-related (maintainer can ignore) is much cheaper than false-flag-as-duplicate (auto-closes a real report).
  • Consider a similar pass for labeling (model picks area:model / area:cost / platform:macos etc.) — same infrastructure, similar low cost.

Saved context for this filing: I had to push back on the bot's flagging on #54990, and it took a written rebuttal to keep the issue from being auto-closed. The structural argument and worked example were captured at the time. The fact that Anthropic uses text-similarity for this on the issue tracker of Claude Code is a small but real piece of the broader observation that AI-tool quality is dominated by interactional context rather than raw model capability — which is itself the kind of thing the maintainer team probably already knows, but is worth a concrete case on.

I held off filing this on 2026-04-30 because it wasn't a high-priority irritation. Reconsidered: the case is fresh, the worked example exists, and the structural argument doesn't get cheaper to make later. Filing.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING