claude-code - 💡(How to fix) Fix [FEATURE] Static-analysis RAG primitive: pre-prompt repo graph injection cuts first-turn tokens 40.9% on live A/B [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#53224Fetched 2026-04-26 05:21:11
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×3

Root Cause

  • Embeddings. Cold-start, cost, binary deps. On top-1 precision (the metric that matters because Claude only acts on the top result), well-tuned BM25 + path heuristics already get MRR 0.984 on synthetic. Embeddings as an optional reranker on top of the lexical layer is a reasonable v3, not v1.
  • MCP server. Adds a network hop and a separate process to manage. An in-process hook (or Rust-port primitive) is cheaper.
  • Tree-sitter symbol extraction. Higher recall but a native dependency. Regex extraction at ~95% recall + the ranker recovering via path-substring matching is good enough for v1.
RAW_BUFFERClick to expand / collapse

Problem

Claude Code's first-turn pattern on a "where is X" prompt is reliably Glob → Grep → Read → Read → Read before the agent does anything useful. On a 6-prompt fixture instrumented with claude --print --stream-json, the control arm averaged 51,000 tokens per prompt, of which roughly 35k was first-turn exploration. The model has no map of the codebase going into turn 1, so it grep-walks one.

This is a structural cost — every Claude Code user pays it on every fresh session, regardless of model or pricing tier.

Proposed solution

A static-analysis RAG primitive shipped inside Claude Code itself:

  1. claude context build — walks the source tree, extracts symbols (function/class/type defs), import edges, git-hot files (90-day change frequency). Writes .context-os/repo-graph.json (or equivalent). Runs in <1s on a 5k-file repo, <10s on 50k.

  2. UserPromptSubmit-time hook — parses the prompt, queries the graph (IDF-weighted symbol/path matches + import traversal + hot-file boost + test/hub-file penalties), prepends a compact ranked candidate list as <context-os:autocontext>. Hook latency ~50ms typical, p99 118ms at 10k files.

  3. Stale-graph rebuild — detect via git log --since=last-build and rebuild in the background. User's next prompt uses the fresh index.

No embeddings. No server. No model call. The whole primitive is ~400 lines of stdlib Python in the reference implementation.

Evidence (from a prototype I built and shipped MIT-licensed)

The prototype lives at https://github.com/sravan27/context-os and has been measured exhaustively. Every number is reproducible from one command, all reports CI-regenerated on every PR:

Live A/B on 36 real claude --print calls (6 prompts × 3 runs × 2 arms):

  • −40.9% aggregate tokens [bootstrap CI 32.7%, 48.9%]
  • 6/6 prompt-level wins
  • paired t-test p = 5.06e-07
  • Cohen's d = 1.84 (large effect)
  • wall-clock −35.3% (11.80s → 7.64s mean per prompt)

Cross-repo generalization (36 hand-labeled prompts × 3 unseen OSS repos):

  • axios/axios (JS, 214 files): MRR 0.382 vs best baseline 0.252 (+0.130)
  • BurntSushi/ripgrep (Rust, 100 files): MRR 0.503 vs 0.459 (+0.044)
  • psf/requests (Py, 36 files): MRR 0.750 vs bm25-symbols 0.875 (lexical-ceiling regime)
  • Weighted aggregate: auto_context 0.545 vs best lexical baseline 0.461. +18.2%.

Operational:

  • Hook p99 latency 118ms @ 10k files, 589ms @ 50k files (1.7× under 1s SLA)
  • 18/18 adversarial robustness cases pass (unicode, regex bombs, path injection)
  • 9 CI-enforced regression gates prevent quality drift on every PR
  • 8-signal leave-one-out ablation confirms no dead weight

Why upstream (vs. third-party plugin)

Discovery is the problem. Users who would benefit most (new to Claude Code) won't find a GitHub plugin. The savings only accrue to Anthropic's hosted users if the primitive is on by default.

Three integration paths, ordered by depth (full detail in docs/PROPOSAL.md of the prototype repo):

  • (A) Bundle the hook. Ship the reference implementation as claude init-hooks --context. Zero Anthropic-side work. Opt-in.
  • (B) First-class primitive. claude context build and claude context search as CLI verbs. ~1 engineer-week to port to Rust.
  • (C) Default-on with telemetry-driven rebuild. Anthropic-scale measurement loop catches regressions.

Anti-patterns to flag

  • Embeddings. Cold-start, cost, binary deps. On top-1 precision (the metric that matters because Claude only acts on the top result), well-tuned BM25 + path heuristics already get MRR 0.984 on synthetic. Embeddings as an optional reranker on top of the lexical layer is a reasonable v3, not v1.
  • MCP server. Adds a network hop and a separate process to manage. An in-process hook (or Rust-port primitive) is cheaper.
  • Tree-sitter symbol extraction. Higher recall but a native dependency. Regex extraction at ~95% recall + the ranker recovering via path-substring matching is good enough for v1.

Honest scope notes

  • The prototype loses on psf/requests in the cross-repo eval. Prompts there use exact class names (PreparedRequest, HTTPError) — the lexical-retrieval ceiling regime where bm25-symbols caps. The aggregate win is real, the per-repo win is not universal.
  • The live A/B is 36 calls on 6 prompts. Statistically significant (p<0.001) but Anthropic-scale telemetry would dwarf this.
  • Tested on Py/TS/Rust/Go. Other languages fall back to path-only ranking until a per-language symbol extractor is added.

What I'm offering

  • The reference implementation under MIT, no strings.
  • Methodology, evals, regression gates — all in the repo.
  • Whatever consultation is useful for the port (1 engineer-week estimate). I do not need credit, equity, or attribution.

What would help in return

  • A stable UserPromptSubmit hook payload schema with a version header. Today it shifts in minor releases and every community hook has to keep up.
  • A claude --token-report flag emitting per-turn usage as JSON. We hack it with --stream-json today; native support would unlock proper community benchmarking.

Pitch doc: https://github.com/sravan27/context-os/blob/main/docs/PITCH.md Reviewer walkthrough (20 min): https://github.com/sravan27/context-os/blob/main/docs/REVIEW-CHECKLIST.md Multi-repo eval report: https://github.com/sravan27/context-os/blob/main/python/evals/reports/multi-repo-eval.md Repo: https://github.com/sravan27/context-os

extent analysis

TL;DR

Implement a static-analysis RAG primitive in Claude Code to reduce the structural cost of first-turn exploration by providing a compact ranked candidate list as context.

Guidance

  • Integrate the proposed claude context build and claude context search features to reduce the number of tokens used in first-turn exploration.
  • Use the provided reference implementation as a starting point and consider porting it to Rust for better performance.
  • Implement a UserPromptSubmit hook to prepend the ranked candidate list to the user's prompt, reducing the need for grep-walking.
  • Consider adding a claude --token-report flag to emit per-turn usage as JSON for better community benchmarking.

Example

No code example is provided as the issue does not contain specific code snippets that can be used as a reference.

Notes

The proposed solution has been tested and shows promising results, but it may not work universally across all repositories and languages. Additional testing and evaluation may be necessary to ensure the solution works as expected.

Recommendation

Apply the proposed workaround by integrating the static-analysis RAG primitive into Claude Code, as it has shown significant reductions in tokens used during first-turn exploration and has a large effect size according to the provided evaluation results.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [FEATURE] Static-analysis RAG primitive: pre-prompt repo graph injection cuts first-turn tokens 40.9% on live A/B [1 participants]