AI EngineerJune 3, 202616m

Benchmarking semantic code retrieval on Claude Code — Kuba Rogut, Turbopuffer

TL;DR

Cursor's production results are strong: Kuba cites Cursor's blog showing a 24% relative improvement in answer accuracy for Composer, plus a 2.6% increase in code retention and a 2.2% drop in dissatisfied requests on large codebases.
Embeddings act like cached compute: Instead of re-grepping the same codebase every session, semantic indexing pays an upfront chunk-embed-index cost so multiple agents can query stored meaning later with fewer tokens and less repeated work.
Turbo Grep raised precision sharply on Claude Code: In Kuba's 50-task ContextBench-style evaluation, file precision improved from about 65% baseline to 87% with windowed grep plus semantic search, meaning wasted file reads fell from roughly 1 in 3 to 1 in 8.
Semantic search did not automatically improve recall: Raw Claude Code still led on file recall because it aggressively explores many files, while semantic search and windowed grep ended up with similar recall despite better targeting.
The tool choice depends on the task shape: Semantic search won when files were behaviorally related but did not share obvious keywords, such as logic spread across multiple ORMs and libraries, while grep won when the task was basically tracing imports from an early keyword hit.
Inline comments make semantic retrieval better: Kuba says messy code is harder, and repositories with strong inline documentation performed noticeably better because the embedding model could infer the chunk's meaning more accurately.

The Breakdown

Cursor saw a 24% relative accuracy lift and measurable user gains from semantic code search, but when Kuba Rogut bolted a similar approach onto Claude Code, the biggest win was precision, not recall. His benchmark shows semantic search is great at finding behaviorally related files that grep misses, while plain grep still wins on straightforward import tracing.