Why Your AI Coding Agent Spends Most of Its Tokens Just Looking Around

If you’ve spent any real time with Cursor or Claude Code, you’ve felt it: the agent is slow, and the bill is bigger than you expected. Look closely at where those tokens went, and a surprising answer emerges. Most of them weren’t spent writing code. They were spent finding it.

A new tool called Semble is making the rounds on Hacker News with a bold pitch: 98% fewer tokens than grep for the same code search work. The number is eye-catching. The reason it’s true is more interesting.

Grep is free for humans. It is expensive for agents.

Grep has been the universal text search tool for 30+ years. Fast, free, unkillable. The instant you wire it into an LLM agent, the economics flip.

Picture an agent asking, “where is authenticate used?” Grep dutifully returns every matching line — possibly thousands of them in a real codebase. All of that text gets piped straight into the model’s context window. A single search can burn tens of thousands of tokens before the agent has written a single character of code.

And agents don’t grep once. They grep dozens of times per task. Skim a result. Open a related file. Grep again. The actual “thinking” phase ends up dwarfed by the wandering around looking for context phase.

Semble’s pitch: let the index do the understanding

Semble’s idea is almost embarrassingly simple. Stop dumping raw text into the model.

Instead, it pre-indexes the codebase semantically — functions, classes, call graphs, type signatures, all stored as structured metadata. When an agent searches, Semble returns something like “this function is defined here, called from these three call sites, with this signature” rather than a wall of matching lines.

Their headline benchmark: a query that costs grep around 50,000 tokens costs Semble roughly 1,000. Same information, different shape. That’s where the 98% comes from.

The real bottleneck was never the model

This reframes a debate the AI coding world has been having for the wrong reasons. For two years, the conversation has been about model quality and response speed. Smarter Claude. Faster GPT. Bigger context windows.

But context windows hit 1M tokens and the problem didn’t go away. The bottleneck turned out to be what you put inside the window, not how big it is. Stuff 10,000 lines of grep output in there, and the one snippet that actually mattered drowns.

And tokens are money. Pull up any Claude or OpenAI bill and you’ll notice the same thing: input tokens dwarf output tokens. Your agent isn’t expensive because it thinks too hard. It’s expensive because it reads too much.

The “just use grep” pushback

Not everyone is sold. The skeptics on HN and Reddit have three reasonable objections.

Index maintenance. Semble has to pre-analyze the codebase and keep that index fresh. In a fast-moving repo, that’s non-trivial overhead. Grep needs nothing.

Lossy summaries. Pre-digested results are convenient, but they can hide the exact detail the agent needed. Raw grep output is ugly, but it’s the ground truth.

Lock-in. Bet your workflow on a specific search tool and you’re stuck if it disappears, gets acquired, or jacks up its pricing. Grep is on every machine and will outlive us all.

The real story is a new category of infrastructure

Zoom out and Semble isn’t really a product story. It’s a market story. A whole layer of agent-supporting infrastructure is opening up.

For the last few years, AI coding competition has been “whose model is smartest.” Now that the top models are roughly comparable, the battleground is shifting underneath. The new question: who can deliver better context to that model for less money?

It’s not just code search. Memory management, tool-call routing, context compression — startups are crowding into all of these. If the AI agent is the worker, this new crop of companies is selling the clean workbench, the labeled drawers, and the sharpened tools. Boring infrastructure. Big margins.

The takeaway

Don’t get hung up on the 98% figure. The signal is that we’ve been thinking about AI coding costs wrong. Before reaching for a more expensive model, it might be time to ask whether your agent needs a smarter index.

Honest question for the road: when you watch your AI assistant work, does it spend more time hunting for context or actually producing code? Your answer probably decides which tools matter for you over the next year.

Why Your AI Coding Agent Spends Most of Its Tokens Just Looking Around

Grep is free for humans. It is expensive for agents.

Semble’s pitch: let the index do the understanding

The real bottleneck was never the model

The “just use grep” pushback

The real story is a new category of infrastructure

The takeaway

Comments

Related Logs

The Real LLM Bottleneck Isn't Size — It's Memory

A 26M-Parameter Model Mimicking Gemini? Needle and the Quiet Rise of Tiny Specialists

Local AI Should Be the Default — Why Developers Are Pushing Back on Cloud LLMs