Why 200 Tokens Per Second Still Feels Slow

Every AI model launch comes with a shiny number. 200 tokens per second. 500 tokens per second. Sounds fast. Then you actually use the thing and spend three seconds staring at a blinking cursor. That gap between the spec sheet and the experience isn’t a fluke — it’s baked into how the industry measures speed.

A token isn’t a unit you can trust

Start here: a token is not a word, not a character, not anything consistent. Each model uses its own tokenizer, and the same sentence can fragment differently depending on which one you’re asking.

“Hello” might be one token. A Korean greeting might be four. Non-English languages get punished hard — English averages roughly four characters per token, while Korean, Japanese, and Chinese can blow up to multiple tokens per character. The same model rated at 200 tokens/second might churn out a page of English in eight seconds and need thirty for the equivalent Korean output.

The number on the slide is the same. The user experience isn’t.

TTFT is what actually frustrates you

There’s a metric benchmark pages rarely lead with: Time To First Token, or TTFT. The delay between hitting enter and seeing the first character appear.

Think about when you feel a model is “slow.” It’s not usually the trickle of output — it’s that silent pause before anything happens. A model could be screaming along at 300 tokens per second once it gets going, but if TTFT is three seconds, you’ve already lost the user.

Long prompts make this worse. The model has to chew through your entire input before producing a single output token. In RAG setups, where context windows balloon with retrieved documents, TTFT often matters more than raw throughput. Yet vendors keep advertising average TPS like it tells the whole story.

Batch throughput is a different game

The throughput numbers vendors love? Those are usually measured under batch conditions — dozens or hundreds of concurrent requests packed onto a GPU at once. Pile up the workload and the per-GPU token count looks spectacular.

Now picture you, alone, asking ChatGPT one question. Single-request latency can be half (or worse) of what’s advertised. The company’s “tokens per second per GPU” and your “how fast does this answer my question” are essentially unrelated metrics.

NVIDIA’s datacenter marketing leans hard on this. A claim that one H-series card pushes tens of thousands of tokens per second is technically true — under maxed-out batch conditions that bear no resemblance to a single user’s session.

Reasoning models burn tokens you never see

Then there’s the reasoning model wave — o-series, DeepSeek-R1, Claude with extended thinking, and everything chasing them. These models generate thousands of internal “thinking” tokens before producing the answer you actually read.

So two models can both hit 100 tokens per second, but one spends 80 of those tokens talking to itself before emitting a single visible word. From your seat, perceived speed drops by a factor of five. The benchmark number didn’t move. The wait did.

The infrastructure tax

Last piece: the speed at which a GPU produces tokens and the speed at which those tokens hit your screen are not the same thing. Network latency, streaming implementation, and client-side rendering all sit between the model and you.

A service that batches tokens server-side and ships them in chunks feels noticeably worse than one streaming token-by-token, even running the exact same model. Anyone who’s compared the same backend through different frontends has felt this. Fast model, slow plumbing — and the plumbing wins.

Read the conditions, not the headline

Tokens per second is the top speed on a car’s spec sheet. The catalog says 250 km/h. Your morning commute averages 30. Useful for marketing, mostly useless for the question you actually care about.

Next time a launch event flashes a TPS number, ask the boring follow-ups. What language? What context length? Single request or batched? Does that count include reasoning tokens? The model that wins on the slide is rarely the one that feels fast in your hands.

Why 200 Tokens Per Second Still Feels Slow

A token isn’t a unit you can trust

TTFT is what actually frustrates you

Batch throughput is a different game

Reasoning models burn tokens you never see

The infrastructure tax

Read the conditions, not the headline

Comments

Related Logs

Simon Willison's Six-Month LLM Recap: Where AI Actually Stands in Spring 2026

A 26M-Parameter Model Mimicking Gemini? Needle and the Quiet Rise of Tiny Specialists

A Fields Medalist Tried GPT-5.5 Pro on Real Math. His Verdict Surprised Everyone.