Gemma 4's Multi-Token Trick: How Google Tripled Inference Speed
The frontier of the LLM race quietly shifted. It’s no longer about who has the biggest model — it’s about who serves tokens fastest and cheapest. Google’s Gemma 4 just turned heads with a built-in multi-token prediction (MTP) drafter that early testers say delivers roughly 3x faster inference. Here’s why that matters more than another benchmark bump.
Why LLMs Feel Slow: The One-Token-at-a-Time Problem
Large language models generate text the way a typewriter prints — one token at a time. To produce a five-token response, the model runs five full forward passes. Each pass drags every weight in the network through GPU memory, even though the actual math finishes in microseconds.
This is the classic memory-bound bottleneck. Modern GPUs aren’t compute-starved; they’re starved for bandwidth. Loading tens of billions of parameters from HBM for every single token is the tax that makes inference expensive. If you’ve ever wondered why your H100 cluster bills look the way they do, this is most of the answer.
Speculative Decoding: Guess First, Verify Once
The standard workaround is speculative decoding, popularized over the last two years across OpenAI, Anthropic, and the open-weight community. The setup is elegant. A small, fast “draft model” guesses the next 5–10 tokens. The big model then verifies those guesses in a single forward pass — and if the draft was right, you get multiple tokens for the price of one.
The catch: you have to train, ship, and host that second draft model. It eats VRAM. Worse, if the draft and target use different tokenizers or training distributions, accuracy gets shaky. It’s a clever hack that always felt like duct tape.
Gemma 4’s Move: Bake the Drafter Into the Model
Gemma 4 sidesteps the duct tape by absorbing the drafter into the main network. Instead of running a second model, it bolts multiple prediction heads onto Gemma 4 itself. A single forward pass spits out several next-token candidates simultaneously, and the same model verifies them on the next step.
No tokenizer mismatch. No extra model to host. No alignment drift between drafter and target. The Compile Future YouTube channel published a hands-on test on May 6 plugging the MTP drafter into a Claude Code workflow and clocked roughly 3x speedups on real coding tasks. The video crossed 1,400 views within a day, with commenters specifically calling out the difference on local inference rigs.
The Bigger Picture: Parallel Decoding Is the New Default
MTP didn’t appear out of nowhere. A late-April rundown from the Byte Goose AI channel — “Parallel Decoding: New Standard for Fast LLM Inference” — traces the lineage clearly: Jacobi iteration, Lookahead decoding, Medusa, EAGLE, and now MTP. Meta added an MTP loss term during Llama training last year. DeepSeek shipped its own variant. Google adopting it natively in Gemma 4 effectively makes parallel decoding table stakes for any serious open-weight release going forward.
The line from researchers is blunt: sequential one-token decoding is a legacy artifact. The architecture was never designed around it; we just hadn’t figured out how to break free.
Why Developers Should Care
Faster inference isn’t just snappier responses. It’s a structural cost shift. The same GPU serves more concurrent users. Cost-per-token drops. And for agentic workloads — where a model gets called dozens of times in a single session — the effect compounds brutally. Shave 500ms per call, and a 50-step agent loop runs 25 seconds faster.
Code generation is where MTP shines hardest. Code is repetitive, predictable, pattern-heavy — exactly the conditions under which speculative drafts hit their highest acceptance rates. It’s no coincidence the first viral demos are landing inside coding agents like Claude Code.
The Closing Thought
The competitive axis of LLMs is rotating from “how smart” to “how cheap and fast at scale.” Given a choice between a slightly smarter model that’s slow and a same-quality model that’s 3x faster, anyone shipping real-time products or agents already knows the answer. Gemma 4 is the loudest signal yet that inference architecture, not parameter count, is the next moat.
Comments
Loading comments...