Anthropic's Claude Code Postmortem: It Wasn't the Model, It Was Everything Else
If you’ve been on Hacker News, r/LocalLLaMA, or dev Twitter in April, you’ve seen the complaints: Claude Code isn’t what it used to be. Responses felt inconsistent. Long-context recall degraded. Simple edit instructions ballooned into unrequested refactors across unrelated files. Anthropic has now published a postmortem — and the story is stranger than the conspiracy theories.
The headline: the “they secretly quantized the model” crowd was wrong. The “they’re routing us to a cheaper model” crowd was also wrong. The real answer is messier, and it’s a useful case study in just how fragile modern LLM serving infrastructure actually is.
What users were actually seeing
From early August through early April, the symptom pattern was remarkably consistent across reports. Identical prompts producing wildly variable quality. Models forgetting the start of long contexts. Instructions like “fix this function” triggering edits in three other files nobody asked about.
Anthropic kept insisting the weights hadn’t changed. That was technically true, and completely unsatisfying — because the lived experience of paying customers was unambiguous. On HN threads that hit the front page multiple times, the top comments all said variations of the same thing: “I’m not imagining this.”
They weren’t.
Three bugs, three root causes, zero overlap
The postmortem’s most uncomfortable finding: this wasn’t one incident. It was three independent bugs, deployed at different times, triggering only on specific hardware configurations and request patterns. Each one alone would have been a bad week. Stacked, they produced months of diffuse, hard-to-reproduce degradation.
Bug 1: Context window routing. A misconfigured routing layer was sending a subset of requests to server pools provisioned for shorter contexts. Users working on large codebases were getting their prompts silently truncated on the serving side. That “it forgot what I said 50 messages ago” feeling? Real. The model literally never saw it.
Bug 2: Numerical drift between hardware backends. Anthropic runs inference across a mix of TPUs and GPUs. Certain operations handle floating-point differently across platforms, and on specific code paths, the same weights produced subtly different outputs. No single response looks “broken,” but aggregate behavior — tone, reasoning chains, edge-case handling — drifts. This is the one that fueled the “did you change the model?” threads. They didn’t. The math just rounds differently on different silicon.
Bug 3: Tokenizer edge case. A specific class of Unicode sequences was being tokenized incorrectly, producing weird splits on code containing emoji, certain symbols, or mixed-script identifiers. If you ever wondered why Claude Code got oddly worse at JavaScript files with emoji in string literals, here it is.
The evaluation system didn’t catch any of it
The most damning part of the postmortem isn’t the bugs — it’s the admission that internal evals missed all three. Anthropic’s benchmark suites mostly ran at standard context lengths, on a single hardware pool, with sanitized input. Real users hit every corner of the distribution Anthropic wasn’t measuring.
The result: aggregate benchmark scores stayed flat while specific user cohorts experienced severe regressions. The dashboards looked green. The subreddits looked red. Both were telling the truth about different slices of reality.
This is the failure mode every serving platform should be studying. If your eval harness doesn’t match production’s hardware mix, routing behavior, and input diversity, your evals are measuring a system that doesn’t exist.
The lesson the whole industry should be writing down
Two things stand out. First, LLM product quality is an infrastructure problem, not just a model problem. Weights are the smallest part of the surface area. Routing, tokenization, hardware heterogeneity, and quantization paths all silently shape what users actually feel. A model vendor saying “the weights haven’t changed” is, at this point, an almost useless statement.
Second, users noticed before the telemetry did. The collective developer intuition on HN and Reddit was directionally correct for weeks while Anthropic’s own monitoring registered nothing wrong. That’s a governance question as much as a technical one: how seriously should vendors treat aggregated qualitative user signal when it contradicts quantitative internal evals?
If you’ve had a “wait, did this model get worse?” moment recently with any coding assistant, the honest answer is: maybe, and maybe nobody on the vendor side knows yet. That’s the real takeaway from this postmortem — not that Anthropic messed up, but that everyone shipping LLMs is one routing config away from the same story.
Comments
Loading comments...