Anthropic Wants to Mathematically Eliminate Software Bugs. Here's Why That's Hard

A bug in your to-do app is annoying. A bug in flight control software kills people. That distinction is why formal verification exists — and why Anthropic just announced Project Glasswing, an effort to use AI to mathematically prove that critical software is free of defects. It’s an audacious goal. It might also be the most important AI safety project no one is talking about.

What Formal Verification Actually Is

Most software testing works by example. You feed inputs in, check outputs, and hope you’ve covered enough cases. The fundamental problem: the space of possible inputs is effectively infinite. Testing can prove bugs exist. It cannot prove they don’t.

Formal verification takes a different approach entirely. Instead of testing samples, it constructs a mathematical proof that code satisfies its specification in all cases. Not most. Not statistically likely. All.

This isn’t theoretical. The aerospace industry already recommends formal methods under DO-178C certification standards. Intel adopted them aggressively after the infamous Pentium FDIV bug in 1994 cost the company $475 million. Chip design today relies heavily on formal proofs.

So why doesn’t all software get this treatment? Because it’s brutally expensive.

The Bottleneck Glasswing Is Targeting

Writing formal proofs requires specialists fluent in proof assistants like Coq, Isabelle, or Lean — tools that are powerful but notoriously unforgiving. The commonly cited figure: verified code costs 10 to 30 times more per line than conventional code.

The poster child is seL4, an Australian-built microkernel that is fully formally verified. The result is remarkable — a mathematically proven secure OS kernel. The cost was staggering: roughly 8,700 lines of C code took a team of dozens working over several years to verify.

Project Glasswing wants AI to break this bottleneck. The approach has three layers. First, AI analyzes existing code and automatically generates formal specifications. Second, AI searches for and writes the mathematical proofs against those specs. Third, human experts shift from writing proofs to reviewing and approving AI-generated ones.

In short: turn verification specialists from authors into editors.

Why This Makes Sense for Anthropic

An AI company diving into formal verification might seem like a detour. For Anthropic, it’s a straight line.

The company has built its identity around AI safety since day one — Constitutional AI, improved RLHF, model evaluation frameworks. Glasswing extends that philosophy beyond model behavior to the systems AI will be embedded in. If you’re going to put AI in medical devices and autonomous vehicles, the software around that AI had better be airtight.

The timing also works. Large language models have gotten significantly better at mathematical reasoning. DeepMind’s AlphaProof demonstrated that AI could formally prove International Mathematical Olympiad problems. Research teams have been using AI to suggest proof steps in Lean 4 for years now. Anthropic is betting it can take these academic results and apply them where the stakes are real.

The Hard Parts Nobody Has Solved

The vision is seductive. The obstacles are severe.

Specification is the real problem. Before you can prove code correct, you need a mathematical definition of “correct.” In practice, software requirements are vague, implicit, and often contradictory. Even if AI can extract a spec from code, someone still has to verify that the spec captures the actual intent. Get the spec wrong and you get a flawless proof of the wrong thing.

Scale is brutal. seL4 is elegant and small. Real-world industrial software runs to millions of lines, tangled with legacy code, third-party libraries, and hardware-specific dependencies. Deciding what to verify — let alone verifying it — is an open problem at that scale.

Absence of proof isn’t proof of absence. When the AI produces a valid proof, great — formal proofs are mechanically checkable. But when the AI fails to find a proof, you’re stuck. Is there a bug? Or did the AI just hit its limits? That ambiguity undermines the core promise.

The Direction Matters More Than the Destination

Here’s the thing: none of these obstacles make the project pointless.

Right now, quality assurance for critical software relies on massive test suites and code review. NASA targets roughly 0.1 defects per 1,000 lines of code — impressive, but not zero. Even partial formal verification can catch categories of bugs that testing fundamentally cannot reach.

The real prize isn’t proving every line of every program correct. It’s lowering the barrier enough that formal verification stops being a luxury reserved for microkernel research teams and becomes a practical tool for any organization writing safety-critical code. If Glasswing can make verification accessible to a hospital IT department or a mid-size avionics contractor, that shifts the baseline for the entire industry.

Will Project Glasswing eliminate all software bugs tomorrow? No. But “democratize formal verification with AI” is a direction this field has been waiting decades for someone to seriously pursue. The question that matters now is concrete: what scale of software, and what level of assurance, can it actually deliver? The answer will determine whether you’d board a plane whose control software was proven correct by an AI.

Anthropic Wants to Mathematically Eliminate Software Bugs. Here's Why That's Hard

What Formal Verification Actually Is

The Bottleneck Glasswing Is Targeting

Why This Makes Sense for Anthropic

The Hard Parts Nobody Has Solved

The Direction Matters More Than the Destination

Comments

Related Logs

BitTorrent's Creator Has a Message for Vibe Coders: You're Eating Poison