A Fields Medalist Tried GPT-5.5 Pro on Real Math. His Verdict Surprised Everyone.
“AI wins gold at the Math Olympiad” headlines have become routine. Real mathematicians, mostly, have shrugged. Benchmark scores and actual research math live in different universes. So when Timothy Gowers, a Fields Medalist who has spent years skeptically poking at AI math systems on his blog, said GPT-5.5 Pro felt like the first model worth talking to, the math community noticed.
Why Gowers’ Verdict Carries Weight
Gowers won the Fields Medal in 1998 for work in functional analysis and combinatorics. More relevant here: he has been one of the most consistent, articulate skeptics of AI mathematics. His standing line for years was that benchmark wins are not math, and that LLMs can’t do the thing research mathematicians actually do — invent new concepts.
So this isn’t a hype-cycle endorsement from someone with a stake in OpenAI. It’s a hard grader changing his grade. That alone is news.
The Problems He Threw At It
Gowers didn’t reach for textbook exercises or olympiad puzzles. He used the kind of non-standard questions he poses to graduate students — open-ended, no clean answer in any reference.
The most striking exchange involved a lemma in functional analysis. The standard proof is well known; Gowers asked whether a different route existed. GPT-5.5 Pro got it wrong on the first try. But when he flagged a counterexample, the model didn’t just apologize and reroll. It diagnosed why its earlier reasoning had failed and proposed a structurally different attempt. That diagnostic move — naming the flaw, not just accepting the correction — is what Gowers said he had never seen before.
“It Felt Like Talking to a Colleague”
The shift Gowers kept circling back to wasn’t raw capability. It was conversational texture. Older models swung between two failure modes: confidently wrong, or instantly capitulating the moment you pushed back. Neither is what a working mathematician does.
GPT-5.5 Pro, he said, pushed back when it had grounds and conceded when it actually was wrong. That calibration — knowing the difference between “I might be missing something” and “no, you’re missing something” — is the thing.
The ceiling is still real, though. Gowers was clear: the model doesn’t invent genuinely new concepts. It’s good — sometimes startlingly good — at recombining tools scattered across the existing literature. It is not yet proposing the new definition or new lens that opens a field. His shorthand: a smart mid-PhD student, not a peer.
The Reception, and What’s Still Missing
Reactions on math Twitter and Hacker News have split along predictable lines. One camp: “Mid-PhD-level math partner is enormous, stop downplaying it.” The other: “It’s one anecdote from one user, and we’ve been here before.” A recurring critique is reproducibility — same model, same prompt, wildly different outputs across runs.
Worth noting: Gowers himself stressed that this is a subjective impression, not a benchmark. The real test comes when AI is listed as a co-author on a serious paper, or moves the needle on an open problem. Neither has happened. Yet.
The Question Underneath
One mathematician’s blog post, sure. But that mathematician is Gowers, and his prior was firmly on the skeptic side. Something has shifted from “answer machine” to mathematical interlocutor, and the threshold may be closer than the field assumed even a year ago.
Which surfaces a harder question nobody has answered. If a competent PhD-level math collaborator is now a chat window away, what does the next decade of math education and research actually look like? “Use AI as a tool” is the easy answer. It’s also not nearly enough.
Comments
Loading comments...