Zuckerberg Personally Signed Off on Pirated Training Data, Lawsuit Claims
The question of where AI training data comes from has been simmering for years. This week it boiled over. Court filings in the Sarah Silverman-led class action against Meta now allege that Mark Zuckerberg himself signed off on training Llama with a dataset everyone in the room knew was pirated. That’s not a corporate liability story anymore. That’s a CEO liability story.
The dataset at the center of it
The pirated trove in question is LibGen — short for Library Genesis, a “shadow library” that hosts millions of copyrighted books and academic papers without permission. It’s been a fixture of the academic piracy underground for over a decade, and it’s exactly the kind of source a Fortune 500 legal department would normally run screaming from.
According to internal documents surfaced in discovery, Meta engineers flagged LibGen internally as “a dataset we know to be pirated.” They knew. They wrote it down. And they used it anyway.
Why the CEO angle changes everything
Plaintiffs aren’t just saying Meta used pirated books. They’re saying Zuckerberg personally approved it after being briefed on the legal risk. That’s the explosive claim — and it’s the kind of allegation that could pierce the corporate veil that normally shields tech CEOs from individual liability.
Standard playbook in copyright cases: the company pays a settlement, executives stay personally untouched, everyone moves on. But “the CEO knew it was illegal and greenlit it anyway” is a different category of fact pattern. It maps closer to willful infringement, which carries statutory damages of up to $150,000 per work in the US. Multiply that by a library’s worth of books and the math gets uncomfortable fast.
Fair use was Meta’s shield. It just got thinner.
Meta’s defense has leaned on fair use — specifically the argument that training a model is a transformative use, the same logic that helped Google win the Books case back in 2015. A California federal judge bought parts of that argument last year, dismissing several claims against Meta.
But fair use analysis weighs the character of the source material, not just the use. Training on books you legally licensed is one argument. Training on books you knowingly pulled from a piracy site is a much harder sell to a jury. The transformative-use defense doesn’t really have an answer for “you stole the inputs.”
Every other AI lab is watching
OpenAI, Anthropic, Google, and Microsoft are all defending versions of the same lawsuit. The New York Times case against OpenAI, the Authors Guild suits, the Getty Images claim against Stability — they’re all wrestling with the same core question: what counts as a legitimate training corpus?
If the Meta case establishes that knowingly using pirated sources strips fair use protection — and worse, exposes executives personally — every general counsel in Silicon Valley is going to be on the phone Monday morning asking the ML team where, exactly, that 15-terabyte dataset came from. “We didn’t ask” stops being a viable answer.
The era of plausible deniability is closing
AI copyright law is still mostly gray area, and it’ll stay that way for a while — courts move slower than model releases. But this case is drawing one line clearly: willful blindness isn’t a defense anymore. The internal Slack messages exist. The discovery process will surface them. And the next time a frontier lab decides to scrape something it shouldn’t, somebody is going to remember what happened to Meta.
Comments
Loading comments...