AI ethics 4 min read

Is AI Just Plagiarism at Industrial Scale? The Question Nobody Wants to Answer

“AI is a giant plagiarism machine.” That single line, posted in a recent essay, pulled in 571 upvotes and reignited a fight the industry keeps hoping will go away. ChatGPT has been mainstream for over three years now, and we still cannot agree on a basic question: is this creation, or is it laundering? Time to look the accusation in the eye.

The Argument That Refuses to Die

The case against AI training is brutally simple. Billions of texts, images, and code snippets were used to train these models. How many of those creators actually said yes?

Almost none. If it was on the open web, it got scraped, ingested, and turned into raw material. Critics have a name for this: industrialized infringement. When one person copies a book, it is plagiarism. When a company hoovers up millions of books to train a commercial product, somehow it becomes “innovation.”

A line that has been circulating on Hacker News and tech-ethics YouTube captures the mood perfectly: “theft that no one can prosecute.” Too big, too distributed, too technically novel for the law to catch up.

“But Humans Learn the Same Way” — Does That Hold Up?

The standard rebuttal from the AI camp goes like this. Painters study other painters. Writers read thousands of books before they write one. Why is a model doing the same thing any different?

It sounds reasonable until you push on it. The differences are scale, speed, and intent.

A human reads maybe a few thousand books in a lifetime. A frontier model ingests hundreds of millions in days. Human learning runs through forgetting, interpretation, and lived experience. Models retain statistical patterns of their training data with startlingly low loss — researchers have repeatedly shown that the right prompt can regurgitate near-verbatim passages or near-identical images from training material.

And here is the part the analogy quietly skips: a student painter is not flooding the same market their teachers sell into, undercutting them on price and infinitely scaling output. AI does exactly that, in the exact market the original creators are trying to make a living in.

The “Transformative Use” Escape Hatch

US copyright law has a doctrine called transformative use. Take something, change it meaningfully, and you may qualify as fair use. AI labs lean hard on this — training, they argue, is the textbook definition of transformation.

There is a problem. Transformative use was originally tied to specific purposes like criticism, parody, or research. Whether vacuuming up the entire internet to train a commercial product fits cleanly inside that doctrine is a question that no court has fully settled. The New York Times suit against OpenAI, Getty Images versus Stability AI, the Authors Guild cases — these are going to set the precedent over the next few years.

Until then, the entire industry is running on a legal gray zone, and so is everyone using these tools.

The Question Nobody Wants to Ask

Here is the part that actually gets uncomfortable. The real question is not “is AI plagiarism.” It is whether we ever agreed to this deal in the first place.

The person who wrote a Medium post in 2015. The developer who pushed code to GitHub in 2018. The illustrator who built an Instagram following in 2020. Did any of them imagine their work would become raw material for a model worth tens of billions of dollars? Almost certainly not. “Public” and “freely transferable” are not the same thing, but the AI industry has quietly conflated them.

OpenAI, Google, Meta, Anthropic — each one has gotten progressively vaguer about what is in their training corpora. The less transparency, the louder the criticism. The louder the criticism, the more they close up. It is a feedback loop that ends badly for trust.

What’s Owed, and to Whom

Whether AI is “plagiarism” is ultimately a question for courts and legislatures. But before that verdict lands, the industry has a question it should be asking itself: what, exactly, are we giving back to the people whose work made this possible?

Licensing deals with publishers are a start. So are the early experiments with creator compensation pools. Neither is at the scale of the value being extracted. Until that gap closes, the “plagiarism machine” framing is going to keep finding 571 upvotes — and then 5,710, and then more. Do you think the difference between human learning and model training is a difference in kind, or just a difference in scale? Because the answer to that question is quietly deciding the next decade of this industry.

AI ethics copyright generative AI fair use training data

Comments

    Loading comments...