Wednesday, December 3, 2025

Is it plagiarism or aLLMost plagiarism?

From a discussion on a Linkein post by Aron Brand:

The question arose as to whether LLMs store "representations" of their training data, and if so, why is it not plagiarism to use those representations as they respond to users' prompts?

I think that's a very nuanced and insightful question that comes down, I suppose, to the definition of "stored representations."

At first blush, it seems obvious: Let's say an LLM ingests the post you're reading at this moment. For the sake of argument, assume it's not here on a blog, but in a book that I've published and copyrighted. Of course I've included the standard notice that "no part may be stored, transmitted, reproduced, etc. without written permission." The owner of the LLM has bought and owns a copy of my book.

Later, you ask it about my opinions on LLMs and plagiarism, and it summarizes what I've written. I allege that it has "stolen" the content and used it unlawfully, without my permission.

Has the LLM "stored" my content?

Well, yes and no. If we're asking whether we can search an LLM's memory and find a snippet of its training data, the answer is no. That's not how LLMs work. If you think of an LLM as a vast system of billions or trillions of interconnected pipes and valves, then the training data simply adjusts the valves (which we call "parameters") to bring whatever comes out into better compliance with what's desired. The actual words (or images, or sounds) exist nowhere inside it; their only vestige is the minuscule changes they've made to some of the valves. And it's generally impossible to reverse-engineer those changes: we can't determine the real-world meaning of any valve's function, or why the LLM adjusted it the way it did.

But, clearly, at least some of the information associated with the training data exists encoded within the LLM; it's somehow involved in the process of generating the output. The LLM will take that information into account when it responds to your prompt, and it should provide a summary that reflects the opinions I've written here.

Does that constitute plagiarism?

If I plan to sue the owner of the LLM, then that's a question that falls under the purview of lawyers, judges, and juries—a group of which I am thankfully not a member, beyond a few stints as a juror. It sure feels like plagiarism, doesn't it?

But suppose you study a set of textbooks about the inner workings of LLMs in all of their technical detail. You learn how to build them, train them, and prompt them. The book teach you not only the basics, but the insider tips and tricks to wring the best performance out of them. Then you use the information from memory to create an online college course.

How is that different from plagiarism?

  • You haven't copied any of the original text or images, but neither has the LLM.
  • You've based your course solely on in the content of the books, and so has the LLM.
  • "Storing" the textbooks' content in your brain—even if you've memorized it verbatim—doesn't violate copyright law ... but the LLM hasn't even done that; it's just adjusted its parameters.
  • The owner of the LLM has acquired the content legally, just as you have.
We expect humans to use reference material in creating their original content; we've been doing that for centuries. And, yet, when an LLM does it, we consider it plagiarism.

So, what's the answer?

For now, at least, for me, it's a disappointing "I don't know."

And a lot of far-reaching consequences will depend on the ultimate answer. Intellectual property rights aside, today's AI models are deluging the Internet with "LLM slop," unreliable "information" that's often misleading, poorly written, and, unfortunately, assumed to be legitimate by far too many of us. "Deep fakes" abound, and each new model exacerbates the situation by imitating genuine content ever more closely. LLMs continue to surprise us with unexpected behaviors that can be dangerous, and that will always be the case. Why? Remember that the underlying mechanism of every AI system is a set of countless parameters that have been adjusted based on their training data, and that it's impossible to associate any given parameter with its real-world effect. And that means nobody can predict exactly how an LLM will respond to a prompt, which threatens accuracy, security, and (in the case of, e.g., self-driving cars) physical safety. We continue to improve LLMs, but we still don't know what we don't know ... and given their increasing prevalence, that's downright dangerous.

Whichever way things go, AI will have at least as much impact on us as the Industrial Revolution, the use of electronics and inexpensive microcomputers, and eventually nuclear fusion.

Indeed, we live in interesting times.

No comments:

Post a Comment