- Pascal's Chatbot Q&As
- Posts
- Kleiner v. Adobe is another step in a pattern: the legal system is increasingly treating “training data governance” as a compliance domain, not a research footnote.
Kleiner v. Adobe is another step in a pattern: the legal system is increasingly treating “training data governance” as a compliance domain, not a research footnote.
Complaint: You don’t get to outsource your risk to the open dataset supply chain. If SlimPajama inherits tainted inputs, and you commercialize the resulting model, you may inherit the liability too.
“Ethical AI, Unethical Inputs”: The Kleiner v. Adobe SlimLM Lawsuit and What It Signals for AI Training Liability
by ChatGPT-5.2
This particular complaint (filed February 9, 2026, in the Northern District of California) is a proposed class action by author Arthur Kleiner against Adobe Inc., alleging that Adobe trained (and continues to benefit from) small language models using large-scale, unlicensed copies of copyrighted books—specifically including Kleiner’s own registered book.
What the grievances are (what Adobe is accused of doing)
At its core, the lawsuit frames Adobe’s conduct as a “dataset supply chain” infringement theory.
Adobe allegedly trained “SlimLM” on SlimPajama-627B, a dataset the complaint claims is a cleaned and deduplicated derivative of RedPajama.
RedPajama allegedly incorporated “Books3,” a corpus widely associated with pirated books sourced from shadow-library infrastructure (the complaint names Bibliotik and references other piracy ecosystems).
Kleiner’s specific work (The Age of Heretics) is asserted to be in that chain, and he attaches a U.S. copyright registration record as Exhibit A.
The complaint alleges Adobe downloaded, copied, stored, and repeatedly reproduced the dataset internally across the training pipeline (including iterations and experiments), and that Adobe’s continued possession and use of those copies constitutes ongoing infringement.
The suit also leans into “ethical AI” positioning by contrasting Adobe’s marketing around responsible AI with alleged reliance on datasets “known to contain” unlicensed text.
Legal claims and remedies sought: The complaint pleads direct copyright infringement under 17 U.S.C. § 501 on behalf of a nationwide class of copyright owners with registered works, and it asks for damages, attorneys’ fees, injunctive relief, and (notably) destruction or disposition of infringing copies under 17 U.S.C. § 503(b).
Is the evidence good quality?
It’s a strong pleading-stage narrative, but the evidentiary strength splits into “what’s fairly well-supported publicly” versus “what will require discovery.”
What’s relatively strong (or at least well-anchored)
Adobe’s SlimLM training stack is tied to SlimPajama in public research artifacts describing pretraining on SlimPajama-627B and fine-tuning on DocAssist, which makes the complaint’s starting point easy to substantiate.
SlimPajama’s relationship to RedPajama is well-documented publicly, including by Cerebras Systems and in dataset documentation.
The “Books3 problem” is a known controversy in the open-dataset ecosystem, including public notes about access being restricted due to copyright infringement reports for the “book” configuration.
What’s weaker or “to be proven”
Dataset membership and provenance for Kleiner’s book will likely be contested, because the complaint’s chain-of-custody allegations still need forensic proof that the work was present in the dataset used by Adobe after filtering, deduplication, and any exclusions.
Allegations about how many internal copies existed and how they were distributed across infrastructure are typically “on information and belief” until discovery produces logs, snapshots, and experiment tracking.
The ‘surreptitious’ framing is rhetorically potent but legally secondary, because copyright liability turns primarily on unauthorized copying and defenses like fair use, not on whether the copying was hidden.
“Bottom line: as a complaint, it’s coherent and anchored to an identifiable SlimLM → SlimPajama → RedPajama lineage, but it will stand or fall on proof about the specific works used or retained and on how the court treats fair use in the presence of piracy-tainted sources.
How this case compares to other AI copyright cases
This suit is a familiar “books in training data” playbook in one sense, and strategically distinct in another.
The familiar pattern
It mirrors the now-common author/publisher theory used against model developers: models were trained on copyrighted books without permission, the resulting capability has commercial value, and class treatment is used to aggregate many rightsholders into one action.
The distinct angle
Adobe is not being sued as “a frontier chatbot company,” but as a dominant creative and document software platform integrating language models into productivity workflows, which can shift how market harm and unjust benefit are argued.
The complaint emphasizes small language models for on-device document assistance rather than a general-purpose LLM, which could cut both ways in court (narrower purpose and less expressive overlap for Adobe, but clearer productization and monetization for plaintiffs).
The case also leans into an ‘ethics mismatch’ theme by contrasting public ‘responsible AI’ messaging with alleged reliance on datasets plaintiffs describe as contaminated by piracy.
Also relevant context is that defendants increasingly cite recent fair use reasoning in cases involving Anthropic, while plaintiffs highlight that piracy acquisition and retained libraries can remain legally and factually toxic even where “training” is argued to be transformative.
Predicting the potential outcome (with realistic branches)
No outcome is guaranteed, but the most likely trajectory looks like this.
1) Motion to dismiss: partial survival is plausible
Adobe will likely move to dismiss and argue fair use, and also argue that the complaint does not plausibly allege copying of Kleiner’s specific work at a sufficient level of detail.
However, fair use is often fact-intensive, and when claims hinge on provenance and retention of allegedly pirated libraries, courts often want discovery before ruling conclusively.
2) Discovery becomes the real battlefield
If the case proceeds, the decisive evidence will be dataset snapshots and hashes, internal dataset governance records, experiment tracking, and logs showing what was ingested, filtered, retained, and used across runs and checkpoints.
This is also where plaintiffs may lean on memorization and extraction research to argue that protected expression can persist in models in ways that matter for liability, even if outputs are not intended to reproduce books.
3) Settlement pressure is real; “destruction” remedies are harder
Even if Adobe believes it can win a fair use ruling on “training,” piracy-provenance and internal retention issues can raise litigation and reputational risk, which often increases settlement pressure.
That said, the specific remedy of destruction or disposition of models and datasets is typically difficult in practice once systems are integrated into products, so endgames more commonly look like a confidential settlement plus money and operational commitments around future data sourcing and governance.
My best forecast
Most likely: the case survives the first gate at least in part, becomes discovery-driven, and carries meaningful settlement gravity if plaintiffs can concretely prove the piracy-tainted chain and specific-work inclusion.
Less likely but possible: Adobe obtains an early or mid-stage win if it can undermine the “tainted inputs” factual premise, show robust filtering, or convince the court that fair use applies cleanly on the developed record.
High-impact plaintiff win scenario: plaintiffs prove large-scale use and retained pirate-sourced libraries at scale, producing either a major settlement or a precedent-setting ruling emphasizing provenance and retention as liability multipliers.
What this case signals (why it matters beyond Adobe)
Kleiner v. Adobe is another step in a pattern: the legal system is increasingly treating “training data governance” as a compliance domain, not a research footnote. The complaint is essentially saying: you don’t get to outsource your risk to the open dataset supply chain. If SlimPajama inherits tainted inputs, and you commercialize the resulting model, you may inherit the liability too.
