- Pascal's Chatbot Q&As
- Posts
- The complaint alleges a deliberate, repeated, and knowing acquisition of copyrighted books from shadow libraries (LibGen, Z-Library, Bibliotik, Books3, PiLiMi) followed by systematic copying...
The complaint alleges a deliberate, repeated, and knowing acquisition of copyrighted books from shadow libraries (LibGen, Z-Library, Bibliotik, Books3, PiLiMi) followed by systematic copying...
...during ingestion, preprocessing, deduplication, training, fine-tuning, and in some cases retrieval-augmented generation (RAG).
Carreyrou v. the AI Industry — From “Fair Use Theater” to Evidence-Driven Copyright Liability
1. What this case is really about
At first glance, Carreyrou et al. v. Anthropic, Google, OpenAI, Meta, xAI, and Perplexity looks like yet another book-author AI lawsuit. In substance, however, it is one of the most aggressive attempts so far to collapse the AI industry’s fair-use narrative by reframing model training as large-scale, willful piracy rather than abstract “learning.”
The complaint alleges a deliberate, repeated, and knowing acquisition of copyrighted books from shadow libraries (LibGen, Z-Library, Bibliotik, Books3, PiLiMi) followed by systematic copying during ingestion, preprocessing, deduplication, training, fine-tuning, and in some cases retrieval-augmented generation (RAG). The plaintiffs emphasize that infringement did not occur once, but hundreds or thousands of times per work across the model lifecycle.
Crucially, the plaintiffs reject class-action settlement logic outright. They argue that class actions structurally underprice infringement and shield defendants from the Copyright Act’s statutory-damages regime, which allows up to $150,000 per work per defendant for willful infringement, assessed by a jury. Instead, they deliberately pursue individualized jury trials, explicitly attacking recent AI settlements as “pennies on the dollar” outcomes that benefit platforms, not creators.
This is not just litigation—it is a counter-strategy to how the AI industry has been managing legal risk.
2. Most surprising, controversial, and valuable statements
A. Most surprising
“The infringement occurred at least twice for every work.”
The complaint insists that infringement is multiplicative: once when downloading pirated books, and repeatedly during training passes and optimization. This framing matters because it reframes “model training” from a single act into an industrial copying process.
Explicit reliance on torrent mechanics.
Few prior complaints have so clearly described how torrenting inherently involves redistribution (“leeching” and “seeding”), thereby creating additional acts of infringement. This is legally significant because it undermines any claim of passive acquisition or accidental exposure.
B. Most controversial
The claim that model parameters “embed” copyrighted expression.
This remains legally unsettled. Courts have not yet squarely ruled that trained weights constitute infringing copies. Plaintiffs here assert it forcefully, but this is where defendants will fight hardest, arguing functional transformation rather than expressive fixation.
Naming nearly the entire U.S. AI industry in one case.
This is unprecedented in scale and coordination. It invites judicial skepticism about overreach, but also pressures defendants to fragment their defenses—strategically advantageous to plaintiffs.
C. Most valuable (strategically)
Willfulness is the backbone of the case.
The complaint repeatedly cites internal warnings, industry knowledge, USTR “Notorious Markets” reports, prior injunctions, and even later licensing deals (e.g., HarperCollins/Microsoft) to argue defendants knew licenses were required but chose piracy anyway.
This is the key move. Without willfulness, statutory damages collapse. With it, exposure becomes existential.
3. Comparison to other AI copyright litigation
Compared to earlier cases—Authors Guild v. Google, Getty v. Stability, Andersen v. Stability, NYT v. OpenAI/Microsoft, Bartz v. Anthropic—this complaint is distinctive in four ways:
Evidence density over theory
Instead of speculative similarity or output-based arguments, this case anchors itself in documented datasets(Books3, LibGen, PiLiMi) and publicly acknowledged training practices.Anti-settlement posture
Where most AI cases quietly aim for licensing settlements, this one openly attacks the legitimacy of those settlements and positions jury trials as the corrective.Lifecycle infringement theory
The complaint treats preprocessing, deduplication, and gradient descent as legally relevant copying events—far more granular than most prior pleadings.Industry-wide framing without class action dilution
This hybrid approach—many defendants, no class—maximizes pressure while preserving statutory leverage.
In contrast, class actions like Bartz have already shown how easily AI defendants can cap exposure through aggregate settlements, a point the plaintiffs explicitly weaponize here.
4. Likely outcomes and predictions
Short term (12–24 months)
Motions to dismiss will narrow claims, especially around “embedding” and RAG-based liability.
Discovery battles will be ferocious, particularly over training data provenance and internal risk assessments.
Some defendants (likely Perplexity or xAI) may seek early settlement to avoid discovery asymmetry.
Medium term
Courts are likely to reject blanket fair-use defenses at the pleading stage where piracy sources are plausibly alleged.
Willfulness claims are likely to survive, at least partially, given public knowledge of LibGen/Z-Library.
Long term
The most probable outcome is selective, defendant-specific settlements, not an industry-wide verdict.
However, even one adverse jury verdict on willful infringement could reset licensing norms overnight.
This case is less about winning outright than about making infringement economically non-viable.
5. Value of this effort for rights owners and publishers
For publishers and authors, this litigation is strategically valuable even if plaintiffs never reach trial:
It raises the floor price of AI licensing by making “free training” legally radioactive.
It undermines the narrative that “everyone did it” as a defense.
It shifts negotiations from abstract ethics to quantified statutory risk.
It validates publisher positions that shadow-library ingestion is not a gray area but a red line.
That said, this approach is resource-intensive and favors well-resourced plaintiffs. Smaller authors and publishers will still struggle unless collective licensing or regulatory solutions emerge.
6. How AI developers can prevent this from happening again
From a compliance and governance perspective, the lessons are stark:
Hard provenance requirements
If you cannot trace training data to lawful sources, do not ingest it—full stop.Dataset quarantining and attestations
Treat datasets like regulated supply chains, with internal audits and third-party verification.Licensing over litigation arbitrage
The HarperCollins deal cited in the complaint proves licenses are cheaper than jury trials.Model-lifecycle risk accounting
Stop pretending infringement happens only at “ingestion.” Courts are being asked to scrutinize the entire pipeline.Stop relying on class actions as liability sinks
This case shows that sophisticated rights holders will route around that strategy.
Conclusion
Carreyrou v. the AI industry is not merely another copyright suit. It is a direct challenge to the economic foundations of generative AI as currently practiced, aimed at collapsing the assumption that unauthorized training is a tolerable, transitional phase.
Whether the plaintiffs ultimately prevail is almost secondary. The real achievement of this case is that it forces AI developers, publishers, and courts to confront a simple question they have long avoided:
If books are the “gold-standard” input for AI systems worth hundreds of billions, why were they treated as valueless at the moment of acquisition?
That question will not go away—no matter how this case ends.
