- Pascal's Chatbot Q&As
- Posts
- A group of 28 authors and rights holders, including Angie Cruz, Dave Eggers, Vendela Vida and others, opted out of the Bartz settlement and filed their own individual action against Anthropic.
A group of 28 authors and rights holders, including Angie Cruz, Dave Eggers, Vendela Vida and others, opted out of the Bartz settlement and filed their own individual action against Anthropic.
The plaintiffs allege that Anthropic built and improved Claude using copyrighted books taken from shadow libraries and pirate datasets, including Books3, LibGen and PiLiMi/Z-Library mirrors.
Summary: Cruz v. Anthropic is an opt-out lawsuit by authors who rejected the Bartz settlement and want a jury to award statutory damages for Anthropic’s alleged use of pirated books to build Claude.
Its importance is that it reframes AI copyright litigation away from abstract “training is fair use” arguments and toward piracy, willfulness, torrenting, metadata stripping, retained datasets, and market harm.
For future litigants, the lesson is to build forensic evidence around source data, licensing markets, CMI removal, and concrete substitution harms rather than relying only on broad claims that AI models trained on copyrighted works.
Cruz v. Anthropic: The Opt-Out Lawsuit That Tries to Turn AI Copyright Litigation from “Fair Use” into “Piracy, Willfulness, and Jury Damages”
by ChatGPT-5.5
The new Angie Cruz v. Anthropic complaint is a tactical answer to the proposed class settlement in Bartz v. Anthropic. A group of 28 authors and rights holders, including Angie Cruz, Dave Eggers, Vendela Vida and others, opted out of the Bartz settlement and filed their own individual action against Anthropic in the Northern District of California. This was filed on the eve of the Bartz final approval hearing and frames the case as a deliberate attempt to seek a jury trial on statutory damages rather than accept class-settlement economics.
The complaint’s basic story is stark. The plaintiffs allege that Anthropic built and improved Claude using copyrighted books taken from shadow libraries and pirate datasets, including Books3, LibGen and PiLiMi/Z-Library mirrors. They say Anthropic did not merely “train on data” in some abstract technical sense; it allegedly downloaded, torrented, reproduced, distributed, stripped metadata from, scanned, retained and commercially exploited copyrighted works to gain an advantage in the generative-AI arms race.
The most important strategic choice is that the plaintiffs do not bring a class action. They expressly say they want to retain control of their claims and avoid having their rights “diluted” through broad class settlements that allegedly resolve high-value infringement claims for “pennies on the dollar.” They rely on the statutory-damages regime of the Copyright Act and the right to have a jury evaluate willfulness and damages.
That matters because the case is aimed less at winning a clean doctrinal declaration that all AI training is infringement and more at creating a damages-and-willfulness vehicle. The plaintiffs appear to be saying: even if courts are tempted to treat some AI training as transformative, Anthropic’s conduct here should not be sanitized as “learning.” The issue is unlawful acquisition, piracy-based sourcing, torrent distribution, deliberate metadata stripping, permanent retention and the use of copyrighted works as the unlicensed infrastructure of a very valuable commercial product.
The situation
The complaint sits in the shadow of Judge Alsup’s 2025 Bartz ruling. That ruling gave Anthropic a major fair-use victory on certain training uses, but it also drew a crucial distinction: the court said the copies used to train specific LLMs were justified as fair use, while downloaded pirated copies used to build a central library were not justified by fair use; “every factor” pointed against fair use for those pirated library copies.
Cruz tries to exploit that fracture. The case does not simply ask whether “AI training” is fair use. It breaks Anthropic’s conduct into separable acts: acquiring pirate books, torrenting them, distributing them through peer-to-peer protocols, scanning books, stripping copyright management information, retaining a central library, embedding near-verbatim material in model weights, and creating market-substituting outputs. That is a smarter litigation architecture than treating the dispute as one giant yes/no question about whether machine learning is fair use.
The complaint alleges that Anthropic downloaded Books3 in 2021, downloaded at least five million books from LibGen, and downloaded at least two million books from PiLiMi, while knowing or having reason to know that these were repositories of unauthorized copyrighted works. It also alleges that Anthropic used BitTorrent, which matters because torrenting can involve not only downloading but also uploading or “seeding” pieces of files to others. That gives plaintiffs a distribution and contributory-infringement theory, not merely a training-copy theory.
The complaint also makes a DMCA-style copyright-management-information claim. It alleges that Anthropic processed the works in ways that removed or altered author information, copyright notices and other CMI, and that this was done intentionally to create high-quality training data while concealing the true copyrighted sources of the training corpus.
Finally, the plaintiffs seek not only damages but also a permanent injunction, disgorgement, restitution, statutory damages, actual damages under the DMCA provisions, attorneys’ fees, and a jury trial.
Why the case is important
The case is important because it moves the battlefield from abstract technology policy to concrete evidence of sourcing conduct. AI companies prefer the broad frame: models learn from text in a transformative way, outputs are not copies, and imposing licensing would slow innovation. Rights holders prefer the narrower and more morally powerful frame: the AI company copied pirate libraries at industrial scale because clean licensing was slower, more expensive and less convenient.
That distinction is everything. Courts may be sympathetic to AI training where the defendant has lawful access, where the use is tightly connected to non-substitutive analysis, or where the outputs do not reproduce protected expression. But the optics change when the record is about shadow libraries, internal knowledge, torrenting, metadata stripping, permanent retention and the deliberate bypassing of licensing markets.
The Copyright Office’s 2025 report strengthens the plaintiffs’ narrative in one important respect. It stated that AI training can threaten markets not only where outputs are substantially similar to specific works, but also where outputs dilute markets for similar works; it also said that where licensing options exist or are likely feasible, that weighs against fair use under the fourth factor. Most pointedly, it concluded that copying expressive works from pirate sources to generate unrestricted content competing in the marketplace, when licensing is reasonably available, is unlikely to qualify as fair use.
That does not mean Cruz is an easy win. The “market dilution” theory remains legally immature and factually demanding. Judge Chhabria’s Meta decision shows the danger: he ruled for Meta because the authors had not developed enough evidence that Meta’s AI would dilute the market for their work, while also warning that unauthorized AI training could be unlawful in many circumstances and that the plaintiffs had made the wrong arguments on the wrong record.
So Cruz is important precisely because it is a second-generation complaint. It tries to absorb lessons from Bartz and Kadrey: do not rely only on “they trained on my book”; do not assume courts will infer market harm; do not treat output similarity as the only harm; and do not let a class settlement become the default price of past infringement.
What it means for litigation in this space
The larger implication is that AI copyright litigation is fragmenting into several different categories.
The first category is lawful-access training cases, where defendants argue that copying for model training is transformative and comparable to intermediate copying in earlier technology cases. These cases are difficult for plaintiffs unless they can show output substitution, memorization, licensing-market harm or concrete competitive damage.
The second category is pirate-source acquisition cases. These are much more dangerous for AI companies. Here the legal and factual question becomes: even if some training uses might be fair, can a company first steal or torrent the library and then ask the court to bless the downstream use? Bartz already suggests that the answer may be no, at least for a retained central library of pirated works. Cruz tries to turn that opening into statutory damages, willfulness and injunctive relief.
The third category is CMI and provenance cases. These could become very important because metadata stripping is easier for courts and juries to understand than model-weight memorization. If plaintiffs can show that AI developers intentionally removed author names, copyright notices, publisher identifiers or source markers to optimize training and obscure provenance, the dispute becomes less about innovation and more about concealment.
The fourth category is market-harm cases. This is where plaintiffs need the most discipline. It will not be enough to say that AI can write “in the style of” authors or flood the market with generic works. Plaintiffs need evidence: comparable licensing deals, substitution studies, expert market analysis, prompts showing competitive outputs, memorization/extraction tests, platform data, sales impact, price effects, discoverability harms, and proof that the defendant’s product is being sold into markets where the plaintiffs’ works have commercial value.
The fifth category is settlement-opt-out litigation. Cruz signals that class settlements may not end the AI copyright wars. If statutory damages for registered works are available, sophisticated authors, publishers or estates may decide that individual or coordinated opt-out claims are more valuable than class recovery. This matters especially where the defendant’s conduct appears willful and where the works are numerous, registered and traceable to known datasets.
The strengths and vulnerabilities of the Cruz strategy
The strength of Cruz is its moral and evidentiary framing. “Anthropic used my book for training” may sound technical and contestable. “Anthropic knowingly torrented millions of pirated books, stripped rights information, retained a permanent pirate library and built a commercial model from it” is a much harder story for a defendant to neutralize.
The complaint also benefits from prior proceedings. It leans heavily on facts surfaced in Bartz, including alleged internal knowledge about LibGen and PiLiMi. That makes the pleading feel less speculative than early AI copyright complaints, which often had to infer training use from dataset membership or model outputs.
But the vulnerabilities are real. First, Anthropic will almost certainly argue that Bartz already protects training uses as fair use, and that Cruz is trying to repackage settled training arguments as damages claims. Second, injunctions against deployed models are hard. Courts may hesitate to order model shutdowns or model-weight remedies unless plaintiffs can show ongoing infringement with precision. Third, market dilution is still underdeveloped. The more plaintiffs frame the harm as “Claude can produce similar genre works,” the more they risk drifting into protection for style, genre, ideas or competition itself—areas copyright law traditionally does not protect. Fourth, §1202 CMI claims have scienter and causation hurdles; plaintiffs must show not merely that metadata disappeared, but that removal was connected to infringement and done with the required knowledge or reasonable grounds.
Recommendations for other litigants
Other litigants against Anthropic or other AI makers should treat Cruz as a roadmap, but not as a template to copy blindly.
First, separate the defendant’s acts. Do not litigate “AI training” as one undifferentiated event. Plead acquisition, copying, cleaning, deduplication, metadata removal, storage, training, fine-tuning, retrieval, output generation, retention and commercial deployment as distinct uses. Fair use is use-specific; the same copy may be treated differently depending on why it was made and how it was used.
Second, build the piracy record before filing. The strongest cases will be those that can connect specific registered works to specific datasets, hashes, pirate-library records, torrent archives, internal training manifests, public dataset documentation, model cards, leaked training references or discovery from related cases. “My book might have been used” is weak. “My registered work appears on the defendant-linked Works List, in LibGen/PiLiMi/Books3, and in a dataset the defendant admits downloading” is much stronger.
Third, do not over-rely on output similarity. Output evidence matters, especially memorization and near-verbatim extraction. But many courts will not require substantial similarity at output if the claim is about unauthorized reproduction during training or retention. Conversely, if the theory is market substitution or dilution, plaintiffs need economic and empirical evidence, not just alarming examples.
Fourth, make licensing markets concrete. Plaintiffs should collect evidence of existing AI licensing deals, negotiations, pricing benchmarks, internal defendant licensing discussions, opt-out practices, and the commercial value of clean, high-quality content. A fair-use fourth-factor argument improves dramatically when the market is not hypothetical.
Fifth, use CMI and provenance carefully. Metadata stripping can be powerful, especially for publishers and professional authors whose works contain rich rights information. But plaintiffs should be ready to prove what information existed, where it was removed, by what tool or process, why that removal mattered, and how it concealed or facilitated infringement.
Sixth, be realistic about injunctions. A demand to “shut down Claude” may be rhetorically satisfying but judicially difficult. More practical remedies might include deletion or quarantine of retained datasets, certified destruction of unauthorized copies, independent audits, training-data provenance logs, restrictions on further training or fine-tuning, output controls for memorized text, licensing obligations, and reporting requirements.
Seventh, coordinate to avoid bad precedent. The greatest risk for rights holders is not losing one case; it is creating broad adverse precedent through a weak record. Plaintiffs should coordinate expert evidence, market-harm theories, dataset forensics and remedy design. The Meta decision shows that courts may be sympathetic but still rule for AI defendants if plaintiffs fail to build the record.
Eighth, choose plaintiffs strategically. Registered works, clean ownership chains, strong commercial markets, evidence of dataset inclusion, identifiable licensing value and credible testimony about harm matter. Symbolic plaintiffs help with public narrative, but damages cases need evidentiary discipline.
The core lesson is simple: the next phase of AI copyright litigation will be won less by abstract arguments about whether machines “learn like humans” and more by forensic proof of how companies obtained, processed, retained and monetized copyrighted works. Cruz v. Anthropic matters because it tries to force that shift. It tells courts: do not let the glamour of AI obscure the supply chain. The legal question is not only what Claude can do now, but what Anthropic allegedly took, copied, stripped, stored and commercialized to make Claude possible.
