• Pascal's Chatbot Q&As
  • Posts
  • Verdict: Training LLMs on copyrighted books can be lawful—but only when done with care, legal acquisition, and respect for the limits of fair use.

Verdict: Training LLMs on copyrighted books can be lawful—but only when done with care, legal acquisition, and respect for the limits of fair use.

For creators and AI developers, the message is clear: legal clarity demands documentation, licensing, and transparency—not just ambition. Future court cases will likely hinge on similar distinctions.

The June 2025 Alsup Order on Fair Use and Copyright Infringement in Anthropic v. Bartz et al.

by ChatGPT-4o

I. Introduction

On June 23, 2025, Judge William Alsup of the U.S. District Court for the Northern District of California issued a consequential ruling in Bartz et al. v. Anthropic PBC. The order partially granted and denied summary judgment on the issue of fair use in the context of AI training on copyrighted books. The case, involving allegations that Anthropic copied millions of books—both pirated and purchased—for use in training its Claude large language model (LLM), is the first major U.S. court decision on AI training and fair use post-Warhol and Google v. Oracle. This essay analyzes the most surprising, controversial, and valuable findings, assesses implications for similar litigation, and offers strategic guidance to creators, rights owners, and AI developers.

II. Surprising Findings

  1. Clear Legality of LLM Training on Books (Under Certain Conditions)
    The most surprising finding is that training an LLM on copyrighted books was found to be “quintessentially transformative” and a fair use—so long as the training was on legally acquired copies and did not result in outputs reproducing the originals. Judge Alsup analogized LLMs to humans reading, memorizing, and then emulating styles, ruling this act alone does not infringe copyrights.

  2. Format Shifting Print Books Is Fair Use
    Digitizing legally purchased print books (destructive scanning) for internal use was ruled a narrow but valid fair use. This echoes but expands prior rulings like Google Books and Sony Betamax, applying them to AI data ingestion without requiring public distribution.

  3. Retention of Pirated Copies Not Excused by Transformative End Use
    Although AI training itself was fair use, retaining pirated copies not used for training or used but never deletedwas not a fair use. The order firmly separated the legality of end uses from the illegality of acquiring source materials—rejecting any “blessing by intention” logic.

III. Controversial Rulings

  1. Memorization of Full Works by LLMs Not Infringing—If Not Output
    Alsup accepted plaintiffs’ claim that Claude’s LLMs “memorized” entire books almost verbatim, yet ruled that as long as no output reproduces the content or any substantial part, the training is lawful. This sharply diverges from earlier fears of LLMs acting as de facto copying machines.

  2. Dismissal of Derivative Work Arguments for Tokenized Copies
    The court did not reach the question of whether intermediate tokenized copies were derivative works, due to a plaintiff concession. However, Alsup hinted that if a digital copy merely replaces a print version without new expressive content, it might not qualify as a derivative work at all.

  3. Fair Use Not Voided by Commercial Intent
    Despite Claude generating over $1 billion in revenue, the court emphasized that commercial benefit alone does not weigh heavily against fair use if the usage is transformative and doesn’t displace the original work’s market.

  1. Use-Based Fair Use Analysis
    The ruling reaffirmed that each use of a work must be analyzed independently(e.g., digitization vs. training vs. output). This provides a blueprint for parsing complex AI workflows in future cases.

  2. No Copyright in Style or Writing Quality
    Emulating the style or quality of writing—what LLMs often aim for—is not copyright infringement. This limits claims based on "vibes," tone, or expressive style unless specific text is reproduced.

  3. Bad Faith Acquisition Matters for Liability, Not Fair Use
    The court applied an objective use test per Warhol but clarified that bad faith still counts in willfulness and damages, not the fair use analysis itself.

V. Implications for Other Court Cases

  • Encouragement for Plaintiffs to Differentiate Between Source and Output
    Creators seeking damages must now disaggregate allegations: were the copies used in training legally obtained? Were outputs infringing? Vague or bundled claims are likely to fail.

  • LLMs May Lawfully Train on Purchased Texts
    Provided companies buy books (even for destructive scanning), courts may protect that as fair use, reducing legal risk for developers who avoid pirated data.

  • AI Companies Cannot Ignore Chain of Title
    Pirated training materials, even if unused or unused later, expose developers to liability. Courts will view this retention as unjustified infringement.

  • Class Certification May Hinge on Source Type
    The pending class motion—differentiating between pirated and purchased books—may set a template for future collective actions.

VI. Strategic Use of the Verdict for Creators and Rights Owners

  1. Demand Proof of Source and Usage Chains
    This ruling empowers rights holders to demand transparency on how data was sourced, whether books were used in training, and how long they were retained.

  2. Push for Discovery on Retention and Repurposing
    Alsup noted Anthropic withheld a spreadsheet detailing what was used in training—highlighting how discovery on internal libraries can be key to litigation strategy.

  3. Challenge Long-Term Library Retention Practices
    Rights holders can focus on library-building ambitions (e.g., “everything forever”) as distinct from training-related fair use to argue infringement and damages.

  4. Negotiate Licensing Terms Based on Use, Format, and Duration
    Content owners may accept destructive scanning under license for limited, controlled use—but insist on deletion, non-reuse, or payment for long-term storage or redistribution.

VII. Recommendations for AI Developers to Minimize Risk and Liability

  1. Avoid Pirated Datasets Entirely
    Even if used in training, pirated sources are indefensible under this ruling. Pay for access or obtain public domain/Creative Commons texts.

  2. Implement Data Lifecycle Controls
    Set expiration, deletion, or usage boundaries for each book. Avoid “forever” libraries unless backed by licenses.

  3. Maintain Detailed Provenance Logs
    Keep granular logs tracing every source, transformation, and use of copyrighted works—including books never used in training.

  4. Enable Output Safeguards Against Memorization
    Filtering and guardrails that prevent memorized output bolster the fair use argument—especially when paired with audit logs.

  5. Use Transformative Purpose Statements and Limit External Distribution
    Internal-only copies for transformative use—like digitizing books for model testing or data curation—should be tightly scoped and well-documented.

VIII. Conclusion

Judge Alsup’s ruling is both a roadmap and a warning. It affirms that training LLMs on copyrighted books can be lawful—but only when done with care, legal acquisition, and respect for the limits of fair use. It rejects opportunistic data hoarding and piracy masquerading as innovation. For creators and AI developers alike, the message is clear: legal clarity demands documentation, licensing, and transparency—not just ambition. Future court cases will likely hinge on similar distinctions, and this decision is now a foundational precedent in the evolving terrain of generative AI and copyright law.