Pascal's Chatbot Q&As
Posts
GPT-4o: Yes, the case of Hachette Book Group, Inc., et al. v. Internet Archive is highly relevant when considering the use of copyrighted works without consent in training Large Language Models (LLMs)

GPT-4o: Yes, the case of Hachette Book Group, Inc., et al. v. Internet Archive is highly relevant when considering the use of copyrighted works without consent in training Large Language Models (LLMs)

This ruling could have significant implications for how copyrighted material is used in AI training, particularly if it doesn't involve obtaining consent or paying for usage.

Pascal Hetzscholdt
September 05, 2024

Asking ChatGPT-4o and Perplexity: please read the “HACHETTE BOOK GROUP, INC., HARPERCOLLINS PUBLISHERS L.L.C., JOHN WILEY & SONS, INC., PENGUIN RANDOM HOUSE LLC v. INTERNET ARCHIVE” decision as well as this LinkedIn post and answer the question: “Do you think this may have relevance wrt using works without consent in training LLMs?”

ChatGPT-4o’s answer:
Yes, the case of Hachette Book Group, Inc., et al. v. Internet Archive is highly relevant when considering the use of copyrighted works without consent in training Large Language Models (LLMs). The court's decision centers around whether copying and distributing copyrighted works without permission constitutes fair use under the Copyright Act. In this case, the court ruled that Internet Archive's practice of scanning books and offering them for free distribution, without compensating the copyright holders, was not fair use. The key factor was that the works were reproduced in full, serving the same purpose as the originals, without adding transformative value(23-1260_opn).

This decision can be compared to the use of copyrighted materials in training LLMs. When copyrighted works are used without consent to train LLMs, similar issues arise: the LLMs are utilizing and potentially distributing those works in some form (e.g., summaries or generated content based on them), raising concerns about whether this use is transformative or infringes on the rights of the original creators.

The LinkedIn post echoes this sentiment, emphasizing that LLMs may not reproduce full works but still rely on them for training, potentially undermining creators' rights. Given the court's emphasis on protecting the copyright holders' market and rights to control how their works are used, this ruling could have significant implications for how copyrighted material is used in AI training, particularly if it doesn't involve obtaining consent or paying for usage.

The key considerations from the Hachette Book Group, Inc., et al. v. Internet Archivedecision that are most relevant to AI developments, particularly in the context of using copyrighted materials to train AI models, include the following:

Transformative Use: One of the central issues in the decision was whether Internet Archive's use of copyrighted works was transformative. The court concluded that Internet Archive's digital library was not transformative because it did not add new meaning or purpose to the original works—it simply reproduced them in a different format (digital). For AI, this consideration is highly relevant because the use of copyrighted works in training models could be seen as non-transformative if the AI is essentially using the original works without significantly altering or adding new meaning to them.
Commercial Nature: The court considered whether the use of copyrighted works had a commercial nature, which weighs against fair use. Although Internet Archive is a nonprofit, the court still found that it exploited the works by offering a service that could replace the original market (e.g., eBooks). For AI, the potential commercial gain from models trained on copyrighted materials without permission—whether or not the AI-generated outputs directly compete with the original works—raises similar concerns.
Market Impact: A critical factor was the effect on the market for the original works. The court ruled that Internet Archive’s practice of offering free digital copies undermined the market for eBooks and print copies, which directly affected the copyright holders' ability to monetize their works. In the context of AI, using copyrighted works without consent in training models could similarly impact the market for the original works, particularly if the AI outputs replace or reduce the demand for the original content.
One-to-One Controlled Digital Lending (CDL) Comparison: The court rejected Internet Archive’s Controlled Digital Lending (CDL) model as a defense, noting that even with a one-to-one ratio between physical and digital copies, the digital reproduction was still not authorized and was harmful to the market. For AI, this relates to the issue of "scraping" or using large datasets without permission, even if the content isn’t directly replicated or distributed. The key point is that any use of copyrighted material that bypasses the creators' right to control its distribution is likely problematic.
Fair Use Defense: The court rejected Internet Archive's fair use defense under Section 107 of the Copyright Act because the use was neither transformative nor non-commercial, and it substantially affected the market for the original works. For AI models trained on copyrighted materials, developers would need to prove that the use is transformative, non-commercial, or falls within a legally protected category of fair use, which could be challenging depending on the specifics of how the data is used.
Reproduction of Works in Full: The decision emphasized that Internet Archive reproduced entire books, which weighed heavily against a fair use finding. AI models typically use large datasets that may include entire works, raising similar concerns about whether reproducing entire copyrighted works for training purposes violates copyright law.

These considerations highlight the legal complexities AI developers face when using copyrighted materials without permission, particularly regarding whether the use can be considered transformative, the potential commercial impact, and the market harm to copyright holders.

Perplexity’s answer:
The Hachette v. Internet Archive decision and its potential relevance to using copyrighted works without consent in training large language models (LLMs) is a complex issue that touches on several key aspects of copyright law and fair use.

Key Points from the Decision

The court ruled that Internet Archive's practice of scanning and distributing digital copies of copyrighted books without permission from publishers was not fair use

This decision hinged on several factors:

Purpose and Character of Use: While Internet Archive is a non-profit organization, its use was not considered transformative enough to justify copying entire books.
Nature of Copyrighted Work: The books in question were creative works, which receive stronger copyright protection.
Amount and Substantiality: Internet Archive copied entire books, which weighed against fair use.
Effect on Potential Market: The court found that Internet Archive's practices could negatively impact the market for e-books.

Relevance to LLM Training

This decision may have significant implications for the use of copyrighted works in training LLMs:

Similarities:

Like Internet Archive, LLM developers often use copyrighted materials without explicit permission.
The scale of copying is extensive in both cases.

Potential Differences:

Transformative Use: LLM training might be considered more transformative than simply digitizing books, as the end product (an AI model) is quite different from the original works.
Market Impact: The impact on the market for original works may be less direct with LLMs compared to distributing digital copies of books.
Nature of Use: LLMs don't typically reproduce entire works verbatim, which could be seen as less infringing than Internet Archive's approach.

Potential Implications

This ruling could prompt increased scrutiny of the use of copyrighted materials in AI training.
It may lead to legal challenges against AI companies, arguing that training on copyrighted works without permission infringes on copyright.
The decision might encourage AI companies to seek licensing agreements or develop alternative data sourcing methods.
It could spark debates about whether current copyright laws are suitable for the AI era, potentially leading to new legislation or legal interpretations.

While the Hachette v. Internet Archive case doesn't directly address LLM training, it does highlight the ongoing tension between technological innovation and copyright protection. As AI continues to advance, these legal questions will likely become increasingly important and may require further clarification from courts or legislators.

Based on the court decision in Hachette v. Internet Archive, several key considerations are particularly relevant to AI developments, especially regarding the use of copyrighted works for training large language models:

Purpose and Character of Use: The court examined whether Internet Archive's use of copyrighted works was transformative. This is highly relevant for AI, as companies may argue that using copyrighted works to train AI models is transformative.
Effect on Potential Market: The court considered how Internet Archive's practices impacted the market for e-books. For AI, this raises questions about how training models on copyrighted works might affect the market for those works.
Amount and Substantiality of Portion Used: Internet Archive copied entire books, which weighed against fair use. AI training often involves ingesting entire works, which could be a similar concern.
Nature of Copyrighted Work: The court noted that creative works receive stronger copyright protection. This could be relevant for AI training on various types of copyrighted material.
Commercial vs. Non-profit Use: While Internet Archive is non-profit, this alone did not justify its practices. This distinction could be important for both non-profit and commercial AI developers.
Authorization and Licensing: The court emphasized that Internet Archive acted without authorization from copyright holders. This raises questions about the need for licensing or permission for AI training data.

These considerations provide a framework for evaluating the potential legal challenges that AI companies might face when using copyrighted works for training purposes without explicit permission.