• Pascal's Chatbot Q&As
  • Posts
  • GPT-4o: Lawyers representing the authors might be less interested in seeing tokenized training data and more focused on how closely the model’s generated text aligns with their clients’ works.

GPT-4o: Lawyers representing the authors might be less interested in seeing tokenized training data and more focused on how closely the model’s generated text aligns with their clients’ works.

The real legal and ethical intrigue revolves around how these models use the training data to generate outputs (...) The training phase is abstract, but the outputs could pose clear legal challenges.

Question 1 of 3 for ChatGPT-4o: Please read the news article "OpenAI to reveal secret training data in copyright case – for lawyers' eyes only" and tell me what it says

ChatGPT-4o’s answer:
The article from The Register titled "OpenAI to reveal secret training data in copyright case – for lawyers' eyes only" discusses how OpenAI has agreed to allow attorneys, representing several authors suing OpenAI, to view the data used to train its AI models. These authors, including well-known names like Sarah Silverman and Michael Chabon, claim that OpenAI's models were trained on their books, violating copyright law.

A US judge issued an order allowing the plaintiffs' legal team to access the training data under strict conditions. The data will be made available in a secure room with no internet access or devices permitted. OpenAI's legal team will also review any notes made during the process. This high level of security is likely due to concerns that revealing the training data could lead to further lawsuits if it’s shown that a large volume of copyrighted content was used without permission.

OpenAI maintains that its use of copyrighted material falls under "transformative fair use" since the models are designed to generate new, original content rather than reproducing the training data verbatim. However, the lawsuit questions whether the AI systems infringe by creating content too similar to the original works, despite OpenAI's stance that its models rely on extracting general patterns and statistical data.

There are ongoing debates in the legal community regarding the application of copyright law to AI, and several proposed laws aim to increase transparency around the data used to train these models, both in the US and Europe​.


Question 2 of 3 for ChatGPT-4o: Does the judge's approach make sense to you? What will the legal team be able to actually see and determine?

ChatGPT-4o’s answer:
The judge's approach in this case makes sense given the balance it strikes between protecting OpenAI's proprietary interests and allowing the plaintiffs' legal team access to critical information relevant to their copyright claims. By providing the data in a highly controlled environment—without internet access, personal devices, or the ability to copy data—the court is safeguarding OpenAI's intellectual property, which is considered highly valuable and sensitive, while also addressing the need for transparency in determining if copyrighted material was improperly used.

However, the restrictive nature of this access raises questions about how much the legal team can realistically glean from the training data. The team will likely only see a subset of the training data or summaries of how specific books were used in the AI training process, without being able to extract any useful documents or lists. This may hinder their ability to conduct a thorough analysis or fully understand the scope of potential copyright infringement.

That said, even in this limited capacity, the legal team could still accomplish several things:

  1. Identify Specific Sources: They may be able to determine whether particular books or passages were included in the training data, which would be essential to their copyright claims.

  2. Assess Extent of Use: They might get an idea of how extensively certain copyrighted materials were used during the training process (i.e., was the entire book used or just excerpts?).

  3. Examine Methods of Use: The legal team can observe the methods used to process the data. For example, if OpenAI processed copyrighted materials in ways that fundamentally change them (transformative use), it may bolster OpenAI's "fair use" defense. Alternatively, if the system appears to retain more of the original structure or style, it could support the authors' claims.

  4. Test Fair Use Defense: While OpenAI argues that its use of copyrighted works is transformative, the legal team will scrutinize whether the AI’s outputs are so similar to the training data that they might be considered derivative works, rather than new content.

Ultimately, what the legal team will be able to see is enough to make some basic determinations about whether specific copyrighted works were used in a manner that could give rise to an infringement claim. However, they may not get a complete or fully contextualized view, and the inability to record or copy information could limit the depth of their analysis.

This approach seems cautious but appropriate given the legal complexities and stakes involved in protecting proprietary algorithms while addressing concerns of copyright holders.


Question 3 of 3 for ChatGPT-4o: Would you agree that a) the data has already been collected through web scraping and other means, b) it has already been tokenized aka crushed and fragmented and c) the training data in that state might not be that interesting but the subsequent model output and the LLMs capability to create derivatives or even verbatim portions of any copyrighted trading data is a lot more interesting and possibly relevant as well?

ChatGPT-4o’s answer:
Yes, I would agree with your points, and here's a breakdown of why:

a) The data has already been collected through web scraping and other means

This is a widely accepted fact in the AI industry. Large language models (LLMs) like OpenAI’s GPT models are trained on vast datasets that often include content scraped from the web, books, and other publicly available or copyrighted sources. OpenAI has even acknowledged that publicly accessible data is often included in these datasets. The controversy arises because many of these datasets are scraped without explicit consent from content creators, raising legal and ethical questions, especially in the context of copyright.

b) The data has already been tokenized (crushed and fragmented)

When LLMs are trained, they don’t retain full, continuous representations of the data. Instead, the content is tokenized—broken down into small fragments, such as individual words, subwords, or characters, and transformed into numerical representations. This tokenization abstracts the original works into patterns that the model uses for language understanding, rather than memorizing entire passages in their original form.

This is important because tokenization itself is usually seen as a transformative process. In theory, the model isn't "storing" complete works but learning relationships between words and concepts, which supports the "fair use" defense. However, while this may legally protect the training phase, it doesn’t completely eliminate concerns over potential reproduction of the original copyrighted material.

c) The training data in that state might not be that interesting, but the model output and the ability to create derivatives (or even verbatim portions) is more relevant

You’re exactly right here. The tokenized, fragmented training data may not be of much interest by itself because it’s so far removed from the original works. In fact, the training data likely won’t offer any obvious instances of copyright infringement on its own—it's too abstracted and transformed.

The real concern lies in the model's outputs, which can, at times, reproduce portions of the original content—whether verbatim or close approximations—and generate derivative works that are stylistically similar to copyrighted material. This is where the legal gray area deepens:

  • Verbatim reproduction: If the model outputs large sections of text that are clearly identifiable as coming from a specific book or copyrighted work, this could be direct copyright infringement.

  • Derivative content: Even when the output is not a word-for-word copy, the model’s ability to produce content that mimics the style, structure, or thematic elements of copyrighted works (especially when prompted) could still raise legal concerns.

The Bigger Picture:

The ability of LLMs to generate content that closely mirrors copyrighted materials makes the output stage of the AI far more interesting and potentially problematic. The legal question is whether the creation of "new" content is sufficiently transformative to avoid copyright infringement. The concern isn't just the training data itself, but how the model processes and outputs content in ways that could reproduce or closely imitate copyrighted works.

In copyright law, the focus is often on the outputs rather than the mechanisms used to produce those outputs. This is why lawyers representing the authors in this case might be less interested in seeing tokenized training data and more focused on how closely the model’s generated text aligns with their clients’ works.

So, yes, the real legal and ethical intrigue revolves around how these models use the training data to generate outputs, and whether those outputs cross the line into reproducing or creating derivatives of copyrighted materials. The training phase is abstract, but the outputs could pose clear legal challenges.