- Pascal's Chatbot Q&As
- Posts
- 2024 Stanford LLM Lecture Analysis. Below are the claims / statements / “facts” in the transcript that could be relevant to rights owners litigating against AI companies.
2024 Stanford LLM Lecture Analysis. Below are the claims / statements / “facts” in the transcript that could be relevant to rights owners litigating against AI companies.
Especially on issues like training data sourcing, copying at scale, knowledge of copyright risk, opacity, and operational feasibility of compliance.
Source: YouTube (27th August 2024)
by ChatGPT-5.2
Below are the claims / statements / “facts” in the transcript that could be relevant to rights owners litigating against AI companies—especially on issues like training data sourcing, copying at scale, knowledge of copyright risk, opacity, and operational feasibility of compliance. I’m phrasing these as “the speaker asserts” (i.e., potential evidentiary leads), not as independently verified truths.
A) Explicit copyright-liability / concealment admissions
“There’s a lot of secrecy… this is the key of most of the pretraining… companies don’t talk about how they do the data collection.”
Litigation relevance: argues intentional opacity around training inputs and data pipelines.
“And also there’s a copyright liability issue. They definitely don’t want to tell you that they’ve trained on books even though they did because if not [people] can sue them.”
Litigation relevance: direct statement implying knowledge of infringement risk and deliberate non-disclosure; also asserts books are included in training (at least in practice/industry) and that companies avoid admitting it for legal reasons.
B) “Train on all of internet” described as standard practice + scale of crawling
Pretraining described as “train your language model to essentially model all of internet.”
Relevance: frames training as broad ingestion of web content.
“Download all of internet… use web crawlers… go on every web page… or every web page that is on Google.”
Relevance: suggests extremely broad acquisition, consistent with mass copying.
“That is around 250 billion pages right now” and “around 1 petabyte of data.”
Relevance: quantifies scale; supports arguments that permissioning is non-trivial and that use is massive.
“Common Crawl is one [crawler]… every month adds all the new websites… found by Google… put it in a big dataset.”
Relevance: identifies a commonly used source, supports discovery requests around Common Crawl usage, pipelines, filters, provenance logs.
C) Data pipeline steps that imply copying, retention, transformation, filtering choices
These are the “mechanics” that matter in litigation because they imply copying into local corpora, processing, selection, and often retention/derivative datasets.
Extraction: “First… extract the text from the HTML… looking at tags.”
Math extraction difficulty: “Extracting math is complicated but important.”
Boilerplate handling: “Headers/footers/menus… you don’t want to repeat all of this in your data.”
Filtering undesirable content: “NSFW, harmful content, PII.”
Blacklists: “Usually every company has basically a blacklist of websites that they don’t want to train their models on… very long.”
Relevance: implies companies can exclude sources by domain/URL lists; supports “feasibility of opt-out / exclusion” arguments.
Model-based filtering for PII: “Train a small model for classifying what is PII… removing these things.”
Relevance: indicates active classification/removal systems exist.
De-duplication: remove repeated content and “paragraphs that come from common books… duplicated 1,000 times or 10,000 times on internet.”
Relevance: (1) acknowledges books are present online at scale, (2) indicates they detect/handle repeated copyrighted passages, (3) suggests they can do similarity/dedup at scale.
Heuristic filtering to remove “low-quality documents” (token distribution outliers, extremely short/long pages, etc.).
Relevance: shows deliberate quality gates rather than “purely incidental” capture.
Wikipedia-link “quality” classifier: “Take all of Wikipedia… look at all links referenced… train a classifier to predict if a doc comes from Wikipedia references vs random web… want more of the Wikipedia-referenced things.”
Relevance: shows purposeful curation of “higher quality” sources; implies use of Wikipedia outbound links as a proxy whitelist.
Domain classification + reweighting: classify into “entertainment, books, code…” then “up or down weight some domains.”
Relevance: again shows curation decisions; also explicitly calls out books as a category that is often upweighted.
End-of-training “overfit on high quality data”: “Usually… overfit on Wikipedia… and human data that was collected.”
Relevance: reinforces that specific datasets are emphasized late in training for model behavior/quality.
D) Statements implying inclusion of books / GitHub / PubMed etc. in training mixtures
Books explicitly treated as a domain to upweight: “Books is usually also another one that people usually upweight.”
Relevance: points to intentional inclusion and importance of book corpora.
Discussion of common benchmark mixture (“The Pile”) including:
“arXiv, PubMed Central… Wikipedia… Stack Exchange… GitHub… and some books…”
Relevance: even if framed as an academic benchmark, it normalizes the categories and can guide discovery.
Code tokenization change alleged as a “big change that GPT-4 did”: “One of the big changes that GPT-4 did is changing the way that they tokenize code… model couldn’t really understand code [before].”
Relevance: supports arguments that engineering decisions materially affect capability; less direct on rights, but relevant if code corpora are disputed.
E) Statements about “not enough data” and synthetic data as a response
“We don’t have enough data on the internet.”
“Synthetic data generation… big one right now… because we don’t have enough data on the internet.”
Relevance: may be used to argue that marginal value of additional copyrighted corpora is material; also may support defenses/claims about why they seek books and paywalled corpora.
F) Quantitative claims about training scale, tokens, and specific models
These can be useful for damages narratives, willfulness, and implied copying volume.
Training scale trend: “Started from around 150 billion tokens (~800 GB)… now around 15 trillion tokens… best models probably trained on that amount.”
Llama 2: “trained on 2 trillion tokens.”
Llama 3: “15 trillion tokens.”
GPT-4: “we don’t really [know], but… probably same order… probably around [~13T] from leaks, if leaks are true.”
Relevance: potentially supports arguments that defendants possess (or should possess) concrete accounting, even if public claims are vague.
Common Crawl filtering magnitude: says final usable corpora reflect “100 to 1,000 times filtering of the Common Crawl” (speaker’s back-of-envelope).
Relevance: suggests massive reduction pipelines exist; also implies they can precisely control inclusion/exclusion.
G) Train-test contamination / memorization-adjacent relevance
“Train test contamination… for companies… maybe not that important because they know what they trained on. For us, we have no idea.”
Relevance: implies companies have internal knowledge of training set composition; can support discovery demands for logs, manifests, or provenance.
Describes a technique to detect contamination: if examples are “more likely” when generated in original order vs shuffled order, that suggests presence in training.
Relevance: points to feasible forensic methods plaintiffs might deploy or ask experts to consider.
H) Economic/compute claims that can support “industrial-scale copying” framing
“Data is bigger than the modeling aspect” (re: staffing/effort).
“In LLAMA’s team… 70-ish people… maybe 15 work on data.”
Relevance: suggests substantial, dedicated data engineering for acquisition/filtering—useful for “this wasn’t accidental” narratives.
Llama 3 “400B / 405B params” training cost estimate and compute details:
“Trained on 15.6 [trillion] tokens” (speaker states “15.6 tokens” but context indicates trillion).
“405 billion parameters.”
FLOPs estimate: “3.8e25 flops.”
Hardware: “trained on 16,000 H100s.”
Duration: “~70 days / 26 million GPU hours” (speaker) and “they said 30 million GPU hours.”
Renting cost estimate: “H100… ~$2/hour… ~$52M” compute rental equivalent.
People cost estimate: “50 employees… $25M” (at $500k/yr assumption).
Total estimate: “~$75M.”
Relevance: helps show massive commercial scale and investment, relevant to unjust enrichment/damages and “commercial exploitation” arguments.
Executive Order / scrutiny threshold claim:
“Executive order from Biden… once you have 1e26 flops… special scrutiny… they went 2x less… right below this to not have special scrutiny.”
Relevance: suggests regulatory gaming / threshold-aware behavior; may matter if transparency/notification duties are litigated.
I) Statements about post-training that may matter for output risks and factuality (indirectly rights-related)
These aren’t “training-data infringement” admissions, but can matter to claims about regurgitation, hallucination, and product behavior, depending on the lawsuit theory.
Supervised fine-tuning (SFT) does not “teach” new knowledge: “Knowledge is already in the pretrained LLM… you just specialize to one type of user.”
Relevance: supports argument that pretraining is where the copyrighted “knowledge” enters.
Hallucination hypothesis linked to SFT: If humans provide an answer/reference the model never saw in pretraining, SFT may cause it to “make up plausible sounding reference” rather than true citation.
Relevance: can support claims about reliability and attribution failures (and why provenance matters).
RLHF causes longer outputs: “If you’ve been annoyed at ChatGPT answering super long… this is because of RLHF.”
Relevance: can relate to substitutability and market harm (longer, more complete answers reduce need to click through).
Humans agree only ~66% on binary preference tasks (and authors themselves ~67–68%); models can reach higher agreement with “mode of humans” because “models have no variance.”
Relevance: can be used to argue that alignment and safety claims depend on noisy human labeling; less directly IP-related but could matter in consumer protection / reliability angles.
J) Practical feasibility statements relevant to “opt-out,” “filtering,” and “control” arguments
Companies maintain long blacklists of sites they won’t train on.
They use classification and heuristic pipelines to remove categories (PII, NSFW, “undesirable” content).
They do de-duplication at scale including for “common books” repeated online.
They can domain-classify and reweight (upweight books/code; downweight entertainment).
Relevance: collectively supports the argument that selective exclusion of rightsholders (or licensed-only pipelines) is technically feasible, even if costly.
·
27 MAR

The Memorization Mirage: What LLMs Really Keep, What the Law Really Cares About, and Where the Next Court Shock Will Land
·
8 MAR

“Fair Use by Technical Necessity” — or The Piracy Protocol Defense (Meta’s Newest Attempt to Re-Label Seeding as Scholarship)
·
6 MAR

Anna’s Archive, Shadow Libraries, and the AI “Scale Problem”: What the New SDNY Complaint Signals—and What Regulators Should Do Next
·
21 FEB

AI’S PAST SINS: TRAINING ON STOLEN DATA, AND REMONITIZING IT
·
28 JAN

“Project Panama” and the Great Book Strip-Mine: what the Anthropic filings reveal about AI’s content supply chain
·
20 JAN

When “Shadow Libraries” Meet Big Tech: The Anna’s Archive–NVIDIA Collision and Its Blast Radius
·
9 JAN

When “fair use” meets “full book output”: what Extracting books from production language models changes for rights owners
·
23 DECEMBER 2025

Carreyrou v. the AI Industry — From “Fair Use Theater” to Evidence-Driven Copyright Liability
·
7 NOVEMBER 2025

“The Books That Trained the Machine” — Entrepreneur Media v. Meta and the Future of AI Copyright Law
·
31 OCTOBER 2025

A Turning Point in AI Copyright Litigation—George R.R. Martin, Download Claims, and the Future of Generative AI Accountability
·
24 OCTOBER 2025

Apple’s AI on Trial: The Alexander Lawsuit and the Pirated Books Controversy
·
6 SEPTEMBER 2025

Apple Faces Class Action over AI Training on Pirated Books — A Justified Pushback or Strategic Overreach?
·
5 SEPTEMBER 2025

A $1.5 Billion Warning to the AI Industry – Unpacking the Anthropic Copyright Settlement
·
20 JULY 2025

·
4 JULY 2025

Online Copyright Infringement (OCI) 2024 – Published 20 June 2025 – Analysis, Lessons, and Sector-Specific Recommendations
·
1 JUNE 2025

From Litigation to Licensing: The New York Times–Amazon Deal Signals a Turning Point in the AI Copyright Wars
·
20 MARCH 2025

Question 1 of 3 for ChatGPT-4o: Please read the article “The Unbelievable Scale of AI’s Pirated-Books Problem - Meta pirated millions of books to train its AI. Search through them here” and tell me what it says. List the most surprising, controversial and valuable statements made.
·
5 MARCH 2025

Asking ChatGPT-4o: Please read the “Report on Pirated Content Used in the Training of Generative AI” and the associated LinkedIn post and tell me what the key messages are and how AI makers should be responding to these issues.
·
14 FEBRUARY 2025

Question 1 of 2 for ChatGPT-4o: Please read the article "Cloze Encounters: The Impact of Pirated Data Access on LLM Performance" and tell me what it says in easy to understand language.
·
7 FEBRUARY 2025

Asking ChatGPT-4o: List every surprising, controversial and valuable detail (for copyright owners) you can find in the newly unsealed emails from META as published by Ars Technica
·
7 FEBRUARY 2025

Asking AI Services: Please read the articles "“Torrenting from a corporate laptop doesn’t feel right”: Meta emails unsealed" and "‘Meta Torrented over 81 TB of Data Through Anna’s Archive, Despite Few Seeders’" and explain what happened, whether this is bad behaviour or not and what a) the judge should do and b) what the AI maker(s) should do to prevent…
·
14 JANUARY 2025

Question 1 of 3 for ChatGPT-4o: Please read the article "Judge Chhabria grants Kadrey, represented by David Boies, leave to amend to file Third Amended Consolidated Complaint. Adds DMCA CMI claim, CA Computer Fraud Act claim. Plus, Kadrey gets to depose Meta about seeding of works via torrents.
·
10 JANUARY 2025

·
13 NOVEMBER 2023

The Evolution of Digital Content Sharing: From Centralized Repositories to AI-driven Private Interactions
·
7 SEPTEMBER 2023

Question 1 of 3 for AI Services: The author of this article https://qz.com/openai-books-piracy-microsoft-meta-google-chatgpt-bard-1850757064 states that Big Tech has a “tendency to pinch pennies wherever people can be exploited”. Can you find more evidence about that than is mentioned in the article itself? Is there any research available on this or are…
·
17 JULY 2023

Question for AI services: when AI developers knowingly use copyrighted content for AI training without asking permission from the rights owners, and even revert to the use of data downloaded from pirate platforms, does that constitute an intentional and malicious attack on Copyright Law?
·
30 JUNE 2023

Question for AI services: Please provide me with your opinion about the following statement:
·
15 JUNE 2023

Question 1 of 3 for AI Services: Am I right in thinking that the key piracy threats for rights owners and content creators have historically been: infringing websites and URLs, search results leading to infringements, BitTorrent and associated torrent sites, infringing files made available via e-commerce platforms, social media and chat applications and…
·
4 JUNE 2023

Question 1 of 9 for AI Services: Every new technology has contributed something to the content piracy problem or solution. One can think of websites, social media, P2P file sharing, blockchain, IPFS, chatbots...What kind of content piracy problems do you think artificial intelligence services will create?
·
15 APRIL 2023
The topic of content piracy is perfect information to use when testing and interrogating the AI services ChatGPT, Google Bard and Bing Chat. First of all, the topic is complex and high quality information about it is not abundant.
