Pascal's Chatbot Q&As
Posts
2024 Stanford LLM Lecture Analysis. Below are the claims / statements / “facts” in the transcript that could be relevant to rights owners litigating against AI companies.

2024 Stanford LLM Lecture Analysis. Below are the claims / statements / “facts” in the transcript that could be relevant to rights owners litigating against AI companies.

Especially on issues like training data sourcing, copying at scale, knowledge of copyright risk, opacity, and operational feasibility of compliance.

Pascal Hetzscholdt
April 20, 2026

Source: YouTube (27th August 2024)

by ChatGPT-5.2

Below are the claims / statements / “facts” in the transcript that could be relevant to rights owners litigating against AI companies—especially on issues like training data sourcing, copying at scale, knowledge of copyright risk, opacity, and operational feasibility of compliance. I’m phrasing these as “the speaker asserts” (i.e., potential evidentiary leads), not as independently verified truths.

A) Explicit copyright-liability / concealment admissions

“There’s a lot of secrecy… this is the key of most of the pretraining… companies don’t talk about how they do the data collection.”
- Litigation relevance: argues intentional opacity around training inputs and data pipelines.
“And also there’s a copyright liability issue. They definitely don’t want to tell you that they’ve trained on books even though they did because if not [people] can sue them.”
- Litigation relevance: direct statement implying knowledge of infringement risk and deliberate non-disclosure; also asserts books are included in training (at least in practice/industry) and that companies avoid admitting it for legal reasons.

B) “Train on all of internet” described as standard practice + scale of crawling

Pretraining described as “train your language model to essentially model all of internet.”
- Relevance: frames training as broad ingestion of web content.
“Download all of internet… use web crawlers… go on every web page… or every web page that is on Google.”
- Relevance: suggests extremely broad acquisition, consistent with mass copying.
“That is around 250 billion pages right now” and “around 1 petabyte of data.”
- Relevance: quantifies scale; supports arguments that permissioning is non-trivial and that use is massive.
“Common Crawl is one [crawler]… every month adds all the new websites… found by Google… put it in a big dataset.”
- Relevance: identifies a commonly used source, supports discovery requests around Common Crawl usage, pipelines, filters, provenance logs.

C) Data pipeline steps that imply copying, retention, transformation, filtering choices

These are the “mechanics” that matter in litigation because they imply copying into local corpora, processing, selection, and often retention/derivative datasets.

Extraction: “First… extract the text from the HTML… looking at tags.”
Math extraction difficulty: “Extracting math is complicated but important.”
Boilerplate handling: “Headers/footers/menus… you don’t want to repeat all of this in your data.”
Filtering undesirable content: “NSFW, harmful content, PII.”
Blacklists: “Usually every company has basically a blacklist of websites that they don’t want to train their models on… very long.”
- Relevance: implies companies can exclude sources by domain/URL lists; supports “feasibility of opt-out / exclusion” arguments.
Model-based filtering for PII: “Train a small model for classifying what is PII… removing these things.”
- Relevance: indicates active classification/removal systems exist.
De-duplication: remove repeated content and “paragraphs that come from common books… duplicated 1,000 times or 10,000 times on internet.”
- Relevance: (1) acknowledges books are present online at scale, (2) indicates they detect/handle repeated copyrighted passages, (3) suggests they can do similarity/dedup at scale.
Heuristic filtering to remove “low-quality documents” (token distribution outliers, extremely short/long pages, etc.).
- Relevance: shows deliberate quality gates rather than “purely incidental” capture.
Wikipedia-link “quality” classifier: “Take all of Wikipedia… look at all links referenced… train a classifier to predict if a doc comes from Wikipedia references vs random web… want more of the Wikipedia-referenced things.”
- Relevance: shows purposeful curation of “higher quality” sources; implies use of Wikipedia outbound links as a proxy whitelist.
Domain classification + reweighting: classify into “entertainment, books, code…” then “up or down weight some domains.”
- Relevance: again shows curation decisions; also explicitly calls out books as a category that is often upweighted.
End-of-training “overfit on high quality data”: “Usually… overfit on Wikipedia… and human data that was collected.”
- Relevance: reinforces that specific datasets are emphasized late in training for model behavior/quality.

D) Statements implying inclusion of books / GitHub / PubMed etc. in training mixtures

Books explicitly treated as a domain to upweight: “Books is usually also another one that people usually upweight.”
- Relevance: points to intentional inclusion and importance of book corpora.
Discussion of common benchmark mixture (“The Pile”) including:
- “arXiv, PubMed Central… Wikipedia… Stack Exchange… GitHub… and some books…”
- Relevance: even if framed as an academic benchmark, it normalizes the categories and can guide discovery.
Code tokenization change alleged as a “big change that GPT-4 did”: “One of the big changes that GPT-4 did is changing the way that they tokenize code… model couldn’t really understand code [before].”
- Relevance: supports arguments that engineering decisions materially affect capability; less direct on rights, but relevant if code corpora are disputed.

E) Statements about “not enough data” and synthetic data as a response

“We don’t have enough data on the internet.”
“Synthetic data generation… big one right now… because we don’t have enough data on the internet.”
- Relevance: may be used to argue that marginal value of additional copyrighted corpora is material; also may support defenses/claims about why they seek books and paywalled corpora.

F) Quantitative claims about training scale, tokens, and specific models

These can be useful for damages narratives, willfulness, and implied copying volume.

Training scale trend: “Started from around 150 billion tokens (~800 GB)… now around 15 trillion tokens… best models probably trained on that amount.”
Llama 2: “trained on 2 trillion tokens.”
Llama 3: “15 trillion tokens.”
GPT-4: “we don’t really [know], but… probably same order… probably around [~13T] from leaks, if leaks are true.”
- Relevance: potentially supports arguments that defendants possess (or should possess) concrete accounting, even if public claims are vague.
Common Crawl filtering magnitude: says final usable corpora reflect “100 to 1,000 times filtering of the Common Crawl” (speaker’s back-of-envelope).
- Relevance: suggests massive reduction pipelines exist; also implies they can precisely control inclusion/exclusion.

G) Train-test contamination / memorization-adjacent relevance

“Train test contamination… for companies… maybe not that important because they know what they trained on. For us, we have no idea.”
- Relevance: implies companies have internal knowledge of training set composition; can support discovery demands for logs, manifests, or provenance.
Describes a technique to detect contamination: if examples are “more likely” when generated in original order vs shuffled order, that suggests presence in training.
- Relevance: points to feasible forensic methods plaintiffs might deploy or ask experts to consider.

H) Economic/compute claims that can support “industrial-scale copying” framing

“Data is bigger than the modeling aspect” (re: staffing/effort).
“In LLAMA’s team… 70-ish people… maybe 15 work on data.”
- Relevance: suggests substantial, dedicated data engineering for acquisition/filtering—useful for “this wasn’t accidental” narratives.
Llama 3 “400B / 405B params” training cost estimate and compute details:
- “Trained on 15.6 [trillion] tokens” (speaker states “15.6 tokens” but context indicates trillion).
- “405 billion parameters.”
- FLOPs estimate: “3.8e25 flops.”
- Hardware: “trained on 16,000 H100s.”
- Duration: “~70 days / 26 million GPU hours” (speaker) and “they said 30 million GPU hours.”
- Renting cost estimate: “H100… ~$2/hour… ~$52M” compute rental equivalent.
- People cost estimate: “50 employees… $25M” (at $500k/yr assumption).
- Total estimate: “~$75M.”
- Relevance: helps show massive commercial scale and investment, relevant to unjust enrichment/damages and “commercial exploitation” arguments.
Executive Order / scrutiny threshold claim:
- “Executive order from Biden… once you have 1e26 flops… special scrutiny… they went 2x less… right below this to not have special scrutiny.”
- Relevance: suggests regulatory gaming / threshold-aware behavior; may matter if transparency/notification duties are litigated.

I) Statements about post-training that may matter for output risks and factuality (indirectly rights-related)

These aren’t “training-data infringement” admissions, but can matter to claims about regurgitation, hallucination, and product behavior, depending on the lawsuit theory.

Supervised fine-tuning (SFT) does not “teach” new knowledge: “Knowledge is already in the pretrained LLM… you just specialize to one type of user.”
- Relevance: supports argument that pretraining is where the copyrighted “knowledge” enters.
Hallucination hypothesis linked to SFT: If humans provide an answer/reference the model never saw in pretraining, SFT may cause it to “make up plausible sounding reference” rather than true citation.
- Relevance: can support claims about reliability and attribution failures (and why provenance matters).
RLHF causes longer outputs: “If you’ve been annoyed at ChatGPT answering super long… this is because of RLHF.”
- Relevance: can relate to substitutability and market harm (longer, more complete answers reduce need to click through).
Humans agree only ~66% on binary preference tasks (and authors themselves ~67–68%); models can reach higher agreement with “mode of humans” because “models have no variance.”
- Relevance: can be used to argue that alignment and safety claims depend on noisy human labeling; less directly IP-related but could matter in consumer protection / reliability angles.

J) Practical feasibility statements relevant to “opt-out,” “filtering,” and “control” arguments

Companies maintain long blacklists of sites they won’t train on.
They use classification and heuristic pipelines to remove categories (PII, NSFW, “undesirable” content).
They do de-duplication at scale including for “common books” repeated online.
They can domain-classify and reweight (upweight books/code; downweight entertainment).
- Relevance: collectively supports the argument that selective exclusion of rightsholders (or licensed-only pipelines) is technically feasible, even if costly.