- Pascal's Chatbot Q&As
- Posts
- Doe 1 v. Github/Microsoft/OpenAI: Much of the proof—prompt logs, output frequencies, memorization testing, preprocessing pipelines, “cleaning” steps—is uniquely in defendants’ possession...
Doe 1 v. Github/Microsoft/OpenAI: Much of the proof—prompt logs, output frequencies, memorization testing, preprocessing pipelines, “cleaning” steps—is uniquely in defendants’ possession...
...and if courts require plaintiffs to plead those internal details before discovery, many DMCA/AI claims will die at the gate. The defense, of course, wants exactly that gate.

Doe 1 v. GitHub / Microsoft / OpenAI at the United States Court of Appeals for the Ninth Circuit: what the oral argument is really about
by ChatGPT-5.2
The transcript captures a Ninth Circuit oral argument that is, on the surface, about a technical DMCA question—but underneath it’s a fight over whether copyright-adjacent “provenance” duties can be enforced against AI companies without first proving classic copyright infringement in the familiar way.
The core legal dispute: “removal” of CMI vs. a backdoor “attribution right”
The plaintiffs’ theory (as it came through in the argument) is: if defendants ingest code that contains copyright management information (CMI)—think license headers, author lines, copyright notices—and strip it during preprocessing/training, and then the system later outputs code snippets without that CMI, that can violate 17 U.S.C. § 1202(b) (DMCA CMI provisions). They argue the statute does not require a literally identical copy; “substantial similarity” concepts and circumstantial evidence frameworks can do the work where direct proof isn’t available.
The defendants’ pushback has two main layers:
Standing (and plausibility): You can’t sue just because a technology couldcause harm to someone; the named plaintiffs must plausibly allege they are (or are likely to be) harmed in the specific way §1202(b) targets—here, their code being output without CMI, or their CMI being “removed” in a way that creates a concrete, non-speculative injury. They hammer the idea that the complaint doesn’t say enough about what the plaintiffs’ code is, how common it is, whether it’s realistically prompted, etc.
Merits boundary: removal vs. omission: Even if strict identicality is wrong, defendants argue “removal” means tampering with existing CMI on an existing copy, not merely generating an output that lacks attribution. In other words: don’t let §1202(b) become a general-purpose attribution regime for AI outputs. They repeatedly warn about “floodgates” and statutory damages being massively punitive if courts treat “no attribution in output” as “removal.”
Why the judges’ questions matter
The panel’s questioning exposes what future courts are likely to do in these cases:
They are allergic to abstract injury. The “1% regurgitation” idea gets probed: 1% of what, over what scale, with what distribution across “billions of lines of code,” and are plaintiffs alleging facts that make their harm plausible?
They are trying to draw a workable line between (a) obvious DMCA misconduct (e.g., “rip off the cover page / delete the license header”) and (b) mere non-attribution in an output or derivative work. The hypotheticals (book reviews, fan fiction, “Twilight from memory,” museum art student copying without signature) are basically the court stress-testing whether either side’s rule collapses into absurdity.
They care about what exactly is “in the case.” A recurring thread is whether “training/input-stage stripping” is truly pled and preserved, or whether the only live relief sought is effectively “make outputs include CMI,” which looks like an attribution claim.
A meta-point: this is a power fight over evidence control
One theme that plaintiffs lean on (implicitly and explicitly) is that much of the proof—prompt logs, output frequencies, memorization testing, preprocessing pipelines, “cleaning” steps—is uniquely in defendants’ possession, and if courts require plaintiffs to plead those internal details before discovery, many DMCA/AI claims will die at the gate. The defense, of course, wants exactly that gate: “show me it’s likely your work was impacted, with realistic facts, or you don’t get discovery.”
All recommendations for future litigants suing AI makers (drawn from what the argument signals courts will demand)
I’m phrasing these as practical drafting + litigation moves—because that’s what the argument is telegraphing.
A. Plead standing like you mean it (don’t assume the court will “let you into discovery”)
Allege concrete harm to the named plaintiffs, not just a generalized industry harm. Make the court’s “why you?” question easy.
Quantify plausibility: if you cite a “regurgitation rate,” pair it with scale facts (queries/day, user base, or other plausible volume proxies) and explain why that creates a realistic likelihood your work is affected.
Show why your work is likely to appear (frequency, popularity, “well-traveled” repos, inclusion in common tutorials, forks/stars/downloads, dependency chains). Defendants explicitly attacked the absence of this.
Use realistic prompts: include examples of prompts that ordinary users would actually use—not contrived “needle” prompts—and show resulting outputs. Defendants attacked “wildly unrealistic” examples.
If you rely on “input-stage stripping” as injury, plead it with specificity (see section B). Courts may treat that as the cleanest path to standing if properly alleged.
B. If your theory is “CMI stripped during ingestion/training,” don’t hand-wave it
Describe the mechanism: what preprocessing steps plausibly remove CMI? (e.g., license-header deletion, comment stripping, deduplication that drops metadata, format normalization, “cleaning” scripts). The defense argument was: “stripping is an act; plead how it happens.”
Tie mechanism to intent (double scienter): §1202(b) fights often turn on intent/knowledge. Plead facts supporting that the defendant intentionallyremoved/altered CMI and knew it would “induce, enable, facilitate, or conceal” infringement. The argument repeatedly circled this “double scienter” gate.
Don’t let the case collapse into “you owe me attribution.” If your complaint reads like “make outputs include CMI,” defendants will hammer “this is just an attribution right.” Draft the theory as tampering/removal (if that’s your best evidence) rather than as “outputs should always cite.”
C. Anchor the DMCA claim to infringement risk, not vibes
Plead the infringement pathway the CMI-removal allegedly facilitates or conceals (license noncompliance, downstream copying, commercial substitution, etc.). Courts are wary of DMCA becoming a free-standing moral right.
If you also have a copyright or license claim, plead it (or at least plead the facts that would support it). One rhetorical advantage defendants used was: “there’s no allegation of infringement, yet you want DMCA statutory damages.”
D. Build your “output” allegations like a product-liability engineer, not a poet
Document outputs that match your work (verbatim or near-verbatim) and identify the missing CMI. Plaintiffs pointed to testing and an “admission” about regurgitation; the court pressed on what’s actually pled.
Explain why outputs are not merely “new text from statistical patterns.”Defendants leaned hard on “pattern recognizer / not connected to training data.” Plaintiffs must counter with facts showing memorization, overfitting pockets, or reproduction phenomena that courts can understand without becoming ML experts.
Differentiate direct vs. circumstantial evidence. Plaintiffs emphasized that copy-comparison is often used only because direct proof of stripping is missing; if you can allege direct removal facts, say so and explain why that should change the analysis.
E. Preserve theories aggressively (courts will treat ambiguity as waiver risk)
If the district court suggests you didn’t plead a theory (e.g., training/input-stage removal), correct it immediately in briefing and hearings. The argument shows how easily a case gets funneled into the narrowest “output-only” view.
Make your requested relief match your theories. If you care about training-stage removal, don’t ask only for output attribution. Draft relief that tracks the misconduct you allege (injunctions can be tailored later, but your prayer for relief sets the tone).
F. Draft remedies that look like governance, not theater
Ask for operationally meaningful relief (data-provenance controls, retention/deletion commitments, auditability, evaluations for memorization/regurgitation, logging, and safe harbors that don’t require disclosing trade secrets publicly). Courts are more likely to tolerate forward-looking governance relief than a sweeping “always attribute everything” mandate. This aligns with how defendants framed the “attribution right” concern.
Be careful with statutory-damages optics. Defendants repeatedly signaled that “bankrupt-inducing” DMCA damages make judges cautious. Calibrate damages theories and show proportionality.
G. Use the asymmetry of evidence as an explicit litigation strategy
Plead what you can observe externally and explain why the rest requires discovery (prompt logs, training corpora handling, cleaning scripts, internal metrics). Plaintiffs argued the key numbers are uniquely held by defendants; courts may accept that—but only if you’ve already done serious external testing.
Transcript 9th circuit
Where I, ChatGPT, disagree (or at least, where I think one side overstated the case)
A couple of claims in the argument struck me as either overconfident or strategically convenient in ways that could mislead future litigants if taken at face value:
“If we allow this, DMCA becomes an attribution right and teachers become liable.”
That’s rhetorically powerful, but it smuggles in worst-case assumptions. The DMCA §1202(b) claim is fenced by intent/knowledge requirements and by the need to connect removal to infringement facilitation/concealment. Treating every unattributed excerpt as a DMCA violation is not the only possible doctrinal outcome, and courts are capable of drawing narrower rules. The “teacher in a classroom” flourish feels like advocacy designed to panic the court, not a balanced forecast.The clean separation claim: “Generative AI is not connected to training data; it’s just statistical patterns.”
As a general description of model architecture, it’s fine. As a legal conclusion to shut down “removal” theories, it’s too neat. Courts are being asked to rule on observable behavior (near-verbatim regurgitation, systematic omission of CMI-bearing headers, etc.). If the system can reproduce long stretches, “it’s only statistics” doesn’t resolve whether CMI was intentionally stripped in preprocessing or whether outputs are functionally substitutive copies. The argument shows judges are not fully satisfied with analogies that treat models like purely human memory.On the plaintiffs’ side: “The stripping itself is automatically injury-in-fact.”
I get the intuition—CMI exists to keep provenance attached. But standing doctrine can be less romantic. Some courts will demand a clearer showing of downstream risk or concrete impairment (market substitution, licensing interference, or realistic likelihood of distribution without CMI) rather than accepting “metadata was removed somewhere” as categorically sufficient. Plaintiffs should plead both: (a) the stripping event and (b) why that event creates concrete, non-speculative harm.
