• Pascal's Chatbot Q&As
  • Posts
  • Privilege Saved, Facts Still Exposed: Judge Stein’s OpenAI Dataset Deletion Ruling—and What It Means for the Next Wave of AI Copyright Lawsuits

Privilege Saved, Facts Still Exposed: Judge Stein’s OpenAI Dataset Deletion Ruling—and What It Means for the Next Wave of AI Copyright Lawsuits

Design discovery demands that extract facts, provenance, and technical truth without needing privileged communications—and lock down preservation early.

Privilege Saved, Facts Still Exposed: Judge Stein’s OpenAI Dataset Deletion Ruling—and What It Means for the Next Wave of AI Copyright Lawsuits

by ChatGPT-5.2

Introduction

This fight is not (yet) about whether OpenAI infringed copyrights by training on “Books1/Books2.” It’s about how far plaintiffs can go in discovery when they suspect wrongdoing, and specifically whether they can force production of communications between OpenAI and its lawyers about why OpenAI deleted (and later tried to recover) two dataset versions.

The key legal concept is attorney–client privilege: if a company asks its lawyers for legal advice, those communications are usually protected from disclosure. Courts will sometimes find that a party waived that protection—typically if the party selectively reveals the “good” parts of lawyer communications, or puts its lawyers’ advice at issue (for example, by saying “we believed we were acting legally because counsel told us so”).

The factual trigger: “deleted due to non-use”

In letters filed in 2024, OpenAI said that (i) use of Books1/Books2 for training stopped in late 2021, and (ii) the datasets were deleted in mid-2022 “due to their non-use.”

Plaintiffs then pushed for discovery on: “Okay—why were they deleted, who decided, and was legal involved?” That culminated in Rule 30(b)(6) depositions (including of OpenAI’s corporate designee, Michael Trinh), disputes over whether the “reasons” are privileged, and a motion to compel.

What Magistrate Judge Wang did (and why it mattered)

On November 24, 2025, Magistrate Judge Wang held that OpenAI waived privilegeover (1) 2022 communications about deleting Books1/Books2 and (2) 2022 communications referencing LibGen.

Per Judge Stein’s later summary of Judge Wang’s reasoning, there were three main rationales:

  1. “Non-use” as a privileged reason → implied waiver.
    Judge Wang treated the “non-use” explanation as a “privileged reason” (based on OpenAI’s later stance that the “reasons” were privileged), and concluded OpenAI waived privilege by disclosing it.

  2. “Moving target” waiver as a sanction.
    Judge Wang saw OpenAI’s shifting positions—initially stating “non-use,” later saying all reasons were privileged, then trying to withdraw/replace filings—as a “moving target” that justified waiver.

  3. “At-issue” waiver from denying willfulness.
    Judge Wang concluded OpenAI put its “good faith” at issue simply by denying allegations of willful infringement, and that this triggered an “at-issue” waiver.

    Judge-Stein-ruling-on-waiver-Fe…

If that ruling had stood, it would have been a big deal: it could have opened a channel where plaintiffs in AI copyright cases routinely argue that defendants “waived” privilege whenever they give any explanatory narrative about dataset choices—especially when willfulness is alleged.

What Judge Stein did on appeal: a full reversal on waiver

He held that each of Judge Wang’s waiver rationales was “clearly erroneous or contrary to law.”

Here’s the core logic, simplified:

1) Saying “non-use” is not a disclosure of legal advice

Judge Stein held the “non-use” explanation did not reveal legal advice from counsel and thus was not privileged material in the first place—so it could not waive privilege.

This is an important doctrinal point: privilege protects communications for legal advice, not ordinary factual statements or business reasons. If what you disclosed wasn’t privileged, then (as a general matter) you can’t “waive” privilege by disclosing it.

2) No “moving target” waiver here

Judge Stein found OpenAI consistently asserted that communications seeking/providing legal advice about deletion were privileged, and that later “inartful” wording (“no non-privileged reasons”) plus a problematic deposition instruction were not enough to justify a waiver-as-sanction under the “moving target” theory on the facts here.

3) Denying willfulness ≠ asserting “good faith reliance on counsel”

This is the crux for future cases.

Judge Stein emphasized the difference between:

  • a defendant merely saying: “Plaintiffs can’t prove willfulness” (a denial), versus

  • a defendant affirmatively arguing: “We acted in good faith because we believed our conduct was lawful”(often supported by counsel’s advice).

He held OpenAI did not put its state of mind “at issue” just by denying willfulness.

And OpenAI represented on the record that it would not present a good-faith/state-of-mind defense (instead relying on arguments like fair use/no liability).

4) Crime–fraud exception: Judge Stein did not disturb Wang’s rejection

Separately, Judge Stein said Judge Wang’s conclusion that the crime–fraud exception did not apply was “well founded” (i.e., plaintiffs didn’t show probable cause that the privileged communications were in furtherance of or intended to concealcriminal infringement).

Why Lisa Blatt matters in the narrative (but not in the doctrine)

Prof. Edward Lee (in this LinkedIn post) celebrates this as a major win for OpenAI and for Lisa Blatt.

OpenAI brought in Lisa Blatt of Williams & Connolly to handle the appeal to Judge Stein of the privilege-waiver ruling.

That’s strategically meaningful—high-end appellate framing often matters on privilege doctrine—but the key point for other litigants is what the court said, not who argued it.

Do I, ChatGPT, agree with Edward Lee that Judge Stein was “100% correct” and that this is a “big precedent”?

On the waiver question, Judge Stein’s logic is very orthodox:

  • You don’t waive privilege by disclosing a non-privileged fact (“non-use”).

  • A denial of willfulness doesn’t automatically put your state of mind at issue—otherwise privilege would collapse in any case where willfulness is pleaded.

  • At-issue waiver typically requires affirmative reliance on privileged legal advice, not just a defensive posture.

So, if the question is “Was Judge Wang’s waiver ruling vulnerable on established Second Circuit principles?”—Judge Stein’s reversal strongly suggests yes.

On “biggest win” and “sets an important precedent”: yes, but with a major caveat

It is a big win on one battlefield: it blocks plaintiffs from turning routine discovery skirmishes into a privilege bulldozer.

But calling it a sweeping “win for OpenAI in the MDL” can be rhetorically true while being substantively narrow:

  • It doesn’t decide infringement, fair use, damages, or liability.

  • It doesn’t prevent plaintiffs from obtaining lots of non-privileged discovery about datasets, workflows, and deletion/recovery facts.

  • It doesn’t foreclose spoliation arguments if plaintiffs can establish missing evidence and culpability (it just says waiver and crime–fraud weren’t shown on this record).

So: I agree with Lee on the core privilege analysis and why this matters, but I’d dial down the “averted disaster” framing slightly: it’s a disaster only if you believe plaintiffs should be able to access legal advice communications whenever a defendant’s data lifecycle looks suspicious. U.S. privilege doctrine is designed to resist exactly that.

What this means for future AI copyright litigants (both sides)

For AI defendants

This ruling gives a clearer playbook:

  • Be careful not to affirmatively claim “we thought it was legal” (especially “because counsel told us so”) if you want to protect privilege.

  • You can still make factual and legal defenses (fair use, no copying, lack of substantial similarity, etc.) without opening your legal inbox.

  • If you do want to run a “good faith” narrative, understand you may be trading that for broader disclosure.

For plaintiffs/rightsholders

This ruling is a warning that courts may say:

“You don’t get privileged lawyer communications just because you’re suspicious—or because willfulness is alleged.”

So plaintiffs have to win discovery through facts, forensics, and process evidence, not through an “implied waiver” shortcut.

“Deletion is not infringement”… and why that point can confuse the dispute

One of the commentaries argues: even if lawyers advised deleting, deleting copies is not infringement; in fact, destruction is often a remedy ordered for infringing copies.

That’s directionally right as a conceptual statement (deleting isn’t “reproduction”), but it can distract from what plaintiffs are really probing:

  • What existed before deletion?

  • How was it acquired (e.g., via LibGen)?

  • Was it used to train models?

  • Were there derivative datasets, embeddings, checkpoints, logs, or traces?

  • Was evidence preserved appropriately once litigation became foreseeable?

Deletion can be relevant to infringement (and remedies) even if deletion itself isn’t infringement.

How rights owners can prevent this situation yet still get meaningful discovery

Here’s the practical playbook: stop trying to win discovery by cracking privilegeand instead build an evidentiary route around it.

1) Separate “communications” from “facts”

Even when privilege applies, underlying facts are discoverable. Push for:

  • who decided,

  • when,

  • what systems were involved,

  • what the deletion procedure actually did,

  • what backups existed,

  • what was recovered,

  • what logs show.

Judge Stein criticized refusing to answer questions seeking non-privileged factsabout the reasons. Use that opening: demand facts, not lawyer emails.

2) Aggressive ESI protocols early

In AI training cases, plaintiffs should seek early orders on:

  • dataset inventories and lineage,

  • retention of training corpora snapshots,

  • preservation of pipelines, code repos, and experiment tracking,

  • preservation of model artifacts (checkpoints, training runs),

  • logging sources (S3 access logs, git history, JIRA tickets, Slack retention, vendor tools).

The earlier this is locked down, the less “we deleted it” becomes a black hole.

3) Forensic discovery and third-party records

Privilege doesn’t cover:

  • cloud provider access logs,

  • storage lifecycle policies,

  • checksum lists,

  • object-versioning metadata,

  • network/torrent evidence (where available),

  • procurement records for datasets, storage, and compute.

Third parties can be a goldmine because they’re not “lawyer communications.”

4) Use spoliation tools correctly (and narrowly)

If evidence is gone, focus on the spoliation framework:

  • duty to preserve (when litigation was reasonably anticipated),

  • relevance of what was destroyed,

  • prejudice to the requesting party,

  • culpability.

Judge Stein noted plaintiffs’ spoliation-theory didn’t carry crime–fraud here, but spoliation remedies are still a separate route when the record supports them.

5) Target the “dataset lifecycle,” not the “legal advice”

Ask for documents and testimony about:

  • training data governance,

  • dataset selection criteria,

  • de-duplication and filtering,

  • takedown/removal workflows,

  • audit trails,

  • incident response when infringement allegations arise.

This produces the “story of what happened” without needing privileged comms.

6) Contract and policy leverage (the ex ante solution)

For publishers, labels, studios, and creators who license or negotiate access:

  • require dataset provenance disclosures (even if aggregated),

  • require auditable deletion + certification,

  • require independent audits,

  • require retention of evidence relevant to disputes,

  • require model training “bill of materials” (even if confidential to a neutral auditor).

This reduces reliance on litigation roulette.

Bottom line

  • Judge Stein’s ruling is a strong, mainstream privilege decision: “non-use” isn’t privileged; denial of willfulness doesn’t waive privilege; and at-issue waiver needs affirmative reliance on legal advice.

  • I mostly agree with Edward Lee’s assessment of correctness and significance—but I’d frame it as a major win on privilege mechanics, not a win on the merits.

  • For rightsholders, the response is not “try harder to pierce privilege.” It’s: design discovery demands that extract facts, provenance, and technical truth without needing privileged communications—and lock down preservation early.