Pascal's Chatbot Q&As
Posts
The Entrepreneur Media lawsuit is a highly sophisticated, well-supported complaint that draws on Meta’s own research papers, public admissions, court records, and the known illicit origins of Books3.

The Entrepreneur Media lawsuit is a highly sophisticated, well-supported complaint that draws on Meta’s own research papers, public admissions, court records, and the known illicit origins of Books3.

It frames Meta’s alleged conduct not as technological overreach, but as conscious, repeated, and commercially motivated infringement.

Pascal Hetzscholdt
November 07, 2025

“The Books That Trained the Machine” — Entrepreneur Media v. Meta and the Future of AI Copyright Law

by ChatGPT-5

The lawsuit brought by Entrepreneur Media, LLC (“EM”) against Meta Platforms, Inc.is one of the clearest and most aggressive copyright claims yet filed against a major AI developer. It alleges not only unauthorized copying of copyrighted material, but also intentional circumvention of copyright management information, systematic acquisition of pirated books, and internal decision-making that—if proven—would suggest a deliberate corporate policy of infringement rather than an accidental overreach.

The published Reuters report sets the tone of the dispute, stating that Meta “copied [EM’s] business strategy books, professional development guides and other instructional materials to train its LLaMA models” without licensing. The much more detailed 52-page complaint fills in the gaps with concrete allegations—citations to Meta’s public papers, references to known sources of pirated text such as Books3, and claims that Meta intentionally removed copyright metadata from illegally obtained copies of books and magazine issues.

What follows is an analysis of the grievance quality, the strength of the evidence as pleaded, the likely litigation outcomes, and the preventative measures the AI industry must adopt to avoid this and future conflicts.

1. Quality of the Grievances

1.1. Direct Copyright Infringement

EM’s core grievance—that Meta copied its books and magazine articles without permission—is straightforward and legally coherent. The complaint alleges:

Meta downloaded works from “notorious shadow libraries” such as LibGen and Bibliotik (e.g., via Books3).
Meta made multiple reproductions of the text during acquisition, preprocessing, storage, and training.
Meta distributed infringing copies through torrent participation, because BitTorrent’s architecture necessarily seeds while downloading.

If these facts hold, this is classic §106 infringement.

Quality: Very strong.
Copyright infringement does not require intent, and EM alleges multiple acts of unauthorized reproduction. The complaint also ties Meta’s alleged copying directly to market harm—falling digital book sales and substitution effects from LLaMA outputs.

1.2. Contributory & Induced Infringement

By alleging use of torrent protocols—which inherently “upload” while downloading—EM adds a second layer of liability. If Meta employees used torrenting software in default mode to pull books from LibGen, they would technically be redistributing copyrighted files.

Quality: Medium-to-strong.
While compelling, this depends heavily on discovery: who downloaded what, on what machines, in what configurations, and under what instructions.

1.3. DMCA §1202: Removal of Copyright Management Information

One of the most potent allegations concerns removal of copyright metadata:

EM asserts Meta systematically stripped EPUB/PDF metadata and front-matter copyright pages during preprocessing, citing internal descriptions from the Kadrey v. Meta record: “filtering copyright lines” and removing CMI identifiers.

If proven intentional, DMCA §1202 violations can carry stiff statutory damages per work.

Quality: Very strong.
Courts take CMI removal seriously; this is the same statute used in photographyinfringement cases, often to devastating effect for defendants.

1.4. Market Harm & Loss of Licensing Opportunity

EM alleges:

A 50% decline in digital book sales coinciding with LLaMA’s availability.
Lost licensing revenue due to Meta’s alleged refusal to engage in the legitimate licensing market (which EM notes is now standard among OpenAI, Google, Anthropic, UMG, etc.).
Displacement of purchases because LLaMA can generate functionally equivalent business guides on command.

Quality: High, though causation will be debated.
Courts increasingly acknowledge that LLM outputs may substitute for copyrighted works—Kadrey already recognized this risk in the gardening-book example cited in the complaint.

2. Strength of the Evidence as Pleaded

Because this is a complaint, not a ruling, everything is an allegation. But the evidence categories that EM references fall into three classes:

A. Public Admissions and Papers by Meta

EM cites Meta’s LLaMA 1 paper, which openly lists Books3 as part of its dataset. Books3 is well-known to consist largely of pirated books, confirmed by EleutherAI itself.

Strength: Extremely strong.
This is not speculative—Meta listed the dataset in an academic paper.

B. External records (LibGen metadata & The Atlantic’s LibGen database)

EM identifies specific EM books and magazine issues inside the LibGen data corpus as made available publicly by The Atlantic.

Strength: Strong.
While EM must still prove Meta actually ingested the specific works, proving the works were in Books3 is trivial.

C. Internal Meta actions revealed in Kadrey

EM relies on public filings describing Meta engineers filtering copyright lines. These filings will be admissible.

Strength: Very strong.
This creates an evidentiary bridge:
From Books3 → to Meta preprocessing → to stripping copyright notices → to training data.

D. Damages evidence

The 50% decline in digital sales will be strongly contested. Correlation does not equal causation.

Strength: Medium.

3. Likely Litigation Outcomes

Several paths emerge:

Outcome 1: Meta settles (most likely)

Meta has already shown a pattern:

Quick resolution of NYT v. OpenAI (OpenAI side).
Broad licensing deals (UMG, Shutterstock).
Strong desire to avoid discovery about training sets.

This case sits in the Northern District of California, where Kadrey is already paving legal ground. EM’s complaint cites Kadrey extensively. If discovery proceeds, Meta might be forced to reveal detailed training data lineage—an outcome Meta has consistently sought to avoid.

Probability: ~65–75%.
A settlement might include:

A retroactive training-data license.
A forward-looking content licensing agreement.
No admission of wrongdoing.

Outcome 2: Meta wins on “Fair Use” (unlikely given current climate)

Meta will argue:

Training is transformative (vectorization).
Uses were non-expressive.
Outputs are not substantially similar.

But the trend in 2024–2025 US jurisprudence is skeptical of broad AI fair use, especially where:

The training set includes entire books.
The outputs compete with the originals (market substitutability).
The defendant had access to licensing markets.

Here, EM alleges Meta intentionally bypassed the licensing market—this directly undermines factor 4 (market harm).

Probability: ~10–15%.

Outcome 3: A narrow plaintiff win with injunction + damages (plausible)

Given the DMCA CMI claims and evidence of Books3 ingestion, Meta could lose narrowly:

A finding that ingestion was infringing.
An order to purge EM works from future checkpoints.
Modest statutory damages but no massive payout.

Probability: ~25%.

Outcome 4: Game-changing precedent (less likely but possible)

If this reaches final judgment with damning discovery (e.g., emails approving piracy), it could:

Declare training on copyrighted books without license not fair use.
Require provenance and auditability for all future LLM training.

But major AI defendants almost always settle before high-risk precedent.

Probability: <10%.

4. What AI Makers Should Do to Prevent These Lawsuits

The case is a roadmap of what not to do. A preventative framework emerges:

1. Adopt a “No Shadow Libraries, Ever” rule

Books3, LibGen, Bibliotik, and Z-Library are radioactive datasets.
AI makers must publicly commit to:

Zero use of shadow-library corpora.
Full provenance documentation.
Retroactive dataset purges.

2. Build Internal Governance: AI Data Compliance Teams

Just as companies have:

Privacy teams
Security teams
Compliance teams

They now need training-data governance teams responsible for:

Source audits
Provenance logs
Data deletion protocols
CMI preservation

3. Maintain “Right to Query” and “Right to Remove” Systems

Provide a mechanism for rights holders to:

Identify whether their works were used.
Request removal.
Request compensation.

Without this, lawsuits become the only mechanism.

4. Use Licensing Markets (which now exist)

The EM complaint notes Meta bypassed licensing despite billions spent by competitors.
Future-proof AI development requires routine licensing of:

Books
Magazines
News archives
Image libraries
Music catalogues

5. Implement CMI Preservation Throughout the Preprocessing Pipeline

Removing metadata is increasingly viewed as evidence of willfulness.
This is an easily solvable engineering problem:

Do not strip copyright notices.
Preserve metadata in shard-level provenance files.
Log transformations.

6. Use Synthetic and Public-Domain Data to the Maximum Extent

With synthetic augmentation, RLHF feedback loops, and retrieval-augmented generation, dependence on copyrighted corpora can be reduced substantially.

7. Cooperate with Regulators and the Copyright Office

The complaint quotes the Copyright Office’s warnings—this signals regulators are watching.
Collaboration signals good faith; silence signals risk.

Conclusion

The Entrepreneur Media lawsuit is a highly sophisticated, well-supported complaint that draws on Meta’s own research papers, public admissions, court records, and the known illicit origins of Books3. It frames Meta’s alleged conduct not as technological overreach, but as conscious, repeated, and commercially motivated infringement.

The case is more dangerous for Meta than many past lawsuits because:

It names specific copyrighted works.
It alleges intentional CMI removal.
It draws a direct line to market harm.
It invokes contributory infringement via torrenting.

The most likely outcome is settlement—Meta will seek to avoid discovery that could expose the full extent of its dataset practices. But even if settled, this lawsuit continues the tightening legal noose around unlicensed AI training.

The era of “take now, litigate later” is ending.
AI developers who fail to adopt provenance, licensing, and transparent governance will face the same wave of litigation—until courts, regulators, or legislation impose the discipline the industry has so far refused to adopt voluntarily.