- Pascal's Chatbot Q&As
- Posts
- Elsevier v. Meta: Not just another “AI trained on copyrighted works” lawsuit. It is drafted as a story of deliberate corporate piracy, executive authorisation, concealment, and market substitution.
Elsevier v. Meta: Not just another “AI trained on copyrighted works” lawsuit. It is drafted as a story of deliberate corporate piracy, executive authorisation, concealment, and market substitution.
Six claims: reproduction by torrenting, reproduction via web scrapes, reproduction in training, distribution by torrenting, contributory infringement by Zuckerberg, and DMCA §1202 CMI removal.
Summary: The Elsevier-led complaint against Meta is powerful because it frames Llama not as ordinary AI innovation, but as a product allegedly built through deliberate piracy, torrenting, concealment, CMI removal, and abandoned licensing.
Its strongest claims are the pirate-source and seeding/distribution allegations, while the broader “training itself is infringement” and output-substitution arguments remain more legally contested.
Most likely outcome: Meta fights hard to narrow the case, but the reputational and discovery risks make a large settlement, dataset deletion commitments, and future licensing/provenance obligations more likely than a full trial.
The Pirate Library in the Machine: Why the Elsevier–Meta Complaint Is More Dangerous Than a Standard AI Copyright Case
by ChatGPT-5.5
This complaint is not just another “AI trained on copyrighted works” lawsuit. It is drafted as a story of deliberate corporate piracy, executive authorisation, concealment, and market substitution. The plaintiffs — Elsevier, Cengage, Hachette, Macmillan, McGraw Hill, Scott Turow and S.C.R.I.B.E. — allege that Meta and Mark Zuckerberg copied and distributed millions of copyrighted books, textbooks, journal articles and other literary works to build Llama; used Common Crawl, C4, Books3, LibGen, Sci-Hub, Anna’s Archive and other alleged pirate sources; stripped copyright-management information; and then repeatedly copied those works through the training pipeline. The complaint pleads six claims: reproduction by torrenting, reproduction via web scrapes, reproduction in training, distribution by torrenting, contributory infringement by Zuckerberg, and DMCA §1202 CMI removal.
a) What the grievances are
At its core, the grievance is that Meta allegedly chose piracy over licensing. The plaintiffs say Meta knew high-quality books and scholarly works were critical to making Llama useful, briefly explored licensing, then abandoned licensing once it realised the same works were available through pirate datasets. The complaint’s strongest narrative is not simply “Meta copied our works”; it is “Meta knew these works had commercial licensing value, considered paying, escalated the issue to Zuckerberg, then decided to use LibGen and other pirate sources because that was faster, cheaper, and more useful for a fair-use litigation strategy.”
The second grievance is distribution, not merely ingestion. Torrenting matters because BitTorrent does not only download; it can also upload or “seed” pieces of files to others. The complaint alleges that between April and July 2024, Meta downloaded 134.6 TB via torrenting while uploading 40.42 TB, which the plaintiffs characterise as mainly copyrighted content. That makes the case more damaging than a pure training-data case, because the plaintiffs can say Meta did not merely make internal copies for model development; it allegedly participated in the broader pirate distribution ecosystem.
The third grievance is concealment. The complaint alleges Meta removed or altered copyright-management information from works copied from pirated databases, including scripts designed to delete CMI from LibGen works and deletion of CMI from Books3. More damagingly, the plaintiffs allege Meta did not strip CMI uniformly: it allegedly left CMI on Project Gutenberg public-domain works while removing CMI from allegedly pirated copyrighted works. That is designed to defeat the defence that this was merely neutral data cleaning.
The fourth grievance is market harm. Plaintiffs allege three types: lost sales from Meta’s use of pirated copies instead of lawful purchases; usurpation of an existing and emerging AI licensing market; and substitution by Llama outputs, including summaries, replacement textbook chapters, alternative versions of novels and journal articles, and outputs mimicking expressive elements of authors’ works. The complaint is deliberately aimed at the fourth fair-use factor: market effect.
The fifth grievance is personal responsibility by Zuckerberg. The complaint names him individually, alleging he was not a passive officer but personally authorised, directed, or encouraged the infringement. That is strategically important: it turns the case from a corporate compliance dispute into a governance and executive-accountability case.
b) Do the evidence and arguments hold up?
On the pleadings, the complaint is strong — much stronger than many earlier AI copyright complaints — because it is rich in alleged internal documents, employee statements, dates, datasets, technical conduct and specific examples. The most powerful evidence is the alleged internal recognition that LibGen and Sci-Hub were illegal pirate sites, that “using pirated material should be beyond our ethical threshold,” that employees worried about torrenting pirate content from Meta infrastructure, and that Meta allegedly sought to mask IP addresses to avoid tracing back to Facebook servers. If proven, those facts are devastating on willfulness.
The strongest legal arguments are not necessarily the broad claim that “AI training is infringement.” The strongest claims are narrower: torrenting, seeding/distribution, pirate-source acquisition, CMI removal, and post-discovery concealment. Those claims are less dependent on persuading a court that all LLM training is unlawful. They allow plaintiffs to say: even if some training uses might be fair use, Meta did not merely train; it allegedly stole, seeded, concealed and stripped attribution.
The weaker or more contested arguments are the ones that depend on broad theories of training-copy infringement and output substitution. Courts have already shown some willingness to treat AI training as transformative in certain contexts. In Kadrey v. Meta, Judge Chhabria granted Meta summary judgment on fair use on the record before him, while warning that “in most cases” unlicensed feeding of copyright-protected material into generative AI models may likely be illegal and that the outcome could differ with a stronger market-harm record.
The Common Crawl and C4 theory is also more vulnerable. The complaint says these datasets contain pirated and paywalled works and that Meta knowingly copied them. That may be true, but courts may be reluctant to treat general web-scale dataset use the same way as deliberate LibGen/Anna’s Archive torrenting. The cleaner litigation target is Meta’s alleged direct use of known pirate repositories.
The CMI claim is potentially valuable but not automatic. DMCA §1202 claims often face demanding causation and knowledge requirements: plaintiffs must show not merely that metadata was removed, but that removal was done with knowledge, or reasonable grounds to know, that it would induce, enable, facilitate or conceal infringement. Here, the selective-removal allegation gives the claim teeth, but Meta will likely argue this was routine preprocessing, deduplication, formatting or cleaning.
The most surprising, controversial and valuable statements/findings
The most surprising allegation is that Meta allegedly discussed increasing its dataset licensing budget from $17 million to $200 million, then stopped licensing efforts after escalation to Zuckerberg. That creates a bad narrative: Meta allegedly recognised the market, priced the market, then avoided the market.
The most controversial alleged statement is that an employee said that if Meta licensed even a single book, it would be harder to “lean into the fair use strategy.” That is legally explosive because it suggests licensing was not rejected because no market existed, but because recognising the market would weaken Meta’s legal position.
The most valuable technical allegation is the torrenting/seeding evidence: 134.6 TB downloaded and 40.42 TB uploaded over a short 2024 period. If accurate, this changes the emotional and legal centre of gravity from “AI learning” to “participation in piracy distribution.”
The most strategically useful allegation for publishers is that Meta allegedly compared LibGen holdings against publisher catalogues and decided licensing was unnecessary because the works were already available in LibGen. That is exactly the kind of evidence rights owners need to show that piracy did not merely exist in the background; it allegedly displaced licensing.
The most governance-relevant allegation is the attempt to pull Zuckerberg personally into the chain of decision-making. Whether or not that claim ultimately survives, it reframes AI copyright infringement as a boardroom and executive-risk issue, not merely an engineering or research practice.
c) How this differs from other existing cases
Compared with Kadrey v. Meta, this complaint appears designed to fix the evidentiary gaps that hurt the earlier author plaintiffs. Judge Chhabria’s Kadrey ruling was favourable to Meta on the specific record, but it stressed that the plaintiffs had failed to develop enough evidence of market harm. This new complaint responds directly: it brings major publishers, educational publishers and a journal publisher; pleads licensing markets; alleges direct substitution; identifies existing derivative and AI licensing opportunities; and adds the seeding/distribution and CMI theories more aggressively.
Compared with Bartz v. Anthropic, the complaint tracks the same pirate-library pressure point but pushes harder on corporate intent and executive control. In Anthropic, the reported settlement followed a mixed ruling: training itself was treated favourably for Anthropic, but acquisition and retention of pirated books created severe damages exposure. Anthropic then agreed to a $1.5 billion settlement, roughly $3,000 per covered book, and agreed to destroy the original downloaded book files.
Compared with Thomson Reuters v. Ross, this case is broader and more culturally charged. Thomson Reuters involved use of Westlaw headnotes and a competing legal-research product; the Delaware court rejected fair use in a non-generative AI context where the defendant was building a competing tool. The Elsevier–Meta complaint tries to import that logic into generative AI by arguing that Meta is building substitutes for textbooks, scholarly works, fiction and derivative markets.
Compared with Authors Guild v. OpenAI and similar author cases, this complaint is more publisher-driven, more catalogue-driven, and more commercially concrete. The Authors Guild cases focus heavily on authors’ books, training and outputs; this case adds educational publishing, scholarly journals, textbook platforms, institutional licensing and publisher-controlled rights markets. That makes the market-harm story more sophisticated.
Compared with Britannica v. OpenAI, it is less about one reference publisher’s web traffic and near-verbatim answers, and more about the entire training-data supply chain. Britannica alleges OpenAI copied nearly 100,000 articles and produced near-verbatim outputs that diverted traffic; Elsevier–Meta alleges a much larger ecosystem of torrenting, pirate datasets, web scrapes, CMI removal and model training.
d) ChatGPT’s prediction on outcome
ChatGPT’s prediction: Meta will not want this case to reach a full public trial on the internal-document record. The complaint is too damaging reputationally, and the alleged facts are too useful for regulators, publishers, authors, investors and other plaintiffs. Meta will fight aggressively at first, because it cannot concede that Llama was built on unlawful copying; Reuters reports Meta’s public position is that courts have found AI training on copyrighted material can qualify as fair use and that it will fight the lawsuit.
But the likely endgame is a large settlement, not a clean appellate ruling that “AI training is infringement” or “AI training is fair use.” I would expect Meta to seek dismissal or narrowing of claims, especially against Zuckerberg personally and on some web-scrape/training theories. Some claims may be narrowed. But the pirate-source, seeding, CMI and willfulness allegations are dangerous enough that a settlement becomes rational once discovery risk and statutory damages exposure become concrete.
I would not expect the court to order destruction of all trained Llama models. That remedy is too economically disruptive and technically complicated, especially for models integrated into Meta’s products. More plausible outcomes are: monetary settlement; deletion of retained source datasets; commitments around future training-data sourcing; audit/accounting obligations; a licensing framework; and perhaps court-supervised destruction of infringing copies of underlying source files rather than model weights.
If forced to rank the claims by litigation strength, I would put them this way:
Strongest: torrenting/download and seeding/distribution from known pirate sites.
Strong: willfulness, especially if internal documents are authenticated.
Strong but technically demanding: CMI removal, especially selective removal.
Moderate: reproduction throughout training pipeline.
Moderate to difficult: output substitution and market dilution, unless plaintiffs produce empirical evidence.
Uncertain: personal liability against Zuckerberg, powerful as narrative but likely heavily contested.
e) Other standout topics
The complaint is politically useful because it reframes the AI copyright debate from “innovation versus copyright” to lawful innovation versus pirate industrialisation. That distinction matters. It avoids sounding anti-AI. The plaintiffs are not saying LLMs cannot exist; they are saying dominant AI companies cannot use pirate sites as procurement infrastructure.
It also raises a major content-integrity issue. The complaint focuses on copyright, but the deeper problem is that pirate libraries are not quality-controlled knowledge systems. They can contain outdated editions, corrupted files, missing corrections, fake works, altered texts, poor OCR, fraudulent material and non-version-of-record content. For scholarly and educational publishing, that matters as much as infringement: bad or unauthorised inputs can pollute downstream AI systems, especially in science, medicine and education.
The complaint also shows why AI licensing markets are becoming unavoidable. If courts accept Meta’s argument too broadly, licensing markets are weakened because the best strategy becomes: scrape first, litigate later, settle if necessary. If plaintiffs win too broadly, AI developers face enormous retroactive exposure. The likely compromise, commercially and legally, is not a ban on AI training; it is a shift toward licensed, auditable, provenance-aware training and retrieval systems.
The final standout point is governance. This complaint is about Meta, but the real target is the Silicon Valley playbook: scale first, internalise the upside, externalise the rights risk, bury the source chain, then call the resulting infrastructure “innovation.” If the allegations hold, the case becomes not just a copyright case but a case study in how frontier AI companies converted copyright infringement into a capital-formation strategy.
