• Pascal's Chatbot Q&As
  • Posts
  • A class-action lawsuit alleging that Salesforce infringed copyrights by using a dataset of pirated books to train its LLMs, including the CodeGen, XGen, xGen-Sales, and xGen-Small series.

A class-action lawsuit alleging that Salesforce infringed copyrights by using a dataset of pirated books to train its LLMs, including the CodeGen, XGen, xGen-Sales, and xGen-Small series.

The plaintiffs claim their books were included in the Books3 corpus without permission, compensation, or consent, and that Salesforce willfully used these materials for commercial gain.

Analysis of Tanzer et al. v. Salesforce — A Landmark Copyright Case in the AI Era

by ChatGPT-4o

1. Overview and Grievances

In Tanzer and Gilmore v. Salesforce, authors E. Molly Tanzer and Jennifer Gilmore filed a class-action lawsuit alleging that Salesforce infringed their copyrights by using a dataset of pirated books—specifically the Books3 corpus found within The Pile and RedPajama datasets—to train its large language models (LLMs), including the CodeGenXGenxGen-Sales, and xGen-Small series.

The plaintiffs claim their books were included in the Books3 corpus without permission, compensation, or consent, and that Salesforce willfully used these materials for commercial gain. The suit accuses Salesforce of direct copyright infringement under 17 U.S.C. § 501 and seeks damages, restitution, destruction of infringing materials, and class certification on behalf of all authors whose works were similarly misused.

What distinguishes this case is its focus on a major enterprise software company—Salesforce—rather than a consumer-facing AI startup. The plaintiffs also cite CEO Marc Benioff’s own 2024 public remarks acknowledging that AI companies “ripped off” copyrighted works during model training, calling this a “mistake” that must be corrected.

2. Quality of the Evidence

The complaint is detailed and well-documented, demonstrating careful investigative work and legal framing. Key strengths in the evidence include:

  • Direct Salesforce admissions: Early technical blogs and GitHub posts explicitly acknowledged that Salesforce’s XGen models were trained on RedPajama-Books, which contains Books3. A Salesforce engineer provided links to The Pile and RedPajama datasets, confirming their use in training.

  • Timing of dataset use: The plaintiffs argue that Salesforce downloaded and used these datasets before they were removed from sites like Hugging Face due to copyright complaints in late 2023. This adds weight to the claim that Salesforce knew or should have known the datasets were legally compromised.

  • Apparent concealment: Salesforce later edited public documentation to remove references to Books3RedPajama, and The Pile, instead labeling training data vaguely as “legally compliant” and from “public sources.” Plaintiffs frame this as an intentional obfuscation to hide prior infringement.

  • Benioff’s public admissions: Statements by Salesforce’s CEO acknowledging widespread industry misconduct around training data will be damaging in court and undermine any claims of ignorance or fair use.

Together, these components present a compelling case of willful infringement, potentially justifying statutory damages and injunctive relief.

3. Value for Other Ongoing Cases and Rights Owners

This lawsuit is the 53rd copyright case filed against an AI company in the U.S. and the 11th by the Joseph Saveri Law Firm, which has emerged as a key player in AI-related copyright litigation. It holds unique value in several ways:

  • Corporate accountability: This is the first high-profile suit targeting a B2B enterprise software giant, broadening the litigation field beyond OpenAI, Meta, or Stability AI.

  • Model disclosure: The plaintiffs’ use of Salesforce’s own technical documentation (before it was edited) shows the strategic importance of model transparency for plaintiffs, regulators, and the press. Other rights holders should prioritize monitoring changelogs, papers, and GitHub logs for similar admissions.

  • Class-action framework: By seeking class certification, the plaintiffs aim to aggregate the claims of potentially thousands of authors—escalating pressure on Salesforce and creating a blueprint for future large-scale litigation against similar corporate actors.

  • Reputational pressure: Salesforce markets its AI tools like Agentforce to enterprise clients who value brand integrity and ethical sourcing. The lawsuit jeopardizes that positioning and signals that rights owners can target the entire AI value chain—model builders, cloud providers, and integrators alike.

  • Corporate contradictions: The inclusion of Benioff’s critical AI comments and Salesforce’s public stance on copyright sets a precedent for using executives’ public statements to challenge corporate behavior.

This case thus becomes an important legal, strategic, and symbolic milestone in the battle over AI and copyrighted content.

4. Influence of the Anthropic Settlement

While confidential, the recent Anthropic settlement—reportedly involving payments to publishers—may influence Salesforce’s calculus:

  • Precedent of liability: The Anthropic case signals that even well-funded AI companies can be forced to settle over training data, making settlements more likely across the board.

  • Pressure from insurers and investors: As insurance companies reportedly warn that training on pirated datasets invalidates coverage, and as VCs fear long-term model contamination, companies like Salesforce may lean toward resolution rather than litigation.

  • Strategic differentiation: If Anthropic settles and Salesforce doesn’t, it risks being perceived as the lone holdout, inviting sustained legal and reputational attacks.

  • Settlement momentum: Anthropic’s capitulation adds weight to the legitimacy of these claims. Judges and juries may now be more likely to see this as systemic wrongdoing, not a novel legal theory.

So while not binding, the Anthropic settlement amplifies legal and public pressure on Salesforce.

5. What Salesforce Should or Could Have Done

This lawsuit exemplifies failures in compliance, governance, and foresight. Salesforce could have avoided or mitigated the situation by:

  • Licensing high-quality datasets: Salesforce had the financial capacity to license copyrighted materials from publishers, content platforms, or rights aggregators. Opting for pirate-linked datasets like Books3 was reckless and unnecessary.

  • Due diligence and audit trails: The company should have implemented clear model provenance frameworks, logging every dataset, its source, and its legal basis. Absence of a due diligence audit trail may severely weaken Salesforce’s legal defense.

  • Transparency and early corrections: Once Books3’s copyright issues became widely known in late 2023, Salesforce could have conducted a dataset audit, disclosed findings, and re-trained or fine-tuned models using cleared datasets.

  • Benioff’s remarks could’ve been backed by policy: If he believed content creators should be compensated, Salesforce could have preemptively joined licensing coalitions (e.g., STM, CCC, or Coalition for Content Provenance and Authenticity) and partnered with publishers or creator societies.

  • Offering opt-outs or royalty pools: Salesforce could have followed the lead of companies exploring opt-outs or content-sharing agreements and even funded compensation pools for authors whose works were used.

Salesforce instead followed a path of convenience over caution—leaving it vulnerable to litigation, reputational damage, and regulatory scrutiny.

Conclusion

Tanzer et al. v. Salesforce could become a landmark case in AI copyright litigation. It squarely addresses the use of pirated books in model training and holds a major enterprise software company accountable for the choices it made in developing commercial AI products. The evidence presented—particularly Salesforce’s own documents and statements—demonstrates a troubling pattern of use, denial, and cover-up. The case adds fuel to the growing legal momentum demanding fair compensation for creators whose work underpins generative AI systems.

As more lawsuits emerge and high-profile settlements like Anthropic’s shape the legal terrain, rights holders are emboldened, AI companies are put on notice, and the AI sector faces a reckoning. Salesforce, and others like it, must now decide whether to resist, settle—or rethink how they build the future of AI.