• Pascal's Chatbot Q&As
  • Posts
  • This leak pulls back the curtain on one of the AI industry's most opaque layers: the human-directed “clean-up” phase of training, where models are fine-tuned using curated (and excluded) sources.

This leak pulls back the curtain on one of the AI industry's most opaque layers: the human-directed “clean-up” phase of training, where models are fine-tuned using curated (and excluded) sources.

It confirms that platforms like Claude are shaped not just by math and compute, but by deliberate editorial choices—sometimes outsourced, often hidden.


The Claude AI Leak and Its Implications for Rights Holders and AI Accountability

by ChatGPT-4o

Introduction

In July 2025, a leak of internal documentation from Surge AI—an Anthropic contractor—revealed detailed lists of websites that were approved or banned for use during the fine-tuning phase of Claude, Anthropic’s large language model. The lists, exposed through publicly accessible Google Drive folders, offer rare insight into how major AI models are shaped during human-supervised reinforcement learning. This leak matters not only for transparency and ethics in AI but also for rights holders—including scholarly publishers—who now have a clearer map of how their content might be treated during model development.

Overview of the Leak

According to reports from Tom’s Guide and Business Insider, the leaked spreadsheet categorized websites into:

  • Allowed (Whitelisted) Sources: Considered reliable or safe for use during Claude’s Reinforcement Learning from Human Feedback (RLHF) phase.

  • Disallowed (Blacklisted) Sources: Explicitly banned, possibly due to licensing, copyright, or reputational concerns.

Anthropic distanced itself from the document, claiming it had no prior knowledge of the spreadsheet, which was developed by Surge AI. Nonetheless, the leak has triggered scrutiny over how contractors shape AI outputs—often with little oversight from the AI vendors themselves.

Full List of Known Allowed and Banned Sources

✅ Approved Sources (Examples from “teaching-ai-example-sites-you-can-use.pdf” and press reporting)

These 120+ sites span academia, government, finance, medicine, and law. Key entries include:

Academic / University Websites

  • Harvard University —

  • MIT —

  • Princeton —

  • Yale —

  • University of Chicago —

Finance / Business

  • Bloomberg —

  • Crunchbase —

Medical / Health

  • Mayo Clinic —

  • New England Journal of Medicine —

  • Johns Hopkins Medicine —

  • WHO —

Law / Government

  • Legal Information Institute —

  • Justia —

  • National Archives —

  • GovInfo —

STEM & Software

  • IEEE Xplore —

  • Papers With Code —

  • GitHub —

❌ Disallowed Sources (from “teaching-ai-not-approved.pdf”, press coverage, and leaks)

The "banned" list includes over 50 sites, with strong representation from journalism, academic publishing, and online platforms known for user-generated content.

News Outlets

  • The New York Times —

  • The Wall Street Journal —

  • Reuters —

  • Financial Times —

  • The Economist —

  • BBC —

User Platforms / Miscellaneous

  • Reddit —

  • Wikipedia —

  • Quora —

  • Yahoo —

Academic / Research Publishers

  • Wiley —

  • PLOS —

  • Stanford University —

  • Harvard Business Review —

  • BioRxiv —

Government / Medical Sources

  • FDA —

  • Library of Congress —

  • Department of Education —

Analysis: Why Certain Sites Were Disallowed

The rationale behind banning sites likely reflects a combination of:

  1. Copyright and Licensing Risk: Many disallowed entities—like Wiley, NYT, and Reddit—have already taken legal or policy steps to restrict AI training on their content.

  2. Reputational Risk: Reddit and Wikipedia, while massive repositories of user-generated content, are seen as unreliable or unmoderated for factual training.

  3. Legal Compliance: Disallowing sites with robots.txt exclusions or formal take-down requests signals an effort to appear compliant with copyright and scraping norms—even if retroactively.

Anthropic told Business Insider it had no knowledge of the Surge AI list, but this distancing doesn’t eliminate its accountability under copyright law, as RLHF and pretraining may both fall under scrutiny in future litigation.

Legal scholars cited in the coverage argue that courts may not meaningfully distinguish between:

  • Pretraining (ingesting vast quantities of data for model initialization), and

  • Fine-tuning (especially RLHF, where gig workers use third-party content to craft and rank model responses).

Both may be seen as substantial use of protected content, with or without direct ingestion. Therefore, even “teaching” models using copyrighted PDFs or website content during RLHF may still trigger copyright liability.

This mirrors legal positions taken by:

  • The New York Times (suing OpenAI and Microsoft),

  • Reddit (suing Anthropic),

  • Dow Jones (suing Perplexity), and

  • Authors, developers, and visual artists (in dozens of copyright class actions worldwide).

Implications for Rights Holders and Plaintiffs

🧩 1. Evidence of Intentional Avoidance or Use

This leak shows that AI vendors (or their contractors) are actively distinguishing between approved and banned content. For plaintiffs, this weakens “innocent infringement” claims.

🧠 2. Proof of Access and Selective Use

Disallowed sites, including those involved in lawsuits, appear specifically listed—suggesting Surge AI (and by extension Anthropic) knew or anticipated rights-based restrictions. This strengthens claims that AI companies exercised control over what was used.

⚖️ 3. Support for Licensing Demands

Platforms like Wiley and PLOS can now assert that exclusion from RLHF processes reduces their visibility in AI outputs—a potential commercial harm or bargaining chip in licensing negotiations.

📚 4. New Discovery Material

Litigants may subpoena similar training spreadsheets or contractor instructions from other AI companies, especially where Surge AI or Scale AI were involved. These documents can serve as:

  • Evidence of willful copying

  • Internal inconsistencies (e.g., using Harvard.edu but banning Stanford.edu)

  • Precedent for settlement discussions

🔍 5. Policy Advocacy and Regulatory Reform

Scholarly publishers and academic societies can use the leak to push for:

  • AI transparency rules mandating disclosure of training and tuning sources

  • Greater regulation of third-party contractors

  • “Opt-in by default” licensing frameworks for academic and scientific content

Conclusion

This leak pulls back the curtain on one of the AI industry's most opaque layers: the human-directed “clean-up” phase of training, where models are fine-tuned using curated (and excluded) sources. It confirms that platforms like Claude are shaped not just by math and compute, but by deliberate editorial choices—sometimes outsourced, often hidden.

For rights holders, especially scholarly publishers and plaintiffs in ongoing lawsuits, the leak presents a rare opportunity:

  • To demonstrate that vendors do make judgments about whose content to include or exclude,

  • To advocate for transparent licensing regimes,

  • And to reinforce the legal and ethical necessity of respecting intellectual property in all phases of model development.

Sources Referenced: