- Pascal's Chatbot Q&As
- Posts
- GPT-4o: The report highlights that major AI companies (e.g., Meta, OpenAI, Microsoft, Apple, and others) have used pirated content to train their generative AI models.
GPT-4o: The report highlights that major AI companies (e.g., Meta, OpenAI, Microsoft, Apple, and others) have used pirated content to train their generative AI models.
Sources of pirated content include classic pirate sites such as LibGen, Z-library, Anna’s Archive, Watchseries, OpenSubtitles, and shadow libraries.
Asking ChatGPT-4o: Please read the “Report on Pirated Content Used in the Training of Generative AI” and the associated LinkedIn post and tell me what the key messages are and how AI makers should be responding to these issues.
ChatGPT-4o’s answer:
Key Messages from the "Report on Pirated Content Used in the Training of Generative AI" and LinkedIn Post
1. AI Models Rely on Pirated Content
The report highlights that major AI companies (e.g., Meta, OpenAI, Microsoft, Apple, and others) have used pirated content to train their generative AI models.
Sources of pirated content include classic pirate sites such as LibGen, Z-library, Anna’s Archive, Watchseries, OpenSubtitles, and shadow libraries.
Meta, for example, has used Books3, a dataset containing illegally shared books, and is now sourcing datasets from Z-library, LibGen, and Anna’s Archive for its upcoming Llama 4 model.
2. Training Data Defines AI Output
The AI industry is shifting from traditional Large Language Models (LLMs) to agentic AI and reasoning models.
AI’s output is fundamentally limited by its training data—models lack true creativity and primarily interpolate existing content.
Demis Hassabis (CEO of Google DeepMind) has categorized AI creativity as “interpolation”, where models merely blend what they’ve been trained on.
Andrej Karpathy (co-founder of OpenAI) has described LLMs as “internet document simulators”, compressing and reproducing online content.
3. AI Companies Continue Risky Practices Despite Legal Challenges
Despite lawsuits, AI companies continue using infringing datasets:
Meta: Acknowledged using Books3 for Llama 1-3 in court and is now sourcing pirate datasets for Llama 4.
OpenAI: Used LibGen data to train GPT models and was revealed to have transcribed YouTube videos with its Whisper tool.
Runway AI: Scraped YouTube videos for training, violating copyright protections.
Suno AI: Allegedly trained on millions of music recordings, likely sourced from piracy or stream-ripping platforms.
4. AI Dataset Distribution is Difficult to Regulate
AI training datasets are widely shared via Hugging Face, Kaggle, torrents, and private online forums.
AI companies often do not acquire datasets through official agreements but download them from public repositories or peer-sharing platforms.
Common Crawl, while not a pirate source, collects copyright-protected content without permission and is widely used in training.
5. Potential Legal and Ethical Implications
The widespread use of pirated books, movies, and music in training data raises significant copyright concerns.
AI companies could face increasing lawsuits from authors, publishers, music labels, and studios.
Legal and regulatory scrutiny is likely to intensify, potentially forcing AI companies to retrain models with licensed content or face financial and reputational risks.
How AI Makers Should Respond
To mitigate risks and build trust, AI developers should take the following actions:
Stop Using Infringing Content
AI companies must cease using pirate datasets such as Books3, OpenSubtitles, and other shadow libraries.
This means auditing existing training data and removing content obtained from unauthorized sources.
Obtain Proper Licensing
AI companies should negotiate licensing agreements with publishers, studios, and record labels.
Google and OpenAI have already made deals with some publishers—other companies should follow suit.
Increase Transparency in Training Data
Publicly disclose what datasets are used for AI model training.
Implement clear policies on content sourcing to avoid legal disputes.
Develop AI-Safe Data Practices
AI developers should use legally sourced, high-quality data rather than relying on scraped or pirated material.
Build AI datasets through partnerships with rights holders instead of leveraging unauthorized sources.
Adopt Technical Safeguards
AI companies should implement watermarking, attribution tracking, and opt-out mechanisms for content creators.
Respect copyright law and comply with emerging AI regulations worldwide.
Engage with Policymakers and Rights Holders
Work with governments, regulatory bodies, and industry organizations to define ethical AI practices.
Develop AI governance frameworks that balance innovation with intellectual property rights.
Prepare for Regulatory Action
Governments are increasingly scrutinizing AI training practices—AI makers must prepare for compliance.
Future AI models may need strict auditing and dataset verification to meet legal standards.
Conclusion
The report underscores the AI industry’s systemic reliance on pirated content and the urgent need for responsible data practices. AI companies must shift toward licensed, transparent, and ethical AI training or risk escalating legal and reputational consequences.
