- Pascal's Chatbot Q&As
- Posts
- GPT-4o: Meta’s actions—torrenting massive amounts of pirated books, concealing its tracks, and seeding copyrighted works—are ethically and legally troubling.
GPT-4o: Meta’s actions—torrenting massive amounts of pirated books, concealing its tracks, and seeding copyrighted works—are ethically and legally troubling.
Courts should allow full discovery, enforce injunctions, and consider criminal referrals. AI companies must improve compliance and use licensed datasets.
Asking AI Services: Please read the articles "“Torrenting from a corporate laptop doesn’t feel right”: Meta emails unsealed" and "‘Meta Torrented over 81 TB of Data Through Anna’s Archive, Despite Few Seeders’" and explain what happened, whether this is bad behaviour or not and what a) the judge should do and b) what the AI maker(s) should do to prevent this from happening and remedy this. Do you think that there is anything US or EU regulators should do?
ChatGPT-4o’s answer:
Analysis of the Meta Torrenting Case
What Happened?
Meta Downloaded Massive Amounts of Pirated Books
Court documents revealed that Meta downloaded at least 81.7 terabytes of pirated books from shadow libraries such as LibGen and Z-Library via BitTorrent through the site Anna’s Archive.
The books were allegedly used for training its AI models.
Meta Knew This Could Be Legally Problematic
Internal emails show that Meta employees expressed concerns about torrenting copyrighted content.
A researcher wrote: "Torrenting from a corporate laptop doesn’t feel right", highlighting concerns about using Meta IP addresses to download pirated books.
Another email mentioned "seeding", which means Meta’s systems were also distributing pirated contentto others.
Meta Took Steps to Conceal Its Actions
Emails show that Meta avoided using Facebook servers to prevent tracking.
They modified settings to minimize seeding to limit further distribution but still contributed to the spread of the books.
Legal and Copyright Implications
The plaintiffs (authors) argue that downloading and seeding these books amounts to massive copyright infringement.
They claim that even small-scale piracy has resulted in criminal referrals, making Meta’s actions particularly serious.
Meta defends itself by arguing that using these datasets for AI training is fair use under U.S. copyright law.
Is This Bad Behavior?
Yes. Several ethical and legal concerns arise from this:
Legal Violation: Downloading copyrighted content without a license, and seeding (distributing) it, potentially violates copyright laws.
Corporate Ethics: Meta ignored internal warnings from employees about the risks.
Deceptive Practices: Attempting to conceal the downloads and seeding shows awareness of wrongdoing.
Fair Use Defense is Weak: U.S. courts have not yet accepted the argument that large-scale AI training on pirated books is fair use. Even if AI training were considered fair use, acquiring data through piracy remains illegal.
What Should Be Done?
(a) What Should the Judge Do?
Deny Meta’s Motion to Dismiss
The court should reject Meta’s argument that there’s no evidence of third-party downloads.
Seeding, even in minimal amounts, legally constitutes distribution, which strengthens the authors’ copyright claim.
Allow Discovery of Torrent Logs
Plaintiffs should be granted access to Meta’s torrenting logs to confirm the extent of the seeding.
Impose Preliminary Injunctions
The judge could order Meta to stop using any AI models trained on these datasets while the case proceeds.
Refer for Criminal Investigation
Given that small-scale piracy cases have led to criminal referrals, the judge could refer the case to the U.S. Department of Justice (DOJ) for further investigation.
(b) What Should AI Makers (Including Meta) Do?
Stop Using Pirated Data
AI companies must only use licensed datasets or public domain materials.
Implement Stronger Compliance Measures
AI firms should have clear policies that prevent employees from downloading copyrighted data from illegal sources.
Develop Alternative Data Sources
AI companies should collaborate with publishers and authors to create legaltraining datasets.
Audit AI Training Pipelines
Companies should implement regular legal reviews of all datasets used for AI training.
Remedy Past Harm
Meta should delete models trained on pirated books and offer compensation to affected authors.
Regulatory Recommendations
(c) What Should U.S. and EU Regulators Do?
Strengthen AI Transparency Laws
Require AI developers to disclose all datasets used for training.
Implement audits of AI training data.
Introduce Copyright Safeguards for AI Training
Establish a legal framework ensuring AI training data must be obtained legally.
Require licensing agreements between AI companies and content owners.
Increase Criminal Liability for Large-Scale Piracy
If a company knowingly downloads copyrighted works at this scale, regulators should consider criminal penalties.
Require AI Firms to Pay Royalties to Authors
Implement a system where AI companies pay fees to copyright holders if their works are used in AI training.
Conclusion
Meta’s actions—torrenting massive amounts of pirated books, concealing its tracks, and seeding copyrighted works—are ethically and legally troubling. Courts should allow full discovery, enforce injunctions, and consider criminal referrals. AI companies must improve compliance and use licensed datasets. Regulators in the U.S. and EU should tighten AI copyright rules and enforce greater transparency in training data usage.
Claude’s answer:
Based on the articles, Meta has been accused of downloading over 81.7 terabytes of copyrighted books from shadow libraries through torrenting, specifically using Anna's Archive as a source. The company used this data to train its AI models. Several concerning elements emerged from unsealed emails:
Internal Awareness of Issues Meta employees expressed concerns about the legality and ethics of these actions. A Meta research engineer, Nikolay Bashlykov, noted that "torrenting from a corporate laptop doesn't feel right" and raised concerns about using Meta IP addresses to download pirated content. Another employee stated that "using pirated material should be beyond our ethical threshold."
Attempts to Conceal Meta allegedly tried to hide its torrenting activities by avoiding Facebook servers to prevent tracing the downloads back to them. They also modified settings to minimize seeding (sharing) of the downloaded content.
Regarding your specific questions:
What the judge should do: The judge should carefully evaluate both the direct copyright infringement claims and the broader implications for AI training. While Meta argues this falls under "fair use," the scale of the downloading (81.7TB) and the apparent attempts to conceal the activity suggest potentially willful infringement. The judge should consider:
Whether this constitutes systematic copyright infringement
If Meta's attempts to hide the activity demonstrate awareness of wrongdoing
The implications of using BitTorrent, which inherently involves redistributing content
Whether "fair use" can apply to such large-scale, systematic copying of copyrighted works
What AI makers should do:
Establish clear ethical guidelines for training data acquisition
Develop legitimate licensing frameworks with publishers and content creators
Create transparent documentation of training data sources
Implement internal compliance mechanisms to prevent unauthorized data collection
Invest in creating or acquiring properly licensed datasets
Work with content creators to establish fair compensation models
Regulatory Response: Both US and EU regulators should consider:
US Regulators:
Clarify fair use doctrine as it applies to AI training
Establish clear guidelines for permissible data collection practices
Create frameworks for compensating copyright holders
Implement mandatory transparency requirements for AI training data
EU Regulators:
Strengthen the AI Act to address training data acquisition
Develop specific provisions for copyright in AI training
Create mechanisms for auditing AI training data sources
Establish clear liability frameworks for copyright infringement in AI training
This case highlights a critical tension in AI development: the need for vast amounts of training data versus copyright protection and fair compensation for creators. The solution likely requires a new regulatory framework that balances innovation with creators' rights, possibly including:
Compulsory licensing schemes for AI training
Clear guidelines on permissible data collection methods
Transparency requirements for training data sources
Fair compensation mechanisms for content creators
This is problematic behavior because:
It potentially undermines copyright law
It shows disregard for creators' rights
The attempted concealment suggests awareness of wrongdoing
It sets a concerning precedent for AI development
It creates unfair competition with companies that acquire data legally
The broader implications suggest we need a comprehensive framework for AI training data acquisition that respects intellectual property rights while enabling AI development. This might require new legislative approaches that go beyond traditional copyright law.
Perplexity’s answer:
Okay, I've reviewed the articles "Torrenting from a corporate laptop doesn't feel right": Meta emails unsealed" from Ars Technica and "'Meta Torrented over 81 TB of Data Through Anna's Archive, Despite Few Seeders'" from TorrentFreak. Here's a breakdown of what happened, an assessment of the behavior, and potential actions:
What Happened?
The articles detail allegations that Meta illegally torrented vast quantities of copyrighted books to train its AI models. Key points:
Torrenting Activity: Meta allegedly torrented at least 81.7 terabytes of data from shadow libraries like LibGen and Z-Library through Anna's Archive. This followed a previous torrenting of 80.6 terabytes of data from LibGen.
Internal Concerns: Meta employees, including a research engineer named Nikolay Bashlykov, expressed concerns about the legality and ethics of torrenting copyrighted material on company resources. Bashlykov specifically mentioned the risk of using Meta IP addresses to download pirated content and the legal implications of "seeding" (sharing) the files.
Concealment Attempts: There are allegations that Meta attempted to hide its torrenting activities, including avoiding the use of Facebook servers to download the data and modifying settings to minimize seeding.
Copyright Infringement Lawsuit: Authors have filed a copyright infringement lawsuit against Meta, claiming that the company used their works without permission to train its AI models. The torrenting activity is seen as further evidence of copyright infringement, particularly regarding the distribution of copyrighted material.
Meta's Defense: Meta argues that its use of the copyrighted material falls under "fair use" for AI training purposes. They also claim that the plaintiffs haven't shown any instances where a third party actually downloaded copyrighted material from Meta via torrent.
Is This Bad Behavior?
Yes, based on the information presented, this appears to be questionable behavior for the following reasons:
Copyright Infringement: Torrenting and distributing copyrighted material without permission is a violation of copyright law.
Ethical Concerns: Even if Meta believes its actions fall under "fair use," the internal concerns raised by employees suggest an awareness of the ethical implications of using pirated material.
Concealment: Attempts to hide the torrenting activity indicate an understanding that the behavior was problematic.
Scale: The sheer volume of data allegedly torrented (over 81 terabytes) raises concerns about the impact on copyright holders.
Recommendations
a) What the Judge Should Do:
Thoroughly Evaluate the "Fair Use" Claim: The judge needs to carefully consider whether Meta's use of the copyrighted material for AI training meets the criteria for fair use. This will likely involve examining the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for or value of the copyrighted work.
Consider the Distribution Aspect: The judge should pay close attention to the allegations of seeding, which involves distributing copyrighted material to others. This could strengthen the copyright infringement claim.
Assess the Credibility of Evidence: The judge needs to evaluate the unsealed emails and other evidence to determine the extent of Meta's knowledge and involvement in the torrenting activities.
Consider Potential Remedies: If Meta is found liable for copyright infringement, the judge should consider appropriate remedies, such as monetary damages and injunctive relief (e.g., requiring Meta to stop using the copyrighted material).
Referral for Criminal Investigation: The author's filing notes that "vastly smaller acts of data piracy" have resulted in referrals to US Attorneys for criminal investigation. The judge should consider whether the scale and nature of Meta's alleged actions warrant a similar referral.
b) What the AI Maker(s) (Meta) Should Do to Prevent This and Remedy It:
Cease Torrenting Copyrighted Material: Immediately stop torrenting copyrighted material without permission.
Conduct a Copyright Audit: Conduct a thorough audit of the data used to train its AI models to identify any other instances of copyright infringement.
Obtain Licenses or Permissions: Seek licenses or permissions from copyright holders to use their works for AI training.
Develop Alternative Data Sources: Explore alternative data sources for AI training, such as public domain works, licensed datasets, or data generated with permission.
Implement Copyright Compliance Policies: Implement clear copyright compliance policies and training programs for employees.
Be Transparent: Be transparent with copyright holders about the data used to train AI models.
Compensate Copyright Holders: Consider establishing a fund to compensate copyright holders whose works were used without permission.
Role of US/EU Regulators
US and EU regulators could play a role in addressing this type of issue:
Clarify Copyright Law for AI Training: Regulators could provide clearer guidance on the application of copyright law to AI training, including the scope of fair use or similar exceptions. This would help AI developers understand their obligations and avoid infringing copyright.
Promote Licensing Solutions: Regulators could encourage the development of licensing solutions that make it easier for AI developers to obtain permission to use copyrighted material.
Increase Enforcement: Regulators could increase enforcement of copyright law in the context of AI, sending a message that copyright infringement will not be tolerated.
International Cooperation: Given the global nature of AI development and copyright, international cooperation is essential. US and EU regulators could work together to develop consistent approaches to copyright and AI.
Consider Data Scraping Regulation: Regulators could explore regulations around data scraping and usage for AI training, balancing the need for innovation with the rights of content creators.
In summary, the situation presents serious legal and ethical questions about the use of copyrighted material for AI training. A thorough investigation and appropriate remedies are needed to address the alleged wrongdoing and prevent similar incidents in the future.
Grok’s answer:
What Happened?
From the Articles:
Meta's Alleged Torrenting: Meta, the parent company of platforms like Facebook and Instagram, was found to have downloaded over 81 terabytes of data from shadow libraries like Anna's Archive, Z-Library, and LibGen. This data included copyrighted material, primarily books, using BitTorrent protocols.
Internal Concerns: Internal communications revealed that Meta employees were aware of the legal and ethical implications of downloading pirated content. There were discussions about the risks of seeding (sharing) this content, with one employee specifically noting discomfort about torrenting from a corporate laptop.
Legal Actions: Authors, including Richard Kadrey, Sarah Silverman, and Christopher Golden, have filed a class-action lawsuit against Meta, alleging copyright infringement due to this behavior. The lawsuit highlights that Meta's use of BitTorrent could facilitate further infringement by acting as a distribution point for pirated content.
Is This Bad Behavior?
Ethically and Legally Questionable: The actions are certainly ethically questionable as they involve using pirated materials which could undermine the rights of authors and creators. Legally, downloading and distributing copyrighted works without permission is a clear infringement unless it falls under fair use or other legal exceptions.
Corporate Responsibility: From a corporate governance perspective, Meta's actions suggest a lack of oversight or ethical consideration in data acquisition practices for AI training, potentially risking the company's reputation and legal standing.
What Should Be Done?
a) What the Judge Should Do:
Evidence Review: The judge should thoroughly examine all evidence, including the unsealed emails and torrent logs, to establish the facts of the case.
Legal Precedents: Consider previous rulings on copyright infringement and fair use, especially in the context of AI training data.
Ruling on Infringement: Determine if Meta's actions constitute copyright infringement or if they qualify under fair use doctrine, considering the transformative nature of using the data for AI training.
Remedies: If infringement is found, appropriate remedies could include:
Cease and desist orders to stop the use of pirated materials.
Financial compensation for the authors whose works were used without permission.
Potentially, a referral for criminal investigation if the scale and intent of the infringement are deemed severe.
b) What AI Makers Should Do:
Ethical Data Practices: Implement stringent policies for data sourcing, ensuring that all data used for AI training is either legally obtained, licensed, or falls under fair use without ambiguity.
Alternative Data Sources: Explore or develop legal datasets or partnerships with content creators to access data without infringement issues.
Transparency: Increase transparency about data usage practices, particularly how copyrighted material is used or transformed in AI models.
Compliance Teams: Strengthen internal compliance teams to monitor and guide data acquisition activities, ensuring they align with legal and ethical standards.
Remedy: If infringement is confirmed, AI makers should:
Cease using the infringing data.
Compensate or negotiate with copyright holders.
Retrain models with legally obtained data.
Regulatory Actions:
US and EU Regulators:
Investigate: Both US and EU regulators could investigate Meta and similar tech firms for compliance with copyright laws, focusing on how data for AI is sourced.
Guidelines and Regulations: Develop or enforce clearer guidelines on what constitutes fair use in the context of AI training, particularly with copyrighted material.
Privacy and Data Use: Ensure that data practices respect privacy laws like GDPR in Europe or CCPA in California, which could be tangential to copyright issues but are relevant in data handling.
Market Fairness: Consider regulations to prevent unfair competition advantages gained through unethical or illegal data acquisition practices.
In summary, Meta's actions raise significant legal and ethical questions about data use in AI development. Both judicial and regulatory responses should aim to clarify and enforce standards that protect intellectual property while supporting innovation in technology.
