• Pascal's Chatbot Q&As
  • Posts
  • GPT-4o: Yes, it is possible for AI makers to remove illicit, unlawful, and infringing content from training data. Claude: YES, it is possible to remove illicit, unlawful and infringing content.

GPT-4o: Yes, it is possible for AI makers to remove illicit, unlawful, and infringing content from training data. Claude: YES, it is possible to remove illicit, unlawful and infringing content.

Perplexity: It appears that the answer to your question is YES, it is possible for AI makers to remove illicit, unlawful, and infringing content from training data. Gemini: YES░

Asking AI services: please read the article “Nonprofit scrubs illegal content from controversial AI training dataset” and the blog post “RELEASING RE-LAION 5B: TRANSPARENT ITERATION ON LAION-5B WITH ADDITIONAL SAFETY FIXES” and answer the simple question: Is it possible for AI Makers to remove illicit, unlawful and infringing content from training data YES or NO?

ChatGPT-4o’s answer:
Yes, it is possible for AI makers to remove illicit, unlawful, and infringing content from training data. Both documents describe efforts and procedures that have been implemented to identify and remove illegal content, such as child sexual abuse material (CSAM), from AI training datasets like LAION-5B. These procedures involve collaborations with specialized organizations and the use of hash lists to filter out and delete known problematic content​.


Perplixity’s answer:
Based on the available search results, it appears that the answer to your question is YES, it is possible for AI makers to remove illicit, unlawful, and infringing content from training data. The search results indicate that LAION, a nonprofit organization behind a popular AI training dataset, has successfully removed illegal content from their dataset. Specifically, they have "scrubbed" or purged child sexual abuse materials (CSAM) and other illegal content from their AI training dataset after facing backlash. This action demonstrates that AI makers can indeed take steps to remove problematic content from their training data when issues are identified. The fact that LAION was able to clean their dataset in response to concerns raised shows that it is technically feasible to remove such content, even after the initial dataset creation.While the process may be challenging and time-consuming, the ability to iterate and improve upon existing datasets by removing illegal or unethical content is an important aspect of responsible AI development. This case serves as an example of how AI organizations can respond to ethical concerns and improve the safety of their training data.


Claude’s answer:
Based on the information provided in the two articles, the answer is YES, it is possible for AI makers to remove illicit, unlawful and infringing content from training data.

The articles describe how LAION, a nonprofit AI research organization, successfully removed links to suspected child sexual abuse material (CSAM) and other problematic content from their LAION-5B dataset after issues were identified. Specifically:

  1. LAION partnered with organizations like the Internet Watch Foundation (IWF) and Canadian Center for Child Protection (C3P) to obtain lists of hashes for known CSAM content.

  2. They used these hash lists to identify and remove 2,236 links to suspected CSAM from the LAION-5B dataset without having to directly access the content.

  3. They also removed additional privacy-sensitive data flagged by Human Rights Watch.

  4. LAION released updated versions of the dataset (Re-LAION-5B) that are free of known links to suspected CSAM.

  5. They implemented improved filtering techniques to prevent such content from being included in future datasets.

While the process is complex and resource-intensive, the articles demonstrate that it is indeed possible for AI makers to take steps to remove illicit and problematic content from training datasets when issues are identified. The LAION example shows that through partnerships, use of hash lists, and improved filtering, AI organizations can work to clean their datasets of known illegal and infringing content.