Pascal's Chatbot Q&As
Posts
GPT-4: There needs to be a balance that allows for the continued growth and innovation in AI while respecting and protecting the rights of creators

GPT-4: There needs to be a balance that allows for the continued growth and innovation in AI while respecting and protecting the rights of creators

GPT-4: Yes, the issues of data provenance and licensing are significant for both AI developers (makers) and rights owners (creators). Here's why...

November 09, 2023

Question 1 of 3 for ChatGPT-4: What does this article say? Public AI Training Datasets Are Rife With Licensing Errors - An audit of popular datasets suggests developers face legal and ethical risks https://spectrum.ieee.org/data-ai

ChatGPT-4’s answer:Used Access Link

The article from IEEE Spectrum discusses the legal and ethical issues surrounding the use of large datasets for training AI models, particularly natural language processing (NLP) models. Here are the key points:

Legal Status of Training Sets: Many datasets used for training AI are of dubious legal status. The Data Provenance Initiative, led by researchers from MIT and Cohere for AI, audited over 1,800 datasets and found that more than 70% had no data licenses, and many that did were incorrect or too permissive.
Data Provenance Concerns: The initiative highlights the uncertainty in data provenance, which is problematic for practitioners who aim to use data responsibly. The audit used both manual and automated methods, including GPT-4, to analyze the datasets.
Issues with Licensing: The researchers found that a significant portion of datasets on platforms like GitHub, Hugging Face, and Papers With Code had unspecified licensing information. Even when licenses were specified, they often did not align with the original creators' intentions.
Complications with Data Compilations: Many publicly available datasets are compilations of smaller sets, each with its own licensing requirements. This can lead to conflicting licenses and the inclusion of datasets that should not have been used.
Cultural and Legal Ambiguity: There is a pervasive culture within the AI community that overlooks the importance of data provenance and licensing. Additionally, there is ambiguity in how copyright law applies to AI training data, and it varies by country.
Data Provenance Explorer Tool: The initiative has created a free online tool called the Data Provenance Explorer, which allows developers to trace the provenance of datasets and filter for different licensing conditions.
Challenges for Developers: Establishing data provenance is difficult and time-consuming for developers. There is some progress in standardized datasheets for datasets, but adoption is not widespread, and data-hosting sites have little incentive to require them.

The article also includes perspectives from other experts who agree on the importance of the issue but note that the term "data provenance" is perhaps not used accurately in this context, as it usually refers to the entire history of a dataset's changes and transformations, not just licensing.

The article underscores the complexity and the critical nature of data licensing and provenance in the AI field, pointing out the need for better practices and tools to ensure responsible use of data.

Question 2 of 3 for ChatGPT-4: Do you think this is an issue for AI makers and for rights owners and creators? Why is that?ChatGPT-4’s answer:Yes, the issues of data provenance and licensing are significant for both AI developers (makers) and rights owners (creators). Here's why:

For AI Makers:

Legal Risks: Using datasets without clear licensing can lead to legal challenges for AI developers. If a dataset includes copyrighted material without proper authorization, developers could face lawsuits or be required to cease using their models.
Ethical Considerations: There is an ethical obligation to respect the intellectual property rights of creators. Using data without permission can undermine trust in the AI community and the technology itself.
Model Integrity: The quality and reliability of an AI model depend on the data it was trained on. If the provenance of the data is unclear, it can be difficult to assess the model's biases, limitations, and overall integrity.
Reputation: Companies and developers could suffer reputational damage if found to be using data irresponsibly, which could lead to a loss of user trust and potential business.
Barrier to Innovation: Uncertainty around data licensing can stifle innovation, as developers may be hesitant to use datasets with unclear or risky legal statuses.

For Rights Owners and Creators:

Copyright Infringement: Creators may find their copyrighted works used without permission, compensation, or credit, which infringes on their legal rights and can lead to financial losses.
Loss of Control: Creators lose control over how their work is used, which can lead to it being associated with projects or causes they do not endorse.
Economic Impact: If AI companies use copyrighted material without licensing it properly, it undermines the market for the creators' work. This can have a broader economic impact on industries that rely on copyright for revenue.
Moral Rights: In some jurisdictions, authors have moral rights to their work, which includes the right to attribution and the right to object to derogatory treatment of their work. Improper use of their work in AI training sets can violate these rights.

The intersection of AI development and copyright law is complex and still evolving. As AI continues to advance and consume larger datasets, the importance of resolving these issues grows. There needs to be a balance that allows for the continued growth and innovation in AI while respecting and protecting the rights of creators. This balance is crucial for the sustainable and ethical development of AI technologies.

Question 3 of 3 for ChatGPT-4: What should AI makers be doing to address this and what can others do?ChatGPT-4’s answer:AI makers and the broader community, including data providers, legislators, and rights owners, can take several steps to address the issues of data provenance and licensing:

For AI Makers:

Due Diligence: Perform thorough due diligence on datasets before using them. This includes checking for clear licensing information and ensuring that the data can be legally used for the intended purpose.
Use Licensed Datasets: Whenever possible, use datasets that have clear and appropriate licenses. If a dataset lacks a license, AI makers should seek permission from the rights holders or avoid using the dataset altogether.
Contribute to Open Data: Consider contributing to or using open datasets that have been created with proper licensing and provenance. This can help create a culture of sharing and transparency.
Develop Clear Policies: Establish clear internal policies for data governance, including the use of third-party data, and ensure that all employees are trained on these policies.
Support Data Provenance Tools: Utilize and support the development of tools like the Data Provenance Explorer to trace and verify the origins and licensing of datasets.
Advocate for Standardization: Work with industry groups to advocate for standardized metadata and datasheets for datasets, which include licensing and provenance information.

For Data Providers and Platforms:

Enforce Licensing Standards: Platforms that host datasets should enforce standards for licensing information, making it a requirement for dataset uploads.
Educate Users: Provide guidance and education to users on the importance of data provenance and licensing, and how to comply with legal requirements.
Implement Verification Processes: Develop and implement processes to verify the provenance and licensing of datasets before they are made available.

For Legislators and Policy Makers:

Clarify Copyright Laws: Work to clarify how copyright laws apply to datasets and AI training, providing clear guidelines for AI developers and rights owners.
Promote Data Sharing Legislation: Consider legislation that promotes the sharing of data in ways that protect the rights of creators while fostering innovation.

For Rights Owners and Creators:

Monitor Use: Use tools to monitor how their work is being used and ensure it is not included in AI training sets without permission.
Assert Rights: Be proactive in asserting their rights and seek legal advice if their work is used without proper licensing.
Engage in Dialogue: Engage in dialogue with AI developers and platforms to find mutually beneficial ways to use content while respecting copyright.

For the AI Community as a Whole:

Promote Ethical Standards: Develop and adhere to ethical standards for AI development, including the responsible sourcing of training data.
Collaborate on Solutions: Work collaboratively across sectors to develop solutions that address data provenance and licensing issues.

By taking these steps, AI makers and the community can help ensure that the development of AI technologies is both legally compliant and ethically responsible, which is essential for the long-term success and acceptance of AI.