Pascal's Chatbot Q&As
Posts
GPT-4o: In practical terms, complete deletion of specific information from an LLM is challenging due to the nature of how these models learn and store information.

GPT-4o: In practical terms, complete deletion of specific information from an LLM is challenging due to the nature of how these models learn and store information.

The most feasible current solutions involve some combination of fine-tuning and suppression, although these come with trade-offs in terms of effectiveness and compliance with privacy regulations.

Pascal Hetzscholdt
May 31, 2024

Question 1 of 4 for ChatGPT-4o: Please read this LinkedIn post with its 98 comments and tell me: what is the best answer provided to the LinkedIn user's main question: "What is the exact technical mechanism used by a LLM model to 'forget' something? OpenAI claims that it can honor a deletion request (but not a rectification request), and I'm curious how that is achieved technically-speaking? Do they delete all underlying source data? Do some fine-tuning to force the LLM to forget (https://lnkd.in/etY2sRW8)? Update some sort of global prompt?"

ChatGPT-4o’s answer:
The best answer provided to the LinkedIn user's question about the technical mechanism used by an LLM model to 'forget' something can be summarized from multiple comments in the thread. Here are the key points combined from various responses that provide a comprehensive explanation:

Data Deletion and Suppression:
- One approach to 'forgetting' is to delete the data from the training sets. However, since the data is not directly saved in the weights of the neural net, it's not straightforward to 'untune' the model with that content. Retraining from scratch is an option but is incredibly costly and time-consuming.
Fine-tuning:
- Another method is to fine-tune the model by penalizing specific outputs. This involves adjusting the weights and bias parameters by assessing the model's performance and making necessary amendments.
Suppression:
- Suppression involves tuning the LLM so that certain outputs do not occur. This doesn't delete the information but makes the model less likely to generate it. This method is often more feasible economically but doesn't fully comply with data privacy laws since the information is still technically present in the model.
Disgorgement:
- A more radical approach than suppression but less so than complete retraining is 'disgorgement,' which involves removing specific behavior patterns from the model. This can be seen as a step between suppression and full deletion.
Black Box Nature of Deep Learning:
- A significant challenge is the black-box nature of deep learning models. We do not yet have the mathematical tools to precisely correlate the learned input with the model's parameters, making it difficult to delete or rectify information accurately.
Economic and Practical Challenges:
- The process is economically unfeasible in many cases because retraining models from scratch or even implementing sophisticated fine-tuning mechanisms requires significant resources. Therefore, many solutions focus on making it harder for the model to access or generate the unwanted data rather than completely removing it.

In conclusion, the most comprehensive answer highlights that there is no simple technical mechanism for an LLM to 'forget' something completely. The feasible methods include data deletion (though impractical due to costs), fine-tuning, and suppression, each with its limitations and challenges.

Question 2 of 4 for ChatGPT-4o: What would your answer be? Feel free to include anything not yet mentioned by the commenters.

ChatGPT-4o’s answer:
To technically achieve 'forgetting' in a large language model (LLM) like those developed by OpenAI, there are several mechanisms that could be considered. These mechanisms range in feasibility and efficiency, and often there isn't a one-size-fits-all solution. Here’s a comprehensive overview:

Data Deletion from Training Sets:
- Process: This involves identifying and removing specific data entries from the training datasets and retraining the model from scratch.
- Challenges: This is extremely resource-intensive in terms of computational power, time, and costs. It may not always be feasible for very large models like GPT-4.
- Current Feasibility: Often impractical for large models due to high costs.
Fine-Tuning:
- Process: Fine-tuning the model on a modified dataset that penalizes specific outputs or associations. This involves adjusting the model’s weights and biases.
- Challenges: Requires significant resources and can introduce unintended biases or degrade model performance. It’s more targeted than retraining but still resource-intensive.
- Current Feasibility: Used selectively for specific and critical updates.
Output Suppression:
- Process: Implementing filters or additional layers that suppress the generation of certain outputs. This could involve pre-processing the input or post-processing the output to avoid specific results.
- Challenges: This method does not remove the information but makes it less likely to appear in outputs. It may not fully comply with legal requirements for data deletion.
- Current Feasibility: Economically feasible and commonly used in practice.
Ablation or Neuron Deactivation:
- Process: Identifying and deactivating specific neurons or network pathways responsible for generating the undesired outputs.
- Challenges: Technically complex and requires a deep understanding of the model’s architecture. It may not be foolproof and can lead to unintended side effects.
- Current Feasibility: Experimental and not widely adopted yet.
Machine Unlearning:
- Process: Using specialized algorithms designed to 'unlearn' specific data points. This involves sophisticated techniques to revert the model’s parameters as if certain data was never included.
- Challenges: Still a developing field. Effective unlearning algorithms are complex and computationally demanding.
- Current Feasibility: Limited practical implementations, mostly in research stages.
Differential Privacy Techniques:
- Process: Applying differential privacy methods to ensure that individual data points have a minimal impact on the model's outputs. This can help in mitigating the effects of specific data deletions.
- Challenges: While it enhances privacy, it doesn’t completely remove specific information already embedded in the model.
- Current Feasibility: Used more for privacy guarantees during training rather than post-hoc deletion.

Summary and Practical Considerations

In practical terms, complete deletion of specific information from an LLM is challenging due to the nature of how these models learn and store information. The most feasible current solutions involve some combination of fine-tuning and suppression, although these come with trade-offs in terms of effectiveness and compliance with privacy regulations.

The choice of method depends on the specific requirements, such as the criticality of the deletion, the available resources, and the acceptable trade-offs. For now, organizations often resort to suppression and selective fine-tuning while the field of machine unlearning and ablation continues to evolve.

Question 3 of 4 for ChatGPT-4o: List all reasons (including moral, ethical, legal and regulatory ones) why AI should be capable of 'forgetting' in the first place.

ChatGPT-4o’s answer:
There are several compelling reasons why AI should be capable of 'forgetting,' spanning moral, ethical, legal, and regulatory domains:

1. Privacy Protection

Moral/Ethical: Respecting individuals' right to privacy by ensuring their personal data can be removed from AI systems.
Legal/Regulatory: Compliance with data protection laws like the GDPR, which includes the right to be forgotten (RTBF).

2. Data Minimization

Moral/Ethical: Minimizing the amount of personal data retained by AI systems to reduce potential misuse or unauthorized access.
Legal/Regulatory: Aligning with principles of data minimization under laws such as GDPR, which mandate that only necessary data should be stored and processed.

3. Correction of Errors

Moral/Ethical: Correcting misinformation or inaccuracies that the AI might have learned, which can cause harm or propagate false information.
Legal/Regulatory: Fulfilling obligations under laws that allow individuals to correct inaccurate data about themselves.

Moral/Ethical: Honoring individuals' consent to use their data and their right to withdraw that consent.
Legal/Regulatory: Complying with legal frameworks that require organizations to respect consent and provide mechanisms for withdrawal.

5. Security

Moral/Ethical: Enhancing security by reducing the risk of sensitive data being exposed through breaches or leaks.
Legal/Regulatory: Meeting security requirements of data protection laws that mandate safeguarding personal information.

6. Preventing Discrimination and Bias

Moral/Ethical: Reducing the potential for discrimination and bias by ensuring that AI systems do not retain and propagate biased or prejudiced information.
Legal/Regulatory: Adhering to anti-discrimination laws and ensuring fairness in automated decision-making processes.

7. Accountability and Transparency

Moral/Ethical: Maintaining accountability in AI systems by providing mechanisms to remove data that should not have been collected or used.
Legal/Regulatory: Fulfilling transparency obligations under various data protection laws, ensuring that individuals know how their data is being used and have the ability to request its deletion.

8. Mitigating Psychological Harm

Moral/Ethical: Protecting individuals from potential psychological harm or distress caused by the retention of sensitive or traumatic information.
Legal/Regulatory: Ensuring compliance with regulations that seek to protect individuals' mental well-being.

9. Ethical AI Development

Moral/Ethical: Promoting ethical AI development practices that prioritize human dignity and respect for individuals.
Legal/Regulatory: Aligning with emerging ethical guidelines and standards for AI development, such as those from IEEE and other bodies.

10. Corporate Responsibility and Trust

Moral/Ethical: Building and maintaining trust with users by demonstrating a commitment to ethical data handling practices.
Legal/Regulatory: Meeting corporate social responsibility standards and enhancing reputational integrity by adhering to best practices in data management.

11. Human Rights

Moral/Ethical: Upholding human rights, including the right to privacy, freedom of expression, and the right to be free from unwarranted surveillance.
Legal/Regulatory: Complying with human rights frameworks and ensuring AI systems do not infringe on these rights.

In conclusion, enabling AI systems to 'forget' is crucial for aligning with moral, ethical, legal, and regulatory standards. It ensures the protection of individuals' rights and fosters the responsible development and deployment of AI technologies.

Question 4 of 4 for ChatGPT-4o: Therefore, what should AI makers and regulators be doing or enabling in this context?

ChatGPT-4o’s answer:
To address the need for AI systems to be capable of 'forgetting,' AI makers and regulators should take a multi-faceted approach that includes technological, legal, ethical, and regulatory actions. Here are key steps that AI makers and regulators should be doing or enabling:

For AI Makers:

Develop Robust Data Deletion Mechanisms
- Implement effective methods for deleting data from training datasets and models.
- Invest in research and development of machine unlearning techniques and other advanced methods to ensure complete and accurate deletion of data.
Incorporate Privacy by Design
- Integrate privacy-focused design principles into AI systems from the outset.
- Ensure that mechanisms for data deletion, minimization, and user consent are built into the core architecture of AI systems.
Enhance Transparency
- Provide clear information about how data is collected, used, and deleted in AI systems.
- Offer users transparency tools that allow them to understand and control their data within AI systems.
Regular Audits and Compliance Checks
- Conduct regular audits to ensure compliance with privacy and data protection regulations.
- Implement continuous monitoring and compliance mechanisms to promptly address any issues related to data retention and deletion.
User Empowerment Tools
- Develop user interfaces and tools that empower individuals to request data deletion, understand what data the AI holds about them, and how it is used.
- Ensure that these tools are user-friendly and accessible.
Ethical Training and Development
- Train AI developers and data scientists on ethical considerations and the importance of privacy and data protection.
- Foster a culture of responsibility and accountability within AI development teams.

For Regulators:

Strengthen Legal Frameworks
- Update and enforce data protection laws to specifically address the challenges posed by AI and large-scale data processing.
- Include clear provisions for the right to be forgotten, data deletion, and correction in AI contexts.
Create Standards and Guidelines
- Develop and publish standards and guidelines for data deletion and machine unlearning in AI systems.
- Collaborate with international bodies to ensure consistency and harmonization of these standards globally.
Mandate Transparency and Accountability
- Require AI makers to disclose their data handling practices, including how they manage data deletion and user requests.
- Implement regulations that mandate accountability mechanisms for non-compliance, including penalties and corrective actions.
Support Research and Innovation
- Fund and support research into advanced data deletion methods, such as differential privacy, machine unlearning, and ethical AI design.
- Encourage public-private partnerships to innovate in this space and develop practical solutions.
Public Awareness and Education
- Educate the public about their rights regarding data privacy and the implications of AI data retention.
- Provide resources and support to help individuals understand how they can exercise their rights.
Promote Ethical AI Development
- Advocate for ethical AI development practices that prioritize user privacy and data protection.
- Collaborate with industry stakeholders to establish ethical standards and best practices.
Facilitate International Cooperation
- Work with international regulators to create a cohesive framework for AI governance that addresses cross-border data flows and the global nature of AI technologies.
- Ensure that data protection regulations are aligned and enforced consistently across different jurisdictions.

By taking these steps, AI makers and regulators can ensure that AI systems are capable of 'forgetting' in a way that respects privacy, upholds ethical standards, and complies with legal requirements. This will help build trust in AI technologies and protect individuals' rights in an increasingly data-driven world.

Previously:

GPT: They propose a new, efficient algorithm designed for I2I generative models that can effectively unlearn specific data while minimizing the impact on the model's performance for the remaining data

This is a significant step forward since it addresses the challenge without needing to access the data that should remain (retained data), complying with data retention policies

p4sc4l.substack.com/p/gpt-they-propose-a-new-efficient

Asking ChatGPT-4: List all possible problematic content that - if observed as being present in an LLM's training data - could have serious repercussions for AI Makers

It's important to recognize the gravity of this issue and the need for rigorous content moderation and ethical considerations in dataset creation

p4sc4l.substack.com/p/asking-chatgpt-4-list-all-possible

AI Services and the use of Copyrighted Content without authorisation...can they be 'untrained'?

ChatGPT: If copyrighted content has been used to train an AI service, it may not be possible to completely "untrain" the AI relevant to the content that has been used illegally

p4sc4l.substack.com/p/ai-services-and-the-use-of-copyrighted