• Pascal's Chatbot Q&As
  • Posts
  • This study exposes a stark contradiction between the public-facing promises of AI developers and their quiet, systematic erosion of user privacy. The exploitation of user chat data by default...

This study exposes a stark contradiction between the public-facing promises of AI developers and their quiet, systematic erosion of user privacy. The exploitation of user chat data by default...

...opaque policies & inclusion of children’s and sensitive personal data for training create a situation where the societal costs far outweigh the technological gains—unless regulators act decisively.

User Privacy and Large Language Models — An Alarming Gap Between AI Development and Data Protection Laws

by ChatGPT-4o

Introduction

In the paper User Privacy and Large Language Models: An Analysis of Frontier Developers’ Privacy Policies, Stanford researchers King, Klyman, and colleagues conduct a forensic investigation into the privacy policies of six major U.S.-based AI developers: Amazon, Anthropic, Google, Meta, Microsoft, and OpenAI. These companies—at the frontier of generative AI—serve hundreds of millions of users through LLM-powered chatbots. Their findings are deeply troubling: all six companies use user-generated chat data to train their models, often without meaningful consent, inadequate de-identification, unclear retention policies, and practices that may violate existing data privacy legislation globally.

All the Problematic Issues Identified

The researchers identify a litany of issues with how these AI developers handle user data. These include:

1. Chat Data Used for Model Training by Default

All six companies train their models on user inputs and outputs by default, including sensitive data shared in chatbot conversations. Opt-outs, if they exist, are buried deep in user settings or scattered across multiple policies.

2. Indefinite Data Retention

Amazon, Meta, and OpenAI retain user data indefinitely. Meta’s justification is vague: “as long as we need it to provide our Products, comply with legal obligations or protect our or others’ interests.”

3. Inclusion of Sensitive Data

Sensitive personal data—such as health, biometric, and sexual orientation data—is often included in training data. Only Microsoft explicitly states efforts to strip this out.

4. Training on Children’s Data

At least four companies allow children aged 13–18 to use their chatbots, and likely train on their data. Google even opened Gemini to children under 13. This raises serious consent and legal issues, especially under laws like COPPA and GDPR.

5. Unclear or Inaccessible Opt-Out Mechanisms

Some companies (e.g., Google and Meta) do not offer a clear opt-out. Others (e.g., OpenAI, Microsoft) require complex user journeys to find and execute opt-outs. Defaults are deliberately exploitative.

6. Human Review of Chat Logs

Google and OpenAI use humans to review chats for quality and safety, which introduces risks of re-identification and privacy breaches. Reporters found contract workers at Meta had access to personally identifiable chat content.

7. Lack of Transparency in Privacy Policies

The policies are often distributed across a “web of documents.” Main privacy policies omit critical data processing details, which are buried in obscure FAQs or product-specific sub-policies.

8. Platform Integration and Blurring of Boundaries

Companies collect data from across their product ecosystems (e.g., Google Docs, Meta posts) and merge it into chatbot training data, undermining purpose limitation and data minimization principles.

9. Absence of Effective De-identification

Only three of the six developers mention de-identification efforts, and even then, the approaches are vague. With the contextual richness of chat data, true anonymization is nearly impossible.

10. Opaque Data Sources

Developers fail to clarify whether they train on uploaded documents, images, audio, or other files. Only five clearly admit to using web-scraped data, and two mention licensing proprietary content—but only in vague terms.

11. Use of Non-Users’ Data

LLMs are trained on public web data that includes the personal information of non-users, who had no chance to consent or opt-out, exacerbating the privacy violations.

OpenAI, for example, frames data sharing as a social good (“Improve the model for everyone”), nudging users toward accepting data use rather than meaningfully informing them of risks.

13. Dual-Class Privacy System

Enterprise clients are automatically opted out of training, while regular consumers are opted in by default. This two-tier system privileges paying customers with stronger privacy.

14. Exploitation of Children

Children’s data is collected and used, often without age verification or clear parental consent mechanisms, despite increasing concerns about children forming parasocial and even sexualized relationships with bots.

15. Lack of Deletion Clarity

It is unclear if and when data is actually deleted—even when users request it—raising compliance questions under GDPR’s right to erasure and CCPA deletion rights.

16. Privacy Violations by Design

AI memory features (e.g., personalization in OpenAI, Google, and Microsoft products) store user preferences and details indefinitely, sometimes without the ability to delete or correct them—undermining user agency.

How the AI Makers Are Behaving and Why

The AI developers in question are behaving in ways that prioritize maximum data extraction, loophole exploitation, and rhetorical obfuscation. Despite decades of awareness that privacy policies are unreadable and ineffective, they deliberately structure them across fragmented documents, minimizing disclosure while maximizing data ingestion.

This behavior appears driven by:

  • Data scarcity pressures: The era of abundant training data is ending. Developers are desperate to squeeze every bit of information from users to keep improving models.

  • Race-to-the-top economics: In the arms race for smarter, more commercially viable LLMs, ethical restraint becomes a competitive disadvantage.

  • Regulatory gaps: U.S. federal law offers no baseline privacy protection; developers exploit this vacuum by assuming practices that would be illegal elsewhere (e.g., in the EU).

  • Weak enforcement: No major LLM developer has yet been penalized by regulators for data privacy breaches in training.

Global Privacy Laws and Potential Violations

The current practices starkly conflict with global data privacy legislation, including:

General Data Protection Regulation (GDPR) – EU

  • Consent violation: GDPR requires informed, affirmative consent for personal data processing. Default opt-ins and manipulative framing violate this.

  • Purpose limitation: Merging data across services violates GDPR’s requirement to use data only for the purpose it was collected.

  • Data minimization: Indefinite and excessive retention breaches the principle that only necessary data be collected and stored.

  • Children’s data: Training on data from children without verifiable parental consent breaches Articles 8 and 6 of GDPR.

  • Data subject rights: Failure to clearly offer deletion, correction, or access violates GDPR’s Articles 12–17.

California Consumer Privacy Act (CCPA) – U.S.

  • Inadequate disclosure: Omitting key training practices in main privacy policies violates CCPA requirements for comprehensiveness.

  • Lack of opt-out clarity: Obscuring opt-out pathways violates CCPA’s emphasis on user choice and transparency.

  • Sensitive data mishandling: The CCPA grants special protections to biometric, health, and geolocation data—yet developers routinely ingest these by default.

Children’s Online Privacy Protection Act (COPPA) – U.S.

  • If children under 13 are using chatbots and their data is being retained and used for training, companies may be in direct violation of COPPA requirements for verifiable parental consent.

Other Jurisdictions

  • Brazil’s LGPD, India’s DPDP Act, Canada’s PIPEDA, and other regimes contain many of the same principles as GDPR. The practices documented would likely be considered illegal in many of these jurisdictions as well.

Long-Term Consequences if Regulators Do Not Act

If regulators fail to enforce existing laws or pass new ones to curb these practices, the consequences could be dire:

1. Normalization of Data Exploitation

Unchecked behavior by industry leaders will become the norm, spreading exploitative data practices across the tech sector.

2. Erosion of Privacy as a Right

The public may gradually lose any expectation of private digital spaces, leading to a chilling effect on free expression and self-exploration online.

3. Expansion of Surveillance Capitalism

Chat data, rich in emotional, social, and behavioral cues, becomes the next frontier for profiling, targeted manipulation, and behavioral prediction.

4. Increased Risk of Data Breaches and Misuse

Indefinite data retention and lax access controls (including human reviewers) make leaks inevitable—especially as adversaries target AI companies.

5. Digital Inequality

A two-tiered privacy system (enterprise vs. consumer) entrenches inequality, where only those who pay or know how to opt out can protect their data.

6. Global Regulatory Fragmentation

Without unified enforcement, we risk a balkanized internet where U.S.-based services become non-compliant with international norms and are blocked or penalized abroad.

7. Collapse of Trust in AI

Users who feel surveilled or manipulated by their chatbots may turn against AI entirely, stifling adoption and stoking political backlash.

Conclusion

This study exposes a stark contradiction between the public-facing promises of AI developers and their quiet, systematic erosion of user privacy. The exploitation of user chat data by default, opaque policies, and the inclusion of children’s and sensitive personal data for training create a situation where the societal costs far outweigh the technological gains—unless regulators act decisively.

To preserve both privacy and public trust in AI, developers must abandon “data maximalism” and build models that respect individual rights. In the absence of enforcement, however, these practices will persist—and we risk sacrificing our digital autonomy at the altar of machine intelligence.