Pascal's Chatbot Q&As
Posts
The CNIL’s new guidance on legitimate interest and AI development represents a mature and balanced attempt to reconcile AI innovation with European data protection values.

The CNIL’s new guidance on legitimate interest and AI development represents a mature and balanced attempt to reconcile AI innovation with European data protection values.

It acknowledges the practical necessity of large-scale data access—especially in a competitive global AI race—while reaffirming the GDPR’s protective core.

Pascal Hetzscholdt
June 25, 2025

CNIL’s Latest AI Guidelines and the Use of ‘Legitimate Interest’ in Data Processing for AI Systems

by ChatGPT-4o

Introduction

In June 2025, the French data protection authority CNIL (Commission Nationale de l’Informatique et des Libertés) released its latest recommendations on the development of artificial intelligence systems—particularly focusing on when and how “legitimate interest” can serve as a legal basis for data processing, especially in AI training contexts such as web scraping. These guidelines were shaped by an extensive public consultation and build upon the broader GDPR framework. They mark a significant step in defining lawful AI development while balancing innovation and the protection of fundamental rights.

This essay explores the CNIL’s new approach, highlights the key legal and technical developments, and explains how this impacts AI developers, businesses, and European digital policy more broadly.

The CNIL’s Evolving AI Framework

The recommendations are part of CNIL’s broader AI action plan launched in 2023, which aims to:

Clarify the applicability of GDPR to AI development;
Promote transparency, accountability, and data minimization;
Support legal certainty for AI developers and data controllers;
Harmonize approaches within the EU, especially with the incoming AI Act (RIA).

At the heart of this new guidance is the question: when can developers lawfully rely on "legitimate interest" as a basis for collecting and processing personal data, including through web scraping?

Key Developments in the Latest CNIL Guidance

1. Legitimate Interest as a Legal Basis

CNIL confirms that legitimate interest can be used for AI development under certain conditions, including:

Clear articulation of the interest pursued (commercial, scientific, societal);
Evidence that the interest is "real and present";
Compatibility with the original data collection context;
A balancing test showing the interest does not override individual rights and freedoms.

Importantly, CNIL emphasizes that legitimate interest is not subordinate to consent—it is an autonomous legal basis, not a fallback.

2. Web Scraping (Moissonnage)

The most controversial area is web scraping, which many developers rely on to access large datasets.

CNIL now accepts that:

Web scraping is not illegal per se;
Legitimate interest can justify scraping publicly available data—but only under strict safeguards;
Sensitive data must be avoided or deleted upon detection;
The use of robots.txt and other technical signals to disallow scraping must be respected.

CNIL encourages developers to implement data minimization, anonymization, deduplication, and transparency mechanisms as core mitigations.

3. Public Consultation Insights

Based on 62 contributions from companies, academics, NGOs, and individuals, CNIL addressed concerns about:

The lack of clarity in previous drafts;
The tension between commercial and public interests;
The risks of mass data processing (e.g., discrimination, privacy loss);
The expectation (or lack thereof) by individuals that their data could be scraped for AI.

The authority responded by including more detailed examples and use cases and promising future clarification on open-source model dissemination and model status under GDPR.

Controversial and Valuable Statements

Controversial:

CNIL does not require consent by default for AI model training, even when data is scraped online.
It allows “discretionary rights of objection” instead of stronger user consent mandates.
Web scraping can proceed if data is “freely accessible” and users can be “reasonably expected” to know their data may be processed.

Valuable:

The guidance explicitly permits commercial interests as legitimate if proportional and necessary.
CNIL supports synthetic data and technical measures to mitigate memorization/regurgitation risks.
Developers are encouraged to coordinate with data hosts to increase transparency.
The authority supports the creation of a future European-wide opt-out mechanism akin to “Do Not Track.”

Surprising:

CNIL suspended its plan to create a public registry of scraping entities due to pushback from developers and privacy concerns.
It confirms that GDPR-compliant processing may include data from sites that don’t prohibit scraping via technical means but not those using robots.txt or ai.txt.

Implications for Businesses and AI Developers

For French and EU-based Businesses:

The CNIL’s position provides more room to innovate under GDPR without needing blanket consent.
However, legal risk remains high unless developers implement strong safeguards and perform thorough balancing tests.
Web scraping must now be filtered through not just GDPR but also copyright, platform terms of use, and sectoral regulations.

For AI Developers:

The onus is on developers to justify necessity, apply data minimization principles, and document their legitimate interest thoroughly.
Developers must use safeguards such as:
- Data exclusion filters (e.g., pornographic or personal aggregator sites);
- Anonymization/pseudonymization;
- Right-to-object mechanisms;
- Transparency about model training data sources.

Recommendations for Businesses

Legal Readiness:
- Conduct DPIAs and balancing tests early in the model development process.
- Document the rationale for using legitimate interest versus consent or other bases.
Technical Controls:
- Respect signals like robots.txt, apply filters pre-scraping, and avoid sensitive data by design.
- Implement RLHF (Reinforcement Learning from Human Feedback) and post-training controls to limit regurgitation.
User Transparency:
- Publish clear, accessible notices about data use and model training practices—even on third-party websites where data was scraped.
Regulatory Engagement:
- Monitor evolving CNIL and EDPB guidelines, especially for open-source models and systemic risk classification under the EU AI Act.
- Participate in standardization efforts (e.g., tagging data opt-out, provenance protocols).
Strategic Positioning:
- Emphasize societal benefits and public interest use cases in your AI projects to strengthen the legitimacy of data processing.
- Consider hybrid datasets and synthetic data generation to reduce personal data dependency.

Conclusion

The CNIL’s new guidance on legitimate interest and AI development represents a mature and balanced attempt to reconcile AI innovation with European data protection values. It acknowledges the practical necessity of large-scale data access—especially in a competitive global AI race—while reaffirming the GDPR’s protective core.

Yet, it also places increasing responsibility on developers to document, justify, and mitigate the risks of personal data use. For European businesses, this guidance opens doors to more agile and scalable AI experimentation—if done responsibly. For global AI actors, especially those scraping European data, CNIL’s approach will likely influence enforcement decisions and future litigation under both GDPR and the AI Act.

The future will test how well these safeguards function in practice—and whether “legitimate interest” can truly be a bridge between innovation and individual rights in the AI age.