Pascal's Chatbot Q&As
Posts
For the first time under EU law, all providers of GPAI models—open source or proprietary—must publicly disclose a summary of the data used to train their models...

For the first time under EU law, all providers of GPAI models—open source or proprietary—must publicly disclose a summary of the data used to train their models...

Summary must be: Sufficiently detailed, Made public upon market placement, Updated with further training or significant changes, Inclusive of modalities like text, image, video, audio & synthetic data

Pascal Hetzscholdt
July 24, 2025

The EU’s Mandatory Training Data Disclosure Template — Why It Matters and What AI Makers Must Do Now

by ChatGPT-4o

Introduction

On July 24, 2025, the European Commission released a final version of its mandatory Template and Explanatory Notice for disclosing training data used in general-purpose AI (GPAI) models, as required under Article 53(1)(d) of the AI Act (Regulation EU 2024/1689). This development signals a transformative shift in how AI providers must operate in the EU and likely beyond. As Luca Bertuzzi noted in his widely shared LinkedIn post, this regulation carries significant implications—not only for compliance but for intellectual property (IP) enforcement, market competitiveness, and global norms surrounding AI governance.

Why This Is Important

1. Mandatory Transparency and Accountability

For the first time under EU law, all providers of GPAI models—open source or proprietary—must publicly disclose a summary of the data used to train their models, following a standardized and legally binding format. This summary must be:

Sufficiently detailed
Made public upon market placement
Updated with further training or significant changes
Inclusive of modalities like text, image, video, audio, and synthetic data

This directly challenges the longstanding opacity of AI training processes and forces providers to explain where their data came from—without hiding behind trade secret claims.

2. IP and Copyright Enforcement

The Template’s most critical feature is its utility for rightsholders. By requiring the listing of top domain names scraped, a narrative description of datasets, and information on licensed and synthetic data, the Summary gives creators, publishers, and collecting societies the tools to trace unauthorized usage. This enables them to invoke EU copyright law (DSM Directive 2019/790) and IPR enforcement tools (Directive 2004/48/EC) to demand licensing, seek takedowns, or pursue legal remedies.

This also empowers foreign rights holders, who can use the disclosed data as a basis for lawsuits even outside the EU, turning this transparency obligation into a de facto global enforcement lever.

3. Consumer, Data Protection, and Anti-Discrimination Rights

Besides copyright, the Summary facilitates compliance with:

GDPR: Identifying personal data sources
Consumer rights law: Clarifying provenance of AI-generated outputs
Non-discrimination: Helping downstream developers assess dataset bias and cultural diversity

By requiring providers to disclose how user data was collected and processed (including from product interactions), the regulation strengthens data subject rightsand oversight over model behavior and dataset composition.

What AI Makers Must Do Now

With enforcement beginning August 2, 2025, and supervision powers activating August 2, 2026, the clock is ticking. Fines for non-compliance may reach €15 million or 3% of global turnover, whichever is higher. To prepare, providers must immediately:

1. Compile a Detailed Inventory of Training Data

AI makers must start by:

Mapping all datasets used from pre-training through fine-tuning and alignment
Classifying data into: publicly available datasets, commercially licensed datasets, private non-licensed datasets, scraped web data, user data, synthetic data, and other sources
Tracking domain names scraped, especially those comprising the top 10% by volume (or top 5% or 1,000 for SMEs)

2. Complete and Validate the EU Template

The Template contains three main sections:

General Information: Model and provider identification, training data volume and modality
List of Data Sources: Clear taxonomy of where the data came from
Data Processing Aspects: TDM opt-out compliance, illegal content filtering, copyright policies

Providers should work cross-functionally—legal, compliance, engineering, and ethics—to ensure the submission is complete and accurate. A good-faith effort is expected but insufficient detail or misleading omissions may trigger regulatory action.

3. Review and Enhance Copyright and TDM Opt-Out Compliance

Article 3 and 4 of the DSM Directive allow rightholders to opt out of text and data mining, requiring AI developers to respect these signals (e.g. robots.txt, meta tags, contractual reservations). The Template explicitly asks for the methods used to honor these signals, so providers must:

Document their opt-out honoring systems
Audit crawlers and data acquisition protocols
Join or align with the voluntary Code of Practice, if applicable

4. Publish the Summary Accessibly and Maintain Updates

AI makers must:

Publish the Summary visibly on their official websites
Attach it to the model in all distribution channels
Update the Summary every six months if training continues or material changes occur

This demands a new level of documentation discipline, version control, and legal engagement.

5. Prepare for Stakeholder Requests and Legal Claims

The EU encourages a voluntary “upon request” disclosure mechanism: if a domain isn't in the Summary, rights holders may still ask whether their content hosted on specific domains was used. AI makers should:

Build a searchable internal index of scraped domains
Set up internal workflows to respond to rights holder inquiries
Train staff in EU IP enforcement frameworks to manage risk

Conclusion: Global Ripple Effects

The EU’s mandatory training data disclosure rule is a watershed moment in AI regulation. It moves from aspirational ethics to legally binding transparency, with teeth. Rights holders can now identify and challenge unauthorized uses, researchers can evaluate dataset bias, and consumers get greater assurance of lawful model behavior.

AI makers must act now to achieve compliance—not just for legal survival in the EU, but to preempt broader global demands for accountability, transparency, and fairness in AI development.

Final Recommendation to AI Makers

Treat this not as red tape but as license to operate.
Build a training data governance architecture.
Work with compliance vendors, TDM opt-out platforms (e.g. Liccium, Netacea), and IP counsel.
Respect creative ecosystems by licensing responsibly.

Those who do will not only avoid penalties—they may earn the trust that sustains long-term success in the age of generative AI.