• Pascal's Chatbot Q&As
  • Posts
  • A technologically sophisticated, ethically nuanced, and politically timely examination of how unbridled AI data collection may be undermining the very foundations of digital cultural heritage.

A technologically sophisticated, ethically nuanced, and politically timely examination of how unbridled AI data collection may be undermining the very foundations of digital cultural heritage.

If society wishes to preserve open access to knowledge and history, urgent action—legal, technical, and philosophical—is required.

Are AI Bots Knocking Cultural Heritage Offline?

by ChatGPT-4o

Introduction

In “Are AI Bots Knocking Cultural Heritage Offline?” (GLAM-E Lab, June 2025), Michael Weinberg and collaborators present an urgent, technically detailed, and contextually rich analysis of how AI data harvesting bots are impacting the digital infrastructure of GLAM institutions—Galleries, Libraries, Archives, and Museums. These bots, deployed primarily to build training datasets for large AI models, are swarming open cultural collections online, often to the point of overwhelming and even crashing servers. The report is based on a targeted survey and interviews with 43 institutions across multiple continents and seeks to understand whether recent reports of bot-driven disruptions are isolated or symptomatic of a larger systemic issue. The findings point to a widespread, growing challenge—both technical and philosophical—for cultural institutions navigating openness in the age of AI.

New Findings and What They Mean for Higher Education and Cultural Institutions

The report presents a series of critical findings:

  1. Prevalence of Bot Activity
    A significant majority (39 of 43) of survey respondents had experienced increased traffic in recent years; 27 attributed this traffic to AI data scraping bots. Many institutions didn’t recognize the bot traffic until performance degraded or servers crashed, underscoring a lack of preparedness.

  2. Infrastructure Stress and Costs
    The bots often behave like distributed denial-of-service (DDoS) attacks—swarming sites from hundreds of IPs simultaneously, ignoring robots.txt files, and overwhelming bandwidth. This leads to escalating infrastructure costs for cultural institutions—many of which operate on tight budgets.

  3. Bots Do Not Respect Legal or Ethical Norms
    These bots frequently ignore site licenses, terms of service, and common courtesy conventions like robots.txt, revealing the inadequacy of current legal and technical safeguards.

  4. Lack of Standardized Protection
    Because online collection architectures are highly individualized, institutions are responding to bots with an ad hoc patchwork of countermeasures: updating firewalls, throttling by region or IP, using third-party bot blockers like Cloudflare, or moving collections behind login walls—each with trade-offs for accessibility.

  5. Negative Impacts on Open Access and Research
    Efforts to block bots (often indiscriminately) are inadvertently impacting legitimate non-commercial users, researchers, and open knowledge initiatives, suggesting a crisis for the open access movement.

  6. Evolution of Bot Behavior
    AI bots are becoming stealthier, frequently rotating user agent strings and IPs, further complicating mitigation. Their tactics are indistinguishable from malicious traffic, which raises alarms for cybersecurity as well as content protection.

Surprising, Controversial, and Valuable Statements

Most Surprising:

  • Bots Ignore Licensing Status: Bots make no distinction between open and closed collections, rendering legal protections largely irrelevant unless reinforced by technical or legal enforcement mechanisms.

  • Bots Create High-Cost “Expensive Traffic”: Wikimedia revealed that 65% of its most expensive traffic—pages served directly from uncached core infrastructure—came from bots, not humans.

Most Controversial:

  • Open Access Becomes an Unintended Subsidy for AI Companies: Cultural institutions, often public and non-profit, are subsidizing for-profit AI developers by shouldering the infrastructure burden of AI dataset harvesting without compensation or consent.

  • Some Institutions Avoid Blocking Bots to Maintain High Visitor Metrics: Some boards value raw visitor numbers (often inflated by bot traffic) for institutional reporting, potentially incentivizing inaction even when the bots degrade performance.

Most Valuable:

  • Two Streams of Concern—Philosophical vs. Practical: The report effectively separates the ethical debate over the meaning of “openness” from the urgent technical conversation about infrastructural sustainability. This framework enables clearer policy and operational planning.

  • Recommendations for Mitigation: Despite being patchwork, shared tactics such as monitoring bot behavior patterns (burst traffic, IP ranges, user agent strings) and use of scalable cloud infrastructure provide starting points for collective institutional action.

Implications for Society and the Future of Cultural Heritage

If the authors’ assessment holds true, the implications for the future are significant and worrisome:

  1. Risk of Digital Lockdown
    If bots continue unchecked, institutions may increasingly lock down public collections behind logins or firewalls. This would reverse decades of progress in digital democratization and accessibility of cultural heritage.

  2. Erosion of the Public Commons
    The unregulated harvesting of public knowledge by private AI firms commodifies the commons without accountability or reciprocity. This threatens the sustainability of shared cultural and educational infrastructure.

  3. Regulatory and Ethical Vacuum
    Current frameworks are inadequate. Technical controls like robots.txt are ignored, and legal recourse is elusive because bot identities are masked and jurisdictions differ. This vacuum enables exploitative behavior that undermines public trust in openness.

  4. Cascading Impact on Research and Education
    Collateral damage includes academic research, digital humanities projects, and citizen science initiatives—all of which often rely on the same scraping methods as bots, but at a smaller, respectful scale. Institutions risk becoming overly cautious or defensive, which would curtail innovation and inquiry.

  5. Precedent for Other Sectors
    GLAM institutions are the canaries in the coal mine. As AI-driven scraping expands into education, journalism, and non-profit sectors, similar technical and ethical battles will emerge. What happens here will echo across digital society.

Conclusion: Are the Authors Correct?

Yes—the authors are correct in identifying a widespread and growing problem that threatens the infrastructure, philosophy, and viability of open cultural heritage online. The report is methodically compiled, empirically grounded, and measured in tone. While it acknowledges limitations in scope and data collection, the consistency of experiences across institutions suggests the phenomenon is neither isolated nor temporary.

What Should Happen Next?

  1. Legal Innovation and Policy Intervention
    Governments and international bodies must consider new regulatory mechanisms, such as mandatory bot identification and data scraping consent protocols. This might include strengthening digital provenance requirements and enabling legal remedies for scraping abuse.

  2. Ethical AI Development Standards
    AI developers must take responsibility for how data is collected. Models trained on scraped cultural data should embed reciprocity—either by supporting hosting costs, contributing to infrastructure, or offering transparent opt-outs.

  3. Investment in Defensive Infrastructure for GLAM
    National and philanthropic bodies should fund resilient digital infrastructure for vulnerable institutions, enabling them to defend against abusive scraping while maintaining openness.

  4. Reimagining “Open” in an AI Era
    The open movement must revisit its definitions. If openness becomes synonymous with exploitation, it risks losing its moral and practical appeal. New norms are needed to balance access with sustainability and ethical responsibility.

In sum, this report is a clarion call: a technologically sophisticated, ethically nuanced, and politically timely examination of how unbridled AI data collection may be undermining the very foundations of digital cultural heritage. If society wishes to preserve open access to knowledge and history, urgent action—legal, technical, and philosophical—is required.