• Pascal's Chatbot Q&As
  • Posts
  • The ASIS&T paper shows that the contamination of AI models is not an accident; it is the natural consequence of an industry built for scale, not care. Institutions that build these models were never..

The ASIS&T paper shows that the contamination of AI models is not an accident; it is the natural consequence of an industry built for scale, not care. Institutions that build these models were never..

...designed to uphold scholarly integrity. Not because they are malicious, but because they are structurally, culturally, and economically misaligned with the norms of science.

When No One Guards the Gate — Why AI Makers Cannot Protect Scientific Integrity, and Why Society Must Wake Up

by ChatGPT-5

The paper Library Genesis to Llama 3: Navigating the Waters of Scientific Integrity, Ethics, and the Scholarly Record(ASIS&T 2025) exposes a troubling truth hiding in plain sight: AI developers are not stewards of scientific quality. They are not equipped, incentivized, trained, or structured to act as custodians of the scholarly record. Yet the world increasingly relies on their systems as if they were.

The paper highlights several categories of scientific-integrity risks—metadata failures, retraction blindness, epistemic contamination, normalization of bad science, and homogenized scientific mimicry—that together form a systemic threat to how society understands knowledge. But the most important insight is not in the problems themselves. It is in the realization that AI makers will not—because they cannot—fix them.

This is a wake-up call: If courts and regulators fail to intervene, society will be living in a world where scientific truth is no longer stable or recoverable.

1. The Activities Required to Protect Scientific Integrity — and Why AI Makers Will Not Do Them

a. Ensuring Metadata Integrity and Retraction Awareness

To protect scientific content, AI developers would need:

  • continuous ingestion of updated DOI metadata

  • connection to systems like Crossref, RWDB, GetFTR, PubMed

  • fine-grained tracking of retraction events

  • mechanisms to filter or reweight data based on reliability

This is complex scholarly infrastructure. It requires librarianship, disciplinary knowledge, and curation skills. AI labs do not employ scholarly metadata librarians; they employ software engineers and optimization scientists.

Result:
AI models ingest retracted, fraudulent, outdated, or contradictory research without awareness. No one cleans the pipes.

b. Normalizing the Use of “Bad Science”

The paper shows that AI training data contains:

  • outdated articles

  • retracted work

  • manipulated images

  • paper-mill output

  • fraudulent studies

  • unreviewed preprints

To prevent this, AI labs would need:

  • domain-specific curation

  • scientific quality grading

  • systematic data provenance audits

  • disciplinary experts for each field

This is work traditionally done by publishers, editors, peer reviewers, research-integrity teams, and metadata curators—not by engineers optimizing GPUs.

AI makers will not do this because:

  • it requires expertise they do not have

  • it is slow

  • it constrains scale

  • it is expensive

  • it contradicts their “bigger is better” philosophy

c. Updating Models After Retractions

Retractions occur daily. To keep an AI model scientifically reliable, an AI lab would need:

  • constant monitoring of retraction databases

  • systematic identification of where those articles appeared in the training corpus

  • re-training, fine-tuning, or red-weighting of those sections of the model

This is not feasible in practice:

  • current models cannot be easily surgically revised

  • model training costs millions

  • developers do not track or store training-data provenance

Thus, once bad science enters a model, it is immortalized. It becomes part of the AI’s epistemic DNA.

d. Ignoring Established Quality-Control Systems

The scholarly community has spent decades building:

  • Crossref metadata integrity networks

  • Retraction Watch Database

  • publisher-led correction workflows

  • ethics committees

  • peer-review and post-publication review systems

AI developers do not integrate any of these into their pipelines.
Not out of malice — but because:

  • they do not know these systems exist

  • they do not understand them

  • they are not incentivized to adapt them

  • integrating them would slow down model development cycles

AI is moving at “ship fast” speed; scholarly quality control is slow, deliberate, and evidence-based. The two systems are structurally incompatible.

e. Contributing to Epistemic Pollution

Once contaminated data enters an AI model:

  • falsehoods are regenerated as plausible text

  • fraudulent studies become “facts”

  • manipulated images become reproducible “patterns”

  • outdated theories appear as authoritative

This is epistemic pollution — a spreading contamination of the knowledge environment.

The paper warns: AI is creating a world where scientific error becomes scientifically reproducible at scale.

AI makers do not have the institutional, ethical, or structural capacity to prevent this.

f. Choosing Scale Over Integrity

AI research culture is built around:

  • scaling laws

  • larger corpora

  • bigger models

  • faster releases

  • performance benchmarks

Scholarly integrity requires:

  • precision

  • provenance

  • expert review

  • correction cycles

  • repeatability

In the AI industry, quality curation is a cost; scale is a value.
Thus, AI makers will always pick more data over better data.

g. Creating Homogenized, Distorted, Self-Similar Scientific Outputs

The paper highlights a subtle but devastating consequence:
AI models begin to homogenize scientific writing and imaging — creating outputs that:

  • flatten nuance

  • mimic flawed training examples

  • blend fraudulent with legitimate content

  • produce “plausible but false” scientific artifacts

This is not simply a corruption of facts.
It is a corruption of form, style, epistemology, and method.

AI makes science look consistent, even when the underlying ideas are wrong. That is intellectually dangerous.

h. AI Makers Acknowledge the Risks — but Do Nothing

The paper notes that:

  • AI companies know about retraction risks

  • know about contaminated data

  • know about bad science in their corpora

  • know about lack of provenance

  • know about hallucination risks

But they do not fix it because:

  • it is expensive

  • it slows them down

  • it creates liability

  • it requires deep partnerships they have not built

  • they do not see themselves as responsible for research integrity

Thus, the risks are acknowledged in principle but ignored in practice.

2. Why Only Agreements with Rights Owners Can Provide Proper Curation

AI developers cannot curate science.
Rights holders can.

Publishers, societies, editorial boards, and scholarly infrastructure providers possess:

  • validated metadata

  • authoritative versions of record

  • retraction pipelines

  • correction histories

  • subject-matter expertise

  • quality assurance protocols

  • ethics frameworks

Licensing agreements with rights owners are not just about usage rights; they are about:

  • access to high-integrity corpora

  • access to correction streams

  • access to continuous updates

  • access to provenance metadata

  • protection against corrupted, outdated, or fraudulent science

The only way to create trustworthy scientific AI is:

  1. using authoritative data, and

  2. maintaining ongoing connections to scholarly correction systems.

This cannot be done without rights-holder participation.
AI labs cannot do it alone — and will never prioritize it.

3. What Happens if Regulators and Courts Ignore This?

If governments treat AI training as a free-for-all and ignore data-quality obligations, society will face:

1. A collapse of scientific reliability in AI systems

Medical decisions, chemical synthesis, risk assessments, legal analyses — all can be corrupted by embedded scientific error.

2. Spread of fraudulent or retracted science

Retractions will have no effect on public understanding.
AI will reanimate discredited research and present it as truth.

3. An epistemic environment where truth is impossible to verify

When AI makes sources opaque and models homogenize outputs, distinguishing real science from “AI-fabricated science-like text” becomes impossible.

4. Erosion of trust in legitimate research

If AI produces confident misinformation, the public may lose faith in institutions that depend on accuracy.

5. Loss of accountability mechanisms

If courts do not require provenance and integrity protections, AI makers will continue treating scientific content as raw material rather than regulated knowledge.

6. The slow destruction of the scholarly record

AI-generated errors will drown out authentic scientific reasoning, subtly degrading how society understands expertise and evidence.

Conclusion: A Wake-Up Call for Society

We are entering a future where AI models will mediate most interactions with scientific knowledge—summaries, explanations, recommendations, citations, even images of experimental results. But the institutions that build these models were never designed to uphold scholarly integrity. They cannot—not because they are malicious, but because they are structurally, culturally, and economically misaligned with the norms of science.

The ASIS&T paper (https://asistdl.onlinelibrary.wiley.com/doi/epdf/10.1002/pra2.1340) shows that the contamination of AI models is not an accident; it is the natural consequence of an industry built for scale, not care.

If society does not act—through regulation, litigation, licensing frameworks, and accountability—then scientific truth itself becomes collateral damage in the race for ever-larger models.

This is the moment to wake up.
The future of knowledge depends on who controls, curates, and corrects the data on which AI learns.

·

12 NOV

When AI Feeds on Poisoned Knowledge — Lessons from Library Genesis to Llama 3