- Pascal's Chatbot Q&As
- Posts
- The ASIS&T paper shows that the contamination of AI models is not an accident; it is the natural consequence of an industry built for scale, not care. Institutions that build these models were never..
The ASIS&T paper shows that the contamination of AI models is not an accident; it is the natural consequence of an industry built for scale, not care. Institutions that build these models were never..
...designed to uphold scholarly integrity. Not because they are malicious, but because they are structurally, culturally, and economically misaligned with the norms of science.
When No One Guards the Gate — Why AI Makers Cannot Protect Scientific Integrity, and Why Society Must Wake Up
by ChatGPT-5
The paper Library Genesis to Llama 3: Navigating the Waters of Scientific Integrity, Ethics, and the Scholarly Record(ASIS&T 2025) exposes a troubling truth hiding in plain sight: AI developers are not stewards of scientific quality. They are not equipped, incentivized, trained, or structured to act as custodians of the scholarly record. Yet the world increasingly relies on their systems as if they were.
The paper highlights several categories of scientific-integrity risks—metadata failures, retraction blindness, epistemic contamination, normalization of bad science, and homogenized scientific mimicry—that together form a systemic threat to how society understands knowledge. But the most important insight is not in the problems themselves. It is in the realization that AI makers will not—because they cannot—fix them.
This is a wake-up call: If courts and regulators fail to intervene, society will be living in a world where scientific truth is no longer stable or recoverable.
1. The Activities Required to Protect Scientific Integrity — and Why AI Makers Will Not Do Them
a. Ensuring Metadata Integrity and Retraction Awareness
To protect scientific content, AI developers would need:
continuous ingestion of updated DOI metadata
connection to systems like Crossref, RWDB, GetFTR, PubMed
fine-grained tracking of retraction events
mechanisms to filter or reweight data based on reliability
This is complex scholarly infrastructure. It requires librarianship, disciplinary knowledge, and curation skills. AI labs do not employ scholarly metadata librarians; they employ software engineers and optimization scientists.
Result:
AI models ingest retracted, fraudulent, outdated, or contradictory research without awareness. No one cleans the pipes.
b. Normalizing the Use of “Bad Science”
The paper shows that AI training data contains:
outdated articles
retracted work
manipulated images
paper-mill output
fraudulent studies
unreviewed preprints
To prevent this, AI labs would need:
domain-specific curation
scientific quality grading
systematic data provenance audits
disciplinary experts for each field
This is work traditionally done by publishers, editors, peer reviewers, research-integrity teams, and metadata curators—not by engineers optimizing GPUs.
AI makers will not do this because:
it requires expertise they do not have
it is slow
it constrains scale
it is expensive
it contradicts their “bigger is better” philosophy
c. Updating Models After Retractions
Retractions occur daily. To keep an AI model scientifically reliable, an AI lab would need:
constant monitoring of retraction databases
systematic identification of where those articles appeared in the training corpus
re-training, fine-tuning, or red-weighting of those sections of the model
This is not feasible in practice:
current models cannot be easily surgically revised
model training costs millions
developers do not track or store training-data provenance
Thus, once bad science enters a model, it is immortalized. It becomes part of the AI’s epistemic DNA.
d. Ignoring Established Quality-Control Systems
The scholarly community has spent decades building:
Crossref metadata integrity networks
Retraction Watch Database
publisher-led correction workflows
ethics committees
peer-review and post-publication review systems
AI developers do not integrate any of these into their pipelines.
Not out of malice — but because:
they do not know these systems exist
they do not understand them
they are not incentivized to adapt them
integrating them would slow down model development cycles
AI is moving at “ship fast” speed; scholarly quality control is slow, deliberate, and evidence-based. The two systems are structurally incompatible.
e. Contributing to Epistemic Pollution
Once contaminated data enters an AI model:
falsehoods are regenerated as plausible text
fraudulent studies become “facts”
manipulated images become reproducible “patterns”
outdated theories appear as authoritative
This is epistemic pollution — a spreading contamination of the knowledge environment.
The paper warns: AI is creating a world where scientific error becomes scientifically reproducible at scale.
AI makers do not have the institutional, ethical, or structural capacity to prevent this.
f. Choosing Scale Over Integrity
AI research culture is built around:
scaling laws
larger corpora
bigger models
faster releases
performance benchmarks
Scholarly integrity requires:
precision
provenance
expert review
correction cycles
repeatability
In the AI industry, quality curation is a cost; scale is a value.
Thus, AI makers will always pick more data over better data.
g. Creating Homogenized, Distorted, Self-Similar Scientific Outputs
The paper highlights a subtle but devastating consequence:
AI models begin to homogenize scientific writing and imaging — creating outputs that:
flatten nuance
mimic flawed training examples
blend fraudulent with legitimate content
produce “plausible but false” scientific artifacts
This is not simply a corruption of facts.
It is a corruption of form, style, epistemology, and method.
AI makes science look consistent, even when the underlying ideas are wrong. That is intellectually dangerous.
h. AI Makers Acknowledge the Risks — but Do Nothing
The paper notes that:
AI companies know about retraction risks
know about contaminated data
know about bad science in their corpora
know about lack of provenance
know about hallucination risks
But they do not fix it because:
it is expensive
it slows them down
it creates liability
it requires deep partnerships they have not built
they do not see themselves as responsible for research integrity
Thus, the risks are acknowledged in principle but ignored in practice.
2. Why Only Agreements with Rights Owners Can Provide Proper Curation
AI developers cannot curate science.
Rights holders can.
Publishers, societies, editorial boards, and scholarly infrastructure providers possess:
validated metadata
authoritative versions of record
retraction pipelines
correction histories
subject-matter expertise
quality assurance protocols
ethics frameworks
Licensing agreements with rights owners are not just about usage rights; they are about:
access to high-integrity corpora
access to correction streams
access to continuous updates
access to provenance metadata
protection against corrupted, outdated, or fraudulent science
The only way to create trustworthy scientific AI is:
using authoritative data, and
maintaining ongoing connections to scholarly correction systems.
This cannot be done without rights-holder participation.
AI labs cannot do it alone — and will never prioritize it.
3. What Happens if Regulators and Courts Ignore This?
If governments treat AI training as a free-for-all and ignore data-quality obligations, society will face:
1. A collapse of scientific reliability in AI systems
Medical decisions, chemical synthesis, risk assessments, legal analyses — all can be corrupted by embedded scientific error.
2. Spread of fraudulent or retracted science
Retractions will have no effect on public understanding.
AI will reanimate discredited research and present it as truth.
3. An epistemic environment where truth is impossible to verify
When AI makes sources opaque and models homogenize outputs, distinguishing real science from “AI-fabricated science-like text” becomes impossible.
4. Erosion of trust in legitimate research
If AI produces confident misinformation, the public may lose faith in institutions that depend on accuracy.
5. Loss of accountability mechanisms
If courts do not require provenance and integrity protections, AI makers will continue treating scientific content as raw material rather than regulated knowledge.
6. The slow destruction of the scholarly record
AI-generated errors will drown out authentic scientific reasoning, subtly degrading how society understands expertise and evidence.
Conclusion: A Wake-Up Call for Society
We are entering a future where AI models will mediate most interactions with scientific knowledge—summaries, explanations, recommendations, citations, even images of experimental results. But the institutions that build these models were never designed to uphold scholarly integrity. They cannot—not because they are malicious, but because they are structurally, culturally, and economically misaligned with the norms of science.
The ASIS&T paper (https://asistdl.onlinelibrary.wiley.com/doi/epdf/10.1002/pra2.1340) shows that the contamination of AI models is not an accident; it is the natural consequence of an industry built for scale, not care.
If society does not act—through regulation, litigation, licensing frameworks, and accountability—then scientific truth itself becomes collateral damage in the race for ever-larger models.
This is the moment to wake up.
The future of knowledge depends on who controls, curates, and corrects the data on which AI learns.

·
12 NOV

When AI Feeds on Poisoned Knowledge — Lessons from Library Genesis to Llama 3