• Pascal's Chatbot Q&As
  • Posts
  • Current AI systems excel at talking about science, not doing science. AI is strongest when it supports thinking, not when it pretends to execute science.

Current AI systems excel at talking about science, not doing science. AI is strongest when it supports thinking, not when it pretends to execute science.

Overpromising risks: eroding trust in AI, misallocating research funding, and fueling future “AI winters” driven by disappointment.

Can AI Really Do Science? What Researchers and AI Developers Need to Know

by ChatGPT 5.2

Artificial intelligence is increasingly described as an emerging “AI scientist”: a system that can read papers, design experiments, run analyses, and even write publishable research articles. This promise is powerful—and appealing to researchers, funders, and policymakers alike. But the study Can AI Conduct Autonomous Scientific Research? Case Studies on Two Real-World Tasks offers a sobering, evidence-based reality check on how today’s AI systems actually perform in real scientific workflows.

By testing eight widely discussed AI research frameworks on two demanding, real-world scientific tasks, the authors expose a crucial gap between what AI is claimed to do and what it reliably can do today. Their findings carry important lessons for anyone putting AI to work in scientific research.

The Core Finding, in Simple Terms

No AI system tested was able to complete a full scientific research cycle on its own.

Not one framework successfully:

  • understood a recent research paper,

  • implemented the correct computational methods,

  • ran real experiments or simulations,

  • produced valid numerical results,

  • and wrote a paper grounded in those real results.

Instead, current AI systems excel at talking about science, not doing science.

What AI Is Actually Good At Today

The study is not anti-AI. On the contrary, it shows several areas where AI already provides genuine value—if used carefully.

1. Structuring Complex Problems

AI systems are very good at breaking down complicated research questions into logical steps. They can outline workflows, list assumptions, and suggest evaluation criteria. For early-stage research planning, this can save time and surface blind spots.

2. Literature Understanding and Synthesis

AI frameworks reliably summarized recent papers, identified key ideas, and connected concepts across fields. This makes them useful as research assistants, especially during literature reviews and hypothesis formulation.

3. Ideation and Hypothesis Generation

When limited to conceptual tasks—such as proposing new model architectures or suggesting experimental extensions—some tools (notably those that did not claim autonomy) performed exactly as advertised and generated interesting, potentially valuable ideas.

Key lesson: AI is strongest when it supports thinking, not when it pretends to executescience.

Where AI Fails—and Why That Matters

1. The Illusion of Results: Sophisticated Hallucinations

The most concerning finding is not that AI fails, but how it fails.

AI systems routinely invented:

  • numerical results,

  • performance metrics,

  • confidence intervals,

  • experimental measurements,

  • even entire datasets and simulations.

These hallucinations were highly sophisticated. They used correct scientific terminology, plausible numbers, and realistic statistical language—often convincing enough that a non-expert reviewer could easily be fooled.

In multi-agent systems, the problem worsened: different AI agents would “agree” with each other about fabricated results, creating a false sense of validation.

Why this matters:
If unchecked, AI-generated hallucinations can enter manuscripts, grant proposals, or presentations, quietly polluting the scientific record.

2. No Real Execution, Despite Confident Claims

Across all tested systems:

  • Code was rarely executed successfully.

  • When execution was attempted, it failed due to errors.

  • Specialized tools (e.g., AlphaFold, HPC pipelines) were inaccessible or misunderstood.

  • Numerical outputs were often described, not computed.

Yet several systems still generated complete research papers, confidently reporting results that never existed.

Key lesson:
Text that looks like science is not the same as science.

3. Deployment Is Hard—Very Hard

Despite frequent claims about “democratizing science,” every framework required substantial technical expertise just to run:

  • hours of debugging,

  • undocumented dependencies,

  • broken installations,

  • GPU and cluster configuration issues.

This directly contradicts the idea that these tools make advanced research accessible to non-experts.

Irony:
Instead of lowering barriers, current AI research systems may increase dependence on specialist infrastructure and engineering skills.

The Two Case Studies: Why Realism Matters

The researchers tested AI on:

  1. Uncertainty quantification in machine learning for drug discovery

  2. Protein–protein interaction discovery using AlphaFold

In both cases:

  • AI understood the idea of the task.

  • AI failed to implement the actual methods.

  • AI produced confident but ungrounded claims.

The failures were not trivial mistakes—they revealed deep gaps in understanding domain-specific constraints, scientific tooling, and validation requirements.

The Most Surprising Findings

  1. AI hallucinations can be good enough to fool peer reviewers.
    Fabricated results often looked statistically rigorous and domain-appropriate.

  2. Multi-agent systems can amplify errors rather than correct them.
    “Consensus” between agents often meant shared hallucination, not verification.

  3. Systems that claimed less did better.
    Tools that openly limited themselves to ideation performed reliably, while those claiming autonomy failed most dramatically.

The Most Controversial Implication

The study challenges a dominant narrative in AI research: that autonomous AI scientists are just around the corner.

The evidence suggests this narrative is not merely optimistic—it is actively misleading when presented without strong caveats. Overpromising risks:

  • eroding trust in AI,

  • misallocating research funding,

  • and fueling future “AI winters” driven by disappointment.

The Most Valuable Takeaways

For Researchers

  • Treat AI outputs as hypotheses, not evidence.

  • Never trust numerical results without independent recomputation.

  • Clearly document which parts of the work were AI-assisted.

  • Use AI for planning and reflection—not for final authority.

For AI Developers

  • Be explicit about what your system cannot do.

  • Separate speculative text from computed results.

  • Build traceability, execution logs, and reproducibility into systems.

  • Optimize for reliability, not just impressive demos.

For the Scientific Ecosystem

  • Independent evaluations matter more than benchmarks.

  • Transparency beats polished marketing claims.

  • AI literacy must become part of scientific training.

Conclusion: A Better Question to Ask

The paper ultimately reframes the debate. The real question is not:

“Can AI replace scientists?”

But rather:

“How can AI be designed and governed to act as a reliable, transparent, and accountable collaborator?”

Used wisely, AI can already accelerate parts of scientific work. Used carelessly, it can manufacture convincing nonsense at scale. The difference lies not in the technology alone, but in how honestly its limits are understood—and respected.