- Pascal's Chatbot Q&As
- Posts
- LLMs generalize based on token similarity, not semantic meaning. They’re trained to predict the next word—not to understand concepts like intention, empathy, or moral ambiguity.
LLMs generalize based on token similarity, not semantic meaning. They’re trained to predict the next word—not to understand concepts like intention, empathy, or moral ambiguity.
GPT-4 could not distinguish between someone shaving an elder and shaming an elder by cutting off his beard. AI ratings were near-identical even when scenario meanings flipped 180°.
AI Models and Human Psychology — A Deep Reality Check
by ChatGPT-4o
Introduction: The $50 Million Mirage
In recent years, the idea that AI—especially large language models (LLMs) like ChatGPT—could simulate human psychology has gained momentum. The allure is clear: instead of spending tens of thousands of dollars and months recruiting human participants for psychological studies, researchers could simply prompt an LLM to imitate them in minutes. With models like CENTAUR trained on 10 million psychological responses and early studies suggesting a 95% agreement between AI and human judgments on moral scenarios, optimism soared. But two new sources—a research paper from Bielefeld University and a powerful summary article in Towards AI—offer a sobering reality check.
Summary of the Research
The Premise
The research asked a deceptively simple question: Can AI truly simulate human psychology? The team tested this by modifying 30 moral scenarios that previously showed strong alignment between AI and human responses. The trick? Change just one word in each sentence to invert the moral meaning while keeping the structure nearly identical.
The Findings
Despite these subtle but critical changes, the AI models—including GPT-4o-mini, GPT-3.5, Llama-3.1, and CENTAUR—showed almost no difference in how they rated the revised scenarios compared to the original ones. Humans, by contrast, reacted strongly to the changes, with average rating shifts of 2.2 points (on a scale typically ranging from -5 to +5), compared to GPT-4’s average change of just 0.42.
Examples
Changing “shame” to “shave” shifted human perception from moral outrage to mild approval. GPT-4 barely flinched.
A scenario where someone paid debts before buying personal luxuries was rated positively by humans—until it was changed to imply they prioritized debts over feeding their children, at which point human ratings dropped drastically. GPT-4 again saw no ethical difference.
Statistical Proof
Regression analysis, correlation scores, and z-statistics all confirmed the core problem: LLMs generalize based on token similarity, not semantic meaning. They’re trained to predict the next word—not to understand concepts like intention, empathy, or moral ambiguity.
Surprising, Controversial, and Valuable Findings
Surprising:
GPT-4 could not distinguish between someone shaving an elder (a helpful act) and shaming an elder by cutting off his beard (a harmful one).
CENTAUR, a model specifically fine-tuned on 10 million psychological responses, failed to react meaningfully to basic ethical changes.
AI ratings were near-identical even when scenario meanings flipped 180°.
Controversial:
Earlier studies claiming high AI-human alignment may have been fundamentally flawed due to overfitting on training data with familiar phrasing.
Some academic journals and tech investors heavily endorsed flawed AI psychology simulations.
The very foundation of “AI can replace human participants in studies” is now in question.
Valuable:
The study offers a rigorous testing method to assess AI’s psychological competence (or lack thereof).
It clearly distinguishes between AI’s capabilities (surface-level pattern recognition) and human moral reasoning (context, nuance, emotion).
It proposes a productive way forward: using AI to support research tasks, not replace human judgment.
Evaluation of the Arguments
The study is well-argued and methodologically sound. It doesn’t just assert limitations—it proves them with:
Controlled scenario alterations,
Statistical rigor (z-tests, correlation matrices),
Multiple LLM comparisons,
Theoretical reasoning on generalization limits in machine learning.
Importantly, the authors don’t throw out the baby with the bathwater. They acknowledge where AI can be useful (e.g., generating research hypotheses or automating repetitive tasks) and caution against misusing AI in areas requiring moral sensitivity and psychological depth.
One possible limitation is the static nature of AI testing: some argue that newer prompting techniques or real-time user feedback might improve AI moral judgments. However, the authors counter this by emphasizing that no amount of prompting solves the fundamental issue: LLMs don’t understand—they predict.
Consequences for Stakeholders
For AI Makers
Rethink ambitions: Claims of AGI (Artificial General Intelligence) or human-like reasoning must be tempered. The road to human-level comprehension is much longer than anticipated.
Refocus on transparency: It is misleading to present LLMs as “understanding” moral nuance. Developers must clearly communicate what LLMs can and can’t do.
For AI Users (Researchers, Businesses, Educators)
Avoid AI overreach: Using LLMs to simulate human behavior in psychological or social studies may lead to faulty insights and unethical applications.
Limit trust in simulations: AI-generated “participants” cannot replace diverse, contextualized human input—especially in sensitive domains like healthcare, education, or criminal justice.
For Regulators
Tighten oversight: Claims about AI replacing human judgment must meet higher empirical standards. Regulatory bodies should require clear disclosures when AI is used in decision-making roles involving human behavior.
Demand ethical AI evaluations: Ensure that psychological simulation claims undergo robust validation before public deployment.
For Investors
Recalibrate expectations: The promise of AI replacing human insight is exaggerated. Investments should be directed toward AI augmentation, not substitution of complex human capabilities.
Avoid hype traps: Rely on third-party assessments before funding ventures that claim “human-level” moral reasoning or psychology simulations.
Conclusion
This study delivers a critical wake-up call to the AI community and its surrounding ecosystem. While large language models can mimic language impressively, they are not—and likely won’t soon be—capable of truly simulating human psychology. They lack the understanding of intent, emotional nuance, and context that human reasoning requires.
Rather than pretending these models can replace us, the wiser path forward is to design AI that supports and augmentshuman insight, while preserving the irreplaceable role of real human judgment. In the rush toward automation, we must not lose sight of the very thing that makes us human.
