28 May, 2025
Large Language models exhibit a significant generalisation bias and is significantly less accurate than humans
A study of Uwe Peters and Benjamin Chin-Yee shows a significant generalization bias in large language models (LLMs) when summarizing scientific research and generative AI is nearly five times LESS accurate than humans

A study of Uwe Peters and Benjamin Chin-Yee shows a significant generalization bias in large language models (LLMs) when summarizing scientific research and generative AI is nearly five times LESS accurate than humans. Researchers tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.370B, and Claude 3.7 Sonnet, comparing nearly 5000 AI-generated summaries to the original scientific texts. The study found that:
Overgeneralization Prevalence: Many LLMs tended to overgeneralize the conclusions of scientific research, even when explicitly prompted for accuracy. This occurred in 26–73% of cases for models like DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B.
Comparison with Human Summaries: LLM-generated summaries were nearly five times more likely to overgeneralize compared to human-written summaries (odds ratio = 4.85, 95% CI [3.06, 7.70], p < 0.001).
Declining Performance in Newer Models: Surprisingly, newer versions of LLMs performed worse in generalization accuracy than earlier models.
Implications and Risks: This bias poses a risk of large-scale misinterpretation of scientific findings, potentially undermining public science literacy and scientific communication.
Mitigation Strategies: The authors suggest potential solutions, such as lowering LLM temperature settings and benchmarking LLMs for generalization accuracy, to address these issues.
The study underscores the need for stronger safeguards in AI-driven science summarization to prevent widespread misunderstandings of research findings.