
Large language models (LLMs), such as ChatGPT, DeepSeek Claude, and LLaMa, are not very good at summarising scientific studies, according to a Royal Society Open Science study. According to the authors, these models draw inaccurate conclusions in up to 73% of cases.
For this study, Uwe Peters from Utrecht University and Benjamin Chin-Yee from the University of Cambridge tested the most prominent LLMs and examined almost 5,000 chatbot-generated summaries of abstracts and full-length papers from top journals, including Nature and Science.
The results showed that six of ten models exaggerated claims in the original text. Some of the instances included making more general claims, going from “the treatment was effective in this study under these conditions” to a more sweeping “the treatment is effective.” The authors are worried that these changes can mislead readers into believing that findings apply much more broadly than they actually do.
When the models were explicitly asked to be more accurate, they were twice as likely to produce overgeneralized conclusions than when given a simple summary request. “This effect is concerning,” Peters said: “Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they’ll get a more reliable summary. Our findings prove the opposite.”
Peters and Chin-Yee also directly compared chatbot-generated to summaries written by humans. Chatbots were nearly five times more likely to produce broad generalizations than their human counterparts. “Worryingly,” said Peters, “newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones.”
“Previous studies found that overgeneralizations are common in science writing, so it’s not surprising that models trained on these texts reproduce that pattern,” added Chin-Yee. Additionally, the authors also explain that human users prefer to read responses that sound helpful and widely applicable, which means the LLMs may learn to favor fluency and generality over precision.
To increase accuracy, the researchers suggested setting chatbots to lower ‘temperature’ (the parameter fixing ‘creativity’) and using prompts that enforce indirect, past-tense reporting in science summaries. “If we want AI to support science literacy rather than undermine it,” Peters said, “we need more vigilance and testing of these systems in science communication contexts.”
Peters U, Chin-Yee B. Generalization bias in large language model summarization of scientific research. R Soc Open Sci. 2025 Apr 30;12(4):241776. doi: 10.1098/rsos.241776. PMID: 40309181; PMCID: PMC12042776.