Home/ Opinion / Columns/  Artificial intelligence can help us humans vet research papers

Have you heard that reading literature for 3 minutes makes people more empathetic or that holding a heavier clipboard makes a manager more likely to hire a job candidate? The popular press has had a love affair with such social science findings. But they might not be true. Attempts to replicate such results led to a shocking discovery in 2015 that fewer than 40% of papers in peer-reviewed psychology journals could be verified. Similarly dismal findings occurred in economics and some biomedical research, including cancer biology. Since then, researchers have been trying to find better ways to sort the treasure from the trash. A few years ago, one group of social scientists showed that prediction markets—asking people to make bets on a paper’s validity—worked far better than standard peer review. But that required increasing the number of people who vet a paper from three to 100. That’s not a scalable solution.

Now, machine learning programs seem to be getting equally good results, and more are coming thanks to investments by the US Defense Advanced Research Projects Agency (DARPA). Like ChatGPT is trained on text, these paper-evaluating systems are trained with data gathered from painstaking attempts to replicate hundreds of studies. The systems are then tested using studies they haven’t seen. Early results suggest that the robots are just as good at filtering out the noise as 100 human experts—and apparently better than human editors at scientific journals and newspapers.

The more improbable studies garner the most press attention, says Brian Uzzi, a psychologist at Northwestern University who led a recent study published in the Proceedings of the National Academy of Sciences. Many attention-grabbing studies backed a view, now discarded, that people were being buffeted by seemingly irrelevant stimuli in a predictable way. Uzzi traces the replication problem back to 2011, when a prominent journal published a paper claiming that ordinary experimental subjects could see the future—that is, they had ESP.

The research had followed methods that were standard, which got at least a few people worried that something was wrong with those techniques. Some critics identified a flawed use of statistical methods—a form of data manipulation called p-hacking. But p-hacking wasn’t the problem in most irreproducible papers, said Uzzi. The deeper problem was that in fields involving human behaviour, it’s not obvious what makes a claim extraordinary. That makes it tougher to follow the mantra that extraordinary claims require extraordinary evidence. For a counter-example, consider the physical sciences. Any new finding that violates quantum mechanics or general relativity is almost always subject to extra scrutiny, as when physicists quickly put to rest a claim that particles could move faster than the speed of light. Human behaviour doesn’t fall into the same theoretical framework.

In 2018, experiments with prediction markets showed that when researchers asked 100 fellow social scientists to bet on whether an array of results would replicate, they got it right 75% of the time. That’s the same success rate Uzzi found with machine learning (ML), but the machine worked a lot faster. Prediction markets and the ML system both went beyond what normal peer review provides by rating confidence levels. And both systems were right 100% of the time for those papers that fell within the top 15% of the confidence range.

University of Virginia psychologist Brian Nosek said a larger effort funded by DARPA will produce several different evaluation systems, some of which will focus on external factors like track records of the authors and where a paper was cited.

Anna Dreber, a researcher at the Stockholm School of Economics, said she could have used help when she led a study that seemed to show a correlation between certain genes and financial risk taking. It couldn’t be replicated, and she regrets the time she spent on the project. Now she’s leading efforts to improve the reliability of published work in economics. If researchers themselves don’t recognize a problem, ML might be useful further up the chain. It could help editors, journalists and policymakers evaluate research.

And maybe, in the future, it can be used to counter scientific papers being spit out by large language models such as ChatGPT. Machines might be taught both how to detect inaccuracies and generate glib misinformation. “It’s sort of like the Matrix in the later parts of the series, there were the good machines, and the bad machines fighting against each other," said University of Virginia’s Nosek.

That would not be a bad outcome. If human beings aren’t yet quite capable of understanding how our minds work, at least we’re capable of inventing machines that can help us figure it out.

Faye Flam is a Bloomberg Opinion columnist covering science.

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.
More Less
Updated: 20 Mar 2023, 10:15 PM IST
Recommended For You
Get alerts on WhatsApp
Set Preferences My Reads Watchlist Feedback Redeem a Gift Card Logout