Almost three years ago, Nobel Prize- winning psychologist Daniel Kahneman penned an open letter to researchers working on “social priming," the study of how thoughts and environmental cues can change later, mostly unrelated behaviors. After highlighting a series of embarrassing revelations, ranging from outright fraud to unreproducible results, he warned:
“For all these reasons, right or wrong, your field is now the poster child for doubts about the integrity of psychological research. Your problem is not with the few people who have actively challenged the validity of some priming results. It is with the much larger population of colleagues who in the past accepted your surprising results as facts when they were published. These people have now attached a question mark to the field, and it is your responsibility to remove it."
At the time it was a bombshell. Now it seems almost delicate. Replication of psychology studies has become a hot topic, and on Thursday, Science published the results of a project that aimed to replicate 100 famous studies—and found that only about one-third of them held up. The others showed weaker effects, or failed to find the effect at all.
This is, to put it mildly, a problem. But it is not necessarily the problem that many people seem to assume, which is that psychology research standards are terrible, or that the teams that put out the papers are stupid. Sure, some researchers doubtless are stupid, and some psychological research standards could be tighter, because we live in a wide and varied universe where almost anything you can say is certain to be true about some part of it. But for me, the problem is not individual research papers, or even the field of psychology. It’s the way that academic culture filters papers, and the way that the larger society gets their results.
Why did so many studies fail to replicate? For starters, because in many cases, the sample sizes were larger. In general, the larger your sample, the weaker the effects you will find, because it’s harder for a few outliers to swamp the results. If you take the average height of three men, and one of them happens to be Shaquille O’Neal, you’re going to get a very skewed notion of the average height of the American male. If you take the average of ten people, and O’Neal is still standing there in his size 23 sneakers, you’re still going to be off by an inch and a half even if the rest of the group is broadly representative. By the time you’ve got 30 other people in the group, you’re down to a half-inch discrepancy, and by the time there are 100, you’re going to be within measurement error of the right number.
Even when you have a larger sample, however, the groups are not going to match the average of the whole population every time; by blind luck, sometimes the group will be exceptionally tall, sometimes exceptionally short. Statisticians understand this. But journal editors and journalists do not necessarily exercise appropriate caution. That’s not because journal editors are dumb and don’t get statistics, but because scientific journals are looking for novel and interesting results, not “We did a study and look, we found exactly what you’d have expected before you’d plowed through our four pages of analysis." This “publication bias" means that journals are basically selecting for outliers. In other words, they are in the business of publishing papers that, for no failure of method but simply from sheer dumb luck, happened to get an unusual sample. They are going to select for those papers more than they should -- especially in fields that study humans, who are expensive and reluctant to sit still for your experiment, rather than something like bacteria, which can be studied in numbers ending in lots of zeroes.
Journalists, who unfortunately often don’t understand even basic statistics, are even more in this business. They easily fall into the habit (and I’m sure an enterprising reader can come up with at least one example on my part), of treating studies not as a potentially interesting result from a single and usually small group of subjects, but as a True Fact About the World. Many bad articles get written using the words “studies show," in which some speculative finding is blown up into an incontrovertible certainty. This is especially true in the case of psychology, because the results often suggest deliciously dark things about human nature, and not occasionally, about the political enemies of the writers.
Psychology studies also suffer from a certain limitation of the study population. Journalists who find themselves tempted to write “studies show that people ..." should try replacing that phrase with “studies show that small groups of affluent psychology majors ..." and see if they still want to write the article.
Why does this happen? Well, as I said in a column about an earlier fiasco concerning research findings on American attitudes about gay marriage, “We reward people not for digging into something interesting and emerging with great questions and fresh uncertainty, but for coming away from their investigation with an outlier—something really extraordinary and unusual. When we do that, we’re selecting for stories that are too frequently, well, incredible." This is true of academics, who get rewarded with plum jobs not for building a well-designed study that offers a messy and hard-to-interpret result, but for generating interesting findings .
Likewise, journalists are not rewarded for writing stories that say “Gee, everything’s complicated, it’s hard to tell what’s true, and I don’t really have a clear narrative with heroes and villains." Readers like a neat package with a clear villain and a hero, or at least clear science that can tell them what to do. How do you get that story? That’s right, by picking out the outliers. Effectively, academia selects for outliers, and then we select for the outliers among the outliers, and then everyone’s surprised that so many “facts" about diet and human psychology turn out to be overstated, or just plain wrong.
We need to pay a lot more time focusing on having a good process for finding knowledge, and a lot less time demanding interesting outcomes. Because a big part of learning is the null results, the “maybe but maybe not," and the “Yeah, I’m not sure either, but this doesn’t look quite right."
And ultimately, that’s why this latest study is a good sign. Because the researchers did exactly what you want, if you want to increase the sum of human knowledge: Instead of chasing a sexy, new, but possibly imaginary result to put on their resumes, they did the vital work of checking up on stuff we already thought we knew. We don’t have the process right yet, and we’ll probably never get it perfect. But academics are getting it better. Perhaps readers and journalists will follow. BLOOMBERG