How to Make Psychology Studies More Reliable

A new way for the field to address its replication crisis.

Phil Noble / Reuters

The biggest debate in modern psychology concerns the quality of the field itself. Some psychologists contend that there is a replicability crisis, where many published results—including several classic ones—aren’t true. Others assert that things are fine. The debate has become acrimonious, and predictably so. Scientist A fails to repeat B’s experiment and publishes their results, perhaps implying that the original study was p-hacked—that it used bad methodological practices which throw up positive but unreliable results. B gets defensive and accuses A of incompetence, ill intent, or bullying. Twitter has a field day. Rinse and repeat.

It shouldn’t be like that: Science is a gradual and stuttering climb towards greater certainty, and tweaking the published record is part of the game. But science is done by people, and people tend to take it badly and personally when their work is called into question.

There’s another way, says Eric Luis Uhlmann from INSEAD: Get your own studies independently replicated before they are published. He is leading by example. In August 2014, he asked 25 independent teams to repeat all of his group’s unpublished experiments, before he submitted them to academic journals.

Many replication initiatives are about removing the weeds from the scientific record. Uhlmann’s effort—the Pipeline Project—ensures that only flowers bloom in the first place. “The idea was to see if findings are robust before they find their way into the media and into everyone’s lectures,” he says.

Replicating studies before publication could also lead to fewer bruised egos, and less drama. “It’s less sensitive when something fails before publication. No one’s even heard of the effect yet. My reputation isn’t riding on it. There’s less defensiveness.”

“Having researchers replicate each others’ findings before they are published is likely to be a critical step in fostering meta-science’s goal of turning the lens of science onto itself,” says Jonathan Schooler from the University of California, Santa Barbara.

At first, Uhlmann thought he’d do a study swap with just one other group. “I sent out 15 emails, thinking that one lab would say yes,” he says. “I got ten yeses.” He eventually recruited 25 teams across six countries, all of whom gave time and effort in exchange for minimal rewards. This unexpected altruism, he says, shows that “there’s a real enthusiasm for experimenting with better ways of doing science.”

The teams replicated ten of Uhlmann’s experiments many times over. All the studies focused on moral judgments, and our tendency to evaluate actions based on what they reveal about a person’s character, rather than how bad they are in absolute terms. For example, Uhlmann found that a manager who mistreats ethnic minority employees is seen as worse than one who mistreats everyone. A company that airbrushes a model to have perfect skin is seen as more dishonest than one that hires a model whose skin is already perfect. An animal rights activist who is caught hunting is seen as more immoral than a big game hunter.

These findings all checked out, along with three others. But four of Uhlmann’s results stumbled in the replication gauntlet in at least one crucial way. “I couldn’t predict which ones would work and which ones would not,” he says.

For example, he found that a company that doesn’t respond to accusations of misconduct is judged just as harshly as one that’s found guilty—but his replicators found that the silent party is judged five times more harshly. He found that a person who tips in pennies is judged more negatively than one who leaves the same tip in bills—and although his American collaborators found the same thing, those from other countries didn’t.

Uhlmann specifically designed the Pipeline Project to pre-emptively counteract criticisms that were levied at earlier initiatives. Critics have often said that replicators are too incompetent to successfully repeat their experiments—but Uhlmann chose accomplished peers who worked in the same field. “It’s hard for me to argue that someone’s not a competent researcher because I picked them!” he says.

But Simine Vazire from the University of California, Davis says that there’s not much evidence that incompetent replicators are the source of psychology’s problems. For example, in the Many Labs project, where 36 groups repeated 13 earlier studies, there were only small differences in the results from different labs. “It’s a problem if science only replicates when the original authors get to hand-pick the replicators,” she says. “I don’t think it's a good precedent to set.”

A fair point, Uhlmann concedes. Then again, he asked the replicators to pre-register their studies. They detailed every aspect of the experiments and uploaded their plans for anyone to see. With such transparency, “it is difficult to see how a replicator could artificially bias the result in favor of the original finding without committing outright fraud,” he writes.

Absent such biases, the most likely explanations for his four irreproducible experiments are simple. The tipping effect was unique to the U.S., and can’t be generalized to other countries. The other three effects were probably false positives—the kind that often show up in small, underpowered studies. “Even when everything’s public and you have expert replicators, you’ll have an imperfect replicability rate, because that’s just science,” says Uhlmann.

He hopes that these pre-publication initiatives will become more common, from simple study swaps to complex daisy chains of replications, as suggested by Nobel laureate Daniel Kahneman. He is also trying to couple pipeline projects to university courses, by turning undergraduate and graduate students into a network of replicators. “Any researcher can email me their study and ask to have it replicated by students at a number of different universities,” he says.

That would only work for cheap, simple studies—arguably the biggest limitation to the pipeline approach. “It’s a great example of what can be done for research that is economical to replicate,” says Brian Nosek from the Center for Open Science. “It would be great to extend such a process to more specialized procedures, intensive data collections, and difficult-to-sample populations.”

“It might be possible to create an online market where people can post studies, indicate the expertise that’s required, and match up to replicator pools,” says Uhlmann. “The trick there would be to find ways of incentivizing authors to submit their studies. Journals could establish a premium where they’re more likely to accept something if it’s pre-replicated.”

“I hope more authors will subject their work to this kind of test,” adds Vazire. “That would be a huge advance for psychology.”

Ed Yong is a former staff writer at The Atlantic. He won the Pulitzer Prize for Explanatory Reporting for his coverage of the COVID-19 pandemic.