New publication on the reproducability of scientific results

Researchers betting money in prediction markets were very accurate in predicting which findings would replicate and which would not. The discipline is in the midst of a reformation to improve transparency and rigor.

The results of the study reinforced changes in the field in recent years, including:

The importance of open sharing of all materials of published studies
The value of preregistration in reducing bias and increase rigor
The optimism that the shifting norms toward transparency and rigor improve the credibility of published research

Today, in Nature Human Behavior, a collaborative team of five laboratories published the results of 21 high-powered replications of social science experiments published in Science and Nature, two of the most prestigious journals in science. The team tried to replicate one main finding from every eligible experimental social science paper published between 2010 and 2015. To extend and improve on prior replication efforts, the team obtained the original materials and received the review and endorsement of the protocols from almost all of the original authors before conducting the studies.

The studies were preregistered to publicly declare the design and analysis plan, and the study design was very high-powered so that the replications would be likely to detect support for the findings even if they were as little as half the size of the original result. “To ensure highstatistical power, the average sample size of the replication studies was about five times larger than the average sample size of the original studies”, said Felix Holzmeister of the University of Innsbruck, one of the project leaders.

The team found that 13 of the 21 (62%) replications showed significant evidence consistent with the original hypothesis, and other methods of evaluating replication success indicated similar results (ranging from 57% to 67%). Also, on average, the replication studies showed effect sizes that were about 50% smaller than the original studies. Together this suggests that reproducibility is imperfect even among studies published in the most prestigious journals in science. “These results show that “statistically significant” scientific findings need to be interpreted very cautiously until they have been replicated even if published in the most prestigious journals,” said Magnus Johannesson of the Stockholm School of Economics, another of the project leaders.

Prior to conducting the replications, the team set up prediction markets for other researchers to bet and earn (or lose) money based on whether they thought each of the findings would replicate. The markets were highly accurate in predicting which studies would later succeed or fail to replicate. The prediction markets correctly predicted the replication outcomes for 18 of the 21 replications and market beliefs about replication were highly correlated with replication effect sizes. Thomas Pfeiffer of the New Zealand Institute for Advanced Study, another of the project leaders, noted “The findings of the prediction markets suggest that researchers have advance knowledge about the likelihood that some findings will replicate.” It is not yet clear what knowledge is critical, but two possibilities are the plausibility of of the original finding and the strength of the original statistical evidence. The apparent robustness of this phenomenon suggests that prediction markets could be used to help prioritize replication efforts for those studies that have highly important findings, but relatively uncertain or weak likelihood of replication success. Anna Dreber of the Stockholm School of Economics, another project leader, added: “Using prediction markets could be another way for the scientific community to use resources more efficiently and accelerate discovery.”

This study provides additional evidence of the challenges in reproducing published results, and addresses some of the potential criticisms of prior replication attempts. For example, it is possible that higher-profile results would be more reproducible because of high standards and the prestige of the publication outlet. This study selected papers from the most prestigious journals in science. Likewise, a critique of the Reproducibility Project in Psychology suggested that higher powered research designs and fidelity to the original studies would result in high reproducibility. This study had very high powered tests, original materials for all but one study, and the endorsement of protocols for all but two studies and still failed to replicate some findings and found a substantially smaller effect sizes in the replications. “This shows that increasing power substantially is not sufficient to reproduce all published findings,” said Lily Hummer of the Center for Open Science, one of the co-authors.

That there were replication failures does not mean that those original findings are false. “It is possible that errors in the replication or differences between the original and replication studies are responsible for some failures to replicate, but the fact that the markets predicted replication success and failure accurately in advance reduces the plausibility of these explanations” said Gideon Nave of the Wharton School of Business, another project lead. Nevertheless, some original authors provided commentaries with potential reasons for failures to replicate. These productive ideas are worth testing in future research to determine whether the original findings can be reproduced under some conditions.

These replications follow emerging best practices for improving the rigor and reproducibility of research. “In this project, we led by example, involving a global team of researchers. The team followed the highest standards of rigor and transparency to test the reproducibility and robustness of studies in our field,” said Teck-Hua Ho of the National University of Singapore, another project lead. All of the studies were preregistered on OSF (https://osf.io/pfdyw/) to eliminate reporting bias and pre-commit to the design and analysis plan. Also, all project data and materials are publicly accessible with the OSF registrations to facilitate the review and reproduction of the replication studies themselves.

Brian Nosek, executive director of the Center for Open Science, professor at the University of Virginia, and one of the co-authors, noted “Someone observing these failures to replicate might conclude that science is going in the wrong direction. In fact, science’s greatest strength is its constant self-scrutiny to identify and correct problems and increase the pace of discovery.” This large-scale replication project is just one part of a ongoing reformation of research practices. Researchers, funders, journals, and societies are changing policies and practices to nudge the research culture toward greater openness, rigor, and reproducibility. For example:

A sample of 33 journals that publish social-personality psychology research went from 0 having policies promoting transparency in 2013 to 24 (73%) implementing some policy and 19 (58%) implementing relatively assertive to strong policies according to the TOP Guidelines (http://cos.io/top/). Across all sciences, most major publishers have become signatories to the TOP Guidelines and more than 850 journals have completed implementation of new transparency policies.
37 journals have adopted badges to acknowledge open research practices (http://cos.io/badges), an intervention that has demonstrated effectiveness for increasing the rates and quality of sharing research data and materials. Most adopting journals so far are in the social-behavioral sciences.
In the last 4 years, 125 journals have adopted Registered Reports (http://cos.io/rr/) a publishing model in which the journal reviews the research prior to knowing the results. This new model, adopted mostly by social-behavioral science journals, reduces publication bias and pressure for obtaining positive, tidy results.
More than 20,000 studies have been registered on the OSF ( http://osf.io/) with the rate of registration doubling each year since its launch in 2012. A large majority of these registrations are from research teams in the social-behavioral sciences. The practice of preregistration, required by law for clinical trials but previously rare in basic research, is a means of reducing reporting bias and increasing the rigor of research. Preregistration is thought to be one of the most important new behaviors to improve credibility and reproducibility of research evidence.

Nosek concluded, “With these reforms, we should be able to increase the speed of finding cures, solutions, and new knowledge. Of course, like everything else in science, we have to test whether the reforms actually deliver on that promise. If they don’t, then science will try something else to keep improving.”

A comprehensive information page for the SSRP Project including contacts, relevant articles, links to the papers, and supplemental materials can be accessed here .

Colin F. Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A. Nosek, Thomas Pfeiffer, Adam Altmejd, Nick Buttrick, Taizan Chan, Yiling Chen, Eskil Forsell, Anup Gampa, Emma Heikensten, Lily Hummer, Taisuke Imai, Siri Isaksson, Dylan Manfredi, Julia Rose, Eric-Jan Wagenmakers, Hang Wu (2018): Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015, Nature Human Behaviour, http://dx.doi.org/10.1038/s41562-018-0399-z.