When the Open Science Collaboration set out in 2015 to re-run a hundred published psychology experiments, only about four in ten findings clearly survived. A parallel effort in experimental economics later recovered roughly 60% of re-examined studies. For the social sciences, the lesson was uncomfortable: a substantial share of what we take to be empirical knowledge rests on shakier ground than we like to admit.
Few researchers have done more to surface this problem — and to build the infrastructure to fix it — than Anna Dreber Almenberg, Professor of Economics at the Stockholm School of Economics. Over the past decade, her work on large-scale replications, prediction markets for scientific claims, and multi-analyst and many-designs studies has helped reshape how social scientists think about credible evidence. Felix Holzmeister spoke with her about why replicability matters, what happens when science asks the same question many times, and what has—and has not yet—improved.
P A R T I
How much of what we know is actually true?
Q If we were to select a random article published in an academic journal that presents a novel finding based on an experiment, what would be your best estimate of the likelihood that the study’s primary result holds up in an independent replication?
A Maybe 50% — at least for older publications, where “older” means more than five years old. Pooling the results from the replication projects that others and we have done gives an estimate of around 50%. I would expect replicability to have increased in recent years, so the number would likely be higher for studies published, say, in 2025.
Q You have also researched heterogeneity, showing that even seemingly minor variations in study design or data analysis can significantly affect conclusions. Why should alumni working in, say, management consulting, public health, or education policy care about the distinction between replicability and generalizability?
A We think about replications as testing the same hypothesis as a previous study, but with new data. Whether a result replicates can be considered a sanity check — if it does not, it is probably better to move on. If it replicates successfully, an interesting next step is to assess its generalizability: whether the finding holds beyond the specific conditions of the original study.
Q Much of your work has highlighted that a single study often cannot bear the weight that policy, business, or textbook writing later places on it. Why is that?
A We all know that the results from a single study should not carry too much weight until more evidence has accumulated. But results from single studies tend to be communicated as if they are more definitive than they really are. There is always a worry that the result is a false positive — we think something is happening when in fact nothing is — or a false negative — where we think nothing is happening when in fact something is. Depending on how the result was produced, that worry can be substantial, yet it is typically not clear from the paper itself. And even for true positives, results often have limited generalizability: the effect might hold under specific circumstances, but it is not clear how well it transfers to other groups, contexts, or measures.
ABOUT THE INTERVIEWEE
Anna Dreber Almenberg is the Johan Björkman Professor of Economics at the Stockholm School of Economics. Wallenberg Scholar and member of the Royal Swedish Academy of Sciences (KVA) and the Royal Swedish Academy of Engineering Sciences (IVA). Co-founder of Lab2, collaborator with the Institute for Replication (I4R), and — until April 2026 — Editor of the Journal of Political Economy Microeconomics. Her latest book, The Credibility Gap (with Magnus Johannesson), was published by Routledge in 2025.
PART II
From the lab to the boardroom.
Q Our alumni have gone on to work in banks, ministries, hospitals, schools, NGOs, and start-ups. Where, concretely, have you seen a non-replicable finding make its way into policy, management practice, or clinical guidelines before it was corrected — and what was the cost?
A A notorious example is the “signing-on-top” pledge study, which suggested that asking people to sign an honesty declaration at the top of a form rather than at the bottom would lead to more truthful reporting. It was picked up and implemented by tax authorities and insurers — before the original research was later found to rest on fabricated data.
Q If one of our alumni sits on a board, in a cabinet office, or in a leadership role and is handed a study to justify a decision, what three questions should they ask before acting on it?
A First: Are there other studies on the topic, and are their results in line with what this particular study shows? Second: Is the statistical evidence strong? Is the study based on a large sample, and does the analysis have sufficient power? Third: How likely is it that the result will generalize to other settings beyond what the study itself explored?
“Results from single studies tend to be communicated
as if they are more definitive than they really are.”
— ANNA DREBER
Q There is a risk that drawing public attention to these problems feeds a broader distrust of science — particularly at a time when trust in expertise is already fragile. How can we discuss the credibility crisis without handing ammunition to those who would dismiss science altogether?
A Sometimes getting it wrong is part of how science works, and our work should not be interpreted as “science being broken.” The scientific method works, and I think our studies are examples of how science self-corrects. I also think the process can be faster, and that the work others and we have done contributes to that. I have worked on these topics for more than a decade now, and there have been significant improvements in real time.
PART III
The gear change
Q Preregistration and Registered Reports, open access, open data and code, big-team science, adversarial collaborations, open peer review — which of these reforms has, in your view, produced the clearest improvement so far, and which are still more aspiration than reality?
A I am not so sure that open access has worked out the way we intended. At the end of the day, we are paying publishers huge fees to make papers accessible when we could have done so simply by posting preprints. I am a big fan of preregistration and Registered Reports and believe they are here to stay. I also think that big-team science should become the new normal — there are many examples of fantastic team-science projects, but we are not yet at the stage where they are the norm.
Q If you compare the empirical social sciences today with where they stood fifteen years ago, when the “replication crisis” first broke into public view, what has changed for the better, and what are you most proud of having helped push forward?
A Many of us engaged with statistical practices that led to unreliable results, without any bad intentions. Awareness of these practices has increased dramatically — partly because of our work — and this has contributed to the rise of preregistration and to taking sample sizes and statistical power much more seriously. That is something we should be proud of.
“I am very optimistic about the near future of science.”
— ANNA DREBER
Q And conversely: where are we still falling short? What would you most like to see our universities, funders, and journals change in the next five years?
A We need to think about the incentive system. How do we incentivize credible results to a greater extent and ensure that researchers don’t overclaim the generalizability of their findings? How do we incentivize team science if we believe — as I do — that this type of collaboration is more likely to generate interesting and credible results?
PART IV
A personal note.
Q You have built much of your career around pointing out uncomfortable truths about your own field. What first drew you to metascience rather than the “normal” research career path — and what keeps you optimistic today?
A Before metascience, I was working in behavioral economics, and it was not really an active decision to switch focus. I was interested in gender differences in risk-taking and thus fascinated by the now-infamous “power posing” study — a paper based on 42 participants that suggested that adopting a high-power pose would raise testosterone, lower cortisol, and increase both risk-taking and feelings of power. We set out to extend that study, curious whether the results would apply to other economic preferences, such as risk-taking in the loss domain and willingness to compete. With 200 participants, we failed to replicate the behavioral and hormonal results, which made me interested in replications.
I am optimistic because our work is receiving significant positive attention from other researchers, and it is inspiring to see real-time shifts in attitudes and practices. I am very optimistic about the near future of science.
ENGAGING FURTHER WITH THE TOPIC
KEY PAPERS (Anna’s recommendations)
- Simmons et al. (2011), False-Positive Psychology.
- Gelman & Loken (2014), The Statistical Crisis in Science.
- Camerer et al. (2018), Evaluating the Replicability of Social Science Experiments.
- Holzmeister et al. (2024), Heterogeneity in Effect Size Estimates.
- Holzmeister et al. (2025), Examining the Replicability of Online Experiments.
_______________________________________________________________________________________________________________________________
BOOK (Felix’s recommendation)
The Credibility Gap: Evaluating and Improving Empirical Research in the Social Sciences
Anna Dreber & Magnus Johannesson (Routledge, 2025)
________________________________________________________________________________________________________________________________
INITIATIVES & CONTACT
- Lab² - Metalab for better science — labsquare.net
- Institute for Replication (I4R) — i4replication.org
- Open Science Summer School — 09|2026 @UIBK
- Get in touch with Anna — anna.dreber@hhs.se
- Contact Felix — felix.holzmeister@uibk.ac.at


