What Can we Learn from the Many Labs Replication Project?

The first massive replication project in psychology has just reached completion (several others are to follow). A large group of researchers, which I will refer to as ManyLabs, has attempted to replicate 15 findings from the psychological literature in various labs across the world. The paper is posted on the Open Science Framework (along with the data) and Ed Yong has authored a very accessible write-up. [Update May 20, 2014, the article is out now and is open access.]

What can we learn from the ManyLabs project? The results here show the effect sizes for the replication efforts (in green and grey) as well as the original studies (in blue). The 99% confidence intervals are for the meta-analysis of the effect size (the green dots); the studies are ordered by effect size.

Let’s first consider what we canNOT learn from these data. Of the 13 replication attempts (when the first four are taken together), 11 succeeded and 2 did not (in fact, at some point ManyLabs suggests that a third one, Imagined Contact also doesn’t really replicate). We cannot learn from this that the vast majority of psychological findings will replicate, contrary to this Science headline, which states that these findings “offer reassurance” about the reproducibility of psychological findings. As Ed Yong (@edyong209) joked on Twitter, perhaps ManyLabs has stumbled on the only 12 or 13 psychological findings that replicate! Because the 15 experiments were not a random sample of all psychology findings and it’s a small sample anyway, the percentage is not informative, as ManyLabs duly notes.

But even if we had an accurate estimate of the percentage of findings that replicate, how useful would that be? Rather than trying to arrive at a more precise estimate, it might be more informative to follow up the ManyLabs projects with projects that focus on a specific research area or topic, as I proposed in my first-ever post, as this might lead to theory advancement.

So what DO we learn from the ManyLabs project? We learn that for some experiments, the replications actually yield much larger effects that the original studies, a highly intriguing findings that warrants further analysis.

We also learn that the two social priming studies in the sample, dangling at the bottom of the list in the figure, were resoundingly nonreplicated. One study found that exposure to the United States flag increases conservatism among Americans; the other study found that exposure to money increases endorsement of the current social system. The replications show that there essentially is no effect whatsoever for either of these exposures.

It is striking how far the effects sizes of the original studies (indicated by an x) are away from the rest of the experiments. There they are, by their lone selves at the bottom right of the figure. Given that all of the data from the replication studies have been posted online, it would be fascinating to get the data from the original studies. Comparisons of the various data sets might shed light on why these studies are such outliers.

We also learn that the online experiments in the project yielded results that are highly similar to those produced by lab experiments. This does not mean, of course, that any experiment can be transferred to an online environment, but it certainly inspires confidence in the utility of online experiments in replication research.

Most importantly, we learn that several labs working together yield data that have an enormous evidentiary power. At the same time, it is clear that such large-scale replication projects will have diminishing returns (for example, the field cannot afford to devote countless massive replication efforts to not replicating all the social priming experiments that are out there). However, rather than using the ManyLabs approach retrospectively, we can also use it prospectively: to test novel hypotheses.

Here is how this might go.

(1) A group of researchers form a hypothesis (not by pulling it out this air but by deriving it from a theory, obviously).

(2) They design—perhaps via crowd sourcing—the best possible experiment.

(3) They preregister the experiment.

(4) They post the protocol online.

(5) They simultaneously carry out the experiment in multiple labs.

(6) They analyze and meta-analyze the data.

(7) They post the data online.

(8) They write a kick-ass paper.

And so I agree with the ManyLabs authors when they conclude that a consortium of laboratories could provide mutual support for each other by conducting similar large-scale investigations on original research questions, not just replications. Among the many accomplishments of the ManyLabs project, showing us the feasibility of this approach might be its major one.

Reacties

Steve Fiore28 november 2013 om 16:37
Thanks for the post on this Rolf. What I was wondering when I read about the replications in Nature (http://www.nature.com/news/psychologists-strike-a-blow-for-reproducibility-1.14232) was whether or not these were really replications. Doesn't the very fact that they "combined tests from earlier experiments into a single questionnaire — meant to take 15 minutes to complete" mean that they did not, technically, "replicate" the original studies? They essentially created a new study (survey instrument), that contained items from prior studies. That, then, created a set of new contextual factors surrounding these questionnaire items.

Anyway, since you've thought a lot more about this issue, I'd be interested in your interpretation.
BeantwoordenVerwijderen
Reacties
Unknown28 november 2013 om 22:30
Lykken (1968) distinguished 3 types of replication:

1. LITERAL. Exact, only the subjects and time changes (e.g., in-lab replication)
2. OPERATIONAL. Reproduce the methods as best as possible.
3. CONSTRUCTIVE. Replicate the theoretical construct.

The scientific credibility awarded to a succesful constructive replication is largest of all, after that operational and least impressive in terms of credibility awarded to a theory, is a literal replication.

I think ManyLabs shows it is possible to conduct constructive type replications and therefore am not surprised to see variation.

However, I do wonder about the following: There were original studies that had a power of ~99% to detect the original effect, as well as the replicated effect... there's more to power than sample size!

By the way... The idea to use this for novel predictions... Where do I sign up? :)
BeantwoordenVerwijderen
Reacties
J&K30 november 2013 om 18:07
At the risk of sounding like a broken record, why are the two failed priming studies described as "social priming" studies? What is social about priming money or a flag?

Joe
BeantwoordenVerwijderen
Reacties
Rolf Zwaan30 november 2013 om 18:18
I guess that's what they are referred to (by others). It's true that it's not easy to label these kids of studies: http://storify.com/rolfzwaan/conversation-with-wrayherbert-rolfzwaan-hpashler-p.
BeantwoordenVerwijderen
Reacties
Dr. Fox27 februari 2014 om 22:12
One point I haven’t seen discussed but I’m wondering about: how come the effect sizes for the original studies mostly fall within a relatively narrow range? Much narrower than the range of effect sizes from the ManyLabs replication, but centered on about the same grand mean. Is that just happenstance? Is there some obvious explanation I’m missing?
BeantwoordenVerwijderen
Reacties

Reactie toevoegen

Drang naar Samenhang

Zoeken in deze blog

What Can we Learn from the Many Labs Replication Project?

Reacties

Een reactie posten