Sunday, January 12, 2014

Escaping from the Garden of Forking Paths

My previous post was prompted by a new paper by Andrew Gelman and Eric Loken (GL) but it did not discuss its the main thrust because I had planned to defer that discussion to the present post. However, several comments on the previous post (by Chris Chambers and Andrew Gelman himself) leapt ahead of the game and so there already is an entire discussion in the comment section of the previous post about the topic of our story here. But I’m putting the pedal to the metal to come out in front again.

Simply put, GL’s basic claim is that researchers often unknowingly create false positives. Or, in their words: it is possible to have multiple potential comparisons, in the sense of a data analysis whose details are highly contingent on data, without the researcher performing any conscious procedure of fishing or examining multiple p-values.

My copy of the Dutch Translation
Here is one way in which this might work. Suppose we have a hypothesis that two groups differ from each other and we have two dependent measures. What constitutes evidence for our hypothesis? If the hypothesis is not more specific than that, we could be tempted to interpret a main effect as evidence for the hypothesis. If we find an interaction with the two groups differing on only one of the two measures, then we would also count that as evidence. So now we actually had three bites at the apple but we’re working under the assumption that we only had one. And this is all because our hypothesis was rather unspecific.

GL characterize the problem succinctly: There is a one-to-many mapping from scientific to statistical hypotheses. I would venture to guess that this form of inadvertent p-hacking is extremely common in psychology, perhaps especially in applied areas, where the research is less theory-driven than in basic research. The researchers may not be deliberately p-hacking, but they’re increasing the incidence of false positives nonetheless.

In his comment on the previous post, Chris Chambers argues that this constitutes a form of HARKING (Hypothesizing After the Results are Known). This is true. However, this is a very subtle form of HARKING. The researcher isn’t thinking well, I really didn’t predict this but Daryl Bem has told (pp. 2-3) me that I need to go on a fishing expedition about the data, so I’ll make it look like I’d predicted this pattern all along. The researcher is simply working with a hypothesis that is consistent with several potential patterns in the data.

GL noted that articles that they had previously characterized as the product of fishing expeditions might actually have a more innocuous explanation, namely inadvertent p-hacking. In the comments on my previous post, Chris Chambers took issue with this conclusion. He argued that GL looked at the study, and the field in general, through rose-tinted glasses.

The point of my previous post was that we often cannot reverse-engineer from the published results the processes that generated them on the basis of a single study. We cannot know for sure whether the authors of the studies initially accused by Gelman of having gone on a fishing expedition really cast out their nets or whether they arrived at their results in the innocuous way GL describe in their paper, although GL now assume it was the latter. Chris Chambers may be right when he says this picture is on the rosy side. My point, however, is that we cannot know given the information provided to us. There often simply aren’t enough constraints to make inferences about the procedures that have led to the results of a single study.

However, I take something different from the GL paper. Even though we cannot know for sure whether a particular set of published results was the product of deliberate or inadvertent p-hacking, it seems extremely likely that, overall, many researchers fall prey to inadvertent p-hacking. This is a source of false positives that we as researchers, reviewers, editors, and post-publication reviewers need to guard against. Even if researchers are on their best behavior, they still might produce false positives. GL provide suggestions to remedy the problem, namely pre-registration but point out that this may not always be an option in applied research. It is, however, in experimental research.

GL have very aptly named their article after a story by the Argentinean writer Jorge Luis Borges (who happens to be one of my favorite authors): The Garden of Forking Paths. As is characteristic of Borges, the story contains the description of another story. The embedded story describes a world where an event does not lead to a single outcome; rather, all of its possible outcomes materialize at the same time. And then the events multiply at an alarming rate as each new event spawns a plethora of other ones.

I found myself in a kind of garden of forking paths when my previous post produced both responses to that post and responses I had anticipated after this post. I’m not sure it will be as easy for the field to escape from the garden as it was for me here, but we should definitely try.


  1. Rolf:

    Just to add to the mess, let me clarify that I don't like the terms "false positive" and "false negative." I think that many of the problems we describe arise because there is this idea that the purpose of a scientific study is to figure out whether a hypothesis is "true." Sometimes this works out, for example most of us assume that whatever Daryl Bem is trying to study is actually false.

    But most of the time the true/false distinction does not really make sense. For example, consider that study that claimed to find a relation between men's arm circumference and an interaction between their socioeconomic status and their political attitudes. The correlation that the researchers found in their data: is it "real" in the sense of applying to the general population? Well, I don't think the true correlation is 0. What I do think is that they have a type M error (that is, their estimate from their sample is much higher than the correlation in the population) and that they are likely to have a type S error (that is, it is likely the sign of the association in the population is of opposite sign than what they found in the sample).

    But what of the researchers' more general hypothesis, that there is some relation between upper-body strength and political attitude, with some connection to evolution? Yes, I think this is true, in some sense it has to be true in that there is no way these possible relations are exactly zero. But that doesn't mean that anything useful came out of the published paper. It's not that they had a "false positive" or "false negative," that's not really the issue.

    1. Andrew, I had to think about this a little but it makes perfect sense. True/false presupposes the very knowledge that we seek. I guess "unsupported positive" would be a better term. I've always had this impression about social priming as well. The notion itself strikes me as plausible; it is just that the experiments provide no support for it.

  2. Since it was brought up in this post and the previous post by Rolf, Andrew, and Chris, I wanted to try out an argument against pre-registration. I should say upfront that I am not really opposed to pre-registration, but I think this argument suggests it is rather silly for many situations in experimental psychology.

    My concern is about what should be inferred when a researcher sticks to the plan. Does success for a pre-registered strategy lend some extra confidence in the results or in the theoretical conclusion? Does it increase belief in the process that produced the registered hypotheses? A consideration of two extremes suggests that it does not.

    Extreme case 1. Suppose a researcher generates a hypothesis by flipping a coin. It comes up "heads", so the researcher pre-registers the hypothesis that there will be a significant difference of means. The experiment is subsequently run and finds the predicted difference. Whether the observed difference is real or not, surely such an experimental outcome does not actually validate the process by which the hypothesis was generated. For the experiment to validate the prediction of the hypothesis (not just the hypothesis itself), there needs to be some justification for the prediction.

    Extreme case 2. Suppose a researcher generates a hypothesis by deriving an effect size from a quantitative theory that has previously been published in the literature. The researcher pre-registers this hypothesis and the subsequent experiment finds the predicted difference. Such an experimental finding may be strong validation of the hypothesis and of the quantitative theory, but it does not seem that pre-registration has anything to do with such validation. Since the theory has previously been published, other researchers could follow the steps of the original researcher and derive the very same predicted effect size. In a situation such as this it seems unnecessary to pre-register the hypothesis because it follows from existing ideas.

    Most research problems are neither of these extremes, but I still cannot see a situation where pre-registration helps. If the predicted hypotheses (and methods and measures) are clearly derived from existing theory, then pre-registration does not add much to the investigation. On the other hand, if the hypotheses (and methods and measures) are not clearly defined by existing theory, then pre-registration cannot change that situation.

    To put it another way, if a researcher is doing fully confirmatory work, then pre-registration is not necessary. If a researcher is doing fully exploratory work, then pre-registration should not be done at all. A problem we have in the field is that many people think only confirmatory work is proper and that exploratory work is non-scientific. To the contrary, both processes are essential to science.

    Moreover, it is not true that only confirmatory work can reject or validate theoretical predictions. The difference between confirmatory and exploratory work is mostly about the efficiency of the experimental design. Confirmatory work is focused on specific questions, so the design emphasises getting answers to those questions and is likely to give definitive answers. Exploratory work is less focused on specific questions, so the design is less likely to produce definitive answers to any questions (but it might, just by happenstance).

    For some of the specific cases where people have argued for pre-registration, the true problem was that the reported data did not provide a convincing argument for or against presented theoretical ideas. If researchers will just pay attention to the uncertainty in the measurements relative to the considered theoretical ideas, then it does not really matter whether the design is confirmatory or exploratory or whether the experiment was pre-registered or not.

    1. >To put it another way, if a researcher is doing fully
      >confirmatory work, then pre-registration is not necessary.
      >If a researcher is doing fully exploratory work, then
      >pre-registration should not be done at all.

      It's hard to argue with this. But I suspect that people are almost never actually doing fully confirmatory work in psychology, because (cf. Denny Borsboom's recent piece, there is so little theory and what there is, is incomplete. We're generally about as far as you can get from the confirmatory paradigm as exemplified by, say, Eddington's eclipse experiment. Pre-registration reduces researcher degrees of freedom, which is needed not just because some people are dishonest or lazy, but because the theory you're typically working with doesn't even begin to predict what might happen with a whole pile of variables that you hadn't thought of.
      For example, consider the failure to reproduce Gailliot et al's work on ego-depletion - see This ought to be about as confirmatory as you can get, since the second group of researchers had all the materials, methods, and scripts from the original study. There are many possible explanations for their null result, but failure to account for a potentially large number of other (unimagined) variables seems like a plausible one. (Of course, that failure might have been on the part of the reproducing group. Over at there's a *successful* replication of Gailliot et al.'s original results. Who's to say who's right? Does a single null replication constitute falsification if you know that your theory isn't "complete" and never can be?)

      I suspect that people in disciplines that are considered "harder" sciences than psychology, but "softer" than physics (that's pretty well everything done in a lab, I guess!) might do well to study how the better psychologists design their experiments; if you can successfully eliminate most of the effects of hidden variables in psychology, it should be a snap in genomics or neuroscience.

    2. But if we are not doing confirmatory research and are instead performing exploratory work, then why would we want to artificially restrict ourselves to pre-registered hypotheses, methods, or analyses that are apparently made up by the researcher?

      I understand the desire to restrict researcher degrees of freedom and force researchers to generate real predictions; but if there is really no theoretical justification for the predictions, then the request is just a waste of time.

      The best outcome I can foresee from an emphasis on pre-registration is that researchers will sit down and take a good look at their theoretical ideas and carefully consider which ones generate predictions that are sufficiently precise to promote a well-designed experiment. For many cases, researchers will realise that they have no such ideas and thus will not pre-register anything. The realisation of such a situation is valuable, but I hardly think we can praise an approach whose main benefit is that it is not implemented. Researchers should think carefully about their ideas and experimental designs, but that can be done without pre-registration.

      Thanks for the link to Borsboom's piece. I liked it a lot, even though we come to different conclusions about the benefits of pre-registration.