Monday, January 20, 2014

Why Social-Behavioral Primers Might Want to be More Self-critical

During the investigation into the scientific conduct of Dirk Smeesters, I expressed my incredulity about some of his results to a priming expert. His response was: You don’t understand these experiments. You just have to run them a number of times before they work. I am convinced he was completely sincere.

What underlies this comment is what I’ll call the shy-animal mental model of experimentation. The effect is there; you just need to create the right circumstances to coax it out of its hiding place. But there is a more appropriate model: the 20-sided-die model (I admit, that’s pretty spherical for a die but bear with me).

A social-behavioral priming experiment is like rolling a 20-sided die, an icosahedron. If you roll the die a number of times, 20 will turn up at some point. Bingo! You have a significant effect. In fact, given what we now know about questionable and not so questionable research practices, it is fair to assume that the researchers are actually rolling with a 20-sided die where maybe as many as six sides have a 20 on them. So the chances of rolling a 20 are quite high.

I didn't know they existed 
but a student who read this post 
brought this specimen to class; she
uses it for Gatherer games.
Once the researchers have rolled a 20, their next interpretive move is to consider the circumstances that happened to coincide with rolling the die instrumental in producing the 20. The only problem is that they don't know what those circumstances were. Was it the physical condition of the roller? Was it the weather? Was it the time of day? Was it the color of the roller's sweater? Was it the type of microbrew he had the night before? Was it the bout of road rage he experienced that morning? Was it the cubicle in which the rolling experiment took place? Was it the fact that the roller was a 23-year old male from Michigan? And so on.

Now suppose that someone else tries to faithfully recreate the circumstances that co-occurred with the rolling of the 20, from the information that was provided by the original rollers. They recruit a 23-year old male roller from Michigan, wait until the outside temperature is exactly 17 degrees Celsius, make the experimenter wear a green sweater, have him drink the same IPA on the night before, and so on.

Then comes the big moment. He rolls the die. Unfortunately, a different number comes up— a disappointing 11. Sadly, he did not replicate the original roll. He tells this to the first roller, who replies: Yes you got a different number than we did but that’s because of all kinds of extraneous factors that we didn’t tell you about because we don’t know what they are. So it doesn’t make sense for you to try replicate our roll because we don’t know why we got the 20 in the first place! Nevertheless, our 20 stands and counts as an important scientific finding.

That is pretty much the tenor of some contributions in a recent issue of Perspectives on Psychological Science that downplay the replication crisis in social-behavioral priming. This kind of reasoning seems to motivate recent attempts by social-behavioral priming researchers to explain away an increasing number of non-replications of their experiments.

Joe Cesario, for example, claims that replications of social-behavioral priming experiments by other researchers are uninformative because any failed replication could result from moderation, although a theory of the moderators is lacking. Cesario argues that initially only the originating lab should try to replicate its findings. Self-replication is in and of itself a good idea (we have started doing it regularly in our own lab) but as Dan Simons rightfully remarks in his contribution to the special section: The idea that only the originating lab can meaningfully replicate an effect limits the scope of our findings to the point of being uninteresting and unfalsifiable.

Show-off! You're still a "false positive."
Ap Dijksterhuis also mounts a defense of priming research, downplaying the number of non-replicated findings. He talks about the odd false positive, which sounds a little like saying that a penguin colony contains the odd flightless bird (I know, I know, I'm exaggerating here). Dijksterhuis claims that it is not surprising that social priming experiments yield larger effects than semantic priming experiments because the manipulations are bolder. But if this were true, wouldn’t we expect social priming effects to replicate more often? After all, semantic priming effects do; they are weatherproof, whereas the supposedly bold social-behavioral effects appear sensitive to such things as weather conditions (which Dijksterhuis lists as a moderator).

Andrew Gelman made an excellent point in response to my previous post that false positive is actually not a very appropriate terminology. He suggests an alternative phrasing: overestimating the effect size. This seems a constructive perspective on social-behavioral priming without any negative connotations. Earlier studies provide inflated estimations of the size of social-behavioral priming effects.

A less defensive and more constructive response by priming researchers might therefore be: “Yes, the critics have a point. Our earlier studies may have indeed overestimated the effect sizes. Nevertheless, the notion of social-behavioral priming is theoretically plausible, so we need to develop better experiments, pre-register our experiments, and perform cross-lab replications to convince ourselves and our critics of the viability of social-behavioral priming as a theoretical construct.“

In his description of Cargo Cult Science, Richard Feynman stresses need for researchers to be self-critical: We've learned from experience that the truth will come out. Other experimenters will repeat your experiment and find out whether you were wrong or right. Nature's phenomena will agree or they'll disagree with your theory. And, although you may gain some temporary fame and excitement, you will not gain a good reputation as a scientist if you haven't tried to be very careful in this kind of work. And it's this type of integrity, this kind of care not to fool yourself, that is missing to a large extent in much of the research in Cargo Cult Science.

It is in the interest of the next generation of priming researchers (just to mention one important group) to be concerned about the many nonreplications (coupled with the large effect sizes and small samples that are characteristic of social-behavioral priming experiments). The lesson is that the existing paradigms are not going to yield further insight and ought to be abandoned. After all, they may have led to overestimated priming effects.

I’m reminded of the Smeesters case again. Smeesters had published a paper in which he had performed variations on the professor-prime effect, reporting large effects (the effects that prompted my incredulity). This paper has now been retracted. One of his graduate students had performed yet another variation on the professor-prime experiment; she found complete noise. When we examined her raw data, the pattern was nothing like the pattern Uri Simonsohn had uncovered in Smeesters’ own data. When confronted with the discrepancy between the two data sets, Smeesters gave the defense we see echoed in the social-behavioral priming defense discussed here: that experiment was completely different from my experiments (he did not specify how), so of course no effect was found.

There is reason to worry that defensive responses about replication failures will harm the next generation of social-behavioral priming researchers because these young researchers will be misled into placing much more confidence in a research paradigm than is warranted. Along the way they will probably waste a lot of valuable time, face lots of disappointments, and might even face the temptation of questionable research practices. They deserve better. 


  1. I like to think that all quantitative science is basically about working out how biased this one particular die we're given is. The problem in psychology is not just that we are throwing three or four other dice at the same time (which we know about and "control for", although we don't always know how many sides they have, let alone their bias), but more the fact that the table on which we are throwing them turns out to be made of dice, a lot of which are very loosely fixed and jump into the middle of the pile when the others are thrown.

    Re Feynman's comment, I wish I was as optimistic as him. In fact maybe he's right up to the point where he says the truth will come out (albeit that the original was in JPSP and the null replication has to make do with the admirable Journal of Articles in Support of the Null Hypothesis). But after that, I don't see many examples of people saying "Gosh, I was wrong, thanks for pointing that out." A rather more common reaction seems to be "Hold on a moment which I move these goalposts" - classic pseudoscience, in other words.

    1. I have seen several instances now (that I'm not personally involved in), where a journal is unwilling or reluctant to publish an unsuccessful replication of a study that was originally published there. This is clearly something that needs to change.

      I like the goalpost quote.

  2. Good stuff. I thought that some of the effect size arguments in the Dijksterhuis piece were somewhat hard to follow (see his page 73). As you note, the point of the section was to try to make a case for why effect sizes should be fairly large in priming studies versus other areas of psychology. But I was pretty much lost in that paragraph.

    1. I don't get the whole "behavioral priming experiments usually use much bolder primes than semantic priming experiments" claim. Is there some way to demonstrate the construct validity of a prime? To be frank, I think the only real evidence for the boldness of a given prime is the size of some of the published effect sizes. Thus, I worry that this kind of claim is empty – it might amount to saying that large effects sizes are large.

    2. Likewise, I did not follow the idea about how stimulus materials in behavioral priming studies are often more motivating for participants than the stimulus materials used by cognitive psychologists. I thought the point was that many behavioral primes were so impressive because they seem so subtle. What is so motivating about sentence scrambling tasks with words like Florida, old, grey, rigid, bitter? Moreover, the point in these studies is often to show that participants are unaware of the primes (again see the walking study). Is the argument then about the DVs in behavioral studies?

    3. Other passages in that piece seemed to point out that behavior is complicated and multiply determined. However, this is exactly the kind of situation that would suggest the existence of fairly modest effect sizes not large effect sizes. If so “many (social) psychological phenomena are affected by people’s mood, by atmospherics, time of day, fatigue, motivation, and even the weather” (see his page 73), then why would we ever expect large sizes with behavioral primes? Thus, I think this claim about social psychological phenomena is why large effect sizes are often so implausible.

    1. Thanks Brent. I agree with all three of your points here. I also fail to comprehend these arguments. And you're right, what's so exciting about the bingo-unscrambling task?