Thursday, February 26, 2015

Can we Live without Inferential Statistics?

The journal Basic and Applied Social Psychology (BASP) has taken a resolute and bold step. A recent editorial announces that it has banned the reporting of inferential statistics. F-values, t-values, p-values and the like have all been declared personae non gratae. And so have confidence intervals. Bayes factors are not exactly banned but aren’t welcomed with open arms either; they are eyed with suspicion, like a mysterious traveler in a tavern.

There is a vigorous debate in the scientific literature and in the social media about the pros and cons of Null Hypothesis Significance Testing (NHST), confidence intervals, and Bayesian statistics (making researchers in some frontier towns quite nervous). The editors at BASP have seen enough of this debate and have decided to do away with inferential statistics altogether. Sure, you're allowed to submit a manuscript that’s loaded with p-values and statements about significance or the lack thereof, but they will be rigorously removed, like lice from a schoolchild’s head.

The question is whether we can live with what remains. Can we really conduct science without summary statements? Because what does the journal offer in their place? It requires strong descriptive statistics, distributional information, and more power. These are all good things but we need to have a way to summarize our results, not just because so we can comprehend and interpret them better ourselves and because we need to communicate them but also because we need to make decisions based on them as researchers, reviewers, editors, and users. Effect sizes are not banned and so will provide summary information that will be used to answer questions like:
--what will the next experiment be?
--do the findings support the hypothesis?
--has or hasn’t the finding been replicated?
--can I cite finding X as support for theory Y?*

As to that last question, you can hardly cite a result saying This finding supports or does not support the hypothesis but here are the descriptives. The reader will want more in the way of a statistical argument or an intersubjective criterion to decide one way or the other. I have no idea how researchers, reviewers, and editors are going to cope with the new freedoms (from inferential statistics) and constraints (from not being able to use inferential statistics). But that’s actually what I like about the BASP's ban. It gives rise to a very interesting real-world experiment in meta-science. 

Sneaky Bayes
There are a lot of unknowns at this point. Can we really live without inferential statistics? Will Bayes sneak in through the half-open door and occupy the premises? Will no one dare to submit to the journal? Will authors balk at having their manuscripts shorn of inferential statistics? Will the interactions among authors, reviewers, and editors yield novel and promising ways of interpreting and communicating scientific results? Will the editors in a few years be BASPing in the glory of their radical decision?  And how will we measure the success of the ban on inferential statistics? The wrong way to go about this would be to see whether the policy will be adopted by other journals or whether or not the impact factor of the journal rises. So how will we determine whether the ban will improve our science?

Questions, questions. But this is why we conduct experiments and this is why BASP's brave decision should be given the benefit of the doubt.


I thank Samantha Bouwmeester and Anita Eerland for feedback on a previous version and Dermot Lynott for the Strider picture.

* Note that I’m not saying: “will the paper be accepted?” or “does the researcher deserve tenure?” 

Wednesday, January 28, 2015

The Dripping Stone Fallacy: Confirmation Bias in the Roman Empire and Beyond

What to do when the crops are failing because of a drought? Why, we persuade the Gods to send rain of course! I'll let the fourth Roman Emperor, Claudius, explain:

Derek Jacobi stuttering away as 
Claudius in the TV series I Claudius
There is a black stone called the Dripping Stone, captured originally from the Etruscans and stored in a temple of Mars outside the city. We go in solemn procession and fetch it within the walls, where we pour water on it, singing incantations and sacrificing. Rain always follows--unless there has been a slight mistake in the ritual, as is frequently the case.*
It sounds an awful lot as if Claudius is weighing in on the replication debate, coming down squarely on the side of replication critics, researchers who raise the specter of hidden moderators as soon as a non-replication materializes. Obviously, when a replication attempt omits a component that is integral to the original study (and was explicitly mentioned in the original paper), omission of that component borders on scientific malpractice. But hidden moderators are only invoked after the fact--they are "hidden" after all and so could by definition not have been omitted. Hidden moderators are slight mistakes or imperfections in the ritual that are only detected when the ritual does not produce the desired outcome. As Claudius would have us believe, if the ritual is performed correctly, then rain always follows. Similarly, if there are no hidden moderators, then the effect will always occur, so if the effect does not occur, there must have been a hidden moderator.**

And of course nobody bothers to look for small errors in the ritual when it is raining cats and dogs, or for hidden moderators when p<.05

I call this the Dripping Stone Fallacy.

Reviewers (and readers) of scientific manuscripts fall prey to a mild(er) version of the Dripping Stone Fallacy. They scrutinize the method and results sections of a paper if they disagree with its conclusions and tend to give these same sections a more cursory treatment if they agree with the conclusions. Someone surely must have investigated this already. If not, it would be rather straightforward to design an experiment and test the hypothesis. One could measure the amount of time spent reading the method section and memory for it in subjects who are known to agree or disagree with the conclusions of an empirical study.

Even the greatest minds fall prey to the Dripping Stone Fallacy. As Raymond Nickerson describes: Louis Pasteur refused to accept or publish results of his experiments that seemed to tell against his position that life did not generate spontaneously, being sufficiently convinced of his hypothesis to consider any experiment that produced counterindicative evidence to be necessarily flawed.

Confirmation bias comes in many guises and the Dripping Stone Fallacy is one of them. It makes a frequent appearance in the replication debate. Granted, the Dripping Stone Fallacy didn't prevent the Romans from conquering half the world but it is likely to be more debilitating to the replication debate.


* Robert Graves, Claudius the God, Penguin Books, 2006, p. 172.
** This is and informal fallacy; it is formally correct (modus tollens) but is based on a false premise.

Sunday, January 18, 2015

When Replicating Stapel is not an Exercise in Futility

Over 50 of Diederik Stapel’s papers have been retracted because of fraud. This means that his “findings,” have now ceased to exist in the literature. But what does this mean for his hypotheses?*

Does the fact that Stapel has committed fraud count as evidence against his hypotheses? Our first inclination is perhaps to think yes. In theory, it is possible that Stapel ran a number of studies, never obtained the predicted results, and then decided to take matters into his own hands and tweak a few numbers here and there. If there were evidence of a suppressed string of null results, then yes, this would certainly count as evidence against the hypothesis; it would probably be a waste of time and effort to try to “replicate” the “finding.” Because the finding is not a real finding, the replication is not a real replication. However, by all accounts (including Stapel’s own), once he got going, Stapel didn’t bother to run the actual experiment. He just made up all the data.

This means that Stapel’s fraud has no bearing on his hypotheses. We simply have no empirical data that we can use to evaluate his hypotheses. It it still possible that a hypothesis of his is supported in a proper experiment. Whether or not it makes sense to test that hypothesis is purely a matter of theoretical plausibility. And how do we evaluate replication attempts that were performed before the fraud had come to light? At the time the findings were probably seen as genuine--they were published, after all.

Prior to the exposure of Stapel’s fraudulent activities, Dutch social psychologist Hans IJzerman and some of his colleagues had embarked on a cross-cultural project, involving Brazilian subjects, that built on one of Stapel's findings. They then found out that another researcher in the Netherlands, Nina Regenberg, had already tried—and failed—to replicate these same findings in 9 direct and conceptual replications. As IJzerman and colleagues wryly observe:

At the time, these disconfirmatory findings were seen as ‘failed studies’ that were not worthy of publication. In hindsight, it seems painfully clear that discarding null effects in this manner has hindered scientific progress.

Ironically, the field that made it possible for Stapel to publish his made-up findings also made it impossible to publish failed replications of his work that involved actual findings. 

But the times they are a-changin’. IJzerman and Regenberg joined forces and together with their colleagues Justin Saddlemyer and Sander Koole they have written a paper, currently in press in Acta Psychologica, that reports 12 replications of a—now retracted—series of experiments published by Diederik Stapel and Gün Semin. Semin, of course, was unaware of Stapel's deception.**

Here is the hypothesis that was advanced by Stapel and Semin: priming with abstract linguistic categories (adjectives) should lead to a more abstract perceptual focus, whereas priming with concrete linguistic categories (action verbs) should lead to a more concrete perceptual focus. This linguistic category priming hypothesis is based on the uncontroversial observation that specific linguistic terms are recurrently paired with specific situations. As a result, Stapel and Semin hypothesized, linguistic terms may form associative links with cognitive processes. Because these associative links are stored in memory, they may be activated or “primed” whenever people encounter the relevant linguistic terms.

Stapel and Semin further hypothesized that verbs are associated with actions at a more concrete level than nouns. A verb like hit is used in a context like Harry is hitting Peter whereas an adjective like agressive is used in a more abstract description of the situation, as in Harry is being aggressive toward Peter. Because abstract information is more general, it may be associated with global perceptions, whereas concrete information may become with local perceptions. So far so good; I bet that many psychologists can follow this reasoning. Due to these associations, Stapel and Semin reason, priming verbs may elicit a focus on local details (i.e., the trees), while priming adjectives may elicit a focus on the global whole (i.e., the forest). This is a bit of a leap for me but let’s follow along.

Stapel and Semin reported four experiments in which they found evidence supporting their hypothesis. Priming with verbs led to more concrete processing than priming with adjectives. But of course these experiments were actually never performed and the findings were fabrications.

Let's look at some real data. Here is IJzerman et al.'s forest plot of the standardized mean difference scores between verb and adjective primes on global vs. local focus in twelve replications of the Stapel and Semin study.

Of the 12 studies, only one showed a significant effect (and it was not in the predicted direction). Overall, the standardized mean difference between the condition was practically zero. No shred of support for the linguistic category priming hypothesis, in other words.

Are these findings the death blow (to use the authors’ term) to the notion of linguistic category priming? IJzerman and his colleagues don’t think so. In perhaps a surprise twist, they conclude:

[I]t remains to be seen whether the effect we have investigated does not exist, or whether it depends on identifying the right contexts and measurements for the linguistic category priming effects among Western samples.

My own conclusions are the following.

  1. Replications of findings proven to be fraudulent are important. Without replications, the status of the hypotheses remains unclear. After all, the findings were previously deemed publishable by peer reviewers, presumably based in part on theoretical considerations. Without relevant empirical data, the area of research will remain tainted and researchers will steer clear from it. While this may not be bad in some cases, it might be bad in others.
  2. The Pottery Barn rule should hold in scientific publishing: you break it, you buy it. If you published fraudulent findings, you should also publish their nonreplications. Many journals do not adhere to this rule. Sander Koole informed me that the Journal of Social and Personality Psychology (JPSP) congratulated IJzerman and colleagues on their replication attempts but rejected their manuscript nonetheless, even though they had previously published the Stapel and Semin paper. It is a good thing the editors at Acta Psychologica have taken a more progressive stance on publishing failed replications.***
  3. It is a good sign that the climate for the publications failed replications is improving somewhat. Dylan’s right, the times are a-changin'. I am glad that the authors persevered and that their work is seeing the light of day.

*     I thank Hans IJzerman and Sander Koole for feedback on a previous version of this post. 
**   Semin was the doctoral advisor of both IJzerman and Regenberg and was initially involved in the replication attempts but let his former students use the data.
*** Until January 2014 I was Editor-in-Chief at Acta Psychologica. I was not involved in the handing of the IJzerman et al. paper and am therefore not patting myself on the back.

Friday, October 24, 2014

ROCing the Boat: When Replication Hurts

Though failure to replicate presents a serious problem, even highly-replicable results may be consistently and dramatically misinterpreted if dependent measures are not carefully chosen. This sentence comes from a new paper by Caren Rotello, Evan Heit, and Chad Dubé to be published in Psychonomic Bulletin & Review. 

Replication hurts in such cases because it reinforces artifactual results. Rotello and colleagues marshal support for this claim from four disparate domains: eyewitness memory, deductive reasoning, social psychology, and studies of child welfare. In each of these domains researchers make the same mistake by using the same wrong dependent measure.

Common across these domains is that subjects have to make detection judgments: was something present or was it not present? For example, subjects in eyewitness memory experiments decide whether or not the suspect is in a lineup. There are four possibilities.
             Hit: The subject responds “yes” and the suspect is in the lineup.                     
             False alarm: The subject responds “yes” but the suspect is not in the lineup.  
             Miss: The subject responds “no” but the suspect is in the lineup.                      
             Correct rejection: Responds “no” and the suspect is not in the lineup.             

It is sufficient to only take the positive responses, hits and false alarms, into account if we want to determine decision accuracy (the negative responses are complementary to the positive ones). But the question is how we compute accuracy from hits and false alarms. And this is where Rotello and colleagues say that the literature has gone astray.

To see why, let’s continue with the lineup example. Lineups can be presented simultaneously (all faces at the same time) or sequentially (one face at a time). A meta-analysis involving data from 23 labs involving 13,143 participants concludes that sequential lineups are superior to simultaneous ones. Sequential lineups yield a 7.72 diagnosticity ratio and simultaneous ones only 5.78; in other words, sequential lineups are 1.34 (7.72/5.78) times more accurate than simultaneous ones. Rotello and colleagues mention that 32% of police precincts in the United States now use sequential lineups. They don’t state explicitly that this is because of the research but this is what they imply.

The diagnosticity ratio is computed by dividing the number of hits by the number of false alarms. Therefore, the higher the ratio, the better the detection rate. So the notion of sequential superiority rides on the assumption that the diagnosticity ratio is an appropriate measure of diagnosticity. Well, you might think, it has the word diagnosticity in it, so that’s at least a start. But Rotello and colleagues demonstrate, this may be all that it has going for it.

If you compute the ratio of hits and false alarms (or the difference between them, as is often done), you’re assuming a linear relation. The straight lines in Figure 1 connect all the hypothetical subjects who have the same diagnosticity ratio. So the lowest line here connects the subjects who are at chance performance, and thus have a diagnosticity ratio of 1 (# hits = # false alarms). The important point to note is that you get this ratio for a conservative responder with 5% hits and 5% false alarms but also for a liberal responder with 75% hits and 75% false alarms.

The lines in the figure are called Receiver Operating Characteristics (ROC). (So now you know what that ROC is doing in the title of this post.) ROC is a concept that was developed by engineers in World War II who were trying to improve ways to detect enemy objects in battlefields and then was introduced to the field of psychophysics. 

Now let’s look at some real data.The triangles in the figure represent data from an actual experiment (by Laura Mickes, Heather Flowe, and John Wixted) comparing simultaneous (open triangles) and sequential (closed triangles) lineups. Every point on these lines reflects the same accuracy but a different tendency to respond “yes.” The lines that you can fit through these data points will be curved. Rotello and colleagues note that curved ROCs are consistent with the empirical reality and straight lines assumed by the diagnosticity ratio are not.

Several large-scale studies have used ROCs rather than diagnosticity and found no evidence whatsoever for a sequential superiority effect in lineups. In fact, all of these studies found the opposite pattern: simultaneous was superior to sequential. So what is the problem with the diagnosticity ratio? As you might have guessed by now, it is that it does not control for response bias. Witnesses presented with a sequential lineup are just less likely to respond “yes I recognize the suspect” than witnesses presented with a simultaneous lineup. ROCs based on empirical data unconfound accuracy with response bias and show a simultaneous superiority effect.

Rotello and colleagues demonstrate convincingly that this same problem bedevils the other areas of research I mentioned at the beginning of this post but the broader point is clear. As they put it: This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time. Unless we are using the proper dependent measure, replications are even going to aggravate the problem by enshrining artifactual findings in the literature (all the examples discussed in the article are “textbook effects”). To use another military reference: in such cases massive replications will produce what in polite company is called a Charlie Foxtrot.

Rotello and colleagues conclude by considering the consequences of their analysis for ongoing replication efforts such as the Reproducibility Project and the first Registered Replication Report on verbal overshadowing that we are all so proud of. They refer to a submitted paper that argues the basic task in the verbal overshadowing experiment is flawed because it lacks a condition in which the perpetrator is not in the lineup. I haven’t read this study yet and so can’t say anything about it, but it sure will make for a great topic for a future post (although I’m already wondering whether I should start hiding under a ROC).

Rotello and colleagues have produced an illuminating analysis that invites us once more to consider how valid our replication attempts are. Last year, I had an enjoyable blog discussion about this very topic with Dan Simons, it even uses the verbal overshadowing project as an example. Here is a page with links to this diablog.

I thank Evan Heit for alerting me to the article and for feedback on a previous draft of this post.

The Diablog on Replication with Dan Simons

Dan Simons
Last year, I had a very informative and enjoyable blog dialogue, or diablog, with Dan Simons about the reliability and validity of replication attempts. Unfortunately, there was never an easy way for anyone to access this diablog. It has only occurred to me today (!) that I could remedy this situation by creating a meta-post. Here it is.

In my first post on the topic, I argued that it is important to consider to consider not only the reliability but also the validity of replication attempts because it might be problematic if we try to replicate a flawed experiment. 

Dan Simons responded to this, arguing that deviations from the original experiment, while interesting, would not allow us to determine the reliability of the original finding.

I then had some more thoughts.

To which Dan wrote another constructive response.

My final point was that direct replications should be augmented with systematic variations of the original experiment.

Thursday, September 18, 2014

Verbal Overshadowing: What Can we Learn from the First APS Registered Replication Report?

Suppose you witnessed a heinous crime being committed right before your eyes. Suppose further that a few hours later, you’re being interrogated by hard-nosed detectives Olivia Benson and Odafin Tutuola. They ask you to describe the perpetrator. The next day, they call you in to the police station and present you with a lineup. Suppose the suspect is in the lineup. Will you be able to pick him out? A classic study in psychology suggest Benson and Tutuola have made a mistake by first having you describe the perpetrator because the very act of describing the perpetrator will make it more difficult for you to identify him out of the lineup.

This finding is known as the verbal overshadowing effect and was discovered by Jonathan Schooler. In the experiment that is of interest here, he and his co-author, Tonya Engstler-Schooler, found that verbally describing the perpetrator led to a 25% accuracy decrease in identifying him. This is a sizeable difference with practical implications. Based on these findings, we’d be right to tell Benson and Tutuola to lay off interviewing you until after the lineup identification.

Here is how the experiment worked.


Subjects first watched a 44 second video clip of a (staged) bank robbery. Then they performed a filler task for 20 minutes, after which their either wrote down a description of the robber (experimental condition) or listed names of US states and their capitals (control condition). After 5 minutes, they performed the lineup identification task.

How reliable is the verbal-overshadowing effect? That is the question that concerns us here. A 25% drop in accuracy seems considerable. Schooler himself observed subsequent research yielded progressively smaller effects, something he referred to as “the decline effect.” This clever move created a win-win situation for him. If the original finding replicates, the verbal overshadowing hypothesis is supported. If it doesn’t, then the decline effect hypothesis is supported.

The verbal overshadowing effect is the target of the first massive Registered Registration Report under the direction of Dan Simons (Alex Holcombe is leading the charge on the second project) that was just published. Thirty-one labs were involved in direct replications of the verbal overshadowing experiment I just described. Our lab was one of the 31. Due to the large number of participating labs and the laws of the alphabet, my curriculum vitae now boasts an article on which I am 92nd author.

Due to an error in the protocol, the initial replication attempt had the description task and  a filler task in the wrong order before the line-up task, which made the first set of replications, RRR1, a fairly direct replication of Schooler’s Experiment 4 rather than, as was the plan, his Experiment 1. A second set of experiments, RRR2, was performed to replicate Schooler’s Experiment 1. You see the alternative ordering here.

In Experiment 4, Schooler found that subjects in the verbal description condition were 22% less accurate than those in the control condition. A meta-analysis of the RRR1 experiments yielded a considerably smaller, but still significant, 4% deficit. Of note is that all the replication studies found a smaller effect than the original study but that study was also less precise due to having a smaller sample size.

Before I tell you about the results of the replication experiments I have a confession to make. I have always considered the concept of verbal overshadowing plausible, even though I might have a somewhat different explanation for it than Schooler (more about this maybe in a later post), but I thought the experiment we were going to replicate was rather weak. I had no confidence that we would find the effect. And indeed, in our lab, we did not obtain the effect. You might argue that this null effect was caused by the contagious skepticism I must have been oozing. But I did not run the experiment. In fact, I did not even interact about the experiment with the research assistant who ran it (no wonder I’m 92nd author on the paper!). So the experiment was well-insulated from my skepticism.

Let's get back on track. In Experiment 1, Schooler found a 25% deficit. The meta-analysis of RRR2 yielded a 16% deficit-- somewhat smaller but still in the same ballpark. Verbal overshadowing appears to be a robust effect. Also interesting is the finding that the position of the filler task in the sequence mattered. The verbal overshadowing effect is larger when the lineup identification immediately follows the description and when there is more time between the video and the description. In fact either of those or a combination of them could be responsible for this difference in effect sizes.

Here are the main points I take a away from this massive replication effort.

1. Our intuitions about effects may not be as good as we think. My intuitions were wrong because a meta-analysis of all the experiments finds strong support for the effect. Maybe I’m just a particularly ill-calibrated individual or an overly pessimistic worrywart but I doubt it. For one, I was right about our own experiment, which didn’t find the effect. At the same time, I was clearly wrong about the overall effect. This brings me to the second point.

2. One experiment does not an effect make (or break).  This goes both for the original experiment, which did find a big effect, as for our replication attempt (and 30 others). One experiment that shows an effect doesn’t mean much, and neither does one unsuccessful replication. We already knew this, of course, but the RRR drives this point home nicely.

3. RRRs are very useful for estimating effect sizes without having to worry about publication bias. But it should be noted that they are very costly. Using 31 labs seems was probably overkill, although it was nice to see all the enthusiasm for a replication project.

4. More power is better. As the article notes about the smaller effect in RRR1: “In fact, all of the confidence intervals for the individual replications in RRR1 included 0. Had we simply tallied the number of studies providing clear evidence for an effect […], we would have concluded in favor of a robust failure to replicate—a misleading conclusion. Moreover, our understanding of the size of the effect would not have improved."

5. Replicating an effect against your expectations is a joyous experience.  This sounds kind of sappy but it’s an accurate description of my feelings when I was told by Dan Simons about the outcome of the meta-analyses. Maybe I was biased because I liked the notion of verbal overshadowing but it is rewarding to see an effect materialize in a meta-analysis. It's a nice example of “replicating up.”

Where do we go from here? Now that we have a handle on the effect, it would be useful to perform coordinated and preregistered conceptual replications (using different stimuli, different situations, different tasks). I'd be happy to think along with anyone interested in such a project.

Update September 24, 2014. The post is the topic of a discussion on Reddit.