Wednesday, March 11, 2015

The End-of-Semester Effect Fallacy: Some Thoughts on Many Labs 3

The Many Labs enterprise is on a roll. This week, a manuscript reporting Many Labs 3 materialized on the already invaluable Open Science Framework. The manuscript reports a large-scale investigation, involving 20 American and Canadian research teams, into the “end-of-semester effect.”

The lore among researchers is that subjects run at the end of the semester provide useless data. Effects that are found at the beginning of the semester somehow disappear or become smaller at the end. Often this is attributed to the notion that less-motivated/less-intelligent students procrastinate and postpone participation in experiments until the very last moment. Many Labs 3 notes that there is very little empirical evidence pertaining to the end-of-semester effect.

To address this shortcoming in the literature, Many Labs 3 set out to conduct 10 replications of known effects to examine the end-of-semester effect. Each experiment was performed twice by each of the 20 participating teams: once at the beginning of the semester and once at the end of the semester, each time with different subjects, of course.

It must have been a disappointment to the researchers involved that only 3 of the 10 effects replicated (maybe more about this in a later post) but Many Labs 3 remained undeterred and went ahead to examine the evidence for an end-of-semester effect. Long story short, there was none. Or in the words of the researchers:

It is possible that there are some conditions under which the time of semester impacts observed effects. However, it is unknown whether that impact is ever big enough to be meaningful

This made me wonder about the reasons for expecting an end-of-semester effect in the first place. Isn’t this just a fallacy born out of research practices that most of us now frown upon: running small samples, shelving studies with null effects, and optional stopping?

New projects are usually started at the beginning of a semester. Suppose the first (underpowered) study produces a significant effect. This can have multiple reasons:
(1) the effect is genuine;
(2)  the researchers stopped when the effect was significant;
(3) the researchers massaged the data such that the effect was significant;
(4) it was a lucky shot;
(5) any combination of the above.

How the end-of-semester effect might come about
With this shot in the arm, the researchers are motivated to conduct a second study, perhaps with the same N and exclusionary and outlier-removal criteria as the first study but with a somewhat different independent variable. Let’s call it a conceptual replication. If this study, for whatever reason, yields a significant effect, the researchers might congratulate themselves on a job well done and submit the manuscript.

But what if the first study does not produce a significant effect? The authors probably conclude that the idea is not worth pursuing after all, shelve the study, and move on to a new idea. If it’s still early in the semester, they could run a study to test the new idea and the process might repeat itself.

Now let’s assume the second study yields a null effect, certainly not a remote possibility. At this juncture, the authors are the proud owners of a Study 1 with an effect but are saddled with a Study 2 without an effect. How did they get this lemon? Well, of course because of those good-for-nothing numbskulled students who wait until the end of the semester before signing up for an experiment! And thus the the “end-of semester fallacy” is born.




Thursday, February 26, 2015

Can we Live without Inferential Statistics?

The journal Basic and Applied Social Psychology (BASP) has taken a resolute and bold step. A recent editorial announces that it has banned the reporting of inferential statistics. F-values, t-values, p-values and the like have all been declared personae non gratae. And so have confidence intervals. Bayes factors are not exactly banned but aren’t welcomed with open arms either; they are eyed with suspicion, like a mysterious traveler in a tavern.

There is a vigorous debate in the scientific literature and in the social media about the pros and cons of Null Hypothesis Significance Testing (NHST), confidence intervals, and Bayesian statistics (making researchers in some frontier towns quite nervous). The editors at BASP have seen enough of this debate and have decided to do away with inferential statistics altogether. Sure, you're allowed to submit a manuscript that’s loaded with p-values and statements about significance or the lack thereof, but they will be rigorously removed, like lice from a schoolchild’s head.

The question is whether we can live with what remains. Can we really conduct science without summary statements? Because what does the journal offer in their place? It requires strong descriptive statistics, distributional information, and larger samples. These are all good things but we need to have a way to summarize our results, not just because so we can comprehend and interpret them better ourselves and because we need to communicate them but also because we need to make decisions based on them as researchers, reviewers, editors, and users. Effect sizes are not banned and so will provide summary information that will be used to answer questions like:
--what will the next experiment be?
--do the findings support the hypothesis?
--has or hasn’t the finding been replicated?
--can I cite finding X as support for theory Y?*

As to that last question, you can hardly cite a result saying This finding supports or does not support the hypothesis but here are the descriptives. The reader will want more in the way of a statistical argument or an intersubjective criterion to decide one way or the other. I have no idea how researchers, reviewers, and editors are going to cope with the new freedoms (from inferential statistics) and constraints (from not being able to use inferential statistics). But that’s actually what I like about the BASP's ban. It gives rise to a very interesting real-world experiment in meta-science. 

Sneaky Bayes
There are a lot of unknowns at this point. Can we really live without inferential statistics? Will Bayes sneak in through the half-open door and occupy the premises? Will no one dare to submit to the journal? Will authors balk at having their manuscripts shorn of inferential statistics? Will the interactions among authors, reviewers, and editors yield novel and promising ways of interpreting and communicating scientific results? Will the editors in a few years be BASPing in the glory of their radical decision?  And how will we measure the success of the ban on inferential statistics? The wrong way to go about this would be to see whether the policy will be adopted by other journals or whether or not the impact factor of the journal rises. So how will we determine whether the ban will improve our science?

Questions, questions. But this is why we conduct experiments and this is why BASP's brave decision should be given the benefit of the doubt.

--------
Footnotes

I thank Samantha Bouwmeester and Anita Eerland for feedback on a previous version and Dermot Lynott for the Strider picture.

* Note that I’m not saying: “will the paper be accepted?” or “does the researcher deserve tenure?” 






Wednesday, January 28, 2015

The Dripping Stone Fallacy: Confirmation Bias in the Roman Empire and Beyond



What to do when the crops are failing because of a drought? Why, we persuade the Gods to send rain of course! I'll let the fourth Roman Emperor, Claudius, explain:

Derek Jacobi stuttering away as 
Claudius in the TV series I Claudius
There is a black stone called the Dripping Stone, captured originally from the Etruscans and stored in a temple of Mars outside the city. We go in solemn procession and fetch it within the walls, where we pour water on it, singing incantations and sacrificing. Rain always follows--unless there has been a slight mistake in the ritual, as is frequently the case.*
                                                                
It sounds an awful lot as if Claudius is weighing in on the replication debate, coming down squarely on the side of replication critics, researchers who raise the specter of hidden moderators as soon as a non-replication materializes. Obviously, when a replication attempt omits a component that is integral to the original study (and was explicitly mentioned in the original paper), omission of that component borders on scientific malpractice. But hidden moderators are only invoked after the fact--they are "hidden" after all and so could by definition not have been omitted. Hidden moderators are slight mistakes or imperfections in the ritual that are only detected when the ritual does not produce the desired outcome. As Claudius would have us believe, if the ritual is performed correctly, then rain always follows. Similarly, if there are no hidden moderators, then the effect will always occur, so if the effect does not occur, there must have been a hidden moderator.**

And of course nobody bothers to look for small errors in the ritual when it is raining cats and dogs, or for hidden moderators when p<.05

I call this the Dripping Stone Fallacy.

Reviewers (and readers) of scientific manuscripts fall prey to a mild(er) version of the Dripping Stone Fallacy. They scrutinize the method and results sections of a paper if they disagree with its conclusions and tend to give these same sections a more cursory treatment if they agree with the conclusions. Someone surely must have investigated this already. If not, it would be rather straightforward to design an experiment and test the hypothesis. One could measure the amount of time spent reading the method section and memory for it in subjects who are known to agree or disagree with the conclusions of an empirical study.

Even the greatest minds fall prey to the Dripping Stone Fallacy. As Raymond Nickerson describes: Louis Pasteur refused to accept or publish results of his experiments that seemed to tell against his position that life did not generate spontaneously, being sufficiently convinced of his hypothesis to consider any experiment that produced counterindicative evidence to be necessarily flawed.

Confirmation bias comes in many guises and the Dripping Stone Fallacy is one of them. It makes a frequent appearance in the replication debate. Granted, the Dripping Stone Fallacy didn't prevent the Romans from conquering half the world but it is likely to be more debilitating to the replication debate.


Footnotes

* Robert Graves, Claudius the God, Penguin Books, 2006, p. 172.
** This is and informal fallacy; it is formally correct (modus tollens) but is based on a false premise.





Sunday, January 18, 2015

When Replicating Stapel is not an Exercise in Futility

Over 50 of Diederik Stapel’s papers have been retracted because of fraud. This means that his “findings,” have now ceased to exist in the literature. But what does this mean for his hypotheses?*

Does the fact that Stapel has committed fraud count as evidence against his hypotheses? Our first inclination is perhaps to think yes. In theory, it is possible that Stapel ran a number of studies, never obtained the predicted results, and then decided to take matters into his own hands and tweak a few numbers here and there. If there were evidence of a suppressed string of null results, then yes, this would certainly count as evidence against the hypothesis; it would probably be a waste of time and effort to try to “replicate” the “finding.” Because the finding is not a real finding, the replication is not a real replication. However, by all accounts (including Stapel’s own), once he got going, Stapel didn’t bother to run the actual experiment. He just made up all the data.

This means that Stapel’s fraud has no bearing on his hypotheses. We simply have no empirical data that we can use to evaluate his hypotheses. It it still possible that a hypothesis of his is supported in a proper experiment. Whether or not it makes sense to test that hypothesis is purely a matter of theoretical plausibility. And how do we evaluate replication attempts that were performed before the fraud had come to light? At the time the findings were probably seen as genuine--they were published, after all.

Prior to the exposure of Stapel’s fraudulent activities, Dutch social psychologist Hans IJzerman and some of his colleagues had embarked on a cross-cultural project, involving Brazilian subjects, that built on one of Stapel's findings. They then found out that another researcher in the Netherlands, Nina Regenberg, had already tried—and failed—to replicate these same findings in 9 direct and conceptual replications. As IJzerman and colleagues wryly observe:

At the time, these disconfirmatory findings were seen as ‘failed studies’ that were not worthy of publication. In hindsight, it seems painfully clear that discarding null effects in this manner has hindered scientific progress.

Ironically, the field that made it possible for Stapel to publish his made-up findings also made it impossible to publish failed replications of his work that involved actual findings. 

But the times they are a-changin’. IJzerman and Regenberg joined forces and together with their colleagues Justin Saddlemyer and Sander Koole they have written a paper, currently in press in Acta Psychologica, that reports 12 replications of a—now retracted—series of experiments published by Diederik Stapel and Gün Semin. Semin, of course, was unaware of Stapel's deception.**

Here is the hypothesis that was advanced by Stapel and Semin: priming with abstract linguistic categories (adjectives) should lead to a more abstract perceptual focus, whereas priming with concrete linguistic categories (action verbs) should lead to a more concrete perceptual focus. This linguistic category priming hypothesis is based on the uncontroversial observation that specific linguistic terms are recurrently paired with specific situations. As a result, Stapel and Semin hypothesized, linguistic terms may form associative links with cognitive processes. Because these associative links are stored in memory, they may be activated or “primed” whenever people encounter the relevant linguistic terms.

Stapel and Semin further hypothesized that verbs are associated with actions at a more concrete level than nouns. A verb like hit is used in a context like Harry is hitting Peter whereas an adjective like agressive is used in a more abstract description of the situation, as in Harry is being aggressive toward Peter. Because abstract information is more general, it may be associated with global perceptions, whereas concrete information may become with local perceptions. So far so good; I bet that many psychologists can follow this reasoning. Due to these associations, Stapel and Semin reason, priming verbs may elicit a focus on local details (i.e., the trees), while priming adjectives may elicit a focus on the global whole (i.e., the forest). This is a bit of a leap for me but let’s follow along.

Stapel and Semin reported four experiments in which they found evidence supporting their hypothesis. Priming with verbs led to more concrete processing than priming with adjectives. But of course these experiments were actually never performed and the findings were fabrications.

Let's look at some real data. Here is IJzerman et al.'s forest plot of the standardized mean difference scores between verb and adjective primes on global vs. local focus in twelve replications of the Stapel and Semin study.




Of the 12 studies, only one showed a significant effect (and it was not in the predicted direction). Overall, the standardized mean difference between the condition was practically zero. No shred of support for the linguistic category priming hypothesis, in other words.

Are these findings the death blow (to use the authors’ term) to the notion of linguistic category priming? IJzerman and his colleagues don’t think so. In perhaps a surprise twist, they conclude:

[I]t remains to be seen whether the effect we have investigated does not exist, or whether it depends on identifying the right contexts and measurements for the linguistic category priming effects among Western samples.


My own conclusions are the following.

  1. Replications of findings proven to be fraudulent are important. Without replications, the status of the hypotheses remains unclear. After all, the findings were previously deemed publishable by peer reviewers, presumably based in part on theoretical considerations. Without relevant empirical data, the area of research will remain tainted and researchers will steer clear from it. While this may not be bad in some cases, it might be bad in others.
  2. The Pottery Barn rule should hold in scientific publishing: you break it, you buy it. If you published fraudulent findings, you should also publish their nonreplications. Many journals do not adhere to this rule. Sander Koole informed me that the Journal of Social and Personality Psychology (JPSP) congratulated IJzerman and colleagues on their replication attempts but rejected their manuscript nonetheless, even though they had previously published the Stapel and Semin paper. It is a good thing the editors at Acta Psychologica have taken a more progressive stance on publishing failed replications.***
  3. It is a good sign that the climate for the publications failed replications is improving somewhat. Dylan’s right, the times are a-changin'. I am glad that the authors persevered and that their work is seeing the light of day.




*     I thank Hans IJzerman and Sander Koole for feedback on a previous version of this post. 
**   Semin was the doctoral advisor of both IJzerman and Regenberg and was initially involved in the replication attempts but let his former students use the data.
*** Until January 2014 I was Editor-in-Chief at Acta Psychologica. I was not involved in the handing of the IJzerman et al. paper and am therefore not patting myself on the back.


Friday, October 24, 2014

ROCing the Boat: When Replication Hurts

Though failure to replicate presents a serious problem, even highly-replicable results may be consistently and dramatically misinterpreted if dependent measures are not carefully chosen. This sentence comes from a new paper by Caren Rotello, Evan Heit, and Chad Dubé to be published in Psychonomic Bulletin & Review. 

Replication hurts in such cases because it reinforces artifactual results. Rotello and colleagues marshal support for this claim from four disparate domains: eyewitness memory, deductive reasoning, social psychology, and studies of child welfare. In each of these domains researchers make the same mistake by using the same wrong dependent measure.

Common across these domains is that subjects have to make detection judgments: was something present or was it not present? For example, subjects in eyewitness memory experiments decide whether or not the suspect is in a lineup. There are four possibilities.
             Hit: The subject responds “yes” and the suspect is in the lineup.                     
             False alarm: The subject responds “yes” but the suspect is not in the lineup.  
             Miss: The subject responds “no” but the suspect is in the lineup.                      
             Correct rejection: Responds “no” and the suspect is not in the lineup.             

It is sufficient to only take the positive responses, hits and false alarms, into account if we want to determine decision accuracy (the negative responses are complementary to the positive ones). But the question is how we compute accuracy from hits and false alarms. And this is where Rotello and colleagues say that the literature has gone astray.

To see why, let’s continue with the lineup example. Lineups can be presented simultaneously (all faces at the same time) or sequentially (one face at a time). A meta-analysis involving data from 23 labs involving 13,143 participants concludes that sequential lineups are superior to simultaneous ones. Sequential lineups yield a 7.72 diagnosticity ratio and simultaneous ones only 5.78; in other words, sequential lineups are 1.34 (7.72/5.78) times more accurate than simultaneous ones. Rotello and colleagues mention that 32% of police precincts in the United States now use sequential lineups. They don’t state explicitly that this is because of the research but this is what they imply.

The diagnosticity ratio is computed by dividing the number of hits by the number of false alarms. Therefore, the higher the ratio, the better the detection rate. So the notion of sequential superiority rides on the assumption that the diagnosticity ratio is an appropriate measure of diagnosticity. Well, you might think, it has the word diagnosticity in it, so that’s at least a start. But Rotello and colleagues demonstrate, this may be all that it has going for it.

If you compute the ratio of hits and false alarms (or the difference between them, as is often done), you’re assuming a linear relation. The straight lines in Figure 1 connect all the hypothetical subjects who have the same diagnosticity ratio. So the lowest line here connects the subjects who are at chance performance, and thus have a diagnosticity ratio of 1 (# hits = # false alarms). The important point to note is that you get this ratio for a conservative responder with 5% hits and 5% false alarms but also for a liberal responder with 75% hits and 75% false alarms.

The lines in the figure are called Receiver Operating Characteristics (ROC). (So now you know what that ROC is doing in the title of this post.) ROC is a concept that was developed by engineers in World War II who were trying to improve ways to detect enemy objects in battlefields and then was introduced to the field of psychophysics. 

Now let’s look at some real data.The triangles in the figure represent data from an actual experiment (by Laura Mickes, Heather Flowe, and John Wixted) comparing simultaneous (open triangles) and sequential (closed triangles) lineups. Every point on these lines reflects the same accuracy but a different tendency to respond “yes.” The lines that you can fit through these data points will be curved. Rotello and colleagues note that curved ROCs are consistent with the empirical reality and straight lines assumed by the diagnosticity ratio are not.

Several large-scale studies have used ROCs rather than diagnosticity and found no evidence whatsoever for a sequential superiority effect in lineups. In fact, all of these studies found the opposite pattern: simultaneous was superior to sequential. So what is the problem with the diagnosticity ratio? As you might have guessed by now, it is that it does not control for response bias. Witnesses presented with a sequential lineup are just less likely to respond “yes I recognize the suspect” than witnesses presented with a simultaneous lineup. ROCs based on empirical data unconfound accuracy with response bias and show a simultaneous superiority effect.

Rotello and colleagues demonstrate convincingly that this same problem bedevils the other areas of research I mentioned at the beginning of this post but the broader point is clear. As they put it: This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time. Unless we are using the proper dependent measure, replications are even going to aggravate the problem by enshrining artifactual findings in the literature (all the examples discussed in the article are “textbook effects”). To use another military reference: in such cases massive replications will produce what in polite company is called a Charlie Foxtrot.

Rotello and colleagues conclude by considering the consequences of their analysis for ongoing replication efforts such as the Reproducibility Project and the first Registered Replication Report on verbal overshadowing that we are all so proud of. They refer to a submitted paper that argues the basic task in the verbal overshadowing experiment is flawed because it lacks a condition in which the perpetrator is not in the lineup. I haven’t read this study yet and so can’t say anything about it, but it sure will make for a great topic for a future post (although I’m already wondering whether I should start hiding under a ROC).

Rotello and colleagues have produced an illuminating analysis that invites us once more to consider how valid our replication attempts are. Last year, I had an enjoyable blog discussion about this very topic with Dan Simons, it even uses the verbal overshadowing project as an example. Here is a page with links to this diablog.


I thank Evan Heit for alerting me to the article and for feedback on a previous draft of this post.

The Diablog on Replication with Dan Simons

Dan Simons
Last year, I had a very informative and enjoyable blog dialogue, or diablog, with Dan Simons about the reliability and validity of replication attempts. Unfortunately, there was never an easy way for anyone to access this diablog. It has only occurred to me today (!) that I could remedy this situation by creating a meta-post. Here it is.

In my first post on the topic, I argued that it is important to consider to consider not only the reliability but also the validity of replication attempts because it might be problematic if we try to replicate a flawed experiment. 

Dan Simons responded to this, arguing that deviations from the original experiment, while interesting, would not allow us to determine the reliability of the original finding.

I then had some more thoughts.

To which Dan wrote another constructive response.

My final point was that direct replications should be augmented with systematic variations of the original experiment.