Saturday, March 4, 2017

The value of experience in criticizing research

It's becoming a trend: another guest blog post. This time, J.P. de Ruiter shares his view, which I happen to share, on the value of experience in criticizing research.

J.P. de Ruiter
Tufts University

One of the reasons that the scientific method was such a brilliant idea is that it has criticism built into the process. We don’t believe something on the basis of authority, but we need to be convinced by relevant data and sound arguments, and if we think that either the data or the argument is flawed, we say this. Before a study is conducted, this criticism is usually provided by colleagues, or in case of preregistration, reviewers. After a study is submitted, critical evaluations are performed by reviewers and editors. But even after publication, the criticism continues, in the form of discussions in follow-up articles, at conferences, and/or on social media. This self-corrective aspect of science is essential, hence criticism, even though at times it can be difficult to swallow (we are all human) is a very good thing. 

We often think of criticism as pointing out flaws in the data collection, statistical analyses, and argumentation of a study. In methods education, we train our students to become aware of the pitfalls of research. We teach them about assumptions, significance, power, interpretation of data, experimenter expectancy effects, Bonferroni corrections, optional stopping, etc. etc. This type of training leads young researchers to become very adept at finding flaws in studies, and that is a valuable skill to have.  

While I appreciate that noticing and formulating the flaws and weaknesses in other people’s studies is a necessary skill for becoming a good critic (or reviewer), it is in my view not sufficient. It is very easy to find flaws in any study, no matter how well it is done. We can always point out alternative explanations for the findings, note that the data sample was not representative, or state that the study needs more power. Always. So pointing out why a study is not perfect is not enough: good criticism takes into account that research always involves a trade-off between validity and practicality. 

As a hypothetical example: if we review a study about a relatively rare type of Aphasia, and notice that the authors have studied 7 patients, we could point out that a) in order to generalize their findings, they need inferential statistics, and b) in order to do that, given the estimated effect size at hand, they’d need at least 80 patients. We could, but we probably wouldn’t, because we would realize that it was probably hard enough to find 7 patients with this affliction to begin with, so finding 80 is probably impossible. So then we’d probably focus on other aspects of the study. We of course do keep in mind that we can’t generalize over the results in the study with the same level of confidence as in a lexical decision experiment with a within-subject design and 120 participants. But we are not going to say, “This study sucks because it had low power”. At least, I want to defend the opinion here that we shouldn’t say that. 

While this is a rather extreme example, I believe that this principle should be applied at all levels and aspects of criticism. I remember that as a grad student, a local statistics hero informed me that my statistical design was flawed, and proceeded to require an ANOVA that was way beyond the computational capabilities of even the most powerful supercomputers available at the time. We know that full LMM models with random slopes and intercepts often do not converge. We know that many Bayesian analyses are intractable. In experimental designs, one runs into practical constraints as well. Many independent variables simply can’t be studied in a within-subject design. Phenomena that only occur spontaneously (e.g. iconic gestures) cannot be fully controlled. In EEG studies, it is not feasible to control for artifacts due to muscle activity, hence studying speech production is not really possible with this paradigm.

My point is: good research is always a compromise between experimental rigor, practical feasibility, and ethical considerations. To be able to appreciate this as a critic, it really helps to have been actively involved in research projects. Not only because that gives us more appreciation of the trade-offs involved, but also, perhaps more importantly, of the experience of really wanting to discover, prove, or demonstrate something. It makes us experience first-hand how tempting it can be, in Feynman’s famous formulation, to fool ourselves. I do not mean to say that we should become less critical, but rather that we become better constructive critics if we are able to empathize with the researcher’s goals and constraints. Nor do I want to say that criticism by those who have not yet have had positive research experience is to be taken less seriously. All I want to say here is that (and why) having been actively involved in the process of contributing new knowledge to science makes us better critics. 

Thursday, March 2, 2017

Duplicating Data: The View Before Hindsight

Today a first in this blog: a guest post! In this post Alexa Tullett reflects on the consequences of Fox's data manipulation, which I described in the previous post, for her own research and that of her collaborator, Will Hart.

Alexa Tullett
University of Alabama

[Disclaimer: The opinions expressed in this post are my own and not the views of my employer]

When I read Rolf’s previous post about the verb aspect RRR I resonated with much of what he said. I have been in Rolf’s position before as an outside observer of scientific fraud, and I have a lot of admiration for his work in exposing what happened here.  In this case, I’m not an outside observer. Although I was not involved with the RRR that Rolf describes in detail, I was a collaborator of Fox’s (I’ll keep up the pseudonym) and my name is on papers that have been, or are in the process of being retracted. I also continue to be a collaborator of Will Hart’s, and hope to be for a long time to come. Rolf has been kind enough to allow me space here to provide my perspective on what I know of the RRR and the surrounding events. My account is colored by my personal relationships with the people involved, and while this unquestionably undermines my ability to be objective, perhaps it also offers a perspective that a completely detached account cannot.

I first became involved in these events after Rolf requested that Will re-examine the data from his commentary for the RRR. Will was of the mind that data speak louder than words, so when the RRR did not replicate his original study he asked Fox to coordinate data collection for an additional replication. Fox was not an author on the original paper, and was not told the purpose of the replication. Fox ran the replication, sent the results to Will, and Will sent those and his commentary to Rolf. Will told me that he had reacted defensively to Rolf’s concerns about these data, but eventually Will started to have his own doubts. These doubts deepened when Will asked Fox for the raw data and Fox said he had deleted the online studies from Qualtrics because of “confidentiality” issues. After a week or two of communicating with the people at Qualtrics Will was able to obtain the raw data, and at this point he asked me if I would be willing to compare this with the “cleaned” data he had sent to Perspectives.

I will try to be as transparent as possible in documenting my thought process at the time these events unfolded. It’s easy to forget – or never consider – this naïve perspective once fraud becomes uncontested. When I first started to look at the data, I was far from the point where I seriously entertained the possibility that Fox tampered with the data. I thought scientific fraud was extremely rare. Fox was, in my mind, a generally dependable and well-meaning graduate student. Maybe he had been careless with these data, but it seemed far-fetched to me that he had intentionally changed or manipulated them.

I started by looking for duplicates, because this was the concern that Will had passed along from Rolf. They weren’t immediately obvious to me, because the participant numbers (the only unique identifiers) had been deleted by Fox. But, when I sorted by free-response answers several duplicates became apparent, as one can see in Rolf’s screenshot. There were more duplicates as well, but they were harder to identify for participants who hadn’t given free-response answers. I had to find these duplicates based on patterns of Likert-scale answers. I considered how this might have happened, and thought that perhaps Fox had accidentally downloaded the same condition twice, rather than downloading the two conditions. As I looked at these data further I realized that there had also been deletions. I speculated that Fox had been sloppy when copying and pasting between datasets – maybe some combination of removing outliers without documenting them and accidentally repeatedly copying cases from the same dataset.

I only started to genuinely question Fox’s intentions when I ran the key analysis on the duplicated and deleted cases and tested the interaction. Sure enough, the effect was there in the duplicated cases, and absent in the deleted cases. This may seem like damning evidence, but to be honest I still hadn’t given up on the idea that this might have happened by accident. Concluding that this was fraud felt like buying into a conspiracy theory. I only became convinced when Fox eventually admitted that he had done this knowingly. And had done the same thing with many other datasets that were the foundation of several published papers—including some on which I am an author.

Fox confessed to doing this on his own, without the knowledge of Will, other graduate students, or collaborators. Since then, a full investigation by UA’s IRB has drawn the same conclusion. We were asked not to talk about these events until that investigation was complete.

Hindsight’s a bitch. My thinking prior to Fox’s confession seems as absurd to me as it probably does to you. How could I have been so naively reluctant to consider fraud? How could I have missed duplicates in datasets that I handled directly?  I think part of the answer is that when we get a dataset from a student or a collaborator, we assume that those data are genuine. Signs of fraud are more obvious when you are looking for them. I wish we had treated our data with the skepticism of someone who was trying to determine whether they were fabricated, but instead we looked at them with the uncritical eye of scientists whose hypotheses were supported.

Fox came to me to apologize after he admitted to the fabrication. He described how and why he started tampering with data. The first time it happened he had analyzed a dataset and the results were just shy of significance. Fox noticed that if he duplicated a couple of cases and deleted a couple of cases, he could shift the p-value to below .05. And so he did. Fox recognized that the system rewarded him, and his collaborators, not for interesting research questions, or sound methodology, but for significant results. When he showed his collaborators the findings they were happy with them—and happy with Fox.

The silver lining. I’d like to think I’ve learned something from this experience. For one thing, the temptation to manipulate and fake data, especially for junior researchers, has become much more visible to me. This has made me at once more understanding and more cynical. Fox convinced himself that his research was so trivial that faking data would be inconsequential, and so he allowed his degree and C.V. to take priority. Other researchers have told me it’s not hard to relate. Now that I have seen and can appreciate these pressures, I have become more cynical about the prevalence of fraud.

My disillusionment is at least partially curbed by the increased emphasis on replicability and transparency that has occurred in our field over the past 5 years. Things have changed in ways that make it much more difficult to get away with fabrication and fraud. Without policies requiring open data, this case and others like it would often go undiscovered. Even more encouragingly, things have changed in ways that begin to alter the incentive structures that made Fox’s behavior (temporarily) rewarding. More and more journals are adopting registered report formats where researchers can submit a study proposal for evaluation and know that, if they faithfully execute that study, it will get published regardless of outcome. In other words, they will have the freedom to be un-invested in how their study turns out.

Tuesday, February 21, 2017

Replicating Effects by Duplicating Data

RetractionWatch recently reported on the retraction of a paper by William Hart. Richard Morey blogged in more detail about this case. According to the RetractionWatch report:

From this description I can only conclude that I am that “scientist outside the lab.” 

I’m writing this post to provide some context for the Hart retraction. For one, inconsistent is rather a euphemism for what transpired in what I’m about to describe. Second, this case did indeed involve a graduate student, whom I shall refer to as "Fox."

Back to the beginning. I was a co-author on a registered replication report (RRR) involving one of Hart’s experiments. I described this project in a previous post. The bottom line is that none of the experiments replicated the original finding and that there was no meta-analytic effect. 

Part of the RRR procedure is that original authors are invited to write a commentary on the replication report. The original commentary that was shared with the replicators had three authors: the two original authors (Hart and Albarricin) and Fox, who was the first author. A noteworthy aspect of the commentary was that it contained experiments. This was surprising (to put it mildly), given that one does not expect experiments in a commentary on a registered replication report, especially when these experiments themselves are not preregistered, as was the case here. Moreover, these experiments deviated from the protocol that we had established with the original authors. A clear case of double standards, in other words.

Also noteworthy was that the authors were able to replicate their own effect. And not surprising was that the commentary painted us as replication bullies. But with fake data, as it turns out.

The authors were made to upload their data to the Open Science Framework. I decided to take a look to see if I could explain the discrepancies between the successful replications in the commentary and all the unsuccessful ones in the RRR. I first tried to reproduce the descriptive and inferential statistics.  

Immediately I discovered some discrepancies between what was reported in the commentary and what was in the data file, both in condition means and in p-values. What could explain these discrepancies?

I decided to delve deeper and suddenly noticed a sequence of numbers, representing a subject’s responses, that was identical to a sequence several rows below. A coincidence, perhaps? I scrolled to the right where there was a column with verbal responses provided by the subjects, describing their thoughts about the purpose of the experiment. Like the number sequences, the two verbal responses were identical.

I then sorted the file by verbal responses. Lots of duplications started popping up. Here is a sample.

In all, there were 73 duplicates in the set of 194 subjects. This seemed quite alarming. After all, the experiment was run in the lab and how does one come to think they ran 73 more subjects than they actually ran? In the lab no less. It's a bit like running 25k and then saying afterwards "How bout them apples, I actually ran a marathon!" Also, given that the number of subjects was written out, it was clear that the authors intended to communicate they had a sample of 194 and not 121 subjects. Also important was that the key effect was no longer significant when the duplicates were removed (p=.059).

The editors communicated our concerns to the authors and pretty soon we received word that the authors had “worked night-and-day” to correct the errors. There was some urgency because the issue in which the RRR would appear was going to press.  We were reassured that the corrected data still showed the effect such that the conclusions of the commentary (“you guys are replication bullies”) remained unaltered and the commentary could be included in the issue.

Because I already knew that the key analysis was not significant after removal of the duplicates, I was curious how significance was reached in this new version. The authors had helpfully posted a “note on file replacement”: 

The first thing that struck me was that the note mentioned 69 duplicates whereas there were 73 in the original file. Also puzzling was the surprise appearance of 7 new subjects. I guess it pays to have a strong bullpen. With this new data collage, the p-value for the key effect was p=.028 (or .03).

A close comparison of the old and new data yields a different picture, though. The most important difference was that not 7 but 12 new subjects were added. In addition, for one duplicate both versions were removed. Renowned data sleuth Nick Brown analyzed these data separately from me and came up with the same numbers.

So history repeated itself here. The description of the data did not match the data and the “effect” was again significant just below .05 after the mixing-and-matching process.

There was much upheaval after this latest discovery, involving all of the authors of the replication project, the editors, and the commenters. I suspect that had we all been in the same room there would have been a brawl. 

The upshot of all this commotion was that this version of the commentary was withdrawn. The issue of Perspectives on Psychological Science went to press with the RRR but without the commentary.  In a subsequent issue, a commentary appeared with Hart as its sole author and without the new "data."

Who was responsible for this data debacle? After our discovery of the initial data duplication, we received an email from Fox stating that "Fox and Fox alone" was responsible for the mistakes. This sounded overly legalistic to me at the time and I’m still not sure what to make of it. 

The process of data manipulation described here appears to be one of mixing-and-matching. The sample is a collage consisting of data that can be added, deleted, and duplicated at will until a p-value of slightly below .05 (p = .03 seems popular in Hart’s papers) is reached.

I wonder if the data in the additional papers by Hart that apparently are going to be retracted are produced by the same foxy mixing-and-matching process. I hope the University of Alabama will publish the results of its investigation. The field needs openness.

Monday, January 9, 2017

Subtraction Priming

You may have come across a viral video on Facebook from "deception expert" Rick Lax, who invites you to participate in a little pop quiz involving numbers. If you haven't seen it, watch the 1-minute-plus video right now. (I'd embed the video in this post for you but I'm not sure I'm allowed to do so.)

If you were like me, you thought of  the number 7 at the end. Of course, this is exactly the number Lax wanted you to come up with. 

So how does it work? Or does it work at all? There was some discussion about this urgent matter on Facebook in the Psychological Methods Discussion Group. The moderator of that group, Uli Schimmack, who also thought of 7, suggested this was the result of priming. But then he questioned his explanation: "We don't know because we don't know how often he gets it right? We just see 1 million shares. It is like reading Psych Science. We only see the successes."

This makes sense. In theory there could be a massive file drawer of unshared videos by people who didn't pick 7 and the whole thing could be the result of the social media version of publication bias, sharing bias. Uninteresting.

Others in the group provided links to papers showing that people pick 7 here simply because it's the most popular number. Also uninteresting.

But is our world really this mundane? I refused to believe this, and so did Uli. So what else could be going on?

I hypothesized that I got the number 7 because it was the only number between 5 and 12 that was not mentioned. Here are the numbers we get to see in the video:

 5 + 3 = 8
 9 + 2 = 11
10 - 4 = 6

My thinking was that people have a desire to be autonomous. This would then, ironically, direct them toward the number 7, as all the other electable numbers had already been "suggested." I was heartened to see that moving people away from numbers by mentioning them is indeed a trick mentalists use.

I decided to test my mentalist hypothesis. I created a simple version of the pop quiz, using similar timing to the original video. I thought that if the mentalist trick works, I should be able to shift people's preference to a different number, namely 8. I used the following numbers:

 5 + 1 = 6
 9 + 2 = 11
10 - 3 = 7

So now 8 is the number that is left out. I ran my experiment on Mechanical Turk. Here is what I found a few hours later.

Experiment 1: No 8 in sequence

So my mentalist hypothesis clearly got the finger. People still went for 7 and my 8 didn't even outperform that lousy 6.

Uli had a different hypothesis. He reasoned that people were primed by the question: pick a number between 5 and 12. After all, 12 - 5 = 7. If this priming hypothesis is right, then it should be possible to shift people's preference by changing the final question to: pick a number between 5 and 13. So I went ahead and ran that experiment on MTurk. And what do you know:

Experiment 2: Range 5 to 13

So clearly it is possible to shift preferences away from 7. Priming lives!

Obviously, there are more experiments one could do on this topic. I suspect we'll be discussing them in the Psychological Methods Group soon.

And indeed, we collected new data. If the pattern shown in the previous figure is due to subtraction priming, then we should find people reverting to the baseline preference for 7 when all they need to do is pick a number between 5 and 13, without a sequence preceding it. Tat's the idea we tested and here is what we found.

Experiment 3: Baseline

That looks like the pattern we got the first time (7 beats 8) and not like what we got last time. So there is something about having number selection be preceded by the additions and subtraction. Subtraction priming survives!

There are many variants of the experiment one could run. However, the best one to further isolate subtraction priming as a factor is one that uses exactly the same numbers as the second experiment but removes subtraction from the opening sequence. This can be achieved by using the sequence:

5 + 1 = 6            
9 + 2 = 11          
7 + 3 = 10

The key difference with the second experiment is the absence of a subtraction sign. And apparently, this makes a big difference. As in experiments 1 and 3, 7 is now the preferred number, albeit by a small margin, as in all the other experiments except the first one.

Experiment 4: Addition only

So the only experiment in which the number 8 was the preferred choice was Experiment 2, in which the final trial in the opening sequence was a subtraction and in which subtraction of the endpoints of the range yielded the number 8: subtraction priming.

Next step: thinking of some confirmatory experiments.

*Thanks to Uli Schimmack, Laura Scherer, Robin Kok, and James Heathers for some of the references and comments used in this post.

Saturday, December 31, 2016

A Commitment to Better Research Practices (BRPs) in Psychological Science

On the brink of 2017. Time for some New Year's resolutions. I won't bore you with details about my resolutions to (1) again run 1000k (not in a row of course), (2) not live in a political bubble, (3) be far more skeptical about political polls, (4) pick up the guitar again, (5) write more blog posts, and (6) learn more about wine. Instead I want to focus on some resolutions about research practices that Brent Roberts, Lorne Campbell, and I penned (with much-appreciated feedback from Brian Nosek, Felix Schönbrodt, and Jennifer Tackett). We hope they form an inspiration to you as well. 

The Commitment

Scientific research is an attempt to identify a working truth about the world that is as independent of ideology as possible.  As we appear to be entering a time of heightened skepticism about the value of scientific information, we feel it is important to emphasize and foster research practices that enhance the integrity of scientific data and thus scientific information. We have therefore created a list of better research practices that we believe, if followed, would enhance the reproducibility and reliability of psychological science. The proposed methodological practices are applicable for exploratory or confirmatory research, and for observational or experimental methods.
1. If testing a specific hypothesis, pre-register your research, so others can know that the forthcoming tests are informative.  Report the planned analyses as confirmatory, and report any other analyses or any deviations from the planned analyses as exploratory.
2. If conducting exploratory research, present it as exploratory. Then, document the research by posting materials, such as measures, procedures, and analytical code so future researchers can benefit from them. Also, make research expectations and plans in advance of analyses—little, if any, research is truly exploratory. State the goals and parameters of your study as clearly as possible before beginning data analysis.
3. Consider data sharing options prior to data collection (e.g., complete a data management plan; include necessary language in the consent form), and make data and associated meta-data needed to reproduce results available to others, preferably in a trusted and stable repository. Note that this does not imply full public disclosure of all data. If there are reasons why data can’t be made available (e.g., containing clinically sensitive information), clarify that up-front and delineate the path available for others to acquire your data in order to reproduce your analyses.
4. If some form of hypothesis testing is being used or an attempt is being made to accurately estimate an effect size, use power analysis to plan research before conducting it so that it is maximally informative.
5. To the best of your ability maximize the power of your research to reach the power necessary to test the smallest effect size you are interested in testing (e.g., increase sample size, use within-subjects designs, use better, more precise measures, use stronger manipulations, etc.). Also, in order to increase the power of your research, consider collaborating with other labs, for example via StudySwap. Be open to sharing existing data with other labs in order to pool data for a more robust study.
6. If you find a result that you believe to be informative, make sure the result is robust.  For smaller lab studies this means directly replicating your own work or, even better, having another lab replicate your finding, again via something like StudySwap.  For larger studies, this may mean finding highly similar data, archival or otherwise, to replicate results. When other large studies are known in advance, seek to pool data before analysis. If the samples are large enough, consider employing cross-validation techniques, such as splitting samples into random halves, to confirm results. For unique studies, checking robustness may mean testing multiple alternative models and/or statistical controls to see if the effect is robust to multiple alternative hypotheses, confounds, and analytical approaches.
7. Avoid performing conceptual replications of your own research in the absence of evidence that the original result is robust and/or without pre-registering the study.  A pre-registered direct replication is the best evidence that an original result is robust.
8. Once some level of evidence has been achieved that the effect is robust (e.g., a successful direct replication), by all means do conceptual replications, as conceptual replications can provide important evidence for the generalizability of a finding and the robustness of a theory.
9. To the extent possible, report null findings.  In science, null news from reasonably powered studies is informative news.
10. To the extent possible, report small effects. Given the uncertainty about the robustness of results across psychological science, we do not have a clear understanding of when effect sizes are “too small” to matter.  As many effects previously thought to be large are small, be open to finding evidence of effects of many sizes, particularly under conditions of large N and sound measurement.
11. When others are interested in replicating your work be cooperative if they ask for input. Of course, one of the benefits of pre-registration is that there may be less of a need to interact with those interested in replicating your work.
12. If researchers fail to replicate your work continue to be cooperative. Even in an ideal world where all studies are appropriately powered, there will still be failures to replicate because of sampling variance alone. If the failed replication was done well and had high power to detect the effect, at least consider the possibility that your original result could be a false positive. Given this inevitability, and the possibility of true moderators of an effect, aspire to work with researchers who fail to find your effect so as to provide more data and information to the larger scientific community that is heavily invested in knowing what is true or not about your findings.

We should note that these proposed practices are complementary to other statements of commitment, such as the commitment to research transparency. We would also note that the proposed practices are aspirational.  Ideally, our field will adopt many, of not all of these practices.  But, we also understand that change is difficult and takes time.  In the interim, it would be ideal to reward any movement toward better research practices.

Brent W. Roberts
Rolf A. Zwaan
Lorne Campbell

Wednesday, September 28, 2016

Invitation to a Registered Replication Report

Update December 17. Data collection is in full swing in labs from Buenos Aires to Berkeley and from Potsdam to Pittsburgh. Some labs have already finished while others (such as my lab) have just started. Data collection should be completed by March 1. 

Update October 24. Data collection has officially started. No fewer than 20 labs are participating! Besides investigating if the ACE replicates in native speakers, we will also examine if the effect extends to L2 speakers of English.

Mike Kaschak and Art Glenberg, discoverers of the famous ACE effect, have decided to run a registered replication of their effect. There already are 7 participating labs but we'd like to invite more participants. If you're interested in language, action, and/or replication and have access to subjects who are native speakers of English, please consider participating by responding to Mike's ( invitation:
Dear Colleague,

I am writing to ask whether you are interested in participating in the data collection for a multi-lab, pre-registered replication of the Action-sentence Compatibility Effect (ACE), first reported in Glenberg and Kaschak (2002). I am organizing this effort along with my colleagues Art Glenberg, Rolf Zwaan, Richard Morey, Agustin Ibanez, and Claudia Gianelli.

Your participation in this effort will involve running the Borreggine & Kaschak (2006) version of the ACE experiment (which uses spoken sentences, rather than written sentences as in Glenberg & Kaschak, 2002), following the registered protocol and sampling plan. We will provide the E Prime files required to conduct the study. Our current plan is to complete the preparations for the replication within the next month or so, with data collection to commence in the Fall of 2016, and continue through the Spring of 2017. All data collection should be completed by March 1, 2017 (if not sooner), and all data should be made available to us by April 1, 2017 (if not sooner). You will be expected to analyze the data you collect according to the registered protocol, and also to send us your raw data for analysis and eventual deposit in a public repository.

Because we are aware that different labs face different constraints with regard to the availability of research participants, our sampling plan will be as follows. If you agree to participate, we ask that you commit to collecting data from at least 60 participants, with a maximum sample size of 120 participants. We also ask that you pre-register your chosen sample size with us (sample sizes in multiples of 4, due to the counterbalancing involved in the study) before you begin data collection. We will post the sample sizes along with our pre-registration of the replication methods.

The protocol for the study and the E Prime files will be made available on the Open Science Framework.

All contributors to the data collection effort will be included as authors on the published report of the replication (as in previous published registered replications).

Thank you for considering our request. Please let us ( know as soon as you can whether you are willing to join our effort.

Michael Kaschak

Thursday, May 12, 2016

Disentangling Reputation from Replication

With increasing attention paid to reproducibility in science, a natural worry for researchers is, “What happens if my finding does not replicate?” With this question, Charles Ebersole, Jordan Axt, and Brian Nosek open their new article on perceptions of noveltyand reproducibility, published today in PLoS Biology.

There are several ways to interpret this question, but Ebersole and colleagues are most concerned with reputational issues. In an ideal world, they note, reputations shouldn’t matter; the focus should be on the findings. But reality is different: findings are treated as possessions.

Ebersole and his co-authors draw a contrast between innovation and reproducibility in evaluating reputations. Drawing this contrast is not without precedent. Some years back, I served on the National Science Foundation program Perception, Action, and Cognition. We were told that innovation was to be an overriding criterion in evaluating proposals. Up to that point, as I understood it, the program’s predecessor had been perceived as an “old-boys-network” in which researchers who had been funded before pretty much had a ticket to renewed funding, whereas younger researchers were struggling to get in on the funding. In our program discussions the word “solid” in a review was a kiss of death for the proposal, it being a code word for “more of the same old boring stuff.”

In the last decade, we have seen the pendulum switch from “solid” to “innovative.”* The pendulum metaphor invites the idea to align reproducible with boring and nonreproducible with innovative. Ebersole and colleagues create this stark contrast in their survey. Enter AA and BB, two scientists in some unspecified field. AA produces “boring but certain” results; BB produces “exciting but uncertain” results. Ebersole and colleagues asked two large samples from the general public several questions about these scientific opposites. When presented with this stark choice the general public clearly preferred AA over BB. Good for AA.

However, Ebersole and his co-authors are quick to point out that AA and BB are caricatures; after all, nobody embarks on a career to produce boring or uncertain results. The contrast is misleading because there are temporal dependencies at play. You first obtain an exciting finding and then you decide what to next: replicate and extend this exciting finding or move on to the next exciting finding? And if our reputation is at stake, how should we respond when others attempt to replicate our findings to increase certainty independently?

The authors investigated these questions in a further survey featuring the researchers X and Y. The respondents read several scenarios involving X and Y after having received an introduction about the scientific publication process. The respondents first rated researcher X’s ability, ethics, and the level of truth of the finding.  The average rating of the researcher’s ability was then used as a baseline for several scenarios that introduced researcher Y as someone who replicated or failed to replicate X’s original finding. Of interest were the reputational consequences of this for X. This figure displays the results.

I have to admit that the figure is giving me bouts of OCD (am I alone in feeling compelled to pull apart the superimposed letters?), but the message is clear. Reputation depends not so much on whether your finding is true but rather on how you respond to failed replication.
If Y does not replicate the finding, then the original result is perceived as less true. X suffers some reputational damage as well, being perceived as somewhat less ethical and less capable than before. However, what matters crucially is how X responds to the failed replication. For example, there is considerably more reputational damage if X discredits Y’s replication result. I suspect this would vary as a function of whether or not X’s criticism was perceived as justified, but this was not investigated. In contrast, there is a big reputational gain if X accepts Y’s result (see here for an actual example) and concludes that the original result might not be correct; the original effect is perceived as less true, of course. Interestingly, the finding is perceived as less true than when X criticizes the replication. The reputation gain is even bigger if X starts a replication attempt to investigate the difference between the original and replication results. Curiously, the original result is now perceived as truer than before the failed replication. The reputation gain is somewhat smaller than this if X fails to self-replicate the original finding and the original finding is perceived as less true. There is considerable reputational damage if X performs an unsuccessful self-replication and decides not to report it or doesn’t follow up on the finding at all. The former is a bit hypothetical, of course, because if X doesn’t report the failed self-replication, no one is the wiser. And if X doesn’t follow up, it is unclear whether people would pick up on the lack of a follow-up.

So much for the general public. How about students and scientists? Ebersole and colleagues presented the same scenarios to 428 students and 313 researchers (from graduate students to full professors). It turns out that scientists are more forgiving than the general public, especially when it comes to pursuing new ideas rather than following up on a initially published finding. The authors attribute this to the aforementioned drive toward innovation.

Not surprisingly, the researchers displayed a more realistic (pessimistic?) assessment of the current job market than the general population. They viewed the exciting, uncertain scientist as more likely to get a job, keep a job, and be more celebrated by wide margins.

“Despite that,” the authors note, “researchers were slightly more likely to say that they would rather be, and more than twice as likely to say that they should be, the boring, certain scientist.” Demand characteristics are likely to have played a role here. As I said earlier, who wants to be boring? The students responded more like the general public than like the scientists.

What do we make of this set of results? Clearly, it is quite artificial to presenting respondents with a set of idealized and decontextualized scenarios. On what basis are respondents making judgments when presented with these scenarios? Especially the general public. On the other hand, the convergence among the responses from the three different groups (general public, students, researchers) is reassuring.

The set of scenarios that was used is not only idealized but also limited. It does not exhaust the space of possible scenarios, as the authors acknowledge. For example, there is no scenario that involves a (failed) replication that is flawed because it distorts or omits (either accidentally or intentionally) parts of the original experiment. It would be important to include such a scenario in a follow-up study and then ask questions about the ability and ethics of the replicator and truth of the replication finding as well. After all, just as original experiments can be flawed, so can replications. So it only makes sense to approach replications critically.

What I take away from the article is this.

(1) We should disentangle reputation from replication. This becomes easier if we self-replicate.

(2) We should stop seeing innovation and replication as opposites. The drive to innovate means that we are bound to pursue wrong leads in most cases. Competently performed replications are a reality check. Innovation and replication are not enemies. They are two necessary components of the best mechanism at our disposal to learn about the world: science.


*Although some might see this as the main reason for the reproducibility crisis, the only way we can tell for sure is if there are more replication attempts of “boring” research. I’m willing to bet that there are considerable reproducibility issues with that kind of research as well.