Tuesday, December 10, 2013

Time, Money, and Morality (and p-Hacking?)

A paper that is in press in Psychological Science tests the hypothesis that priming someone with the concept of time makes them cheat less than someone who is not thusly primed. Or, as the authors articulate the idea in in the abstract: Across four experiments, we examined whether shifting focus onto time can salvage individuals’ ethicality. 

I've said a lot already about the type of theorizing and experimenting in this type of priming research, so I just want to keep it simple this time and concentrate on something that is currently under fire in the literature, even on the pages of Psychological Science itself, the p-value. 

As the abstract indicates, there are four experiments. In each experiment, the key prediction is that exposure to the "time-prime" causes people to cheat less. Each prediction is evaluated on the basis of a p-value. In Experiment 1, the prediction was that subjects would cheat less in the "time-prime" condition than in the control condition. (There also was a money-prime condition, but this was not germane to the key hypothesis.) I've highlighted the key result.


In Experiment 2 the key hypothesis was that if priming time decreases cheating by making people reflect on who they are, cheating behavior in the latter condition would not differ between participants primed with money and those primed with time. However, participants who were told that the game was a test of intelligence would show the same effect observed in Experiment 1. So the authors predicted an interaction between reflection (reflection vs. no reflection) and type of prime (time vs. money). Here are the results.


In Experiment 3 the authors manipulated self-reflection in a literal way: subjects were or were not seated in front of a mirror and this was crossed with prime condition (money vs. time). Again, the key prediction involved an interaction. 


Finally, in Experiment 4 the three priming conditions of Experiment 1 were used (money, time, control), which produced the following results.



So we have four experiments, each with their key prediction supported by a p-value between .04 and .05. How likely are these results? 

This question can be answered with a method developed by Simonsohn, Simmons, and Nelson (in press). To quote from the abstract: Because scientists tend to report only studies (publication bias) or analyses (p-hacking) that “work”, readers must ask, “Are these effects true, or do they merely reflect selective reporting?” We introduce p-curve as a way to answer this question. P-curve is the distribution of statistically significant p-values for a set of studies (ps < .05).

Simonsohn and colleagues have developed a web app that makes it very easy to compute p-curves. I used that app to compute the p-curve for the four experiments, using the p-values for the key hypotheses.



So if  I did everything correctly, the app concludes that the experiments in this study had no evidential value and were intensely p-hacked. 

It is somewhat ironic that the second author of the Psych Science paper and the first author of the p-curve paper are at the same institution. This is illustrative of the current state of methodological flux that our field is in: radically different views of what constitutes evidence co-exist in institutions and journals (e.g., Psychological Science). 




10 comments:

  1. Here is your answer: "no matter how one chooses the [N and the true effect size] under the alternatives, at most 3.7% of the p values will fall in the interval (.04; .05)". http://www.stat.duke.edu/courses/Spring10/sta122/Labs/Lab6.pdf.
    Having four of those in a row is pretty unlikely!

    ReplyDelete
  2. A Bayes factor analysis shows that these kind of p-values (close to the .05 boundary) have almost no evidential impact. This goes back to Edwards, Lindman, & Savage, 1963 Psych Review, and has recently been demonstrated again by Jim Berger, and, in 2013, by Valen Johnson ("Revised Standards for Statistical Evidence"). Johnson ends up recommending an alpha-level of .005. As Lindley remarked: “There is therefore a serious and systematic difference between the Bayesian and Fisherian calculations, in the sense that a Fisherian approach much more easily casts doubt on the null value than does Bayes. Perhaps this is why significance tests are so popular with scientists: they make effects appear so easily.”

    ReplyDelete
    Replies
    1. Very important points that have certainly changed my outlook on things. I wonder how hard it would be to p-hack your way to a p-value of <.005.

      Delete
  3. It seems to me that, to paraphrase the British politician Peter Mandelson, social and positive psychologists are "intensely relaxed" about the possibility of Type 1 error. In fact, I suspect that many of them don't sincerely consider Type 1 error to be, as the kids on the Internet say, "a thing". I found it, I published it, nobody has taken the time and effort to jump through the many hoops (some of them flaming) needed to refute it, therefore I win.

    I think that the people who pay for all this (i.e., the taxpayers in most cases) would be appalled to discover just how little understanding very many scientists have of the appropriate use of the most basic tools of their trade. Perhaps this applies "especially" to psychologists when it comes to p-hacking, although abjectly bad statistical practice seems to be common in almost every discipline.

    ReplyDelete
  4. I agree that this paper appears to have the hallmarks of p-hacking. But I think we need some caution if want to engage in post-hoc p-hackery analyses. It's one thing to state a priori "I think a set of studies that have this feature may show p-hackery" versus looking at the p values first and then post-hoc look for evidence of p-hackery. Perhaps in the near future researchers interested in p-hacking will develop post-hoc corrections for p-hack investigations.

    ReplyDelete
    Replies
    1. I agree Chris. I'm sensitive to this issue as we had to deal with it when I served on the Smeesters Committee. In this particular case, others had expressed skepticism about this study on Twitter, which I shared when I read the paper. I took a closer look and then I noticed the issue with the p-values. So here there was an a priori hypothesis, so to speak.

      Delete
  5. I think the issue is that the p-values are the clues to p-hacking. Basically, I think a collection of studies with p values just below .05, fluctuating ns without explanation, and weird effect sizes (i.e., large relative to expectations) are clues to p-hacking. I hate seeing packages dotted with p values around .04.

    The solution for p-hacking is fairly simple. Run the studies again under the same conditions (preferably with larger samples to get more precise estimates). If the results hold, the field has increased confidence in the sturdiness of the findings. If the results don’t duplicate, we learn another painful lessons about the impact of chance and the downsides of QRPs.

    ReplyDelete
    Replies
    1. It would indeed be best to run replications. However, as someone mentioned to me in an email yesterday, you cannot possibly refute questionable studies given the rate at which they are published. Experiment 4, however, was run on MTurk and so would be a good candidate for a replication. No worries about special booths or experimenters, etc.

      Delete
  6. I think it is good that this kind of analysis is being performed and shared in a public place. I wanted to consider some details of the analysis and an alternative approach.

    Rolf focused on an effect that was repeatedly found across four experiments in Gino and Mogilner (2013): that participants were less likely to cheat when focused on time compared to participants in a control or a money-focused condition.

    These are not the only reasonable choices. Gino and Mogilner (2013) also explored the effect of a money-focus for a variety of main effects and contrasts. The p-values that are produced by these different hypothesis tests are not independent of the p-values analyzed by Rolf, and such dependencies mean that it is not appropriate to include them all in the p-curve analysis. Table 1 (http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/Table1.pdf ) highlights the statistics used for different analyses. The p-curve analysis for the money effect does not indicate p-hacking (p=0.76). Details of the analysis are in Figure 1 (http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/Figure1.pdf ) These two different conclusions are not in conflict because the statistics measure different effects. Nevertheless, concluding p-hacking from the p-curve analysis depends on which statistics are analyzed. Importantly, the p-curve analysis cannot consider both sets of statistics simultaneously because of the dependencies.

    Table 2 (http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/Table2.pdf ) shows the post hoc power for each experiment. Consider the column for the time-focused statistics. The power estimates are all just above one half. The Test for Excess Significance (TES) notes that the probability of all four experiments like these rejecting the null hypothesis is the product of the power values. The final row indicates that this probability is 0.079. This probability can be considered an estimate of the probability that a direct replication of the four experiments (with the same sample sizes) would all produce statistically significant outcomes. Since this probability is less than the 0.1 criterion that is commonly used for these kinds of analyses, readers should be skeptical that the reported results were produced with proper experiments and analyses.

    The money-focused power values are higher, and their product is well above the 0.1 criterion. In this respect, the TES analysis gives essentially the same conclusions as the p-curve analysis.

    The final column in Table 2 consider a more general TES analysis that considers the money-focused, the time-focused, and additional statistical results (highlighted in yellow in Table 1) that were deemed by Gino and Mogilner (2013) as providing support for their theoretical ideas. The success probability for the full set was estimated with simulated experiments that used the properties of the reported sample statistics. The 0.003 probability is so small that it is difficult to suppose that the experiments were fully reported, properly run, and properly analyzed.

    This result does not mean that there is no merit to the reported results, but it means that readers should be skeptical about the theoretical conclusions that are derived from the reported results. Moreover, it is not obvious which effects can be believed and which are suspect.

    Unlike the p-curve analysis, the TES can consider the full set of experimental results used by Gino and Mogilner (2013) to support their theoretical ideas. Applying this more general approach leads to a pretty convincing conclusion that readers should doubt the validity of the relationship between the experimental data and the theoretical claims.

    A spreadsheet describing the effect size and power estimates, along with R code for the estimating power, can be downloaded from
    http://www1.psych.purdue.edu/~gfrancis/Publications/TimeMoneyMorality/

    ReplyDelete