Donald Trump’s Hair and Implausible Patterns of Results

In the past few years, a set of new terms has become common parlance in post-publication discourse in psychology and other social sciences: sloppy science, questionable research practices, researcher degrees of freedom, fishing expeditions, and data that are too-good-to-be-true. An excellent new paper by Andrew Gelman and Eric Loken takes a critical look at this development. The authors point out that they regret having used the term fishing expedition in a previous article that contained critical analyses of published work.

The problem with such terminology, they assert, is that it implies conscious actions on the part of the researchers even though—as they are careful to point out--the people who have coined, or are using, those terms (this includes me) may not think in terms of conscious agency. The main point Gelman and Loken make in the article is that there are various ways in which researchers can unconsciously inflate effects. I will write more about this in a later post. I want to focus on the nomenclature issue here. Gelman and Loken are right that despite the post-publication reviewers’ best intentions, the terms they use do evoke conscious agency.

We need to distinguish between post-publication review and ethics investigations in this regard, as these activities have different goals. Scientific integrity committees are charged with investigating the potential wrongdoings of scientists; they need to reverse-engineer behavior from the information at their disposal (published data, raw data, interviews with the researcher, their collaborators, and so on). Post-publication review is not about research practices. It is about published results and the conclusions that can or cannot be drawn from them.

If we accept this division of labor, then we need to agree with Gelman and Loken that the current nomenclature is not well suited for post-publication review. Actions cannot be unambiguously reverse-engineered from the published data. Let me give a linguistic example to illustrate. Take the sentence Visiting relatives can be frustrating. Without further context, it is impossible to know which process has given rise to this utterance. The sentence is a standing ambiguity and any Chomskyan linguist will tell you that it has one surface structure (the actual sentence) and two deep structures (meanings). The sentence can mean that it is frustrating to visit relatives or that it is frustrating when they are visiting you. There is no way to tell which deep structure has given rise to this surface structure.

It is the same with published data. Are the results the outcome of a stroke of luck, optional stopping, selective removal of data, selective reporting, an honest error, or outright fraud? This is often difficult to tell and probably not something that ought to be discussed in post-publication discourse anyway.

So the problem is that the current nomenclature generally brings to mind agency. Take sloppy science. It implies that the researcher has failed to exert an appropriate amount of care and attention; science itself cannot be sloppy. As Gelman and Loken point out, p-hacking is not necessarily intended to mean that someone deliberately bent the rules (and, in fact, their article is about how researchers unwittingly inflate the effects they report; more about this interesting idea in a later post). However, the verb implies actions on the part of the researcher; it is not a description of the results of a study. The same is true, of course, of fishing expedition. It is the researchers who are going on a fishing expedition; it is not the data who have cast their lines. Questionable research practices is obviously a statement about the researcher, as is researcher degrees of freedom.

But how about too-good-to-be-true? Clearly this qualifies as a statement about the data and not about the researcher. Uri Simonsohn used it to describe the data of Dirk Smeesters and the Scientific Integrity Committee I chaired adopted this characterization as well. Still, it has a distinctly negative connotation. Frankly, the first thing I think of when I hear too-good-to-be-true is Donald Trumps hair. And let’s face it: no researcher on this planet wants to be associated—however remotely—with Donald Trump’s hair.

What we need for post-publication review is a term that does not imply agency or refer to the researcher—we cannot reverse engineer behavior from the published data—and that does not have a negative connotation. A candidate is implausible pattern of results (IPR). Granted, researchers will not be overjoyed when someone calls their results implausible but the term does not imply any wrongdoing on their part and yet does express a concern about the data.

But who am I to propose a new nomenclature? If readers of this blog have better suggestions, I’d love to hear them.

Reacties

Daniel Lakens9 januari 2014 om 20:36
I don't think 'implausible' is a very good term, because it's about believability, and we are doing science, not religion. We are talking about statistics. Statistics is about only 1 thing: probabilities. So that is the only way anyone should talk about data. Data allows us to draw inferences about probabilities. Some data patterns are improbable. Others are very likely, but only when the null-hypothesis is true, and not when the alternative hypothesis is true. Some data patterns are extremely improbable to be observed by chance alone. As long as we talk about it in terms of probability, we are doing our jobs and there is no need to worry about this topic.
BeantwoordenVerwijderen
Reacties
Nick Brown9 januari 2014 om 21:01
Ultimately, this is going to require someone to specify which terms apply to which implied conduct (incompetence or fraud) --- kind of like "tax avoidance" vs "tax evasion". The line between accidental and deliberate deception is just that, a line: It has zero width. Stapel and Smeesters are clearly on one side of that line; an undergraduate project that generates post hoc hypotheses and uses the same data to confirm them is very likely on the other. But it gets greyer and greyer as you turn the various dials. For example, as the researchers get more senior, you get closer to "either s/he knew this was wrong, or s/he should have known".

However, once you have decided that terms A and B refer, respectively, to "probably accidental" and "probably deliberate" misconduct, you're back to square one. If we're never prepared to come out in public and say "I reckon this is fraudulent" - which, frankly, we're not - then it all comes out a wash in the end regardless of which words we choose. Cf. how every word in English for a place to evacuate human waste is a euphemism for a washing place.

I'm currently working on an article where I can't demonstrate any fraudulent intent --- indeed, the data is undoubtedly authentic, it's just the conclusions that are absurd --- but, if there is no intent to deceive, the level of statistical incompetence (from two full professors) is quite astonishing. I suspect that outright fraud is rare, but that we're dealing with "bullshit" --- in Frankfurt's terms, that is, the person uttering it genuinely doesn't care if it's true or not, as opposed to lying where they know it's false --- more often than we may care to find out (or admit).

For what it's worth, if I heard the phrase "implausible pattern of results", it would instinctively make me think that actual fraud was being alleged. But maybe that's some form of reverse expectation whereby I'm treating an attempt at cautious language as euphemistic understatement. Get some opinions from non-Brits too. :-)
BeantwoordenVerwijderen
Reacties
thom10 januari 2014 om 08:44
What about unusual pattern of results (UPR)? This is more neutral because nearly all papers will contain some unusual patterns of results. Perhaps too neutral. That said - the inference of fraud or sloppiness requires multiple unusual patterns within and between papers (cf. Simonsohn, 2013).
BeantwoordenVerwijderen
Reacties
Unknown10 januari 2014 om 09:07
You are forgetting at least one (set of) agent(s) here, I think. You already identified the researcher and the data, but you did not mention the reviewer(s) and editor. It is precisely that perspective that I find appealing in the HIBAR label for post-publication reviewing. It could well be that your post-publication review deals with things that were actually discussed during the review process but that it is actually due to the (anonymous) reviewers or the editor that they are not included in the final manuscript.
BeantwoordenVerwijderen
Reacties
Chris Chambers10 januari 2014 om 10:39
Very interesting paper by Gelman and Loken, although (unusually) I find myself disagreeing with a number of their points.

1. I think they are viewing psychology through rose-tinted glasses. The survey of John et al estimated high prevalence rates for a range of QRPs (often near 100%), so much of this behaviour is conscious and bog standard. http://www.cmu.edu/dietrich/sds/docs/loewenstein/MeasPrevalQuestTruthTelling.pdf
That's not to say that there isn't also a lot of unconscious bias going on, but to argue that "We have no reason to think that researchers regularly [fish]" belies the evidence - we have every reason to think this. It would be truer to say that it isn't socially acceptable to consciously admit that QRPs are standard operating procedure in psychology, and that groupthink consolidates the delusion that such practices are justified (and necessary to compete for glamour pubs and grants), provided nobody freely admits to engaging in them. The first rule of fight club is...

2. One part of what they are describing in their paper is essentially a form of HARKing (http://www.sozialpsychologie.uni-frankfurt.de/wp-content/uploads/2010/09/kerr-1998-HARKing.pdf). Many of the cases they raise seem to involve researchers having a general idea of the kind of analysis they want to do, with several plausible hypotheses in mind and several legal analytic options. Based upon either inferential analysis or visual inspection, the researchers find what looks like the strongest effect in the data (perhaps only running one inferential analysis but clearly viewing the data descriptively) and then assigning it the best-fitting hypothesis. Whether this is conscious or not, this is a questionable research practice if the researchers are going to use p values. What I find confusing is that later in the paper, Gelman and Loken explicitly advocate HARKing, which (for me at least) undermines their criticisms of the examples they cite. I find myself unsure about how their own research practices differ from those of the studies under scrutiny.

3. I found their dismissal of pre-registration unconvincing. Like many critics, they seem to be viewing pre-registration as a case where *only* pre-registered analyses are reported. This is never the case and it is a straw man. Using a pre-registered methodology, it is entirely reasonable to report exploratory analyses in addition to the confirmatory analyses. Pre-registration simply makes that distinction clear. As a community we need to face the fact that p values mean little when applied in an exploratory context because we can never know what researcher dfs were exploited.

4. Their positive message is again rose-tinted in my view. They say: "Our positive message is related to our strong feeling that scientists are interested in getting closer to the truth." This is part of why people do science, but I think this is not the primary motivation and never will be in a competitive system where the needs of science compete with the needs of individual scientists. The primary goal, I am realising, is for scientists to keep their jobs, meet their career requirements, advance their careers, support their staff, define an identity as a productive worker, and be seen to an authority by others. For many, only when those conditions are met is the quest for truth given a place at the table. The sooner we admit this and dispel the illusion of scientists as objective truth-seekers, the sooner we can re-design incentives to align the needs of the scientists with the goal of revealing truth.
BeantwoordenVerwijderen
Reacties
Unknown10 januari 2014 om 12:08
I'd say "results that are open to doubt".
BeantwoordenVerwijderen
Reacties
Unknown10 januari 2014 om 20:00
It is of course very noble and politically correct to want to avoid explicitly or implicitly accusing scientists of things like organizing fishing expeditions. The problem is, as others have pointing out, including in this blog post, that we also know that fishy procedures are actually used a lot in actual research practice.

Two points: a) not knowing that one should not use p-values if one has generated the hypothesis from the data is of course never a valid excuse for doing so. Ignorance does not protect from the law, so to speak. b) I think esp. editors and reviewers should keep in mind that questionable practices in data analysis are widespread, and therefore they should explicitly address these issues during the review process. For instance, if the hypothesis looks like it may have come from the data, one could ask to address the question how the authors have conceived of that particular hypothesis (i.e., where it comes from). If that turns out not to be from a certain known theory, the editor could ask the author to make explicit the (presumed) fact that this hypothesis was, in fact, *not* generated after the data, but based on spontaneous intuition or something like that. This making explicit of these issues facilitates the awareness in all involved that these particular issues are important, and that it is the authors responsibility to address them openly. So there is no accusation here, only an urge to be maximally transparent.

Finally: the problem reminds me of the problem of sick leave in large organizations. If 90% of all employees of a department call in sick for the lmaximum time one can do so legally without negative consequences for their salaries (stuff like that actually happens) one suspects strongly that some of those employees are cheating. The problem is that one does not know who, and does not want to falsely accuse the individual employees that are sick. The only option then probably is to send a doctor to their house when they are often sick.
BeantwoordenVerwijderen
Reacties
Greg Francis13 januari 2014 om 11:42
I agree with Rolf that post publication review critiques should focus on the properties of the data and only tangentially (if at all) on the behaviour of the researchers. In my own critiques I have always tried to emphasize how easy it is to introduce bias into a set of experiments. I always saw this as a way to mitigate possible negative criticism of the original authors, but some readers ignore my clarifying statements and interpret my critiques as "attacks" on the authors.

In some sense, I understand such an interpretation from readers. If a set of experimental results appear "too good to be true" (or whatever term you prefer), there are no nice ways to point out this flaw. (See http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics9.pdf ). When such experimental results are published, there are two broad interpretations: malfeasance or ignorance. The former suggests fraud by the authors. The latter suggests misunderstandings about scientific practice for the authors, reviewers, and the editor. An interpretation of ignorance is arguably worse for the field than an interpretation of malfeasance, because ignorance implies problems with scientific training and misunderstanding among large groups of people. If the problems were just due to fraud, we could punish the evil doers and return to standard practice.

Regarding a term, I would be happy with "implausible pattern of results" (IPR), but I disagree with Rolf that we need a term without a negative connotation. Even constructive criticism has a negative connotation (at least for the person receiving the criticism), and I think we cannot escape that aspect of post publication review. "Too good to be true" focuses on the data, so I think it is appropriate. Schimmack (2012) suggested the term "incredible", which sounds positive at first but really means incredulous in this context.
BeantwoordenVerwijderen
Reacties
Anoniem20 januari 2014 om 01:04
Mr. Zwaan, seems to me like attempts to whitewash bad practices or tip toe around them in order to preserve the feeling of integrity among the academic community (instead of earning it) is precisely what needs to be left behind., Basic politeness should suffice, no need to whine about the specific words used. Psychologists should learn to give and accept criticism without taking it personally, in other words, they should grow up and learn from their mistakes be they intentional or not.

I'm not saying it's easy given that in academia the need for status and ego gratification as well as the peer pressure pushes toward half-baked reasoning, conceptual obfuscation and butt-hurt, quantitative methods and peer review notwithstanding. As always, the good example is overshadowed by mediocrity unless it's merit is recognized and there's no system that can assure this, just a matter of individual human perceptivity and integrity (a matter for psychologists to study!)

Also, it would help that psychologists were less try-hard in promoting their "science".
BeantwoordenVerwijderen
Reacties
Anoniem20 januari 2014 om 01:07
Mr. Zwaan, seems to me like attempts to whitewash bad practices or tip toe around them in order to preserve the feeling of integrity among the academic community (instead of earning it) is precisely what has been wrong with psychology and what needs to be left behind. In this sense, basic politeness should suffice, no need to whine about the specific words used. Psychologists should learn to give and accept criticism, in other words, they should grow up and learn from their mistakes, be they intentional or not.

It would also help if they were less try-hard in promoting their "science".
BeantwoordenVerwijderen
Reacties

Reactie toevoegen

Drang naar Samenhang

Zoeken in deze blog

Donald Trump’s Hair and Implausible Patterns of Results

Reacties

Een reactie posten