Last week, we discussed the adoption of p>.05 as the threshold at which studies and scientific experiments decide whether the observed differences between measured entities represent true differences or not. Over the past 100 years, the 0.05 significance level has acquired a widespread universal acceptance as the arbiter of statistical significance and hence it has become the binary decisor of whether a finding of a difference between two entities in a study is “real” or not. It is mind-blowing to sit back and appreciate how widespread and total its applications are and how totally it dominates every study, every paper, every journal, every journal editor’s decisions, and every grant application.
Thus, if we compare a drug to placebo for its impact on a cancer’s survival and we find that the group treated with drug A has a survival of 13.8 months while the placebo has a survival of 10.9 months, analysis of this data results in a p-value of 0.08, and drug A would not be judged more effective. Would it get published? It would depend on more details than I am providing, but most studies do get published, albeit in a less prestigious journal. The conclusion of the authors might hint that there was a trend to drug A having more efficacy, but they could certainly not conclude that it “worked.” But let me ask you—if you or your loved one had this cancer, which drug would you want to try?
When Ronald Fisher adopted.05 as the probability at which to measure statistical significance, it does not seem he intended that it become an inflexible, all-or-none barrier to calling a finding significant or not. In the above example, the general tendency is to decide that, because the experiment did not achieve statistical significance, the null hypothesis is correct—there is no difference between drug A and placebo—but the truth is that we should never conclude there is no association or no difference just because we did not find a p-value of.05. This is a widespread problem.
Let us consider another problem with statistical significance. We conduct two studies to assess the association between tobacco use and colon cancer. In both studies we find the same effect size, i.e., smoking raises the risk of colon cancer by 40%. But in one study the p-value is.04 while in the other it is.15, so one outcome is deemed statistically significant while the other is not. But both studies had the same finding! Nonetheless, the norm would be for the nonsignificant study to be called negative and to have difficulty being published.
For success in academia, we often say, “Publish or Perish.” There is an overwhelming mandate to conduct, analyze and interpret studies in such a way as to maximize one’s publications. And since research findings cannot usually be published if they are not statistically significant, there is a major incentive to maximize findings that are p>.05. These efforts by investigators are often either outright cheating or skirting the edge of it.
For example, if one takes the study with which we began this article, it would be commonplace to proceed after an overall finding of p=.08 to undertake secondary analyses of subcohorts and inevitably, if one does enough such analyses, one or more will result in a finding that has a p-value >.05. The investigator will then highlight this finding in his/her paper as if it were the goal of the study. This data-dredging is very commonly practiced and is recognized by journal editors, but is almost impossible to monitor. Other forms of P-hacking include repeatedly analyzing the data to see if the results have achieved p>.05 and stopping if they have, and otherwise collecting more data until the result is statistically significant; collecting more data after a study is completed in order to achieve a statistically significant result or putting off decisions on whether to include outliers in the analysis until after the initial analyses have been completed. Likewise, if one is close to a statistically significant finding, one can find quasi-legitimate ways to cross the finish line — selectively exclude patients from the study for dubious reasons, change the outcome in some minor way, etc. Since these research methods and findings are inherently dishonest and have no a priori basis, it is not surprising to learn that when scientists attempt to replicate findings from studies in subsequent studies, only about 30% of statistically significant findings are successfully replicated!
A letter published by Sander Greenland of UCLA in Nature in 2019 and signed by over 800 statisticians and epidemiologists called for the abandonment of the term statistical significance in the description of study results, a more sparing use of p>.05 as a criterion for decision-making, a general reliance on 95% confidence intervals, and more thoughtful consideration of study results. These changes are gradually being adopted by some journals and it is likely they will become more pervasive and widespread throughout the biomedical research community. Some journals have already adopted these changes and it is likely these changes will become the norm in the coming decade.
Alfred I. Neugut, M.D., Ph.D., is a medical oncologist and cancer epidemiologist at Columbia University Irving Medical Center/New York Presbyterian and Mailman School of Public Health in New York. Email [email protected].
This article is for educational purposes only and is not intended to be a substitute for professional medical advice, diagnosis or treatment, and does not constitute medical or other professional advice. Always seek the advice of your qualified health provider with any questions you may have regarding a medical condition or treatment.