I have always believed that an educated person requires a basic familiarity with probability and statistics—how else can one deal with one’s investments, deal with batting averages and baseball statistics, or understand the nature of polling data on the news? And yet when I meet and interview high school or college students for internships or summer jobs, they generally know little or nothing of statistics, though they have usually had AP calculus or similar mathematical training.
I had four semesters of calculus, which I enjoyed, but in almost 40 years of research in the lab, clinical trials and epidemiology, I cannot recall differentiating an equation or doing integration. These skills are certainly critical for those who require them—engineers, physicists—but not for the common person. And insofar as medicine is concerned, every article and published study has a section that describes the frequently complex statistical methods that were utilized, and the study’s results are couched in statistical terminology. I daresay that the average clinician somehow must obtain such statistical knowledge as he/she can through osmosis. But, a propos of today’s article, almost all physicians and health care providers seem to have acquired an intuitive instinctive grasp of the concepts of p>.05 and statistical significance.
I believe that we are soon to undergo over the next five to 10 years an earthquake in how scientific studies are conducted, analyzed and interpreted in medicine and biology, possibly extending to the rest of science as well, but most clinicians and trialists are totally and blithely unaware of it.
A key part of any experiment or study is the comparison between two entities—how does drug A compare to drug B in improving survival? How does exposure to a chemical compare to no exposure in the etiology of a cancer? How does lower socioeconomic status compared to higher socioeconomic status affect compliance with a chemotherapy regimen? For all of these questions, we conduct experiments in which the initial presumption is that there is no difference between the two entities—what we refer to as the null hypothesis. At the outset, we assume that drugs A and B are equivalent in their effects, as are the chemical exposure and no exposure, and higher and lower socioeconomic status. Think of it as in legal terms—innocent until proven guilty; it is the task of the investigator, through the experiment and the collected experimental data, to convince us that the two entities are indeed truly different from each other.
Because of random variability, the outcomes will always be at least slightly different. The question is whether the difference between the two is “significant.” For example, if drug A gives a survival of 8.9 months and drug B gives a survival of 8.8 months, then it is likely that this difference is nonsignificant and the 0.1 month discrepancy is simply due to chance. However, if drug A gives a survival of 8.9 months and drug B gives a survival of 7.2 months, can we conclude that the two drugs have produced outcomes that are, in fact, truly different?
The p-value is the metric we use to determine whether the difference is significant enough to reject the null hypothesis (the presumption that the two drugs have equivalent outcomes). The p-value is a little difficult to define, but it can be understood as: If drug A and drug B were equivalent/similar drugs (i.e., the null hypothesis is correct), how likely are we to get the results that we got in our experiment? In other words, if drug A and drug B are equivalent, then the outcomes of 8.9 months and 8.8 months seem plausible (with the 0.1 month difference explained by randomness). However, if drug A and drug B are equal, then it is much more unlikely that we get results as disparate as 8.9 months and 7.2 months in our experiment. Such results (that are different by 1.7 months or more) might only happen in 3% of experiments that we run, in which case the p-value is 3% or 0.03. We perform statistical tests to measure the probability that this magnitude of difference would occur by chance.
As we shall see, the dividing line or threshold for statistical significance was set, for better or for worse, at the 0.05 level. In the next two articles, we will describe how this occurred about a century ago, how this became the norm throughout the scientific world, and then discuss how recent problems have led to proposed modifications in this long-held practice.
Alfred I. Neugut, MD, PhD, is a medical oncologist and cancer epidemiologist at Columbia University Irving Medical Center/New York Presbyterian and Mailman School of Public Health in New York. Email: [email protected].
This article is for educational purposes only and is not intended to be a substitute for professional medical advice, diagnosis, or treatment, and does not constitute medical or other professional advice. Always seek the advice of your qualified health provider with any questions you may have regarding a medical condition or treatment.