In last week’s article, we described how hypothesis testing is a critical part of scientific experimentation and that significance testing is a fundamental component in that endeavor, permitting the scientist to test whether observed results differ in a non-random fashion from the null hypothesis. The probability level utilized to make that determination is p>.05.
While the idea of statistical significance goes back to the 1700s, it really begins in 1925 when Ronald Fisher, a British statistician educated at Cambridge, selected.05, or a probability of one in 20, as a reasonable level at which to reject the null hypothesis. To put this concept in concrete terms, if you have a finding at the p>.05 level, if you repeat the experiment under the same experimental conditions 100 times, you should get the same or a greater difference between the two entities 95 times or more.
Why did Fisher initially investigate this problem? In one of his books, he describes the circumstances. In the early 1920s, he was working as a statistician at Rothamsted, an agricultural research station north of London, studying plant genetics. At the same facility were Blanche Bristol, an algologist (she had a PhD from Birmingham University where her dissertation was on algae) and her fiancé, William Roach, a botanist. These three sat down for tea one afternoon, and Fisher, a gentleman, poured a cup of tea for the lady, putting milk in the cup first. However, she demurred, stating that she preferred the tea poured in the cup first. Fisher was skeptical about this, believing the order of pouring should not affect the flavor, but she insisted. Roach said, “Let us test her.” Is that what a loving fiancé would say?
On the spot, Fisher designed an experiment in which Dr. Bristol was given eight cups of tea, four with milk poured first and four with tea poured first, sorted randomly. If one calculates the number of possible combinations of four out of eight, it comes to 70, so her chances of guessing all eight cups correctly was one in 70 or 0.014. For the analysis of this experiment, Fisher devised a new statistical test, Fisher’s exact test, which is still in use today. Apparently, a small audience gathered to watch this experiment. In fact, the lady got them all correct. And we can note that 0.014 is within p>0.05. How exactly she was able to recognize the order of the tea and milk I will leave to your imagination, though among the many many articles that have been written about this tea-break episode, it mentions that George Orwell wrote an essay in which he lauded the pouring of tea first as a consequence of the tannins having a chance to percolate. And I could recommend this experiment for the next time some obnoxious lunch guest claims to be able to distinguish between an expensive and an inexpensive bottle of wine. Parenthetically, Dr. Bristol did marry Dr. Roach in June of 1923.
Fisher continued to mull over this tea experiment and considered the probability of switching two cups incorrectly—that probability dropped to 0.23—if she had done that, he would be considerably less confident in her ability to discern the tea and milk order. He realized that if he used more cups, his confidence would grow and that 12 cups—six each way—would be a preferable number of cups for this experiment. This episode allowed him to clarify the components of a good experiment—control groups, randomization and statistical analysis were not routine parts of experimental design in the early 1920s.
Fisher’s 1925 book, “Statistical Methods for Research Workers,” became one of the most influential works in the history of statistics and biomedical research. Out of the tea-break experiment, he introduced the concepts of the null hypothesis and statistical significance, including the use of.05 as the threshold for its definition. Apparently, he chose.05 from the bell curve as the area that was present at the two tails when one applied 1.96 standard deviations from the mean (or average). The standard deviation is a measure of how dispersed the data is around the mean. Thus, in a bell curve, 68% of the data lies within one standard deviation of the mean while 95% of the sample falls within 1.96 standard deviations, and thus 5% of the sample is at the two ends or tails, i.e., 2.5% is at each end of the curve.
Thus, the significance of 1.96, the approximate value of the 97.5 percentile point of the normal distribution used in probability and statistics, also originated in this book.
As a postscript, before we go on to how this is currently evolving, I would be at fault if I did not comment that while Fisher’s intellectual contributions were certainly prodigious, he himself, similar to many of the British upper classes of the era, was personally odious. He was a committed eugenicist, a rabid racist and antisemite, and, during World War II, was an open pro-Nazi sympathizer and Nazi apologist. He deserves no honor or admiration from any of us.
Alfred I. Neugut, MD, PhD, is a medical oncologist and cancer epidemiologist at Columbia University Irving Medical Center/New York Presbyterian and Mailman School of Public Health in New York. Email: [email protected].
This article is for educational purposes only and is not intended to be a substitute for professional medical advice, diagnosis, or treatment, and does not constitute medical or other professional advice. Always seek the advice of your qualified health provider with any questions you may have regarding a medical condition or treatment.