What is a p-value? You might be surprised to hear that the topic is hotly debated; that the mere mention of p-values can get a statistician more excited than a fat kid on Halloween. The p-value is a probability, but it is not the probability that most people think it is. It is not the probability that the results of a trial are true. The p-value is much more complicated and counterintuitive than that.
To understand p-values, you first have to understand the null hypothesis. (And if you want to stop reading here, I understand, but these concepts are really important if you are going to critically appraise the medical literature.) The null hypothesis is the default position that there is no difference between two specified groups. We are doing research in an attempt to disprove the null hypothesis and demonstrate that there is some difference between two groups. In plain language, the p-value is essentially the probability of obtaining a result equal to or more extreme than the result actually observed, assuming that the null hypothesis is correct. (Although, even this definition might ruffle some feathers at the American Statistical Association.)
The assumption that the null hypothesis is correct can get us in a little trouble, but let’s start with a pretty simple example. Imagine I wanted to determine whether a coin was weighted to come up heads more often than tails. I would start my experiment by setting the null hypothesis: I would assume that when I flip the coin, heads and tails are equally probable results. I would then perform the experiment: flip the coin 10 times in a row. Imagine the results were 9 heads and 1 tails. With a fair coin, this result will occur 2 times in one hundred, or 2% of the time. This is approximately equal to a P-value of 0.02. I interpret this P-value as meaning that if I assume the coin was fair (the null hypothesis), these results (or results more extreme, e.g. 10 heads and 0 tails) would occur 2% of the time.
It is important to note that the calculation of the p-value assumes the null hypothesis is true. Therefore, it is impossible for the p-value to prove the null hypothesis false. It only gives you information about how likely your results are if the null hypothesis is true.
Furthermore, the p-value assumes that the study contains absolutely no bias. A p-value less than 0.05 suggests the null hypothesis is unlikely or that the study was biased. We tend to forget that second part, even though it probably explains the majority of low p-values in medicine.
Finally, although the p-value may be used as evidence against a null hypothesis, it doesn’t doesn’t tell us what hypothesis should replace it. The p-value says nothing about what is true. In the coin flipping example, the p-value of 0.02 might be enough for me to reject the null hypothesis that “these results are consistent with a fair coin flip”. However, there are numerous possible explanations for these results. Maybe the coin itself is unfair, but it would be a mistake to simply accept this alternate hypothesis. Perhaps the coin was perfectly fair, but the person flipping to the coin was using sleight of hand to skew the results. In medicine, we often assume a p-value less than 0.05 means that a treatment works, when in fact the p-value could simply be demonstrating some “sleight of hand” by the researchers. The p-value may allow us to reject the null hypothesis, but doesn’t tell us “the truth”.
What is significant?
The P-value does not define significance; the researcher does. Before starting an experiment, we decide (or should) on the degree of error we are willing to accept. This will depend a great deal on what we are studying.
Consider my coin. If I was using that coin to try to cheat my way into the last slice of pie at Christmas dinner, I might accept a p-value of 0.02, even if I should expect to see those results 2% of the time with a fair coin. Pie is important, but I can accept a 2% risk. However, if I was stalking my life on the coin flip, I would probably demand more certainty. (And in either case, I would probably want to repeat the study to get more information.) The p-value does not define truth, it simply helps us assess uncertainty. How much uncertainty we are willing to accept will depend on the question being asked, and therefore a one size fits all threshold for the p-value is ridiculous.
It is also worth noting that the p-value threshold of 0.05 – although often treated as sacrosanct in medicine – has no special meaning. We use it in medicine because we had to choose something and it’s relatively practical for biologic experiments. However, much stricter p-values are used in other areas of science. For example, in physics the standardly accepted p-value is 0.0000003. (See Fatovich and Phillips 2017) In other words, we accept far more uncertainty in medicine, when deciding whether to administer potentially deadly chemicals to patients, than is acceptable in physics when trying to decide whether the Higgs boson exists. Does that make sense to you?
It is worth repeating: The p-value does not define what is real
The p-value simply gives us a sense of the potential difference between observed data and the null hypothesis. In the eyes of the statistician who invented the p value – Ronald Fisher – the p-value is an informal way to judge whether data was worthy of a second look. The p-value doesn’t define truth. The foundation of science is replication, and the p-value was only intended to tell us which studies are worth replicating.
Although the p-value provides us with some statistical information, it is never the most important information about a trial. Before we even consider the p-value, we need to know that the trial was fair (free of significant bias). In our study to determine whether I was using an unfair coin, the p-value would be pretty meaningless if you found out I was cheating, and simply ignoring the flips where tails occurred. (For a simple approach to assessing the validity of medical research, see the post: Evidence Based Medicine is Easy.) Furthermore, the statistical chance of there being a difference between groups doesn’t give us any information about the magnitude of that difference. Would you care about a “statistically significant” p-value if it meant that I had a 50.00000000001% chance of flipping heads?
There are so many pitfalls in the interpretation of the p-values that some journals have completely banned their use. I think that is probably extreme, but agree that most people in medicine would do a far better job critically appraising research if they completely ignored the p-value.
The p-value is probably the most ubiquitous and at the same time, misunderstood, misinterpreted, and occasionally miscalculated index in all of biomedical research. (Goodman 2008)
Pitfall #1: Assuming a low p-value means that the results of a trial are true
A low p-value does not prove the results of the trial are true. Unfair tests can also result in low P-values. A study’s methodology is far more important than a study’s statistics. The precise finishing times in an Olympic race wouldn’t matter if the race were unfair (one runner was on steroids). Similarly, the p-value is irrelevant if the trial is unfair.
A p-value <0.05 does not mean that you have proven your experimental hypothesis. There is no such thing as a p-value of 0. We never completely exclude the null hypothesis, we just demonstrate that it is unlikely. Furthermore, demonstrating that the experimental data is not consistent with the null hypothesis does not prove that your alternative hypothesis is correct. There are lots of reasons that two groups might look different. The study may not have been fair, or there may be some other, unrecognized explanation. In some ways, a p-value is a lot like a legal verdict. Just because someone has been declared “not guilty” (there was not enough evidence to prove he was guilty) does not mean that he is innocent.
It bears repeating: the statistician who invented the p-value designed it as a tool to indicate results that need further scrutiny. A p-value less than 0.05 means that a result is interesting enough to study further, not that the results are fundamentally true.
Note: Results with p-values close to 0.05 have a >20% chance of being wrong. (Goodman 2001, Johnson 2013)
Pitfall #2: Declaring a trial ‘negative’ because of a p-value >0.05
A P-value >0.05 does not mean that the groups are the same. Just like a P-value doesn’t prove a treatment works, it also doesn’t prove a treatment doesn’t work. There are many reasons for trials to have false negative results. (See this post on what to do when a study is negative.)
Pitfall #3: Treating a p-value of 0.04 as fundamentally different from a p-value of 0.06
This should be obvious, but there is absolutely nothing special or magical about the p-value of 0.05, even though it is sometimes treated that way in medicine.
Pitfall #4: Conflating statistical and clinical significance
Statistical significance is not the same as clinical significance. The P-value is highly dependent on the number of patients in a trial. With very large trials, it is very easy to get “significant” p-values. However, the difference between a blood pressure of 142/90 and 141/90 will never be important to the patient, no matter how small the p-value is.
Pitfall #5: Considering a p-value in isolation (not considering pretest probability)
The p-value ignores pretest probability. A new trial can only be judged in the context of the scientific literature. Much like diagnostics in our clinical practice, to properly interpret a p-value, we must know the pretest probability. Even if 2 trials have the exact same p-value, I am much less likely to believe the one that claims that the earth is flat than the one that claims the sun is a star. We often forget this in medicine, and will consider a single trial of a pseudoscientific medical practice like acupuncture because of a low p-value, instead of considering the broad scientific consensus that energy meridians don’t exist and acupuncture can’t work.
Unfortunately, there is no clear method to judge a study’s pre-test probability. In general, we know the chance that a new medication will help patients is low, because the vast majority of trials of new medications are negative. Therefore, a single positive study, no matter how low the p-value, should almost never convince us that a therapy works. There are too many other explanations for the low p-value, and we shouldn’t expect the trial to be positive, so we should wait for replication studies for confirmation.
Pitfall #6: Not adjusting for multiple comparisons
A p-value is valid when making a single comparison. Medical studies often make multiple (sometimes even hundreds) of comparisons. Because the P-value describes a probability, there will be times when it seems to be “significant” by chance alone. Making multiple comparisons drastically increases the probability of chance finding. Multiple comparisons require adjustment.
Pitfall #7: Assuming the p-value is a measure of the strength of a scientific claim
P-values are not reproducible. Multiple studies of the same phenomenon will have dramatically different p-values. The p-value changes exponentially with how close the 95% confidence intervals come to crossing 0. The p-value is a measure of this single data-set, and does not correlate with the strength of the scientific claim.
Pitfall #8: Discounting the impact of p-hacking.
There are many forms of p-hacking, but the basic concept is that researchers will often try out several different statistical analyses, or look at multiple different datasets, and then only report the ones with “positive” p-values. The more comparisons you make, the more likely you are to find a statistically significant result by chance alone, so the practice of p-hacking completely invalidates the reported p-value. (Sometimes this practice is intentional, such as when companies selectively report their studies when trying to sell a drug. However, p-hacking is often unintentional, and perhaps even taught as normal research practice.)
There is evidence that p-hacking is extremely widespread. One study of abstracts between 1990-2015 showed 96% contained at least 1 p value < 0.05. Are we that wildly successful in research? Or, are statistically nonsignificant results published less frequently (probably)? Or, do we try to find something in the data to report as significant, i.e. p-hack (likely)?
Unless you are a statistician (and maybe even if you are), you are probably interpreting p-values incorrectly. If you want to understand the medical literature, I suggest learning how to identify bias in studies, and generally ignoring p-values.
Other FOAMed Resources
A great (simple) video on p value from Cassie Kozyrkov
Goodman SN. Of P-values and Bayes: a modest proposal. Epidemiology (Cambridge, Mass.). 2001; 12(3):295-7. [pubmed]
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008 Jul;45(3):135-40. doi: 10.1053/j.seminhematol.2008.04.003. Erratum in: Semin Hematol. 2011 Oct;48(4):302. PMID: 18582619
Graves RS. Users’ Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. Journal of the Medical Library Association. 2002;90(4):483.
Johnson VE. Revised standards for statistical evidence Proceedings of the National Academy of Sciences. 2013; 110(48):19313-19317.
Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016; 70(2):129-133. [full article]
Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “< 0.05” The American Statistician. 2019; 73(sup1):1-19. [full text]