<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article"><?properties manuscript?><front><journal-meta><journal-id journal-id-type="nlm-journal-id">7705941</journal-id><journal-id journal-id-type="pubmed-jr-id">7382</journal-id><journal-id journal-id-type="nlm-ta">Sex Transm Dis</journal-id><journal-id journal-id-type="iso-abbrev">Sex Transm Dis</journal-id><journal-title-group><journal-title>Sexually transmitted diseases</journal-title></journal-title-group><issn pub-type="ppub">0148-5717</issn><issn pub-type="epub">1537-4521</issn></journal-meta><article-meta><article-id pub-id-type="pmid">28703730</article-id><article-id pub-id-type="pmc">5761065</article-id><article-id pub-id-type="doi">10.1097/OLQ.0000000000000635</article-id><article-id pub-id-type="manuscript">HHSPA930639</article-id><article-categories><subj-group subj-group-type="heading"><subject>Article</subject></subj-group></article-categories><title-group><article-title>An Illustration of Errors in Using the <italic>P</italic> Value to Indicate Clinical Significance or Epidemiological Importance of a Study Finding</article-title></title-group><contrib-group><contrib contrib-type="author"><name><surname>Kang</surname><given-names>Joseph</given-names></name><degrees>PhD</degrees></contrib><contrib contrib-type="author"><name><surname>Hong</surname><given-names>Jaeyoung</given-names></name><degrees>PhD</degrees></contrib><contrib contrib-type="author"><name><surname>Esie</surname><given-names>Precious</given-names></name><degrees>MPH</degrees></contrib><contrib contrib-type="author"><name><surname>Bernstein</surname><given-names>Kyle T.</given-names></name><degrees>PhD</degrees></contrib><contrib contrib-type="author"><name><surname>Aral</surname><given-names>Sevgi</given-names></name><degrees>PhD</degrees></contrib><aff id="A1">National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention, Atlanta, GA</aff></contrib-group><author-notes><corresp id="FN1">Correspondence: Joseph Kang, PhD, Mailstop E-02, Division of STD Prevention, Centers for Disease Control and Prevention, 12 Corporate Square Blvd, Atlanta, GA 30329. <email>yma9@cdc.gov</email></corresp></author-notes><pub-date pub-type="nihms-submitted"><day>27</day><month>12</month><year>2017</year></pub-date><pub-date pub-type="ppub"><month>8</month><year>2017</year></pub-date><pub-date pub-type="pmc-release"><day>10</day><month>1</month><year>2018</year></pub-date><volume>44</volume><issue>8</issue><fpage>495</fpage><lpage>497</lpage><!--elocation-id from pubmed: 10.1097/OLQ.0000000000000635--><abstract><p id="P1">We conducted a simulation study to illustrate that <italic>P</italic> values can suggest but not confirm statistical significance; and they may not indicate epidemiological significance (importance). We recommend that researchers consider reporting effect sizes as <italic>P</italic> values in conjunction with confidence intervals or point estimates with standard errors to indicate precision (uncertainty).</p></abstract></article-meta></front><body><p id="P2">Since 1999, experts have written about the inappropriate use of the <italic>P</italic> value to make judgments about the scientific significance (importance) of research findings in leading medical and scientific journals.<sup><xref rid="R1" ref-type="bibr">1</xref>&#x02013;<xref rid="R6" ref-type="bibr">6</xref></sup> The primary concern is that a <italic>P</italic> value computed in a statistical significance test does not contain information about the clinical significance&#x02014;the importance of an intervention&#x02014;or epidemiological importance of the finding&#x02014;a measure for the prevention and control of a disease in a population.<sup><xref rid="R7" ref-type="bibr">7</xref>,<xref rid="R8" ref-type="bibr">8</xref></sup> In 2016, the American Statistical Association issued a formal statement clarifying the proper use and interpretation of the <italic>P</italic> value and advising against using the <italic>P</italic> value to determine scientific significance of research findings.<sup><xref rid="R9" ref-type="bibr">9</xref></sup> The purpose of this article is to illustrate, with a simple simulated example, why small <italic>P</italic> values and narrow 95% confidence intervals do not indicate the clinical significance or epidemiological importance of a research finding. We recommend that authors report and interpret <italic>P</italic> values in conjunction with effect sizes and standard errors, or confidence intervals to support limited statements about the precision (uncertainty) and statistical significance of the findings. Conclusions about the clinical significance or epidemiological importance of research findings require clinical or epidemiological judgments that do not depend on statistical evidence alone.</p><sec sec-type="methods" id="S1"><title>METHODS</title><p id="P3">We devised an example of a prevalence difference known to be epidemiologically insignificant or unimportant. The hypothesis is to test the prevalence difference of 2 interventions&#x02014;A and B. To compute the <italic>P</italic> value for the test, it is assumed that intervention A reduces the prevalence of an STD by 31% in group A and intervention B reduces the prevalence of the same STD by 27% in group B. Groups A and B are equal in size, and interventions A and B are equally effective. The relative effect of intervention A to that of intervention B is 1.21 as the prevalence odds ratio. We know from practical experience that a 4% difference in the effects of interventions A and B is not clinically significant or important. Let us suppose a 20% difference would be clinically important. Although 20% may be arbitrary, is comparable to the gender gap in 2014 gonorrhea rates&#x02014;120.1 cases per 10<sup>5</sup> among men and 101.3 cases per 10<sup>5</sup> among women.<sup><xref rid="R10" ref-type="bibr">10</xref></sup> An intervention that closes that gap, that is, reduces the difference by 18.8%, would be epidemiologically important because closing the gap is a national goal. For our statistical simulation, the R statistical program<sup><xref rid="R11" ref-type="bibr">11</xref></sup> was used and is available as the supplementary document.</p></sec><sec sec-type="results" id="S2"><title>RESULTS</title><p id="P4">Using the outcomes described above, data can be readily simulated with different sample sizes. <xref rid="F1" ref-type="fig">Figure 1</xref> illustrates that as the sample size (<italic>N</italic>) and power increase, the <italic>P</italic> value becomes smaller, even when there is no change in the absolute difference of 4% (measure of effect). All the data points of this figure were generated with the same prevalence odds ratio of 1.21 and the absolute difference of 4%, as described in the previous section. For example, with a total sample size <italic>N</italic> of 100, <italic>P</italic> = 0.24 and appears not to be &#x0201c;statistically significant" using the standard threshold of <italic>P</italic> &#x0003c; 0.05. In contrast, a sample size <italic>N</italic> of 1800 results in <italic>P</italic> = 0.002, a value universally considered &#x0201c;statistically significant.&#x0201d; This phenomenon occurs because the <italic>P</italic> value is directly influenced by the sample size. As sample size increases, <italic>P</italic> values become smaller, crossing the 0.05 threshold to become &#x0201c;significant&#x0201d; regardless of whether the outcome is clinically significant. Without context, reporting only a <italic>P</italic> value for the group difference as evidence of its clinical significance or epidemiological importance will often result in misinterpretation. As stated above, the absolute difference in prevalence between the two groups in our example is 4%, a difference that is known to be unimportant. Thus, its effect size&#x02014;the estimated difference of the prevalence rates (4%)&#x02014;should be reported as well.</p><p id="P5">As shown in <xref rid="F1" ref-type="fig">Figure 1</xref>, some small <italic>P</italic> values will show statistically significant differences. Instead of directly associating small <italic>P</italic> values and statistical significance with clinical significance, small <italic>P</italic> values should be interpreted as evidence supporting a rejection of the assumptions that the particular set of data are consistent with the proposed model for the data.<sup><xref rid="R9" ref-type="bibr">9</xref></sup>
<italic>P</italic> values in <xref rid="F1" ref-type="fig">Figure 1</xref> were modeled under a null hypothesis assuming a prevalence ratio of 1.0. <italic>P</italic> values less than 0.05 mean the data are incompatible with the model&#x02019;s null hypothesis assuming a prevalence ratio of 1.0, which should be rejected.<sup><xref rid="R9" ref-type="bibr">9</xref></sup> In the example provided, this is the only information the <italic>P</italic> value can provide. The incompatibility of the data and the statistical model&#x02019;s null hypothesis indicated by <italic>P</italic> values less than 0.05 provides justification for a preliminary, not definitive or final, rejection of the null hypothesis.</p><p id="P6"><xref rid="F2" ref-type="fig">Figure 2</xref> shows that as the sample size <italic>N</italic> increases, the confidence interval for the effect size becomes narrower. With N =1800, confidence intervals for the estimated prevalence rates are reported as 31% to 37% and 24% to 30%, respectively. Equivalently, an estimated prevalence ratio of 1.38 can be reported with its confidence interval of 1.13 to 1.69 or its standard error of &#x000b1;0.13. Neither statement about precision, however, justifies any statement about clinical significance or epidemiological importance.</p></sec><sec sec-type="discussion" id="S3"><title>DISCUSSION</title><p id="P7">In this article, we illustrated that <italic>P</italic> values measure statistical significance, but not necessarily clinical significance or epidemiological importance. The evaluation of public health interventions requires more careful epidemiological investigations including the assessment of the magnitude of attributable risks than simply reporting <italic>P</italic> values. When the <italic>P</italic> value was first proposed by Ron Fisher as an index of statistical significance, it was never meant to be used as an index of clinical significance or epidemiological importance.<sup><xref rid="R12" ref-type="bibr">12</xref></sup> The threshold of 0.05 was intended to serve as an initial or preliminary indicator of potential statistical significance, neither final nor confirmatory. Some disciplines, such as genomics,<sup><xref rid="R13" ref-type="bibr">13</xref></sup> have used <italic>P</italic> values to support some important discoveries, but the nature of the analysis in this respect is exploratory, rather than confirmatory. Even when the <italic>P</italic> value is meant to be exploratory, the use of 0.05 as a threshold may be misleading. Observe the horizontal dotted line at a significance level of 0.05 in <xref rid="F1" ref-type="fig">Figure 1</xref>. If the dotted line was drawn at a significance level of 0.1, a larger number of simulated experiments would have statistically significant <italic>P</italic> values. Regardless of statistical significance levels, however, the prevalence difference (4%) or prevalence ratio (1.21) remains clinically insignificant and epidemiologically unimportant.</p><p id="P8">As shown in the simple example presented in this article, evidence of statistical significance is not evidence of clinical or epidemiological importance. Smaller <italic>P</italic> values are not necessarily associated with larger or more important effects, and larger <italic>P</italic> values are not necessarily associated with clinical insignificance or epidemiological importance of the effect. Recall that our example had a prevalence difference of 4%, but <italic>P</italic> values in <xref rid="F1" ref-type="fig">Figure 1</xref> varied. Thus, any analysis with a large sample size or high precision may produce a small <italic>P</italic> value, whereas analyses with small sample sizes or imprecise measurements may produce large <italic>P</italic> values even though the clinical or epidemiological effect may be important.</p><p id="P9">According to the ASA, &#x0201c;Cherry-picking effect sizes with small <italic>P</italic> values, also known by such terms as data dredging, significance chasing, significance questing, selective inference and &#x0201c;<italic>P</italic> hacking,&#x0201d; leads to spurious significant results.&#x0201d;<sup><xref rid="R9" ref-type="bibr">9</xref></sup> One way to prevent spurious findings is to report both effect sizes and corresponding confidence intervals so that readers can decide for themselves if the difference (or ratio) in effect is big enough to be meaningful or important on the basis of clinical or epidemiological criteria of significance. In some cases, however, effect sizes are not applicable to the result of the statistical test&#x02014;e.g., a goodness of fit test only yields a <italic>P</italic> value. Therefore, there is still utility in using a <italic>P</italic> value to make decisions about a model and it should not be blindly abandoned.</p><p id="P10">Scientific writing tends to convey <italic>P</italic> values as statements about the truth of a null hypothesis, or about the probability that random chance produced the observed data. Yet, as shown in our study, whether <italic>P</italic> values are statistically rejected or not, the assumed clinical insignificance does not change. Evaluating the clinical significance of a finding is quite different from assessing the statistical significance.</p><p id="P11">Generally, <italic>P</italic> values can support exploratory or preliminary judgments about statistical significance, but they are neither certain, nor confirmatory, nor final. Without additional information, such as the effect size, a confidence interval, or standard error for context, readers of scientific reports do not have sufficient information in the <italic>P</italic> value alone to make judgments about clinical significance or epidemiological importance of the statistical finding, as clearly illustrated in our example. Recent increases in big data analytics in health science provide substantially large data sets that can produce small <italic>P</italic> values that may not be epidemiologically useful. Indeed, the influx of modern-day big data in population health for different epidemiological disciplines would require more than the <italic>P</italic> value to assess the importance of new discoveries. To make study findings easier to interpret and to enable readers to make judgments about their usefulness, we recommend that in conjunction with <italic>P</italic> values, researchers also report confidence intervals or effect sizes with standard errors.</p></sec></body><back><ack id="S4"><p>This project was supported in part by an appointment to the Research Participation Program at the National Center for HIV/AIDS, Viral Hepatitis, STD and TB Prevention, Centers for Disease Control and Prevention (CDC), administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and CDC.</p></ack><fn-group><fn fn-type="COI-statement" id="FN2"><p>Conflict of interest: none declared.</p></fn></fn-group><ref-list><ref id="R1"><label>1</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chavalarias</surname><given-names>D</given-names></name><name><surname>Wallach</surname><given-names>J</given-names></name><name><surname>Li</surname><given-names>A</given-names></name><etal/></person-group><article-title>Evolution of reporting p values in the biomedical literature, 1990&#x02013;2015</article-title><source>JAMA</source><year>2016</year><volume>315</volume><fpage>1141</fpage><lpage>1148</lpage><pub-id pub-id-type="pmid">26978209</pub-id></element-citation></ref><ref id="R2"><label>2</label><element-citation publication-type="web"><person-group person-group-type="author"><name><surname>Baker</surname><given-names>M</given-names></name></person-group><source>Statisticians issue warning over misuse of P values</source><comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://www.nature.com/news/statisticians-issue-warning-over-misuse-of-p-values-1.19503">http://www.nature.com/news/statisticians-issue-warning-over-misuse-of-p-values-1.19503</ext-link></comment><date-in-citation>Accessed October 5, 2016</date-in-citation></element-citation></ref><ref id="R3"><label>3</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Goodman</surname><given-names>SN</given-names></name></person-group><article-title>Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy</article-title><source>Ann Intern Med</source><year>1999</year><volume>130</volume><fpage>995</fpage><lpage>1004</lpage><pub-id pub-id-type="pmid">10383371</pub-id></element-citation></ref><ref id="R4"><label>4</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Greenland</surname><given-names>S</given-names></name><name><surname>Senn</surname><given-names>SJ</given-names></name><name><surname>Rothman</surname><given-names>KJ</given-names></name><etal/></person-group><article-title>Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations</article-title><source>Eur J Epidemiol</source><year>2016</year><volume>31</volume><fpage>337</fpage><lpage>350</lpage><pub-id pub-id-type="pmid">27209009</pub-id></element-citation></ref><ref id="R5"><label>5</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Greenland</surname><given-names>S</given-names></name><name><surname>Poole</surname><given-names>C</given-names></name></person-group><article-title>Living with statistics in observational research</article-title><source>Epidemiology</source><year>2013</year><volume>24</volume><fpage>73</fpage><lpage>78</lpage><pub-id pub-id-type="pmid">23232613</pub-id></element-citation></ref><ref id="R6"><label>6</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Trafimow</surname><given-names>D</given-names></name><name><surname>Marks</surname><given-names>M</given-names></name></person-group><article-title>Editorial</article-title><source>Basic Appl Soc Psych</source><year>2015</year><volume>37</volume><fpage>1</fpage><lpage>2</lpage></element-citation></ref><ref id="R7"><label>7</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dixon</surname><given-names>P</given-names></name></person-group><article-title>The p-value fallacy and how to avoid it</article-title><source>Can J Exp Psychol</source><year>2003</year><volume>57</volume><fpage>189</fpage><lpage>202</lpage><pub-id pub-id-type="pmid">14596477</pub-id></element-citation></ref><ref id="R8"><label>8</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hunter</surname><given-names>JE</given-names></name></person-group><article-title>Development of green hospitals home and abroad</article-title><source>Psychol Sci</source><year>1997</year><volume>8</volume><fpage>3</fpage><lpage>7</lpage></element-citation></ref><ref id="R9"><label>9</label><element-citation publication-type="web"><collab>American Statistical Association</collab><source>Statment on statistical significance and p-values</source><comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf">http://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf</ext-link></comment><date-in-citation>Accessed October 5, 2016</date-in-citation></element-citation></ref><ref id="R10"><label>10</label><element-citation publication-type="web"><collab>Centers for Disease Control and Prevention</collab><year>2014</year><source>Sexually Transmitted Diseases Surveillance: Gonorrhea</source><comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.cdc.gov/std/stats14/gonorrhea.htm">https://www.cdc.gov/std/stats14/gonorrhea.htm</ext-link></comment><date-in-citation>Accessed January 23, 2017</date-in-citation></element-citation></ref><ref id="R11"><label>11</label><element-citation publication-type="book"><source>R: A language and environment for statistical computing [computer program] 3.3.2</source><publisher-loc>Vienna, Austria</publisher-loc><publisher-name>R Foundation for Statistical Computing</publisher-name><year>2016</year></element-citation></ref><ref id="R12"><label>12</label><element-citation publication-type="web"><person-group person-group-type="author"><name><surname>Nuzzo</surname><given-names>R</given-names></name></person-group><source>P values, the &#x02018;gold standard&#x02019; of statistical validity, are not as reliable as many scientists assume</source><comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://www.nature.com/news/scientific-method-statistical-errors-1.14700">http://www.nature.com/news/scientific-method-statistical-errors-1.14700</ext-link></comment><date-in-citation>Accessed October 5, 2016</date-in-citation></element-citation></ref><ref id="R13"><label>13</label><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Klein</surname><given-names>RJ</given-names></name><name><surname>Zeiss</surname><given-names>C</given-names></name><name><surname>Chew</surname><given-names>EY</given-names></name><etal/></person-group><article-title>Complement factor H polymorphism in age-related macular degeneration</article-title><source>Science</source><year>2005</year><volume>308</volume><fpage>385</fpage><lpage>389</lpage><pub-id pub-id-type="pmid">15761122</pub-id></element-citation></ref></ref-list></back><floats-group><fig id="F1" orientation="portrait" position="float"><label>Figure 1</label><caption><p><italic>P</italic> value and power versus sample size simulation with the same absolute difference of 4%.</p></caption><graphic xlink:href="nihms930639f1"/></fig><fig id="F2" orientation="portrait" position="float"><label>Figure 2</label><caption><p>Confidence interval for odds ratio (OR) versus sample size with the same absolute difference of 4%.</p></caption><graphic xlink:href="nihms930639f2"/></fig></floats-group></article>