In a meta-analysis, we assemble a sample of independent, non-identically distributed p-values. The Fisher’s combination procedure provides a chi-squared test of whether the p-values were sampled from the null uniform distribution. After rejecting the null uniform hypothesis, we are faced with the problem of how to combine the assembled p-values. We first derive a distribution for the p-values. The distribution is parameterized by the standardized mean difference (SMD) and the sample size. It includes the uniform as a special case. The maximum likelihood estimate of the SMD can then be obtained from the independent, non-identically distributed p-values. The MLE can be interpreted as a weighted average of the study-specific estimate of the effect size with a shrinkage. The method is broadly applicable to p-values obtained in the maximum likelihood framework. Simulation studies show our method can effectively estimate the effect size with as few as 6 p-values in the meta-analyses. We also present a Bayes estimator for SMD and a method to account for publication bias. We demonstrate our methods on several meta-analyses that assess the potential benefits of citicoline for patients with memory disorders or patients recovering from ischemic stroke.

In a meta-analysis, we usually can not access the individual patient data and have to work with aggregated data such as the reported p-values. A common approach is to estimate the overall effect size. In this report we will concentrate on combining the different p-values to provide such an estimate.

Let ^{1}.

The Fisher’s combination procedure (Fisher, 1932^{2}; Section 14.8.3 of Sutton, Abrams et al. 2000, page 220^{3})
_{i} are independently sampled from the uniform distribution, i.e. the distribution of the p-values under the null hypothesis of no between-group difference.

As is often the case in a meta-analysis, after we reject the null hypothesis, we are still faced with the problem of how to combine the assembled p-values. There are other methods to combine p-values. Cousins’ review paper^{4} is a summary with some historical account. Won (2009)^{5}, Chen and Nadarajah (2014)^{6}, and Chen, Yang et al. (2014)^{7} provide some recent developments on the topic. Briefly, the combination procedure takes the form of
_{i} ≥ 0 is the weight given to study _{i} = 1 and ^{8} generalization of the Fisher’s test. Chen, Yang et al. (2014)^{7} further generalize the test

What transformation function ^{3}). Three real data meta-analysis examples in

In

In a two-sided hypothesis testing, consider a continuous-valued test statistic _{p} to be the _{p}) = _{1−p})/_{1−p}), or the likelihood ratio evaluated at the upper ^{9} for goodness of fit tests and later it was used to understand the interpretation of p-values by Hung, O’Neill et al. (1997)^{11} and Donahue (1999)^{10}. Yu and Zelterman (2017)^{12} used it to develop a parametric model to estimate the proportion from the null in p-value mixtures. Koziol and Tuckwell (1999)^{13} used this result to develop a Bayesian method to combine statistical tests.

The current work concentrate on the two-tailed normal test. We derive the density function for the p-value first. Then we develop several approaches to combining p-values to provide an estimate of the standardized mean difference.

The quantile function for the two-tailed normal test is _{1−p} = Φ^{−1}(1 − ^{14}. Yu and Zelterman (2017)^{12} also derived the distribution for p-values generated using the chi-squared test. These distributions may provide alternatives to model p-values. However, their parameterization does not provide as clear an intepretation as distribution

Its cumulative distribution is

In a two-sample normal test with means _{1}, _{2} and common variance ^{2}, we have
^{3}). The density function _{1}, _{2}).

If the standard deviation _{t} can be found in Johnson, Kotz, and Balakrishnan (1995, p 516). The two-tailed quantile function is

Three estimates of SMD, including the maximum likelihood estimate (MLE), a Bayesian estimate, and an estimate that accounts for publication bias, are developed next.

Let ^{th} study, _{1i} and _{2i} denote the study-specific sample sizes.

We recognize different studies may have different _{2i} − _{1i} and _{i} for reasons such as the studies may be conducted on different populations and/or may have used different measurement instruments. This issue of study heterogeneity is discussed elsewhere^{15,16}. In this report we focus on a meta-analysis in which the investigator would like to combine the p-values to infer what an overall effect is. We use the SMD ^{17} for our model and the plot provides an indication of heterogeneity. We also conduct simulations to assess the impact of heterogeneity on our estimate in

With this set-up,

Its score function with respect to

The observed information is

Zero is one root of the score function

If we assume there is only one p-value observation, then the score function

This score function shows that if the sample size is not small and the effect size

_{1}, _{2}) in the individual studies to understand their impact on the likelihood function and the consequence on the estimate of

Some algebra shows that

To directly maximize the likelihood

We next present an indication of heterogeneity among the studies. The indication is the convexity plot developed by Lambert and Roeder (1995)^{17}. These plots are

With distribution ^{18}. We ran 20,000 iterations with 3 chains. The mean of the posterior and the 95% credible intervals are reported in

We next propose a method to account for the publication bias that has been of concern in meta-analysis. It is well-recognized that studies with larger p-values are more difficult to be accepted for publication by journals and this bias results to truncated p-values when they are assembled from the literature. To minimize this bias, investigators are urged to conduct a thorough search on the subject to include as complete a sample of studies as possible, even to use their connections to obtain data from known but not published studies. Statistically this phenomenon can be described as the observed p-values are truncated at a threshold. Since p-values larger than the threshold

A practical issue here is how to choose the threshold

Another publication bias concerns unequal probabilities for studies with different p-values to get accepted by journals. That means the studies we assembled have different probabilities getting into the meta-analysis sample. One way to account for this bias is to use inverse probability weighting^{19}. However, to apply inverse probability weighting, we need to know

We simulate a spectrum of possible meta-analyses. The scenarios are determined by the number of studies (

_{1i} and _{2i} in the studies help with the estimate in terms of smaller standard errors. This suggests the practical value of combining a set of p-values from small under-powered studies. The bottom two panels show the estimates for the same simulation set-up except for a sample of 12 p-values in stead of 6 as in the two top panels. The estimates show similar excellent performance in terms of bias. The improvement with more (_{0} :

From these simulations, we can also report the power of the Wald test against _{0} : _{1i} and _{2i} in the ^{th} study, and the number of studies

In the above simulations, all the studies included in the meta-analysis have exactly the same effect size. In practice, these studies may have heterogeneous _{0} :

From these simulations, we have observed our methods provide a good estimate of the overall common effect size when all the studies have the same SMD. When the effect size is heterogeneous, our methods still provide a reasonable summary of the overall effect. The Wald test against the null hypothesis _{0} :

Citicoline, also named Cytidinediphosphocholine or CDP-choline, is a widely available supplement in the US. It is a drug approved for the treatment of acute ischemic stroke in Europe^{20}. It has been studied in many clinical trials to evaluate its potential benefits for patients with memory disorders^{15} and ischemic stroke^{20,21,22}. Many meta-analyses have been conducted to assemble the evidence^{15,20}. Here we reanalyze three recent meta-analyses to demonstrate our methods.

Fioravanti and Yanagi (2005)^{15} reported a carefully-conducted meta-analysis to assess the efficacy of citicoline in the treatment of cognitive, emotional, and behavioral deficits associated with chronic cerebral disorders in the elderly. The outcomes they examined include attention, memory, behavioral, the Alzheimer’s disease assessment scale (ADAS-cog), clinical evaluation of improvement, the clinicians global impression of change (CIBIC) score, and tolerability measures. Among the endpoints, clinical evaluation of improvement and tolerability measures are assessed using odds ratios. Outcomes ADAS-cog and CIBIC score were only assessed in 1 study and 2 studies, respectively. Therefore, in this report only continuous outcomes attention, memory, behavioral, and memory recall are analyzed using our methods.

^{15} using the two-sample t-test. Almost identical p-values were obtained when we also applied the two-sample normal test. The bottom of

For endpoint attention, its log-likelihood plotted in

For endpoint memory measures, the MLE of the SMD is 0.23 with 95% CI (0.09, 0.37). The study of Bonavita (1983)^{23} contributed a very small p-value. The meta-analysis by Fioravanti and Yanagi (2005)^{15} deemed this study an outlier since it “used an idiosyncratic non-standardized procedure for memory evaluation”, and they repeated their analysis with this study removed. We would like to note in our set-up this is not a reason to remove a study from the meta analysis. Table Analysis 1.2 in Fioravanti and Yanagi (2005)^{15} clearly shows that there was heterogeneity among the studies that collected memory measures. The density curves for p-values given in

The MLE

The Bayesian estimates

In summary, our analyses are consistent with the results reported in Fioravanti and Yanagi (2005)^{15} for these endpoints, supporting a small but statistically significant treatment effect of citicoline on memory, behavioral, and memory recall.

The effect of citicoline on recovery from stroke has not been consistent. While the largest trial to date, the International Citicoline Trial on acUte Stroke (ICTUS)^{21}, found no benefit of administering citicoline on survival or recovery from stroke, many early smaller-sized trials and two meta-analyses^{20,21} support some beneficial effect in the treatment of acute ischemic stroke.

Secades at al. (2016)^{20} conducted a systematic review to identify published randomized, double-blinded, placebo-controlled clinical trials of citicoline in patients with acute ischemic stroke. They assembled 10 studies to conduct a meta analysis to assess if treatment with citicoline (started within 14 days of stroke onset) improves independence when compared with placebo. The binary outcome independence is defined as a modified Rankin Scale score of 0–2 or equivalent. The contributions of these studies to the 3 meta-analyses are summarized in

When reporting the results of the ICTUS trial, Dávalos et al.^{21} also conducted a meta analysis to put their results in the context. They included five studies and their ICTUS in the meta analysis. These 5 studies are a subset of the studies in the meta analysis by Secades at al. (2016)^{20}. The reasons why Dávalos et al. (2012) included these studies can be found on page 355^{21}. The results are in the last column of

These meta-analyses reported odds-ratio (OR) as the efficacy measure since the outcome independence is binary. We worked with the log OR since the common model-based MLE of log OR is asymptotically normally distributed. This is generally true for many model-based parameter estimates based on the maximum likelihood theory. Our methods can be broadly applied to combine p-values from these studies due to the normality of the estimates.

For all 4 meta-analyses, our methods provided OR estimates that are smaller than the originally reported estimates^{20,21}, suggesting a small but still statistically significant citicoline effect. The MLE

In summary, our analyses are consistent with the results reported in earlier meta-analyses^{20,21}, suggesting a smaller but still statistically significant treatment effect of citicoline on post-stroke independence.

This work started with a set of p-values to be combined and we ended up developing an estimator of SMD. We would like to comment on a connection between our maximum likelihood-based method and some of the existing methods in the literature.

The popular z-tests can be generally formulated as:
_{i} is the weight for study _{i} = 1, test ^{24}. A limitation of this approach is that studies with different sample sizes give us estimates with different precision and this needs to be accounted for when we combine the studies. When _{i} = _{i}, where _{i} is the sample size for study ^{25}. Our method is similar to this procedure, but with a shrinkage factor that discounts studies with small sample sizes. Other researchers suggested that we use the square root of the sample size or the inverse of the estimated standard error as weights^{1}. Our MLE provides a theoretical justification on choosing the weight and it has a built-in shrinkage against chance finding in small-sized studies.

Another issue in meta-analysis is that the assembled p-values may be obtained using different test statistics. For tests that are not associated with an effect size indicator, e.g. nonparametric tests, d_{equivalent} or r_{equivalent} developed and discussed in Rosenthal and Rubin (2003)^{26}, Kraemer (2005)^{27}, and Hsu (2005)^{28} may be obtained for the individual studies. Then traditional weighting schema may be used to combine them. We derived the distribution of the p-values for the 2-tailed normal test and the t-test. The MLE of the SMD is developed for the normal test. This may appear to be a limitation of our method since our method would be ideally applied to a set of p-values obtained using the normal test. However, it is straight forward for readers to extend the MLE for their specific tests or even for a mixture of tests as long as the parameter has the same interpretation since the likelihood can be constructed accordingly for p-values from different tests. In the case of regression analysis of randomized clinical trials with two arms, when covariates that are orthogonal to the treatment assignment are included in the model, the estimate of the between-arm difference remains the same, but the residual error variance ^{2} decreases. The net effect is improved power. However, when the number of covariates is far less than the sample size, the distribution of the p-values follows

We developed our method for a 2-tailed test in which the direction of the efficacy is lost as is the case for many other methods discussed above. This appears to be a limitation. From the technical aspect, our method can be modified to combine one-sided p-values. However, one-sided test is not common and needs to be justified prospectively for use in practice (Section 5.5, page 25, International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, 1998^{29}). Furthermore, one should always carefully examine the studies included in the meta-analysis and be particularly cautious when combining a set of p-values where some of the studies point to opposite directions of efficacy if ever one decides to pursue such a meta-analysis.

There is nonparametric work^{14,30,31,32} that describes distributions for p-values. These non-parametric distributions may fit the p-value mixtures well, however, they did not explicitly parameterize the effect size and the sample size in the distribution. Therefore, it is difficult to utilize them in meta-analyses. Our method using derived distributions for p-values allows us to establish a spectrum of estimates for the SMD, including an estimator to account for publication bias and Bayesian estimators.

This work was supported in part by Vanderbilt CTSA grant 1ULTR002243 from NIH/NCATS, R01 CA149633 from NIH/NCI, R21 HL129020, P01 HL108800 from NIH/NHLBI, R01 FD004778 from FDA (CY) and grants P50-CA196530, P30-CA16359, R01-CA177719, R01-ES005775, R41-A120546, U48-DP005023, and R01-CA168733 awarded by NIH (DZ).

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

Density function _{1} = _{2} = 30, _{1} = _{2} = 80, _{1} = _{2} = 200, _{1} = _{2} = 100.

Log-likelihood function for endpoints attention (A), memory (M), behavioural (B), and memory recall (MR) of the CDP data example analyzed in

Top: Estimate of

Estimate of _{0} :

Jackknife influence of each p-value on

This convexity plot indicates a large amount of heterogeneity in behavior (B); a small amount in memory (M); but none in attention (A) or memory recall (MR).

Effect size, sample size and study power.

Effect Size | _{1} = _{2} | Power (%) | Effect Size | _{1} = _{2} | Power (%) |
---|---|---|---|---|---|

0.2 | 30 | 11 | 0.6 | 30 | 62 |

0.2 | 50 | 16 | 0.6 | 50 | 84 |

0.2 | 80 | 24 | 0.6 | 80 | 96 |

0.2 | 100 | 29 | 0.6 | 100 | 98 |

0.2 | 150 | 40 | 0.6 | 150 | > 99 |

0.2 | 200 | 51 | 0.6 | 200 | > 99 |

Estimated overall effect size ^{15} on the effect of citicoline. The sample sizes (_{1}, _{2}) for the citicoline and placebo groups are also noted under the p-values if they are different for the endpoints.

Endpoint | |||||
---|---|---|---|---|---|

Study^{†} | Sample Size_{1}, _{2}) | Attention | Memory | Behavioural | Memory |

Alvarez 1999 | (12, 16) | 0.7313 | 0.4932 | ||

Barbagallo 1988 | (60, 65) | 0.3471 | 0.9718 | 0.2894 | |

( 56 59) | (44 47) | (60 65) | |||

Bonavita 1983 | (20, 20) | 1.5e-08 | |||

Capurso 1996 | (17, 14) | 0.1122 | 0.2347 | 0.1122 | |

Cohen 2003 | (15, 15) | 0.5770 | 0.3523 | 0.9400 | 0.3523 |

Falchi Delitala 1984 | (15, 15) | 1.03e-08 | |||

Madariaga 1978 | (16, 16) | 0.0017 | |||

Motta 1985 | (25, 25) | 0.0482 | 0.2402 | 0.2404 | |

Piccoli 1994 | (43, 43) | 0.8660 | 0.5579 | 0.6370 | 0.5579 |

(34, 33) | (35, 34) | (43, 43) | (35 34) | ||

Senin 2003 | (220, 232) | 0.9530 | 0.1107 | 0.3956 | 0.1107 |

(216, 226) | (216, 221) | (220, 232) | (216, 221) | ||

Sinforiani 1986 | (26, 32) | 0.6253 | 0.0596 | 0.0417 | 0.0596 |

Spiers 1996 | (46, 44) | 0.5272 | |||

Estimates | |||||

MLE | 0.000 | 0.2283 | 0.2307 | 0.1801 | |

95% CI | N/A | (.0856, .3709) | (.0828, .3787) | (.0018, .3583) | |

^{‡} | 0.000 | 0.2252 | 0.2253 | 0.1039 | |

95% CI | N/A | (.0821, .3683) | (.0757, .3748) | (.0000, .3253) | |

Bayes | 0.07 | 0.22 | 0.22 | 0.17 | |

95% CI | (.00, .19) | (.06, .37) | (.04, .37) | (.01, .34) |

Detailed references for the studies are in Fioravanti and Yanagi (2005)^{15}.

MLE based on truncated distribution

Meta analysis estimate ^{20}, and (D) a meta analysis by Davalos et al. (2012)^{21}. The sample sizes (_{1}, _{2}) for the citicoline and placebo groups are noted under the p-values if the study contributes a subset of subjects to the analysis.

Meta Analyses by Secades et al. (2016) | |||||
---|---|---|---|---|---|

Study^{†} | Sample Size_{1}, _{2}) | (A) All | (B) Not | (C) On | (D) Dávalos |

Boudouresques 1980 | (23, 22) | .0092 | .0092 | ||

Goas 1980 | (31,33) | .0264 | .0264 | ||

Corso 1982 | (17, 16) | .9943 | .9943 | ||

Tazaki 1988 | (136, 136) | .00005 | .00005 | .00005 | |

USA 1 1997 | (193, 64) | .3169 | .3169 | .2353 | .3169 |

(65, 64) | |||||

USA 2 1999 | (267, 127) | .4440 | .4440 | .4440 | |

USA 3 2000 | (52, 48) | .9085 | .9085 | .9085 | |

USA 4 2001 | (452, 446) | .0240 | .0240 | .0044 | .0240 |

(396, 368) | |||||

Alviarez 2007 | (29, 30) | .3668 | .3668 | .3668 | |

ICTUS 2012 | (1148, 1150) | .5384 | .7153 | .7153 | .5384 |

(613, 615) | (613, 615) | ||||

Estimates | |||||

MLE | 1.10 | 1.13 | 1.09 | 1.09 | |

95% CI | (1.03, 1.17) | (1.04, 1.21) | (.99, 1.19) | (1.02, 1.16) | |

^{‡} | 1.10 | 1.13 | 1.07 | 1.08 | |

95% CI | (1.02, 1.17) | (1.04, 1.21) | (0.97, 1.18) | (1.01, 1.16) | |

Bayes | 1.09 | 1.12 | 1.09 | 1.08 | |

95% CI | (1.02, 1.17) | (1.03, 1.21) | (1.01, 1.19) | (1.01, 1.16) |

Detailed references for the studies are in Secades et al. (2016)^{20}.

MLE based on truncated distribution