Researchers increasingly use metaanalysis to synthesize the results of several studies in order to estimate a common effect. When the outcome variable is continuous, standard metaanalytic approaches assume that the primary studies report the sample mean and standard deviation of the outcome. However, when the outcome is skewed, authors sometimes summarize the data by reporting the sample median and one or both of (i) the minimum and maximum values and (ii) the first and third quartiles, but do not report the mean or standard deviation. To include these studies in metaanalysis, several methods have been developed to estimate the sample mean and standard deviation from the reported summary data. A major limitation of these widely used methods is that they assume that the outcome distribution is normal, which is unlikely to be tenable for studies reporting medians. We propose two novel approaches to estimate the sample mean and standard deviation when data are suspected to be nonnormal. Our simulation results and empirical assessments show that the proposed methods often perform better than the existing methods when applied to nonnormal data.
Metaanalysis is a statistical approach for pooling data from related studies that is widely used to provide evidence for medical research. To pool studies in an aggregate data metaanalysis, each study must contribute an effect measure (e.g., the sample mean for onegroup studies, the sample means for twogroup studies) of the outcome and its variance. However, primary studies may differ in the effect measures reported. Although the sample mean is the usual effect measure reported for continuous outcomes, authors often report the sample median when data are skewed and may not report the mean.^{1} This occurs commonly for timebased outcomes, such as time delays in the diagnosis and treatment of tuberculosis^{2, 3} or colorectal cancer^{4} or length of hospital stay^{5–7}. Other examples in medical research include muscle strength and mass^{8}, molecular concentration levels^{9}, tumor sizes^{10}, motor impairment scores^{11}, and intraoperative blood loss^{12}. When primary studies report the sample median of an outcome, they typically report the sample size and one or both of (i) the sample minimum and maximum values and (ii) the first and third quartiles.
The same effect measure must be obtained from all primary studies in an aggregate data metaanalysis. In order to metaanalyze a collection of studies in which some report the sample mean and others report the sample median, Hozo et al.^{13}, Bland^{14}, Wan et al.^{15}, Kwon and Reis^{16}, and Luo et al.^{17} have recently published methods to estimate the sample mean and standard deviation from studies that report medians. These methods have been widely used to metaanalyze the means for onegroup studies and the raw or standardized difference of means for twogroup studies. Reflecting how commonly these methods are used, Google Scholar listed 3,315 articles citing Hozo et al.^{13} and 866 articles citing Wan et al.^{15} as of October 23, 2019.
Commonly used methods that have been proposed to estimate the sample mean and standard deviation in this context can be divided into formulabased methods and simulationbased methods. The methods developed by Luo et al.^{17} and Wan et al.^{15} are the bestperforming formulabased methods for estimating the sample mean and standard deviation, respectively. A major limitation of these methods is that they assume the outcome variable is normally distributed, which may be unlikely because otherwise the authors would have reported the mean. Consequently, Kwon and Reis^{16} recently proposed a simulationbased method which is based on different parametric assumptions of the outcome variable. Although the Kwon and Reis^{16} sample mean estimator has not been compared to the formulabased method of Luo et al.^{17}, their proposed standard deviation estimator performed better than the formulabased method of Wan et al.^{15} for skewed data when the assumed parametric family is correct. Limitations of this simulationbased method are that (i) it is computationally expensive, (ii) requires users to write their own distributionspecific code, and (iii) its performance can be highly sensitive to several conceptual and computational decisions that one must make when implementing the method (see
We propose two novel methods to estimate the sample mean and standard deviation for skewed data when the underlying distribution is unknown. The proposed methods overcome several limitations of the existing methods, and we demonstrate that the proposed approaches often perform better than the existing methods when applied to skewed data.
The objectives of this paper are to describe the existing and proposed methods for estimating the sample mean and standard deviation, systematically evaluate their performance in a simulation study, and empirically evaluate their performance on reallife data sets.
In the following section, we describe the existing and proposed methods. In ‘Results’, we report the results of a simulation investigating the performance of the methods. We illustrate these methods on an example data set and evaluate their accuracy in ‘Example’. In ‘Discussion’, we summarize our findings and provide recommendations for data analysts.
Throughout this paper, we use the following notation for sample summary statistics: minimum value (
The sample mean estimator of Luo et al.^{17} and the sample standard deviation estimator of Wan et al.^{15} are formulabased methods that are derived from the assumption that the outcome variable is normally distributed.
Luo et al. developed the following sample mean estimators in scenarios
Building on the sample mean estimators of Hozo et al.^{13}, Wan et al.^{15}, and Bland^{14} in
Wan et al. proposed the following sample standard deviation estimators in scenarios
The standard deviation estimators of Wan et al. are derived using relationships between the distribution standard deviation and the expected values of order statistics for normally distributed data. The expected values of the minimum and maximum values and first and third quartiles are estimated by the respective sample values. The expected value of other order statistics are estimated using Blom’s method^{18}.
Wan et al. were the first to propose a standard deviation estimator in
For the purpose of the analyses presented herein, we refer to the approach which uses the method of Luo et al. to estimate the sample mean and the method of Wan et al. to estimate the sample standard deviation as the Luo/Wan method.
The following two subsections describe the proposed methods for estimating the sample mean and standard deviation from
The QE method was originally introduced in McGrath et al.^{20} for estimating the variance of the median when summary measures of
We prespecify several candidate parametric families of distributions for the outcome variable, namely the normal, lognormal, gamma, beta, and Weibull. The parameters of each candidate distribution are estimated by minimizing the distance between the observed and distribution quantiles. Let
Details concerning the implementation of the optimization algorithm for minimizing
The distribution with the best fit (i.e., yielding the smallest value of
Luo et al.^{17} and Wan et al.^{15} assumed that a sample
In brief, the BC method consists of the following four steps. First, an optimization algorithm, such as the algorithm of Brent^{22}, optimizes the power parameter
BoxCox transformations
Equivalently, inverse BoxCox transformations
Box and Cox^{23} argued that BoxCox transformations can transform a dataset into a more normallydistributed dataset. Moreover, for every value of
The optimization step for finding
Then, the BC method applies the BoxCox transformations with this value of
Let
The mean and standard deviation of
Numerical integration can solve the two above equations. Moreover, the following MonteCarlo simulation can compute the mean and standard deviation of
Recall that
We conducted a simulation study to systematically compare the performance of the existing and proposed approaches when the truth is known.
To be consistent with the work already conducted in this area, we generated data from the same distributions considered in previous studies^{13–17}. As used by Bland^{14}, we used the normal distribution with
For each distribution, a sample of size
We used the following sample sizes in our simulations: 25, 50, 75, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1 000. A total of 1 000 repetitions were performed for each combination of data generation parameters under scenarios
As used in previous studies^{13, 15, 16}, the average relative error (ARE) was used as a performance measure. For repetition
As used in Luo et al.^{17}, we also used the relative mean squared error (RMSE) to evaluate the performance of all methods. Letting
In the following subsections, we present the results of the simulation study using the set of outcome distributions considered by Bland^{14}, as these distributions were selected to investigate the effect of skewness on the estimators. The results of the sensitivity analyses where we used the set of outcome distribution used by other authors^{13, 15–17} is given in
Because the simulation results in scenarios
For estimating the sample mean, the BC method performed best under each distribution and nearly all sample sizes (
The BC method performed best for estimating the sample standard deviation, achieving AREs of magnitude less than 0.03 in nearly all scenarios investigated in
Model selection for the QE method generally performed well. When the outcome distribution was LogNormal(5,0.25), the QE method selected the lognormal distribution between 58.1% (when
The BC and QE sample mean estimators performed substantially better than the Luo et al. sample mean estimator in all scenarios investigated in
Similar trends held for the corresponding sample standard deviation estimators. The QE and BC methods performed considerably better than the Wan et al. sample standard deviation estimator in nearly all scenarios in
Lastly, model selection performance was similar to that observed in
In this section, we illustrate the use of the existing and proposed methods when applied to a reallife metaanalysis of a continuous, skewed outcome. Specifically, we used data collected for an individual participant data (IPD) metaanalysis of the diagnostic accuracy of the Patient Health Questionnaire9 (PHQ9) depression screening tool.^{24, 25} We chose to use data from an IPD metaanalysis because 1)
Our analysis focused on the patient scores of the PHQ9, which is a selfadministered screening tool for depression. PHQ9 scores are measured on a scale from 0 to 27, where higher scores are indicative of higher depressive symptoms. Previous studies have found that the distribution of PHQ9 scores in the general population is rightskewed^{26–28}.
For each of the 58 primary studies, we calculated the sample median, minimum and maximum values, and first and third quartiles of the PHQ9 scores of all patients in order to mimic the scenarios where an aggregate data metaanalysis extracts
Some primary studies used weighted sampling. When extracting
As PHQ9 scores are integervalued, PHQ9 scores of 0 were observed in most of the primary studies. However, a minimum value and/or first quartile value of 0 result in complications for the QE method when estimating the parameters of the lognormal distribution, as the parameter constraints for the QE method implicitly assume that the extracted summary data are strictly positive. Therefore, when applying all methods, a value of 0.5 was added to the extracted summary data. After estimating the sample mean and standard deviation from the shifted summary data, 0.5 was subtracted from the estimated sample mean.
We compared the derived estimated sample means and standard deviations to the true sample means and standard deviations (
We metaanalyzed the PHQ9 scores using the true studyspecific sample means and standard deviations (
The primary studies were highly heterogeneous. When using the true studyspecific sample means and standard deviations, the
Lastly, we investigated the skewness of the PHQ9 scores. To mimic how data analysts may evaluate skewness based on available summary data, we used Bowley’s coefficient to quantify skewness, as it only depends on
We performed additional analyses to explore the sensitivity of the addition of 0.5 to all summary data. When adding 0.1 or 0.01 to all summary data, all methods obtained similar results.
We proposed two methods to estimate the sample mean and standard deviation from commonly reported quantiles in metaanalysis. Because studies typically report the sample median and other sample quantiles when data are skewed, our analyses focused on the application of the proposed QE and BC methods to skewed data. We compared the QE and BC methods to the widely used methods of Wan et al.^{15} and Luo et al.^{17} in a simulation study and in a reallife metaanalysis.
We found that the QE and BC sample mean estimators performed well, typically yielding average relative error values approaching zero as the sample size increased. In the simulation study and the empirical evaluation, the QE and BC sample mean estimators performed better than the methods of Luo et al. in nearly all scenarios.
Although the BC sample standard deviation estimator performed best or comparably to the best performing method in the primary analyses of the simulation study, the sensitivity analyses and empirical evaluations did not clearly indicate a best performing approach for estimating the sample standard deviation. For all methods, the magnitude of the relative errors for estimating the sample standard deviation was typically higher than for estimating the sample mean.
In practice, the existing and proposed methods enable data analysts to incorporate studies that report medians in metaanalysis. Therefore, we compared the performance of the methods at the metaanalysis level using data from a reallife individual patient data metaanalysis. In this analysis, the methods that performed best for estimating the sample mean often resulted in the most accurate pooled mean estimates as well. As the QE and BC methods performed best for estimating the sample mean, these methods also performed best at the metaanalysis level.
In our empirical assessments, we assumed that all primary studies reported
Repeated applications of the BC method to the same summary data will result in slightly different estimates of the sample mean and standard deviation. This is because the BC method uses MonteCarlo simulation to perform the inverse transformation (i.e., to solve
Our analyses focused on skewed data. As expected, when data were generated from a normal distribution, the Luo et al. sample mean estimators and the Wan et al. sample standard deviation estimators performed best (see
Kwon and Reis^{16, 33} proposed methods for estimating the sample mean and standard deviation from the same sets of summary data considered in this work that are based on applying approximate Bayesian computation (ABC). Unlike the methods of Luo et al. and Wan et al. which assume that the outcome variable is normally distributed, the ABC methods can be applied under different parametric assumptions of the underlying distribution (i.e., normal and skewed distributions). We considered including the ABC methods in this paper. However, we found that several implementation decisions strongly affected the performance of the method in the simulation study and empirical assessments. As investigating how to best implement the ABC methods would be beyond the scope of this paper, we decided not to include these methods in this paper and intend to study this in greater detail in future work.
This work has several limitations. Although the settings in our simulation study were based on those used in previous studies^{13–17} to make a fair comparison between methods, these settings are not exhaustive and results may vary in other settings. Additionally, our simulation study focused solely on the performance of the methods for estimating the sample mean and standard deviation. In future work, we intend to conduct a simulation study investigating the performance of the methods at the metaanalysis level (e.g., for estimating the pooled effect measure and heterogeneity).
Strengths of this work include (
In summary, we recommend the QE and BC methods for estimating the sample mean and standard deviation when data are suspected to be nonnormal, as they often outperformed the existing methods in the analyses presented herein. To make these methods widely accessible, we developed the R package ‘estmeansd’ (available on CRAN)^{19} which implements these methods and launched a webpage (available at
This study was funded by the Canadian Institutes of Health Research (CIHR; KRS134297). BDT and AB were supported by Fonds de recherche du Québec  Santé (FRQS) researcher salary awards. BLevis was supported by a CIHR Frederick Banting and Charles Best Canada Graduate Scholarship doctoral award. KER and NS were supported by CIHR Frederick Banting and Charles Best Canada Graduate Scholarship master’s awards. AWL and MA were supported by FRQS Masters Training Awards. DBR was supported by a Vanier Canada Graduate Scholarship. YW was supported by a FRQS Postdoctoral Training Fellowship. PMB was supported by a studentship from the Research Institute of the McGill University Health Centre. DN was supported by G.R. Caverhill Fellowship from the Faculty of Medicine, McGill University. The primary studies by Amoozegar and by Fiest et al. were funded by the Cumming School of Medicine, University of Calgary, and Alberta Health Services through the Calgary Health Trust, as well as the Hotchkiss Brain Institute. SBP was supported by a Senior Health Scholar award from Alberta Innovates Health Solutions. Collection of data for the study by Arroll et al. was supported by a project grant from the Health Research Council of New Zealand. Data collection for the study by Ayalon et al. was supported from a grant from Lundbeck International. The primary study by Khamseh et al. was supported by a grant (M288) from Tehran University of Medical Sciences. The primary study by Bombardier et al. was supported by the Department of Education, National Institute on Disability and Rehabilitation Research, Spinal Cord Injury Model Systems: University of Washington (grant No H133N060033), Baylor College of Medicine (grant No H133N060003), and University of Michigan (grant No H133N060032). Collection of data for the primary study by Kiely et al. was supported by National Health and Medical Research Council (grant No 1002160) and Safe Work Australia. PB was supported by Australian Research Council Future Fellowship FT130101444. Collection of data for the primary study by Zhang et al. was supported by the European Foundation for Study of Diabetes, the Chinese Diabetes Society, Lilly Foundation, Asia Diabetes Foundation, and Liao Wun Yuk Diabetes Memorial Fund. RC was supported by a United States National Institute of Mental Health (NIMH) grant (5F30MH096664), and the United States National Institutes of Health (NIH) Office of the Director, Fogarty International Center, Office of AIDS Research, National Cancer Center, National Heart, Blood, and Lung Institute, and the NIH Office of Research for Women’s Health through the Fogarty Global Health Fellows Program Consortium (1R25TW00934001) and the American Recovery and Reinvestment Act. YC received support from NIMH (R24MH071604) and the Centers for Disease Control and Prevention (R49 CE002093). Collection of data for the primary study by Delgadillo et al. was supported by grant from St Anne’s Community Services, Leeds, UK. Collection of data for the primary study by Fann et al. was supported by grant RO1 HD39415 from the US National Center for Medical Rehabilitation Research. The primary study by Fischer et al. was funded by the German Federal Ministry of Education and Research (01GY1150). Data for the primary study by Gelaye et al. was supported by grant from the NIH (T37 MD001449). Collection of data for the primary study by Gjerdingen et al. was supported by grants from the NIMH (R34 MH072925, K02 MH65919, P30 DK50456). The primary study by Eack et al. was funded by the NIMH (R24 MH56858). Collection of data for the primary study by Hobfoll et al. was made possible in part by grants from NIMH (RO1 MH073687) and the Ohio Board of Regents. BJH received support from a grant awarded by the Research and Development Administration Office, University of Macau (MYRG201500109FSS). Collection of data provided by MHärter and KR was supported by the Federal Ministry of Education and Research (grants No 01 GD 9802/4 and 01 GD 0101) and by the Federation of German Pension Insurance Institute. The primary study by Henkel et al. was funded by the German Ministry of Research and Education. The primary study by Hides et al. was funded by the Perpetual Trustees, Flora and Frank Leith Charitable Trust, Jack Brockhoff Foundation, Grosvenor Settlement, Sunshine Foundation, and Danks Trust. Data for the study by Razykov et al. was collected by the Canadian Scleroderma Research Group, which was funded by the CIHR (FRN 83518), the Scleroderma Society of Canada, the Scleroderma Society of Ontario, the Scleroderma Society of Saskatchewan, Sclérodermie Québec, the Cure Scleroderma Foundation, Inova Diagnostics Inc, Euroimmun, FRQS, the Canadian Arthritis Network, and the Lady Davis Institute of Medical Research of the Jewish General Hospital, Montreal, QC. MHudson was supported by a FRQS Senior Investigator Award. Collection of data for the primary study by Hyphantis et al. was supported by grant from the National Strategic Reference Framework, European Union, and the Greek Ministry of Education, Lifelong Learning and Religious Affairs (ARISTEIAABREVIATE, 1259). The primary study by Inagaki et al. was supported by the Ministry of Health, Labour and Welfare, Japan. The primary study by Twist et al. was funded by the UK National Institute for Health Research under its Programme Grants for Applied Research Programme (grant reference No RPPG06061142). NJ was supported by a Canada Research Chair in Neurological Health Services Research and an AIHS Population Health Investigator Award. KMK was supported by funding from a Australian National Health and Medical Research Council fellowship (grant No 1088313). The primary study by Lamers et al. was funded by the Netherlands Organisation for Health Research and Development (grant No 94503047). The primary study by Liu et al. was funded by a grant from the National Health Research Institute, Republic of China (NHRIEX979706PI). The primary study by Lotrakul et al. was supported by the Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok, Thailand (grant No 49086). The primary studies by Osório et al. were funded by Reitoria de Pesquisa da Universidade de São Paulo (grant No 09.1.01689.17.7) and Banco Santander (grant No 10.1.01232.17.9). BLöwe received research grants from Pfizer, Germany, and from the medical faculty of the University of Heidelberg, Germany (project 121/2000) for the study by Gräfe et al.. Collection of data for the primary study by Williams et al. was supported by an NIMH grant to LM (RO1MH069666). The primary study by Mohd Sidik et al. was funded under the Research University Grant Scheme from Universiti Putra Malaysia, Malaysia, and the Postgraduate Research Student Support Accounts of the University of Auckland, New Zealand. The primary study by Santos et al. was funded by the National Program for Centers of Excellence (PRONEX/FAPERGS/CNPq, Brazil). The primary study by Muramatsu et al. was supported by an educational grant from Pfizer US Pharmaceutical Inc. FLO was supported by Productivity Grants (PQCNPq2 number 301321/20167). Collection of primary data for the study by Pence et al. was provided by NIMH (R34MH084673). The primary study by Persoons et al. was supported by a grant from the Belgian Ministry of Public Health and Social Affairs and a restricted grant from Pfizer Belgium. The primary study by Picardi et al. was supported by funds for current research from the Italian Ministry of Health. The primary study by Rooney et al. was funded by the UK National Health Service Lothian NeuroOncology Endowment Fund. JS was supported by funding from Universiti Sains Malaysia. The primary study by Sidebottom et al. was funded by a grant from the United States Department of Health and Human Services, Health Resources and Services Administration (grant No R40MC07840). Simning et al.’s research was supported in part by grants from the NIH (T32 GM07356), Agency for Healthcare Research and Quality (R36 HS018246), NIMH (R24 MH071604), and the National Center for Research Resources (TL1 RR024135). LS received PhD scholarship funding from the University of Melbourne. Collection of data for the studies by Turner et al. were funded by a bequest from Jennie Thomas through the Hunter Medical Research Institute. The study by van SteenbergenWeijenburg et al. was funded by Innovatiefonds Zorgverzekeraars. The study by Wittkampf et al. was funded by the Netherlands Organization for Health Research and Development (ZonMw) Mental Health Program (No 100.003.005 and 100.002.021) and the Academic Medical Center/University of Amsterdam. PAV was supported by the Fund for Innovation and Competitiveness of the Chilean Ministry of Economy, Development and Tourism, through the Millennium Scientific Initiative (grant No IS130005). The primary study by Thombs et al. was done with data from the Heart and Soul Study. The Heart and Soul Study was funded by the Department of Veterans Epidemiology Merit Review Program, the Department of Veterans Affairs Health Services Research and Development service, the National Heart Lung and Blood Institute (R01 HL079235), the American Federation for Ageing Research, the Robert Wood Johnson Foundation, and the Ischemia Research and Education Foundation. No other authors reported funding for primary studies or for their work on this study. No funder had any role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Declaration of conflicting interests
All authors have completed the ICJME uniform disclosure form at
In the QE method, the parameters of a candidate distribution are estimated by minimizing the objective function,
We set the initial values for the parameters in the optimization algorithm as follows. First, we apply the methods of Luo et al.^{17} and Wan et al.^{15} to estimate the sample mean and standard deviation, respectively, from
To minimize
The algorithm is considered to converge when the objective function is reduced by a factor of less than 10^{7} of machine tolerance. In each application of the QE method in the simulation study, the algorithm converged for at least three distributions. If the algorithm failed to converge for a given candidate distribution, that candidate distribution was excluded from the model selection procedure.
Parameter constraints for the LBFGSB algorithm.
Scenario  Candidate Distribution 



 Normal  
LogNormal  
Gamma  
Beta  
Weibull  
Normal  
LogNormal  
Gamma  
Beta  
Weibull 
To estimate sample mean and standard deviation using the BC method, the use of BoxCox transformations requires the solutions to the following problems.
The first problem is defined as follows. In
Equivalently, this problem can be restated as finding
To find
The second problem arises when
ARE of the Luo/Wan (red line, hollow circle), QE (blue line, solid triangle), and BC (green line, solid circle) methods in scenario
ARE of the Luo/Wan (red line, hollow circle), QE (blue line, solid triangle), and BC (green line, solid circle) methods in scenario
Forest plot from the metaanalysis of mean PHQ9 scores. The studyspecific estimates represent the true sample means and their 95% CIs. The pooled estimate shown was obtained using the truestudyspecific sample means and standard deviations. In the “Mean PHQ9” column, the true studyspecific sample means and their 95% CIs as well as the pooled mean and its 95% CI are given.
ARE of the methods when applied to estimate the sample means and standard deviations of the 58 primary studies. In each column, the ARE value closest to zero is in bold. The presented ARE values were rounded to two decimal places.
ARE for  ARE for  






 
Luo/Wan  −0.14  −0.15  −0.10  −0.15  −  −0.08 
QE  −  0.06  0.00  −  0.34  − 
BC  −0.08 

 −0.25  0.06  0.11 
Estimates of the pooled mean PHQ9 score and their 95% CIs when using the studyspecific derived estimated sample means and standard deviations. For the pooled estimates under the “


 

Luo/Wan  5.76 [5.15, 6.37]  5.68 [5.06, 6.29]  5.97 [5.36, 6.58] 
QE 
 6.88 [6.22, 7.53] 

BC  6.09 [5.48, 6.69] 
 6.58 [6.01, 7.14] 