The relationship between the mean and variance is an implicit assumption of parametric modeling. While many distributions in the exponential family have a theoretical mean-variance relationship, it is often the case that the data under investigation are correlated, thus varying from the relation. We present a generalized method of moments estimation technique for modeling certain correlated data by adjusting the mean-variance relationship parameters based on a canonical parameterization. The proposed mean-variance form describes overdispersion using two parameters and implements an adjusted canonical parameter which makes this approach feasible for all distributions in the exponential family. Test statistics and confidence intervals are used to measure the deviations from the mean-variance relation parameters. We use the modified relation as a means of fitting generalized quasi-likelihood models to correlated data. The performance of the proposed modified generalized quasi-likelihood model is demonstrated through a simulation study and we highlight the importance of accounting for overdispersion in the evaluation of adolescent obesity data collected from a U.S. longitudinal study.

As a common statistical measure, the variance is often relied on to evaluate the model fit and to understand the differences between the responses through the construction of test statistics and confidence intervals. The form of the variance is often assumed based on the underlying distribution of the responses. In fact, the variance is related to the mean for most distributions in the exponential family. However, while the responses may be on a certain scale or resemble a certain distribution, extraneous variation can impact the mean-variance relationship. Extraneous variation, or so-called overdispersion, is often present in longitudinal or clustered data arising from a hierarchical data structure.

Ignoring overdispersion in the fit of correlated data results in summary statistics, including test statistics, with a larger variance than expected.^{1} It often leads to a loss of efficiency in using statistics appropriate for the single-parameter family.^{2} Studies have shown that ignoring overdispersion and thereby misspecifying the model biases the covariate effects and greatly impacts the standard error of the coefficients.^{3, 4} While underdispersion, the case when the variation is smaller than expected, may occur and also impacts the accuracy of the analysis when it is not appropriately specified, McCullagh and Nelder^{5} have suggested that overdispersion may be the norm. Various methods have been proposed to identify the underlying variation and provide corrections to improve estimates of the variance.^{6, 7}

Overdispersion or underdispersion is often identified by estimating the parameters in the mean-variance relationship and measuring the deviations from the theoretical values under the assumed distribution. Kukush et al.^{8} considered a pair of mean and variance functions with a common parameter vector ^{9} considered two parameters ^{10} and score test statistics for overdispersed Poisson and binomial models.^{11} Xiang et al.^{12} provided a score test for overdispersion in a zero-inflated Poisson mixed regression model. Yang, Hardin, and Addy^{13} modified the score statistic to test overdispersion in the zero–inflated generalized Poisson mixed model. While these tests work well for identifying overdispersion, current parameterizations are limited to one parameter or are only applicable to distributions that have a particular form for the variance.

Overdispersed data are analyzed with appropriate statistical models such as generalized estimating equations, generalized linear mixed models, and joint modeling of the mean and dispersion.^{14} Generalized estimating equations account for correlation through the selection of a covariance structure for the correlated responses.^{15} Generalized linear mixed models have been used to model overdispersion in non-normal data.^{16} These models incorporate random effects, through random intercepts and random slopes, to account for correlation due to clustering.^{17} The joint modeling of the mean and the variance uses an additional dispersion submodel to address the overdispersion in a generalized linear model context.^{18} Joint modeling allows one to simultaneously model both the mean and the variance through submodels. This technique has been extended to consider joint modeling in hierarchical generalized linear model structures.^{19, 20}

Quasi-likelihood models are useful in cases where the underlying distribution is unspecified.^{21} This modeling technique relaxes the distributional assumption in the random component and instead relies on the specification of a mean-variance function. The regression parameter estimates and standard errors are obtained from the specified mean-variance relationship and estimates of the covariance matrix in a quadratic form. The quasi-likelihood approach possesses many good properties, including unbiased estimates and small standard errors as compared to alternative methods.^{22} While the quasi-likelihood method is appropriate for evaluating overdispersed data, the form of the variance has been limited to a single multiplicative overdispersion parameter.

This paper proposes a modified generalized quasi-likelihood (MGQL) model which utilizes a canonical two-parameter mean-variance relation. The proposed canonical parameterization is flexible and can be used to represent the form of the variance for any distribution in the exponential family. The incorporation of this mean-variance relationship in the MGQL extends quasi-likelihood models to describe a larger class of variance functions in the analysis of correlated data.

In ^{23} In ^{24} This study collected health information on adolescents over four waves of interviews, and is highly correlated due to the nested structure of the longitudinal study. We demonstrate the use of the MGQL to appropriately account for overdispersion in the evaluation of risk factors associated with obesity.

Let the vector of observations _{i} with known functions ^{5} be

If

Using the expectation of the derivative of the likelihood

The Poisson distribution and the binomial distribution are members of the exponential family and are commonly used to analyze count data and binary data, respectively. The Poisson distribution has probability mass function
^{5, 25}

Generalized quasi-likelihood models use the specification of the mean-variance relationship to evaluate correlated data. Consider vectors of correlated observations _{i}. Let _{i} with elements ^{21, 23} The mean of the response vector, ^{th} finite moments of _{ij}. The partial derivative matrix _{i} such that

For

For

In the covariance matrix,

For

For

The quasi-likelihood estimate ^{23} A specification of the GQL model is important as consistency of the regression parameter estimates depends on correctly specifying the link function and the efficiency depends on a correctly specified variance function.

For a random variable ^{9} suggested the mean-variance relationship,

For example, if

If

Let _{n} is a symmetric, positive definite weight matrix of dimension ^{26, 27} Then,

We make use of a two-step GMM approach, with an identity weight matrix in the first step. In the second step, the weights are selected as an estimate of the optimal weight matrix for GMM as
^{28} Thus, the vector of GMM estimates for the mean-variance relation parameters

An alternative approach is to fix one parameter at a time and estimate the second parameter using one moment condition. Thus, an extension of GMM is to make use of additional moment conditions, such as
^{29}

To identify the mean-variance relation in clustered data, consider _{ij}, the ^{th} observation in the ^{th} cluster, _{i} through the link function _{i} represents the variation between clusters such that

The general mean-variance relationship, obtained for the data across all the clusters, is

The parameters ^{30} We do not require complete distributional assumptions, as is required with maximum likelihood estimators, and the estimates are obtainable even when likelihood methods are computationally burdensome.^{26} The GMM estimators for ^{31}

Assume that the data come from a quasi-exponential family. The sample moments are asymptotically normally distributed, so we have
^{30} For the mean-variance relationship parameters

In the optimal case, the weight matrix is selected as ^{26} In practice, the covariance matrix is evaluated using

Significant overdispersion is identified through two hypothesis tests of the overdispersion parameters

Then the z-test statistics
_{α} is the ^{th} quantile from the standard normal distribution.^{28}

In this section, we propose a modified generalized quasi-likelihood model for correlated data based on the canonical parameterization. As correlated data necessitate dealing with extravariation, we rely on our two-parameter mean-variance relation. The GQL approach relies on the specification of the mean-variance relationship rather than a distributional assumption. We address the correlation through the empirical mean-variance estimates of

The generalized quasi-likelihood estimating _{i}, ^{23} it relies on the estimate of the covariance

This modification makes use of the GMM estimates of

We simulate hierarchical binary data and evaluate the estimation of the regression parameters using the MGQL model which incorporates GMM estimates of the mean-variance parameters into the quasi-likelihood model framework, a GQL model, and a generalized linear mixed model (GLMM) over 1000 iterations. The two-level binary data contain 50 clusters with 10 observations in each cluster, with the linear predictor _{1} and _{2} are generated from standard normal distributions. The canonical mean-variance parameter relation under the Bernoulli distribution is

We evaluate hierarchical binary data with normally distributed random effects. The random intercept _{i} associated with each cluster is generated from

The simulation results demonstrate that the MGQL approach performs well and suggests that the MGQL model recovers the true values when relying on the estimated mean-variance relationship in the covariance matrix. While the parameter estimates are similar across the three methods, the standard errors for the MGQL estimates of _{1} and _{2} are lower than the standard errors of the GQL approach across all values of

We evaluate the performance of the MGQL, GQL, and GLMM for non-normally distributed random effects. The random effects _{i} are generated under the t-distribution with 4 degrees of freedom, which has heavier tails than the normal distribution. The model parameter estimates and standard errors are reported in

As seen in the previous simulation, MGQL tends to be more efficient than the GQL approach for estimates of _{1} and _{2} compared to the GQL model and GLMM. In addition, for values of

The Add Health Study is a longitudinal study in the United States of adolescents in 7^{th} through 12^{th} grade, with information collected over four waves of interviews between 1994 and 2008.^{24} The data are available on the Add Health website (

The covariates activity scale and feeling scale as well as the random effect are found to be significant across all three models. The regression parameter estimates for activity scale are positive, indicating that increased physical activity is associated with a lower probability of obesity. Similar estimates are produced for the GQL and GLMM approaches, while the MGQL estimate is slightly smaller

It is common to assume that the variance of a random variable is a function of the mean, although it is often the case that the true variance in the data may be inflated due to underlying correlation or the hierarchical data structure. While the presence of overdispersion impacts the accuracy of statistical evaluations, the MGQL is a modeling approach that appropriately fits correlated data. The MGQL approach is flexible as it accounts for correlation through an extended representation of the covariance. The canonical parameterization is tractable in the power form for any distribution in the exponential family. Moreover, deviations in the variance can be readily identified using the proposed GMM estimators of the mean-variance parameters

This research uses data from Add Health, a program project directed by Kathleen Mullan Harris and designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris at the University of North Carolina at Chapel Hill, and funded by grant P01-HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with cooperative funding from 23 other federal agencies and foundations. Special acknowledgment is due Ronald R. Rindfuss and Barbara Entwisle for assistance in the original design. Information on how to obtain the Add Health data files is available on the Add Health website (

Funding

Disclaimer: The work for this paper was conducted while the first author was at Arizona State University. The findings and conclusions in this paper are those of the authors and do not necessarily represent the views of the National Center for Health Statistics, Centers for Disease Control and Prevention.

Declaration of Conflicting Interests

The Authors declare that there is no conflict of interest.

Model Fit Simulation Results for Normally Distributed Random Effects

_{1} | _{2} | Iterations | ||||||
---|---|---|---|---|---|---|---|---|

Est | SE | Est | SE | Est | SE | |||

MGQL | 1.0083 | 0.1228 | 1.0181 | 0.1231 | 0.5687 | 0.2177 | 5.5 | |

GQL | 1.0053 | 0.1319 | 1.0120 | 0.1320 | 0.5837 | 0.2019 | 5.2 | |

GLMM | 1.0040 | 0.1316 | 1.0110 | 0.1318 | 0.5706 | 0.1872 | - | |

MGQL | 1.0093 | 0.1243 | 1.0241 | 0.1247 | 0.7822 | 0.1880 | 4.5 | |

GQL | 1.0045 | 0.1344 | 1.0159 | 0.1348 | 0.7948 | 0.1814 | 4.0 | |

GLMM | 1.0035 | 0.1327 | 1.0149 | 0.1331 | 0.7790 | 0.1869 | - | |

MGQL | 1.0115 | 0.1281 | 1.0237 | 0.1284 | 0.9841 | 0.1924 | 4.5 | |

GQL | 1.0047 | 0.1372 | 1.0137 | 0.1375 | 1.0014 | 0.1916 | 4.0 | |

GLMM | 1.0041 | 0.1286 | 1.0131 | 0.1288 | 0.9832 | 0.1943 | - | |

MGQL | 1.0103 | 0.1333 | 1.0225 | 0.1336 | 1.1837 | 0.2067 | 4.6 | |

GQL | 1.0025 | 0.1401 | 1.0126 | 0.1404 | 1.2044 | 0.2096 | 4.1 | |

GLMM | 1.0023 | 0.1220 | 1.0125 | 0.1222 | 1.1835 | 0.2108 | - | |

MGQL | 1.0148 | 0.1402 | 1.0253 | 0.1405 | 1.3951 | 0.2280 | 4.7 | |

GQL | 1.0078 | 0.1437 | 1.0146 | 0.1439 | 1.4120 | 0.2326 | 4.2 | |

GLMM | 1.0080 | 0.1137 | 1.0147 | 0.1139 | 1.3874 | 0.2335 | - |

Model Fit Simulation Results for t-Distributed Random Effects

_{1} | _{2} | Iterations | ||||||
---|---|---|---|---|---|---|---|---|

Est | SE | Est | SE | Est | SE | |||

MGQL | 1.0185 | 0.1245 | 1.0141 | 0.1243 | 0.7653 | 0.1922 | 5.0 | |

GQL | 1.0148 | 0.1347 | 1.0099 | 0.1345 | 0.7831 | 0.1818 | 4.0 | |

GLMM | 1.0136 | 0.1324 | 1.0089 | 0.1320 | 0.7671 | 0.1883 | - | |

MGQL | 1.0150 | 0.1287 | 1.0071 | 0.1283 | 1.0047 | 0.1953 | 4.7 | |

GQL | 1.0143 | 0.1379 | 1.0041 | 0.1374 | 1.0214 | 0.1935 | 4.1 | |

GLMM | 1.0143 | 0.1281 | 1.0041 | 0.1275 | 1.0075 | 0.1977 | - | |

MGQL | 1.0134 | 0.1354 | 1.0119 | 0.1352 | 1.2435 | 0.2136 | 4.9 | |

GQL | 1.0118 | 0.1414 | 1.0111 | 0.1413 | 1.2542 | 0.2151 | 4.2 | |

GLMM | 1.0127 | 0.1190 | 1.0120 | 0.1191 | 1.2417 | 0.2194 | - | |

MGQL | 1.0104 | 0.1427 | 1.0090 | 0.1425 | 1.4770 | 0.2385 | 4.7 | |

GQL | 1.0102 | 0.1450 | 1.0093 | 0.1449 | 1.4821 | 0.2414 | 4.3 | |

GLMM | 1.0119 | 0.1207 | 1.0110 | 0.1205 | 1.4698 | 0.2466 | - | |

MGQL | 1.0128 | 0.1507 | 1.0050 | 0.1503 | 1.7140 | 0.2705 | 4.8 | |

GQL | 1.0135 | 0.1490 | 1.0076 | 0.1487 | 1.7109 | 0.2716 | 4.4 | |

GLMM | 1.0155 | 0.1328 | 1.0096 | 0.1327 | 1.6969 | 0.2770 | - |

Parameter estimates and standard errors for adolescent obesity data

MGQL | Estimate | −1.0921 | −0.6758 | 2.6078 |

Std. Error | 0.0326 | 0.0650 | 0.0794 | |

GQL | Estimate | −1.3458 | −0.5041 | 2.7022 |

Std. Error | 0.0449 | 0.0718 | 0.0974 | |

GLMM | Estimate | −1.3438 | −0.6527 | 2.6900 |

Std. Error | 0.0471 | 0.0726 | 0.0959 |