03706251170BiometricsBiometricsBiometrics0006-341X1541-042021955029440590810.1111/j.1541-0420.2011.01677.xNIHMS677784ArticleA latent variable approach to study gene-environment interactions in the presence of multiple correlated exposuresSánchezBrisa N.brisa@umich.eduKangShanMukherjeeBhramarDepartment of Biostatistics, University of Michigan, Ann Arbor, MI USA 481098420152892011620122242015682466476Summary

Many existing cohort studies initially designed to investigate disease risk as a function of environmental exposures have collected genomic data in recent years with the objective of testing for gene-environment interaction (G × E) effects. In environmental epidemiology, interest in G × E arises primarily after a significant effect of the environmental exposure has been documented. Cohort studies often collect rich exposure data, as a result, assessing G × E effects in the presence of multiple exposure markers further increases the burden of multiple testing, an issue already present in both genetic and environment health studies. Latent variable (LV) models have been used in environmental epidemiology to reduce dimensionality of the exposure data, gain power by reducing multiplicity issues via condensing exposure data, and avoid collinearity problems due to presence of multiple correlated exposures. We extend the LV framework to characterize gene-environment interaction in presence of multiple correlated exposures and genotype categories. Further, similar to what has been done in case-control G × E studies, we use the assumption of gene-environment (G-E) independence to boost the power of tests for interaction. The consequences of making this assumption, or the issue of how to explicitly model G-E association has not been previously investigated in LV models. We postulate a hierarchy of assumptions about the LV model regarding the different forms of G-E dependence and show that making such assumptions may influence inferential results on the G, E, and G × E parameters. We implement a class of shrinkage estimators to data adaptively trade-off between the most restrictive to most flexible form of G-E dependence assumption and note that such class of compromise estimators can serve as a benchmark of model adequacy in LV models. We demonstrate the methods with an example from the Early Life Exposures in Mexico City to Neuro-Toxicants (ELEMENT) study of lead exposure, iron metabolism genes, and birth weight.

Gene-environment independenceprincipal componentsshrinkage estimationstructural equation models
1. Introduction

It is now clear from many lines of evidence that pure genetics or pure environmental factors play only a partial role in the etiology of most complex diseases. Instead, it is now accepted that the majority of chronic diseases likely stem from interactions between genetic traits, “G”, and environmental factors, “E” –an exponentially growing area of study (Khoury and Wacholder, 2009). Characterizing gene-environment interactions, “G × E effects”, is critical in understanding the biological mechanisms of disease etiology and can impact preventive medicine and public health by informing the way clinicians advise their patients and the way public health practitioners assess risk and set policy. Statistical approaches that heighten our ability to understand G × E effects can accelerate mapping of the so far elusive environmental footprint of disease etiology.

Established environmental health cohorts that have demonstrated modest health effects of the environment are now collecting genomic data to test G × E effects. However, G × E interaction studies are statistically difficult problems because of exposure measurement error, multiple potential exposure markers, and prohibitive sample sizes required to reach adequate power. Statistical methods to boost efficiency for testing G × E effects have primarily been developed for case-control studies, where imposing the assumption of independence between environmental exposures and inherited genetic susceptibility factors, so called G-E independence, boosts efficiency of G × E effect estimates (Chatterjee and Carroll, 2005, and references therein). Hybrid approaches that protect against bias under departures from independence constraints have also been proposed (e.g., Mukherjee and Chatterjee, 2008; Li and Conti, 2009; Chen et al., 2009).

Latent variable (LV) models have been used in environmental health studies to extract features from a set of correlated biomarkers, thus reducing dimensionality of exposure data and multiple testing burden, and enhancing power (e.g., Budtz-Jørgensen et al., 2003b). These models have enormous potential for modeling gene-environment interaction between multiple genes and multiple environmental exposures, an area which is still in its formative phase. Chatterjee et al. (2006) motivate a one degree of freedom test for gene-gene interactions from a latent variable perspective in a case-control study. However the issue of modeling G-E dependence through the LV framework is not discussed. More general LV models, as proposed here, can incorporate and estimate G-E association structures, which are of interest to environmental health researchers, and also impose constraints such as independence. In our motivating example, the relationship between iron metabolism genes and lead exposure is of interest in itself (e.g., Hopkins et al., 2008). Very few attempts exist that pose a general LV model for studying G × E effects in cross-sectional or cohort studies (e.g., Dhungana et al., 2007; Rathouz et al., 2008; and Javaras et al., 2010), especially those investigating an array of G-E dependence structures.

The Early Life Exposures in MExico City to Neuro-Toxicants (ELEMENT) study motivates our work. ELEMENT consists of four longitudinal birth cohorts in Mexico City, constituting over 2,000 mother infant pairs with prospectively collected exposure data and several anthropometric, cardiovascular, neuro-development, and behavioral outcomes. Genotyping for these cohorts is underway, with genotyping for the first cohort completed on a set of candidate genes (≈ 400 pairs). Figure 1 is a path diagram describing relationships between four biomarkers of prenatal lead exposure, their interaction with iron metabolism genes, and birth weight.

In Section 2, we describe a model structure to summarize a group of biomarkers into latent variables, and the spectrum of G-E dependence models we consider. In Section 3, we describe maximum likelihood estimation (MLE) and a general class of shrinkage estimators to data-adaptively compromise between the most stringent and most flexible models for G-E association. Small scale simulation studies (Section 4) bring out salient features of our methodology. Section 5 presents analyses of our motivating example. Section 6 discusses the use of LV models for G × E studies, and shrinkage estimators in general LV modeling beyond the G × E context.

2. A Latent Exposure Model for <italic>G × E</italic> Studies2.1 Model Representation

For the ith of N individuals, let Yi represent a univariate health outcome (here birth weight), and Ui be an l × 1 vector of latent exposures measured indirectly by a set of p measurements Ei. In the example, l = 1, p = 4, and Ui is prenatal lead exposure (Figure 1). Let Zi and Wi represent q × 1 and r × 1 covariate vectors, respectively. Without any loss of generality, genotype classes are represented by a categorical variable Gi with classes g = 0, ⋯, G. Genotype classes may arise from data on biallelic polymorphisms where a ‘risk’ allele ‘A’ may alter the exposure metabolism pathway or affect health (with the reference allele denoted by ‘a’) or from combinations of genetic markers measured at multiple loci (e.g., risk alleles A, B). In the lead example we consider two single nucleotide polymorphisms (SNPs) implicated in iron metabolism, but due to sparsity of data, we assume Gi can take two possible values: zero for wild-type on both SNPs (’aa’ and ’bb’), and 1 for at least one copy of either of the risk alleles (i.e., ’Aa’, ’AA’, ’Bb’, or ’BB’), consistent with dominant/recessive models for genetic susceptibility. Alternatively, genetic groups can arise as categorization of an underlying genetic risk score that combines several markers identified by existing genome wide association studies (Qi et al, 2011), or from infant-mother genotype combinations at a single locus in studies of pre-natal fetal exposure.

The latent variable model is then specified in two stages: a health outcome model and an exposure model. In the outcome model, the association between the outcome Yi, exposure Ui and genetic category Gi = g conditional on covariates Zi is characterized by either Yi=β0,g+βU,gUi+βZ,gZi+εi, orYi=β0+βUUi+g=1G{βgIgi+βg×UIgi·Ui+βZZi+βg×ZIgi·Zi}+εi, using genotype class indicators Igi = I(Gi = g). The mean-zero error εi has variance σ2. Equation (1) is written using the multiple group notation (Bollen 1989), which is useful in writing the likelihood (Section 3), and β0,g, βU,g, βZ,g are parameters specific to class g = 0, ⋯, G. Such notation is standard in widely used LV software. In (1), the gene-environment interaction test is specified as H0 : βU,0 = ⋯ = βU,g, i.e., homogeneity of the environment’s effects across genetic groups, and is equivalent to testing the interaction parameters in (2), i.e., H0 : βg×U = βU,gβU,0 = 0 for all g = 1, ⋯, G. G × E interactions are of interest in both environmental and genetic epidemiology, but in environmental epidemiology the G × E question arises after showing main effects of exposure on outcome, and heterogeneity of effects across genetic subgroups as well as the exposure effect within class g, βU,g, are of primary interest.

The exposure model consists of a model for the latent variable (3) dependent on covariates Wi, and a measurement model (4) relating the observed exposure measurements to the latent variable Ui=α0g+αWWi+ξi Ei=νg+ΛgUi+δi with (3) and (4) again written in the multiple group notation. Regression coefficients α0g and αW are l × 1 and l × r matrices with γg = α0g − α00 being effect of genotype class on exposure Ui, and covariates Wi (r × 1) may help predict exposure levels for a given subject (e.g., occupation). The zero-mean error terms, ξi, are assumed independent of εi, and have category-specific l × l covariance matrices Φg. Means vector νg and factor loading matrix Λg are p × 1 and p × l, respectively, and δi has zero mean and p × p covariance matrix Θg.

Although LV models are helpful in many respects, one well known problem is the potential for lack of identifiability. Standard identifiability constraints have been developed for linear latent variable models (Bollen, 1989), and identifiability of latent class models has also been investigated (Huang and Bandeen-Roche, 2004). Essentially, model parameters are constrained to ensure identifiability; for example, some entries of νg and Λg are fixed to 0 or 1, although sometimes algebraic proofs of identifiability are needed (Sánchez et al., 2005). In the lead example, the constraints νg = (0, νg,2, νg,3, νg,4) and Λg = (1, λg,2, λg,3, λg,4) fix the mean and scale of the latent exposure to those of patella lead. Parameters in Φg are typically unconstrained, while the off-diagonal elements of Θg are typically, although not necessarily, restricted to be zero denoting conditional independence between Ei ’s given Ui. However, theoretical identifiability may not necessarily guarantee numerical stability of results, which also depends on sample sizes. Some investigators recommend at least 5 to 10 observations per parameter estimated (Westland, 2010). Hence, users need to be attentive as to what model the available sample size allows them to fit.

2.2 Modeling G-E Dependence

In many G × E studies, it may be natural to assume that an individual’s environmental exposure is independent of genetic factors, but may not be realistic when the gene and the exposure share a metabolic pathway. For example, iron metabolism genes may increase lead absorption (Hopkins et al., 2008), and characterizing such dependence may shed insight into the mechanistic process.

Varying degrees of G-E dependence can be modeled through imposing further constraints on the exposure model parameters (Figure 2). The most restrictive assumption is that parameters are homogeneous across genotypes. We use ‘A0’ to denote this full G-E independence A0:(α0g,Φg,νg,Λg,Θg)=(α0,Φ,ν,Λ,Θ). With A0, the exposure model (3)(4) has at least l parameters in each of α0 and Φ, pl factor loadings Λ and pl intercepts ν, and p parameters Θ, for a total of at least 3p parameters. In the lead exposure model, A0 totals 13 exposure model parameters.

A first step at relaxing A0 is to allow the intercepts and variances in the latent variable equation (3) to differ by genotype. Letting γg = α0g − α00 be the gene effect on the latent exposure we have A1:(νg,Λg,Θg)=(ν,Λ,Θ),butγg0orΦg1Φ0Ifor at least oneg. Relaxing these constraints is very natural, since genotype status may increase absorption of pollutants from the environment as well as change exposure variability. Genotype status may change the distribution of the observed exposure measures, Ei, but modeling change in the distribution of the underlying exposure is a parsimonious way of modeling changes in the actual Ei. This assumption has at least 3p + 2lG exposure model parameters; 15 in the lead example. Alternatively, one could restrict the latent variable variances Φg to be equal across genotype subgroups; that is, a slightly modified assumption A1* : (Φg, νg, Λg, Θg) = (Φ, ν, Λ, Θ).

Next, constraints of equal means ν and factor loadings Λ can be removed, A2:Θg=Θ. Biological mechanisms that modify transfer of pollutants from one compartment to another are consistent with this assumption. In the lead example, three of the observed prenatal lead exposure biomarkers are measured on the mother, while umbilical cord blood on the offspring. The transfer rates from maternal compartments (i.e., blood and bone) to offspring compartment may vary by child’s genotype, such that factor loading λ4 may vary by genotype. A2 has at least 3p + 2Gp exposure model parameters; 21 in the lead example.

Lastly, all equality constraints on the parameters can be removed, namely, A3:All model parameters differ by genotype, yielding at least 4p + 2Gp exposure model parameters; 26 in the lead example. In classical multiple group analyses (Bollen, 1989), one might additionally posit that the structure of the whole model might differ by genotype; e.g., that the number of latent variables differs by genotype. We restrict our attention to assumptions A0–A3, where the model structure is the same across genotypes.

Assumptions A0–A3 have different number of parameters; an increase from 13 to 26 parameters in the example. Constraints on exposure model parameters may increase efficiency and power for the main hypotheses (G, E or G × E tests), but could induce bias in parameters of interest if they are incorrectly assumed. Model evaluation strategies are available that may help select which assumption A0–A3 fits the data best (Bentler and Hu, 1995). However, this task may not be straightforward since many fit criteria exist for LV models. Furthermore, testing all parameter constraints may incur a high rate Type 1 error rate, since the study will unlikely be powered to detect significant differences across genotypes in all model parameters.

3. Parameter estimation

Although various estimation procedures have been proposed for LV models (Bollen, 1989), full MLE is the most common estimation procedure given its wide availability in software packages. We review MLE and describe the implementation of shrinkage estimators that combine MLE estimates. Using shrinkage estimators may be a more suitable approach for parameter estimation. The strategy would be to fit the most restrictive model A0 and the most flexible model A3 (or A2 depending on sample size), and then use shrinkage to derive the final composite estimates.

3.1 Maximum likelihood estimation

Let Yi*=(Yi,Ei) and θ be the vector of all model parameters. Assuming that εi, ξi and δi are normally distributed, and integrating over the latent variable, the joint marginal distribution of the observed outcome and exposures, f(Yi*|Gi=g,Zi,Wi;θ), is a multivariate normal density, with moments given by E(Yi*|Gi=g,Zi,Wi;θ)=νg*+Λg*(α0,g+WiαW)+βZ,g*Zi and Var(Yi*|Gi=g,Zi,Wi;θ)=Λg*Φg(Λg*)+Θg*, where νg*=(β0,gνg)Λg*=(βU,gΛg)βZ,g*=(βZ,g0p×q)Θg*=(σ201×p0p×1Θg) The log likelihood of θ is then, (θ)=i=1Ng=0Glogf(Yi*|Gi=g,Zi,Wi;θ). Parameter estimates are obtained by maximizing ℓ(θ), or equivalently, solving the score equations S(θ)=i=1Ng=0GSi=0, where Si=logf(Yi*|Gi=g,Zi,Wi;θ)/θ is the contribution of the ith observation to the score. Variances for parameters can be obtained by inverting the information matrix, Iθ = −E(∂2ℓ(θ)/∂θθ), or by computing robust variances var^R(θ^)=B1AB, where A=1/Ni=1Ng=0GSiSi and B=1/Ni=1Ng=0GSi/θ.

3.2 Shrinkage estimation

Shrinkage estimators have been used in outcome-dependent sampling based studies as a way to balance bias and efficiency gains from assuming G-E independence while using a retrospective likelihood formulation of the model. Shrinkage approaches will enhance efficiency only when the retrospective model improves efficiency. A typical formulation will factorize the retrospective likelihood p(G, E, Z|Y; θ1, θ2, θ3) as p(Y|G, E, Z; θ1)p(G|E, Z; θ2)p(E, Z; θ3)/p(Y; θ1, θ2, θ3) with θ1 being the outcome model parameters, θ2 and θ3 describing the G-E dependence and exposure-covariate associations, respectively, and the G-E association reflected through the term p(G|E, Z; θ2). Because p(Y; θ1, θ2, θ3) = ∑G, E, Z p(Y|G, E, Z; θ1)p(G|E, Z; θ2)p(E, Z; θ3) (in the denominator) depends on the specification of p(G|E, Z; θ2), the MLE of θ1 depends on the assumed model for the G-E association conditional on covariates. It has been shown (Chatterjee and Carroll, 2005) that assuming conditional G-E independence, p(G|E, Z; θ2) = p(G|Z; θ2), in case-control studies leads to large efficiency gain for estimating the G × E interaction parameter in the outcome model p(Y|G, E, Z; θ1). However, under violation of the independence assumption, these estimators are biased. Shrinkage estimators then arise as a weighted average of two estimators: one obtained under dependence and the other obtained under independence; the weights are chosen in a data-adaptive fashion and reflect the uncertainty around the conditional G-E association (Mukherjee and Chatterjee, 2008). Chen et al. (2009) propose a class of shrinkage estimators applicable, in principle, to estimation problems beyond G × E effects.

In a cohort or cross-sectional study, maximization of the joint likelihood with respect to outcome model parameters, even from modeling the joint distribution p(Y, G, E|Z; θ1, θ2, θ3), will be independent of the specification of p(G, E|Z; θ2, θ3). Because of the lack of outcome dependent sampling, and thus absence of conditioning on Y, the log likelihood becomes a sum of two terms that can be maximized separately. However, the proposed LV framework allows us to incorporate constraints on the G-E association (Section 2.2), raising the possibility of gaining efficiency. In the LV framework, estimation of G-E model parameters cannot be completely disentangled from estimation of outcome model parameters due to integration of the likelihood over the latent variable. The extent of efficiency and power gains for testing of outcome parameters due to such constraints depends on design and effect size settings that we investigate in our simulation study. Although extensive work exists on using G-E independence assumptions in case-control studies, this paper is the first to propose and study the implications such assumptions in a LV setting.

We follow Chen et al. (2009) to describe how estimates obtained under assumptions A0 and A1 − A3 can be combined. Denoting the two estimates by θ̂A0 and θ̂A*, θ^shrink=θ^A*+KMV(θ^A0θ^A*) is the shrinkage estimator, where A* is any of A1 − A3 and shrinkage weights are KMV = ( + ψ̂ψ̂)−1, with ψ̂ = θ̂A0θ̂A* and is the estimated asymptotic covariance matrix of ψ̂. Alternatively, one may use a diagonal weight matrix KCW where the kth diagonal element is V^k/(V^k+ψ^k2); k being the kth diagonal element of and ψ̂k the kth component of ψ̄. Choosing KCW leads to ‘component-wise’ shrinkage, since the weights used for a given component of θ̂shrink depend only on the variance and bias related to that component; we call these estimates EBCW. In contrast, using KMV leads to so-called multivariate shrinkage (Chen et al., 2009), which we refer to as EBMV. In both cases we use superscripts to denote which estimates were combined, e.g., EBMV03 denotes combination of estimates obtained under assumptions A0 and A3. Note that (5) is only defined for parameters that are common to both models, and we have slightly abused notation by using θ̂A* in (5) to only represent the subset of parameters that are equivalent to those in θ̂A0.

Additional considerations about shrinkage estimators are worth mentioning. First, CW shrinkage may be desirable in terms of efficiency gain, compared to multivariate shrinkage in small samples, because large sampling error in the off-diagonals of undermines the potential efficiency gain from multivariate shrinkage (Chen et al., 2009). Second, the form of the weights imply that EBMV estimates will be more prone to favoring the more flexible models. To see this, note that the component-wise shrinkage weights (the kth diagonal element of KCW) can be re-written as 1/(1+χk2), where χk2=ψ^k2/V^k is the ratio of the squared difference in the kth parameter between the two models divided by variance of the difference. The MV shrinkage weights can be similarly re-written: 1/(1 + χ2) where χ2 = ψ̂ −1ψ̂. χk2 and χ2 can be interpreted as a bias-variance ratios; when they are smaller than one, the EB estimates will lean toward the simpler model. In contrast to CW shrinkage, the MV shrinkage weight is the same for all parameters, but the ratio χ2 is a weighted sum of all the bias-variance ratios for all model parameters. Hence, in MV shrinkage, a given parameter might be shrunk given not only its own bias-variance ratio χk2 but also given bias-variance ratios for other parameters. Furthermore, χk2 has expectation (approx) 1 when independence holds, whereas χ2 has expectation equal to the number of model parameters estimated with assumption A0. Hence, MV shrinkage weights 1/(1 + χ2) will almost always be small, leading to EB estimates closer to those from more flexible models. Third, the weights summarize information about model fit in the sense of comparing differences in estimated parameters. If the models fit equally well, then corresponding parameters would likely be similar. Large differences in corresponding parameters indicate a poorer fitting model (e.g., constrained model). Hence, shrinkage estimates can help assess model adequacy, with smaller weights for the constrained model implying the more flexible model is preferred. Finally, both CW and MV shrinkage estimators will asymptotically converge to those from the more flexible model.

Chen et al. (2009) provide formal arguments to derive the variance for θ̂shrink. Heuristically, the variance can be obtained by treating θ̂shrink as a function of two random variables, θ̂A0 and θ̂A*, with joint covariance matrix Σ=var((θ^A0,θ^A*)). Letting h(θA0, θA*) = θA* + V(V + ψψ)−1 ψ, with ψ = θA0θA*, and employing the multivariate Delta theorem, then Var (θ̂shrink) ≈ H Σ̂H, where H=h(θA0,θA*)/(θA0,θA*)|θA0=θ^A0,θA*=θ^A*. Matrix Σ̂ = D−1 CD−⊤ is constructed using the sandwich-variance formula where C=1/Ni=1NPiPi and D=1/Ni=1NPi/θ, and Pi=(SA0,iSA*,i) is a stacked vector of likelihood score contributions from each model.

3.3 Simpler estimation strategies

Instead of positing a latent exposure model, (1)(4), one may fit separate multiple linear regression models (MLR) on each exposure measure, or one regression on their first principal component (PCA), and its interactions with G. We include these simple approaches in Sections 4 and 5.

4. Simulation Studies

We conducted a small scale simulation study to examine the finite sample properties of estimators under various settings of the true data generating model using l = 1, p = 4, and two genetic classes. Genetic class was generated as a binary variable with prevalence 0.2, similar to our data example. Since there are only two gene classes and one latent exposure, in this section (and section 5) we denote the gene effect among unexposed, the exposure effect among wild types, and the interaction parameters as βG, βU, βG×U. We investigate the estimators’ properties under two scenarios of the G-E association: independence (A0) and dependence (A3). We used either βU = βG = βG×U = 0 or βU = 1, βG = βG×U = 2 (i.e., standardized effects of 0.2, 0.4, 0.4, respectively, since outcome variance was σ2 = 52). See Supplementary Materials for full design.

Type I error

When G and E are independent, all approaches retain rejection probabilities (P(R)) of approximately 0.05 for tests at the 0.05 significance level (Table 1, scenario A0). When the data are generated under G-E dependence (Table 1, scenario A3), MLE estimates derived assuming independence (A0) have inflated Type I error probabilities for β̂U and β̂G. While EBCW retain inflated Type I error rates, EBMV estimates do not.

Efficiency/Power

When G-E independence holds (Table 2, scenario A0), gains in efficiency for β̂G and β̂G×U estimated from A0 compared to A1–A3 are very clear: the variance ratio (Var.R) for β̂G×U estimated under A3 vs A0 is 1.90. Efficiency gains translate to large power gains: power for β̂G×U is 0.44 under A3 and 0.66 under A0. Compared to A3, EBCW03 and EBMV03 have lower variance ratios (1.46 and 1.74, respectively) and higher power (0.53 and 0.49, respectively). The PCA approach has power comparable to that from A0, despite the bias in β̂U and β̂G×U.

Bias

When G-E independence does not hold, all parameter estimates have large biases, except those from A3 and EBMV03 (Table 2, scenario A3). Of the MLEs, bias is larger when A0 or A1 are assumed; further, A1 vs. A0 does not result in uniformly less (absolute) bias for both βU and βG×U. Simply relaxing the assumption of different mean and variance for the latent variable, but not the measurement model, may not be sufficient to reduce bias, and could in fact increase it.

EBCW estimates are approximately half way between A0 and the more flexible models, although the exact distance varies depending on the magnitude of the coefficient. As such, they retain some of the bias of A0 estimates when G-E dependence exists. For EBCW01, the bias persists and is larger for βU and βG×U than the bias in A0 for these parameters. In contrast, EBMV02 and EBMV03 are generally closer to the more flexible model (see Section 3.2), and mostly eliminate the bias in A0.

As would be expected from the measurement error literature, parameter estimates using only E1 or PCA as the predictor in multiple regression analysis are biased, i.e., β̂U and β̂G×U are attenuated. However, note that β̂G have large bias as well, due to the G-E dependence and the measurement error in E1 or the first PC in measuring exposure. Measurement error in one predictor (e.g., E1) can induce bias in regression coefficients of covariates measured without error (e.g., G) that are correlated with the error-prone predictor (Budtz-Jørgensen et al., 2003a). The bias in β̂G can be in either positive or negative under the alternative hypothesis (Huang et al., 2005), although estimates will be unbiased when there is no exposure effect (βU = βG×U = 0, Table 1).

MSE

EBMV03 estimates eliminate the bias in A0, but are less efficient; EBCW03 retain some bias, but achieve smaller mean squared error (MSE), hence a better bias-efficiency tradeoff. For example, although EBCW03 incurs 15% bias, its MSE is 0.93, in contrast to an MSE of 1.31 for EBMV03. EBCW estimators achieve a better bias-variance compromise in small samples compared to EBMV.

Additional simulation results for the case of null main effects and small interaction parameter: βU = 0, βG = 0, and βG×U = 0.1 demonstrate that efficiency gains in A0 vs A3 are still observed (Var.R=1.5). Although to a smaller degree, bias in β̂G×E persisted when incorrectly assuming A0 (11% v.s. 3% bias in A3).

Recommendation

For hypothesis testing alone, PCA approaches may be just as good as using a full latent variable model because they maintain Type I error (Table 1) and have power comparable to A0 (Table 2). However, in terms of both bias and efficiency, using EBMV03 is our recommended estimation strategy. Although EBMV03 has slightly higher MSE than EBCW03, it yields unbiased estimates, provides better control of Type I error, and has higher power than EBCW03.

5. Modeling lead exposure, iron metabolism genes, and birth weight

We use data from the first ELEMENT cohort, where the following prenatal lead exposure biomarkers were collected on the mother and child: maternal blood lead levels at delivery and umbilical cord blood lead as well as maternal bone lead levels (patella and tibia) (Gonzalez-Cossio et al., 1997). Birth weight is the health outcome of interest in our analysis. To be included in this analysis, children had to be genotyped, and have measured birth weight and least one of the four prenatal exposure biomarkers (N=406). Missing data on covariates was imputed five times (Raghunathan et al., 2002). Parameter estimates were obtained from the imputed data sets and combined across the imputed data sets according to standard formulae (Little and Rubin, 2002, pg. 86).

Deleterious effects of prenatal lead exposure on birth weight have been demostrated (Gonzalez-Cossio et al., 1997), and the main effects of lead exposure in this sample using the LV model are significant (β̂U = −54.12, se(β̂U) = 25.1, TU = −2.16, Supplementary Materials Table 2). However, individuals with at least one iron metabolism gene variant may be protected against reduced birth weight due to lead exposure (Cantonwine et al., 2010). However, since iron metabolism genes appear to up-regulate iron and lead absorption (Hopkins et al., 2008), there may be dependence between genotype status and lead exposure. We use two SNPs related to iron metabolism, variants of the hemochromatosis gene (C282Y and H63D), and dichotomize genotype into wild type for both (’aa’ and ’bb’) or variant for any (’Aa’, ’AA’, ’bB’, or ’BB’); both SNPs were in Hardy-Weinberg equilibrium, and 83 participants (20%) were classified as variants.

Increasing lead exposure among wild types is associated with decreased birth weight (negative β̂U in Table 3). The largest point estimate is obtained under assumption A1, whereas the lowest is obtained using MLR with the observed patella lead measure as the exposure marker. Such large attenuation in the MLR estimate is due to measurement error of patella lead in capturing prenatal exposure. Similarly, the MLR effect estimated from a PCA-derived exposure measure is attenuated, consistent with the simulation studies. Among the MLE and EB estimates, those obtained from assumptions A0 and A1* have the smallest standard errors, that increase with increasing flexibility of the model as expected. The PCA and MLR standard errors are much smaller, and therefore, even though the point estimates are also largely attenuated, the test statistics are similar to those for the MLE estimates. Component-wise shrinkage estimates are approximately half-way between those from A0 and those from the more flexible models A1*-A3. EBCW01 is closer to the estimates obtained assuming A1, but EBCW02 and EBCW03 are closer to the estimates from A0 than from those obtained with A2 or A3. In contrast, EBMV02 and EBMV03 are closer to the estimates from A2 or A3 compared to those from A0. CW shrinkage favors simpler models since it only trades off bias-variance in one parameter at a time, where as MV shrinkage favors the more flexible model because it simultaneously considers differences in all model parameters (Section 3.2).

Estimates and standard errors for β̂G are fairly constant across G-E assumptions and EB estimates. However, β̂G from MLR and PCA are much higher (more negative) than those from MLE. This can be due to bias arising due to exposure-gene correlation and exposure measurement error (Budtz-Jørgensen et al., 2003a; Huang et al., 2005).

Being variant for iron metabolism genes is protective against reduced the birth weight due to lead exposure (β̂G×U are positive), as hypothesized (Cantonwine et al., 2010). Whereas assumptions about G-E independence did not ultimately alter the conclusions for the main effect among wild types (e.g., all t-test statistics TU < −2.4 in all assumptions), conclusions about β̂G×U are impacted by the assumed G-E model. The most flexible model yields largest estimated effects (105.0g with A3 vs 92.4g with A0) and the standard error is lower (49.2 in A3 vs 63.5 in A0), resulting in t-statistics for the G × U effect as low as TG×U = 1.46 (A0) and as large as 2.13 (A3). This is likely due to a higher degree of overall model residual variance explained in A3 due to an increased number of parameters (i.e., lowest −2 log likelihood, Table 4).

Differences in MLE estimates for the outcome parameters can be largely explained by a few key differences exposure model parameters by genotype (Table 4). While there is little difference in average exposure levels between wild types and variants (small γ̂g), the variance of the latent variable is twice as high among variants (Φ̂g=1 = 2.11) than among wild types (Φ̂g=0 = 1.05). Residual variances for E1 and E2, Θ11 and Θ22, also appear to differ between genotypes (51% and 72% difference, respectively), as does λ4 (17% difference). This deserves further study–e.g., differences in Θ11 and Θ22 might be due to maternal genotypes, which are inherently correlated to infant genotype. Such investigation is out of the scope of the current work, but this finding highlights the utility of LV models in elucidating potential biological pathways.

In this example, implementing multivariate shrinkage was possible only for combining estimates from A0 with those from A3 and A0 with A2 estimates. MV shrinkage using A0 and A1 (and A1*) estimates resulted in a numerically singular variance matrix Σ=Var((θ^A0,θ^A*)), likely due to measurement model parameters being too similar (and correlated) when only making small changes in the latent variable model (3). Although in the example we implemented all approaches for exposition, and even though standard model fit criteria (Table 4) would point toward model A1 being a better model in this particular example, as a general strategy we prefer outcome model parameters estimated using the EBMV03 approach. This approach avoids the potential for increased Type I errors due to fitting multiple models before arriving at a final model, and minimizes bias in outcome model parameters that may persist due to differences in exposure model parameters associated to genotype that may not be declared “significantly different” due to lack of power.

6. Discussion

The presence of multiple correlated measures of exposure exacerbates existing challenges in G × E studies. The current paper is the first step towards an integrated framework where LV models are used to reduce the dimensionality of the exposure measure, thereby limiting the number of tests made and boosting power. Due to the general model formulation, it is easy to accommodate measurement errors in predictors, a pervasive problem in environmental epidemiology, and reduce multicollinearity concerns. Furthermore, a major intuitive appeal of the LV approach is that it provides not only estimates of the disease model parameters, but also a clearer picture of the underlying G-E association and helps capture the essence of the scientific problem. For a genetic marker and exposure which may have a common metabolic pathway, this model is more meaningful to practitioners than a multiple regression model relating Yi to Gi and Ei which is not informative about the association between Gi and Ei.

Because of the flexibility afforded by LV models, one challenge is the potential for model misspecification. In this particular application of LV models, we described various specifications of the G-E association, and discussed how restrictions in the G-E model boost efficiency of the G × E associations, but may incur bias when such restrictions are incorrectly made. We proposed a strategy where one would fit a restricted model and the most flexible model afforded by the data, and then combine estimates based on shrinkage ideas. The proposed approach yields estimates that data-adaptively compromise between bias and variance, and avoids having to fit and re-fit models until a best fitting model is found. Alternatively, estimation could proceed in two stages. First, the most flexible model could be estimated, and genotype differences in exposure model parameters tested. In the second stage, parameters that were found to not differ by genotype would be constrained to be equal across genotypes. However, such two-stage approach would also suffer from inflated Type I error (Mukherjee and Chatterjee, 2008). Yet another alternative, with a similar flavor to what we proposed here, is to average parameter estimates obtained under various G-E assumptions according to prior information of the G-E association (Li and Conti, 2009) or using model fit criteria as weights (Hjort and Claeskens, 2003). Further still, one could use LASSO or Ridge penalties to select which exposure model parameters vary by genotype (Leoutsakos et al., 2010). Lastly, extensions of the methods proposed could include using a continuous genetic risk score G, such that a larger number of genetic categories can be (indirectly) included without collapsing to a few categories due to limited sample size. Such extension may not be straightforward since the multiple group analysis used here would not apply. Compromise estimators like the ones presented have not been used in the LV modeling literature, but can be a tool to achieve improved modeling strategies and robustness in LV models in applications even beyond G × E studies.

It is possible that one may use the proposed approach for screening G × E effects in genome-wide interaction studies. In our simulation studies, the estimation procedure takes approximately 0.36 minutes per data set in a desktop computer with 3.2GHz Intel processor and 1GB RAM. In the advent of cluster and parallel computing the proposed approach is scalable to genome-wide studies. Nevertheless, if the intent is solely testing, and not estimation, the PCA approach may be suitable, since, as shown in the simulation studies, it had comparable power to the proposed shrinkage estimates, despite substantial bias. Employing dimension-reduction approaches to the environmental exposure data will reduce multiple testing problems since only one genome-wide scan would be needed, instead of one scan for each observed exposure. Our methods are particularly appealing to study G × E effects with a given environmental exposure and genetic subclasses defined through genes on a related metabolic pathway.

The availability of higher dimensional genomic data, and multiple continuous or categorical outcomes point to several extensions of our work. General LV models encompass latent class models (Skrondal and Rabe-Hesketh 2004), hence one could posit a latent class model for multiple genetic factors, Gi, which borrows strength from multiple loci and can minimize the chance of false positives (Schumacher and Kraft, 2007). Recent proposals (Chatterjee et al., 2006) pose gene-gene interaction models based on a latent variable approach, and can be extended to reduce the dimension of gene-gene-environment interaction models. Similar to what we have done for the exposure model in the present paper, a latent outcome model to summarize correlated multivariate or longitudinal outcome data Yi can be proposed. One would summarize multivariate correlated outcomes by latent traits, i.e., express Yi in terms of latent outcomes fi (e.g., Budtz-Jørgensen et al., 2003b), and estimate model parameters for a regression of fi on Ui and Gi. When Yi involves repeated measures over time (e.g., growth curves), the model for the observed multivariate vector Yi for subject i, measured at multiple time points may contain a random slope and random intercept, which are inherently latent variables. The random effects can be modeled as dependent on Ui and Gi and other covariates, such that inferences on how exposure and genes modify growth rates can naturally be obtained. Moreover, multivariate observations reflecting latent variables repeated over time (Roy and Lin, 2000), and time-to-event data (Proust-Lima et al., 2009) can be incorporated. In summary, extensions of the present model can involve summarization of all three data components: Y, G, E.

Supplementary MaterialAcknowledgements

The authors thank ELEMENT investigators for proving data for the example, as well as the following NIEHS grants that supported data collection: K23ES000381; P01 ES012874; P42 ES05947; R01 ES013744; R01 ES014930; R01 ES007821. The authors also acknowledge salary support from grants NSF DMS 1007494, NIEHS R01 ES016932 and R01 ES017022, and NIEHS/EPA 1-P20-SE018171-01.

Supplementary Materials

Supplementary Materials referenced in Section 4 are available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

ReferencesBentlerPMHuLTHoyleRHEvaluating model fitStructural Equation Modeling1995LondonSage7699BollenKAStructural Equations with Latent Variables1989New YorkJohn Wiley & SonsBudtz-JørgensenEKeidingNGrandjeanPWeihePWhiteRFConsequences of exposure measurement error for confounder identification in environmental epidemiologyStatistics in Medicine2003a223089310012973789Budtz-JørgensenEKeidingNGrandjeanPWeihePWhiteRFStatistical methods for the evaluation of health effects of prenatal mercury exposureEnvironmetrics2003b14105120CantonwineDHuHTellez-RojoMMSánchezBNLamadrid-FigueroaHEttingerASMercado-GarciaAHernandez-AvilaMWrightROHFE gene variants modify the association between maternal lead burden and infant birthweight: a prospective birth cohort study in Mexico City, MexicoEnvironmental Health201094320659343CarrollRJRuppertDStefanskiLCrainiceanuCMonographs on Statistics and Applied ProbabilityMeasurement Error in Nonlinear Models: A Modern Perspective20062nd editionBoca Raton, FLChapman & HallChatterjeeNCarrollRJSemiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studiesBiometrika200592399418ChatterjeeNKalayliogluZMoslehiRPetersUWacholderSPowerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactionsAmerican Journal of Human Genetics2006791002101617186459ChenYHChatterjeeNCarrollRJShrinkage estimators for robust and efficient inference in haplotype-based case-control studiesJournal of the American Statistical Association200910422023319430598DhunganaPEskridgeKMBaenzigerPSCampbellBTGillKSDweikatIAnalysis of genotype-by-environment interaction in wheat using a structural equation model and chromosome substitution linesCrop Science200747477484Gonzalez-CossioTPetersonKESaninLHFishbeinEPalazuelosEAroAHernandez-AvilaMHuHDecrease in birth weight in relation to maternal bone-lead burdenPediatrics19971008568629346987HjortNLClaeskensGFrequentist model average estimatorsJournal of the American Statistical Association200398879899HopkinsMREttingerASHernandez-AvilaMSchwartzJTellez-RojoMMLamadrid-FigueroaHBellingerDHuHWrightROVariants in iron metabolism genes predict higher blood lead levels in young childrenEnvironmental Health Perspectives20081161261126618795173HuangGHBandeen-RocheKBuilding an identifiable latent class model with covariate effects on underlying and measured variablesPsychometrika200469532HuangLSWangHKCoxCAssessing interaction effects in linear measurement error modelsJournal of the Royal Statistical Society Series C-Applied Statistics2005542130JavarasKNHudsonJILairdNMFitting ACE structural equation models to case-control family dataGenetic Epidemiology20103423824519918760KhouryMJWacholderSInvited commentary: from genome-wide association studies to gene-environment-wide interaction studies–challenges and opportunitiesAmerical Journal of Epidemiology2009169227230discussion 234–235.LiDLContiDVDetecting gene-environment interactions using a combined case-only and case-control approachAmerican Journal of Epidemiology200916949750419074774LittleRJARubinDBStatistical Analysis with Missing Data20022nd editionNew JerseyJohn Wiley & Sons, IncMukherjeeBChatterjeeNExploiting gene-environment independence for analysis of case-control studies: An empirical bayes-type shrinkage estimator to trade-off between bias and efficiencyBiometrics20086468569418162111Proust-LimaCJolyPDartiguesJFJacqmin-GaddaHJoint modelling of multivariate longitudinal outcomes and a time-to-event: A nonlinear latent class approachComputational Statistics & Data Analysis20095311421154QiLMaJQiQHartialaJAllayeeHCamposHGenetic risk score and risk of myocardial infarction in HispanicsCirculation201112337438021242481RaghunathanTESolenbergerPVan HoewykJIVEware: Imputation and Variance Estimation Software User guideSurvey Methodology Program2002Ann ArborUniversity of MichiganRathouzPVan HulleCRodgersJWaldmanILaheyBSpecification, testing, and interpretation of gene-by-measured-environment interaction models in the presence of gene-environment correlationBehavioral Genetics200838301315RoyJLinXHLatent variable models for longitudinal data with multiple continuous outcomesBiometrics2000561047105411129460SánchezBNBudtz-JørgensenERyanLMHuHStructural equation models: A review with applications to environmental epidemiologyJournal of the American Statistical Association200510014431455SchumacherFRKraftPA bayesian latent class analysis for whole-genome association analyses: an illustration using the gaw 15 simulated rheumatoid arthritis dense scan dataBMC Proc20071Suppl 1S11218466453SkrondalARabe-HeskethSGeneralized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models2004Boca Raton, FLChapman & HallWestlandJCLower bounds on sample size in structural equation modelingElectronic Commerce Research and Applications20109476487

Path diagram showing relationships between exposure biomarkers, latent prenatal lead exposure, iron metabolism genes, and birth weight.

Path diagrams showing gene-environment dependence assumptions.

Bias, variance ratios (Var.R), mean squared error (MSE), and rejection probabilities (P(R)) for outcome model parameter estimates under two scenarios of true exposure model parameters. Outcome model parameters set at βU = βG = βG×U = 0, σ2 =5; sample size was N = 350, with 500§ replicates.

DataModelEst.Methodaβ̂UBiasVar.RbMSEP(R)β̂GBiasVar.RbMSEP(R)β̂G×UBiasVar.RbMSEP(R)
A0A00.001(Ref)0.150.063−0.051(Ref)0.460.048−0.041(Ref)0.650.036
A10.00(1.02)0.160.059−0.05(1.05)0.460.044−0.04(1.10)0.690.038
A20.00(1.01)0.150.059−0.04(1.11)0.480.042−0.05(1.15)0.720.036
A30.00(1.02)0.150.061−0.04(1.12)0.490.042−0.04(1.15)0.740.044
EBCW01 0.00(1.02)0.150.061−0.05(0.99)0.460.046−0.04(0.99)0.660.050
EBCW02 0.00(1.02)0.150.069−0.04(1.02)0.460.046−0.04(1.01)0.660.050
EBCW03 0.00(1.02)0.150.069−0.05(1.03)0.460.044−0.04(1.04)0.670.052
EBMV01 0.00(1.03)0.160.063−0.05(1.02)0.460.044−0.04(1.04)0.690.046
EBMV02 0.00(1.03)0.150.063−0.04(1.09)0.480.044−0.05(1.08)0.710.044
EBMV03 0.00(1.03)0.150.063−0.04(1.09)0.480.042−0.04(1.07)0.730.055
E1c0.00(0.23)0.030.044−0.04(1.38)0.640.0520.00(0.24)0.150.048
PCAd0.00(0.23)0.030.063−0.04(1.02)0.460.0480.01(0.24)0.150.042
A3A00.031(Ref)0.180.126−0.071(Ref)0.770.1420.031(Ref)0.520.043
A10.04(2.59)0.320.038−0.10(2.46)1.150.0630.02(1.21)0.660.050
A20.03(1.42)0.180.036−0.08(2.07)0.950.0560.03(1.08)0.570.043
A30.03(1.27)0.160.038−0.09(2.26)1.040.0470.04(1.19)0.630.050
EBCW01 0.03(1.38)0.230.020−0.07(1.18)0.850.1010.02(1.81)0.560.007
EBCW02 0.03(0.83)0.180.171−0.07(1.10)0.800.1190.03(1.65)0.520.011
EBCW03 0.03(0.84)0.170.153−0.08(1.16)0.820.1060.03(1.61)0.530.018
EBMV01 0.04(2.59)0.320.036−0.10(2.37)1.150.0740.02(1.16)0.660.050
EBMV02 0.03(1.39)0.180.050−0.08(1.98)0.950.0560.03(1.04)0.570.054
EBMV03 0.03(1.25)0.160.054−0.09(2.14)1.030.0590.04(1.13)0.630.061
E1c0.01(0.28)0.040.070−0.00(1.97)0.870.052−0.01(0.23)0.120.047
PCAd−0.01(0.36)0.040.038−0.05(1.57)0.750.056−0.01(0.24)0.120.061

A0, A1, A2, A3 denote MLE; A1* not included because combining A0 and A1* resulted in singular variance matrix Σ (see Section 5).

Ratios of empirical variances, comparing to variance of A0

Multiple regression using one exposure marker, E1

Multiple regression using 1st principal component of (E1, E2, E3, E4) as exposure marker

4.6% (A0) and 7.0% (A3) data sets excluded to lack of convergence or unstable results, see Supplementary Materials for details.

Percent bias, variance ratios (Var.R), mean squared error (MSE), and rejection probabilities (P(R)) for outcome model parameter estimates under two scenarios of true exposure model parameters. Outcome model parameters set at βU = 1, βG = βG×U = 2, σ2 = 5; sample size was N = 350, with 500§ replicates.

DataModelEst.Methodaβ̂UBias%Var.RbMSEP(R)β̂GBias%Var.RbMSEP(R)β̂G×UBias%Var.RbMSEP(R)
A0A0−1.3%1(Ref)0.170.743.4%1(Ref)0.470.85−1.4%1(Ref)0.740.66
A1−1.0%(1.02)0.180.743.0%(1.14)0.540.781.3%(1.18)0.830.62
A2−1.0%(1.03)0.180.743.0%(1.73)0.740.633.1%(1.74)1.060.47
A3−0.7%(1.04)0.180.742.9%(1.76)0.750.633.1%(1.90)1.180.44
EBCW01 −1.2%(1.01)0.170.733.2%(1.08)0.500.82−0.3%(1.08)0.760.64
EBCW02 −1.2%(1.02)0.170.743.4%(1.45)0.560.71−0.5%(1.36)0.820.56
EBCW03 −1.0%(1.02)0.170.753.3%(1.47)0.560.72−1.1%(1.46)0.850.53
EBMV01 −1.0%(1.02)0.180.733.0%(1.12)0.540.791.3%(1.13)0.830.64
EBMV02 −1.0%(1.03)0.180.733.0%(1.66)0.730.643.1%(1.64)1.040.50
EBMV03 −0.7%(1.04)0.180.732.9%(1.67)0.720.652.8%(1.74)1.140.49
E1c−65.4%(0.22)0.460.463.2%(1.33)0.490.83−63.6%(0.23)1.790.41
PCAd−53.0%(0.22)0.310.743.3%(0.94)0.470.86−51.5%(0.23)1.240.65
A3A012.1%1(Ref)0.200.8635.7%1(Ref)1.240.94−27.6%1(Ref)0.920.48
A145.6%(2.57)0.550.78−18.7%(2.57)1.370.35−42.0%(1.24)1.500.27
A28.9%(1.43)0.190.7910.3%(3.79)1.960.51−13.5%(1.86)1.210.40
A33.7%(1.26)0.170.80−1.4%(5.18)2.140.410.9%(2.65)1.340.39
EBCW01 32.8%(2.51)0.430.720.0%(2.83)1.230.44−37.5%(1.94)1.300.19
EBCW02 10.4%(0.94)0.190.8828.4%(3.47)1.580.67−21.3%(1.98)0.940.31
EBCW03 7.4%(0.95)0.180.8621.8%(4.45)1.560.57−14.6%(2.48)0.930.30
EBMV01 45.5%(2.54)0.550.77−18.6%(2.55)1.360.37−42.0%(1.22)1.510.27
EBMV02 9.0%(1.38)0.190.7910.7%(4.03)1.930.50−13.7%(1.97)1.200.39
EBMV03 3.8%(1.23)0.170.79−0.7%(4.89)2.090.440.4%(2.50)1.310.41
E1c−63.5%(0.27)0.440.51110.7%(1.97)5.590.99−78.0%(0.23)2.590.23
PCAd−44.2%(0.33)0.230.7837.3%(1.52)1.290.88−63.7%(0.23)1.770.48

A0, A1, A2, A3 denote MLE; A1* not included because combining A0 and A1* resulted in singular variance matrix Σ (see Section 5).

Ratios of empirical variances, comparing to variance of A0

Multiple regression using one exposure marker, E1

Multiple regression using 1st principal component of (E1, E2, E3, E4) as exposure marker

3.6% (A0) and 2.4% (A3) data sets excluded to lack of convergence or unstable results, see Supplementary Materials for details.

Outcome model parameter estimates, robust standard errors, and t-statistics obtained using MLE under assumptions A0–A3 and shrinkage-based estimates combining assumptions. Coefficients β̂U and β̂G×U have been re-scaled to represent changes in birth weight(g) associated with an increase of 10 µg Pb/g in patella bone mass. Models are adjusted for maternal age, parity, education and marital status

Est. Methoda,bβ̂Use(β̂U)TUβ̂Gse(β̂G)TGβ̂G×Use(β̂G×U)TG×U
A0−80.5929.97−2.69−97.7551.30−1.9192.4263.501.46
A1*−79.5129.79−2.67−99.0951.45−1.9391.9262.561.47
A1−98.4439.54−2.49−98.4751.30−1.92103.0952.711.96
A2−85.1433.75−2.52−98.8351.42−1.9290.6653.571.69
A3−86.6235.01−2.47−100.7451.41−1.96105.0049.242.13
EBCW01* −80.0529.76−2.69−98.0251.31−1.9191.9063.251.45
EBCW01 −91.1240.15−2.27−97.8951.31−1.91101.9561.061.67
EBCW02 −81.7332.13−2.54−97.9051.24−1.9192.5860.071.54
EBCW03 −82.4033.04−2.49−98.9351.37−1.9394.6362.101.52
EBMV02 −85.0233.64−2.53−98.8051.40−1.9290.7053.761.69
EBMV03 −86.2834.66−2.49−100.5851.39−1.96104.3049.812.09
E1c−41.5815.06−2.76−191.5071.07−2.6960.5931.191.94
PCAd−46.5815.34−3.04−187.9269.27−2.7158.5230.041.95

Estimates from models A0, A1, and A1*, were not sufficiently distinguishable from each other, hence EBMV01 and EBMV01* could not be obtained (Section 5).

Multiple regression using patella lead, E1

Multiple regression using 1st principal component of E1, E2, E3, E4 as exposure marker

p-value < 0.05

Exposure model parameter estimates and robust standard errors obtained under assumptions A0–A3 described by Figure 1.

A0A1*A1A2A3





Model for latent variableEst(SE)Est(SE)Est(SE)Est(SE)Est(SE)
α01.536 (0.078)1.519 (0.086)1.526 (0.081)1.506 (0.087)1.506 (0.086)
γg0.084 (0.178)0.050 (0.182)0.148 (0.207)0.148 (0.208)
Φg=01.230 (0.312)1.253 (0.317)0.835 (0.218)1.090 (0.306)1.056 (0.333)
Φg=11.645 (0.489)1.440 (0.460)2.112 (0.916)
Measurement Model
Estimates for wild types
ν21.031 (0.053)1.021 (0.057)1.024 (0.056)1.040 (0.057)1.040 (0.056)
ν31.759 (0.024)1.757 (0.025)1.758 (0.025)1.756 (0.027)1.756 (0.027)
ν42.043 (0.023)2.041 (0.023)2.042 (0.023)2.034 (0.026)2.034 (0.026)
λ20.553 (0.126)0.542 (0.124)0.680 (0.136)0.527 (0.136)0.558 (0.157)
λ30.140 (0.036)0.138 (0.035)0.156 (0.036)0.129 (0.041)0.131 (0.043)
λ40.125 (0.032)0.124 (0.032)0.140 (0.033)0.114 (0.037)0.116 (0.039)
Θ111.172 (0.283)1.148 (0.289)1.401 (0.219)1.236 (0.262)1.258 (0.311)
Θ220.710 (0.099)0.717 (0.097)0.624 (0.103)0.671 (0.096)0.627 (0.104)
Θ330.218 (0.017)0.218 (0.017)0.218 (0.017)0.218 (0.017)0.217 (0.019)
Θ440.194 (0.015)0.194 (0.015)0.194 (0.015)0.195 (0.015)0.197 (0.017)
Θ340.168 (0.014)0.168 (0.014)0.168 (0.014)0.168 (0.014)0.168 (0.016)
Estimates for variantsa
ν20.880 (0.156)0.920 (0.139)
ν31.746 (0.059)1.751 (0.056)
ν42.053 (0.055)2.056 (0.051)
λ20.768 (0.162)0.496 (0.205)
λ30.180 (0.058)0.142 (0.063)
λ40.161 (0.053)0.136 (0.058)
Θ110.612 (0.809)
Θ221.077 (0.259)
Θ330.226 (0.040)
Θ440.183 (0.032)
Θ340.171 (0.033)
Model fit criteriab, (criterion for good fit)
Num Paramc2324253136
−2LL (smaller is better)12302.812302.612296.612291.412287.0
AIC (smaller is better)12348.812350.612346.512353.412359.0
BIC(smaller is better)12440.912446.712446.612477.612503.2
CFId (>.95)0.9630.9620.9700.9680.967
TLId (>.95)0.9610.9590.9670.9620.958
RMSEAd (<.05)0.0410.0420.0380.0410.043

Parameters estimates for variants are the same for as for wild types unless shown here

For the exposure and outcome model combined

Including outcome model parameters

See Bentler and Hu (1995) for definitions