The problem of misclassification is common in epidemiological and clinical research. In some cases, misclassification may be incurred when measuring both exposure and outcome variables. It is well known that validity of analytic results (e.g. point and confidence interval estimates for odds ratios of interest) can be forfeited when no correction effort is made. Therefore, valid and accessible methods with which to deal with these issues remain in high demand. Here, we elucidate extensions of wellstudied methods in order to facilitate misclassification adjustment when a binary outcome and binary exposure variable are both subject to misclassification. By formulating generalizations of assumptions underlying wellstudied “matrix” and “inverse matrix” methods into the framework of maximum likelihood, our approach allows the flexible modeling of a richer set of misclassification mechanisms when adequate internal validation data are available. The value of our extensions and a strong case for the internal validation design are demonstrated by means of simulations and analysis of bacterial vaginosis and trichomoniasis data from the HIV Epidemiology Research Study.
In many epidemiologic and clinical studies, one aims to quantify the association between binary disease and exposure status, for instance, via odds ratios (ORs) based on 2 × 2 tables. A common practical problem is that misclassification may exist in one or both variables. The threats to the validity of analytic results that stem from misclassification have received considerable attention. For example, the “matrix method” discussed in epidemiological textbooks (
We recognize the practical need of developing intuitive methods for estimating ORs in 2 × 2 tables with a more general view of misclassification. In particular,
Here, we seek to further extend the focus within the 2 × 2 table setting in a way that allows full generalization of the assumed misclassification process, and as a result subsumes the preceding treatments as special cases. This extension is driven by the practicalities of study design and analysis, as we focus on flexible modeling to account for complex misclassification via a rich internal validation sample when both binary variables are subject to errors in measurement. Rather than solely a theoretical exercise, it is directly motivated by real data for which we demonstrate that only this most general misclassification model is adequate.
In Section 2, we provide a maximumlikelihood (ML) framework that can be viewed as a practical facilitation of generalized versions of the matrix and inverse matrix methods. To our knowledge, it constitutes the first generalization of the matrix method identity to account for both dependent and differential misclassification and the first generalization of the inverse matrix identity to account for misclassification of both
Consider a 2 × 2 table in which one measures an errorprone surrogate
The observed data likelihood contribution for an observation with (
The first and second terms in
Alternatively, one may choose to parameterize the observed data likelihood contribution in terms of positive and negative predictive values, that is,
Assuming independent misclassification implies that Pr(
When assuming nondifferential and independent misclassification, we define SE
This corresponds to the setting originally studied by
Sections 2.1.1–2.1.3 outline three misclassification mechanisms. However, other possibilities exist; for example,
In general, the main study likelihood piece based on observed data pairs (
We generalize the concept of the matrix method and its extensions (
Under other assumptions, the matrix
The inverse matrix identity directly expresses true cell probabilities as sums of products of surrogate cell probabilities and predictive values. Here, we extend the proposal of
In contrast to the generalized matrix method, there is no matrix inversion involved in computing the corrected OR through the generalized inverse matrix method. In principle, this could confer a numerical advantage in practice, although again direct use of the identity is generally restricted to the setting of sensitivity analysis.
The estimate of the corrected OR is
When allowing full generality, that is, dependent and differential misclassification, it can be shown that a full likelihood approach based on the proposed main/internal validation design is equivalent regardless of whether parameterized based on predictive values or SE/SP probabilities (
If parameterizing in terms of SE and SP values, all the
The internal validation subsample likelihood is given by
There are no closedform solutions for the MLEs based on the overall likelihood written in terms of SE and SP. Interestingly, however, closed forms exist for the predictive value parameterization in the most general case. For example, one can readily verify that
The remaining closedform MLEs are displayed in
When the misclassification process is not fully general (e.g. assuming independent misclassification and/or nondifferential misclassification of either variable), the equivalence between the likelihoods based on the SE/SP and predictive value parameterizations no longer holds. In such cases, it appears that there are no simple closed forms for likelihoodbased
In general, we recommend the use of the ML approach for optimal efficiency and the ease of numerically computing standard errors. Optimizing the full main/internal validation likelihood under either parameterization path is readily achieved by taking advantage of numerical procedures in standard statistical software. As such, we view the matrix and inverse matrix constructs more as instructive identities than as practical analysis tools, unless they are to be used solely for sensitivity analyses. Straightforward multivariate deltamethod calculations allow computing the approximate standard error of the corrected
A natural question one might ask is whether measuring (
While our focus has been on crosssectional sampling, the case–control sampling scheme is also worthy of discussion. Here, we consider “case–control” studies as those where case oversampling is conducted based on the errorprone responses. In other words, observations with
When correcting the estimate of the OR, we would ideally choose the misclassification mechanism that generated the observed data. Here, we provide a straightforward model selection procedure to guide practitioners. For ease of discussion, denote the dependent and differential misclassification model as “Model 1”, followed by “Model 2” (the independent and differential misclassification model in Section 2.1.2) and “Model 3” (the completely nondifferential model in Section 2.1.3). Model 1 reflects a fully general misclassification mechanism, while Model 2 can be regarded as a generalization of
Define AIC
Our motivating example comes from the HERS. This is a multicenter prospective cohort study with a total of 1,310 women enrolled in four U.S. cities from 1993 to 1995 (
We consider 916 patients with complete observations on both errorprone and goldstandard diagnoses of BV and TRICH at the fourth HERS visit. We selected Visit 4, because a previous examination uncovered a complex misclassification process underlying the assessment of BV status at that visit (
With the proposed model selection approach (Section 2.7), Model 1 is chosen with the smallest AIC value among the three candidate models. Therefore, we retain the fully general Model 1 as the final model, suggesting that the HERS data require one to account for dependent misclassification that is differential with respect to both
As discussed in Section 2.5, when utilizing both the goldstandard and surrogate measures of BV and TRICH for all 916 subjects in order to specify the corresponding full likelihood
Our first simulation experiment evaluates the performance of the proposed methods under conditions mimicking the HERS example (Section 3). Cell counts were simulated from a multinomial distribution with cell probabilities of (
The corrected results using the generalized matrix methods discussed in Section 2.1.1 agree well with the MLEs, when ML estimates of misclassification probabilities are supplied. However, when simpler crude estimates obtained from the validation subsample are inserted into the generalized matrix method, results are not satisfying, even producing negative estimates of probabilities in some cases (
The results in Section 4.1 suggest the importance of misclassification model selection to ensure the model is specified correctly (or, at least, generally enough). Extensive simulations were performed to evaluate the performance of the proposed AICbased model selection strategy (
The simulation results in
We have considered the classic problem of analyzing 2 × 2 tables, when both binary variables are subject to misclassification. Our main contributions are twofold. First, we have expanded the wellstudied matrix (
In the context studied here, the ability to apply a misclassification model that is sufficiently general can be critical, if one hopes to obtain a valid estimate of association. Our motivating example involving BV and TRICH assessments from the HERS illustrates this point extremely well, as we find evidence suggesting bias in all estimates of the OR except the one based on the fully general dependent and differential misclassification model introduced in this article. When misclassification of either variable is differential, the naïve log(OR) estimator can be biased in either direction. Moreover, the HERS example demonstrates that a corrected estimate based on an incorrect nondifferential error assumption for either variable can be potentially worse than the naïve estimate. For this reason, we urge practitioners not to simply assume nondifferential misclassification of either variable, unless that assumption is supported by the data or there is no other resource.
It should be noted that familiar matrix and inverse matrix methods as applied in practice are only equivalent to special cases of the proposed likelihoodbased approach, when MLEs of misclassification rates are supplied into the generalized matrix identities. Otherwise, estimators based on application of the matrix and inverse matrix methods are not fully efficient. For this reason, we favor the approach advocated here in which the full main/internal validation study likelihood is utilized. If one is also interested in obtaining a confidence interval for the OR, numerical optimization of the likelihood function greatly reduces the complexity of deltamethodbased calculations for computing standard errors to accompany the adjusted log(OR) estimate (
We have proposed a straightforward model selection procedure for practitioners who not only seek to obtain a valid analytic result but also pursue a more precise result that may be achievable via a correct reduced misclassification model. It has been demonstrated that the proposed model selection procedure works stably and permits the choice of simpler models when the deviation of the estimated OR is acceptable relative to the general model. However, since the saturated model allowing dependent and differential misclassification is always valid and appeared to sacrifice little efficiency in our simulations given an adequate validation sample, it may often be prudent to avoid model selection and simply settle upon the saturated misclassification model.
Our findings suggest that when designing largescale epidemiologic studies for which standard outcome (
We are currently investigating natural extensions of the current work to the multivariable regression and longitudinal settings, with internal validation subsampling to facilitate misclassification adjustments. Future work could involve specific consideration of costefficient internal validation designs when both
Assuming differential misclassification with independence,
In general,
Under the most general misclassification model (Model 1 in Section 2.7), we may write the likelihood as follows:
The above term can be rewritten as:
Since the term
BV and TRICH data of 916 participants at the fourth HERS visit
Main study
 

CLIN BV  Wet mount TRICH
 Total  
−  +  
−  497  23  520 
+  138  29  167 
Total  635  52  687 
Internal validation sample  
CLIN BV = 1, WET TRICH = 1, LAB BV = 1, CULTURE TRICH = 1  7  
CLIN BV = 1, WET TRICH = 1, LAB BV = 1, CULTURE TRICH = 0  0  
CLIN BV = 1, WET TRICH = 1, LAB BV = 0, CULTURE TRICH = 1  3  
CLIN BV = 1, WET TRICH = 1, LAB BV = 0, CULTURE TRICH = 0  0  
CLIN BV = 1, WET TRICH = 0, LAB BV = 1, CULTURE TRICH = 1  11  
CLIN BV = 1, WET TRICH = 0, LAB BV = 1, CULTURE TRICH = 0  28  
CLIN BV = 1, WET TRICH = 0, LAB BV = 0, CULTURE TRICH = 1  0  
CLIN BV = 1, WET TRICH = 0, LAB BV = 0, CULTURE TRICH = 0  8  
CLIN BV = 0, WET TRICH = 1, LAB BV = 1, CULTURE TRICH = 1  2  
CLIN BV = 0, WET TRICH = 1, LAB BV = 1, CULTURE TRICH = 0  0  
CLIN BV = 0, WET TRICH = 1, LAB BV = 0, CULTURE TRICH = 1  4  
CLIN BV = 0, WET TRICH = 1, LAB BV = 0, CULTURE TRICH = 0  1  
CLIN BV = 0, WET TRICH = 0, LAB BV = 1, CULTURE TRICH = 1  11  
CLIN BV = 0, WET TRICH = 0, LAB BV = 1, CULTURE TRICH = 0  34  
CLIN BV = 0, WET TRICH = 0, LAB BV = 0, CULTURE TRICH = 1  11  
CLIN BV = 0, WET TRICH = 0, LAB BV = 0, CULTURE TRICH = 0  109  
Total  229 
Description and likelihood contributions for 16 possible types of observations under the internal validation sampling
Obs. type  Description  Likelihood contribution in terms of SE and SP  Likelihood contribution in terms of predictive values 

1  SE  PPV  
2  (1−SP  (1−PPV  
3  SE  PPV  
4  (1−SP  (1−PPV  
5  (1−SE  (1−NPV  
6  SP  NPV  
7  (1−SE  (1−NPV  
8  SP  NPV  
9  SE  PPV  
10  (1−SP  (1−PPV  
11  SE  PPV  
12  (1−SP  (1−PPV  
13  (1−SE  (1−NPV  
14  SP  NPV  
15  (1−SE  (1−NPV  
16  SP  NPV 
Note: See Section 2.1 for the definitions of the terms.
Results of analysis of 916 women at Visit 4 in the HERS, effects of correction models on OR estimates under various misclassification assumptions
Model 

 AIC 

Naïve  1.54(0.26)  4.65 (2.81, 7.69)  
Gold standard  1.14(0.18)  3.13 (2.21, 4.43)  
Main/internal validation: Model 1  1.18(0.33)  3.24 (1.14, 5.35)  1,935.0 
Main/internal validation: Model 2  1.25(0.32)  3.48 (1.25, 5.71)  1,946.0 
Main/internal validation: Model 3  1.58(0.31)  4.84 (1.90, 7.78)  1,942.9 
Notes:
CLIN BV vs wet mount TRICH for all 916 subjects.
LAB BV vs culture TRICH for all 916 subjects.
229 internal validation and 687 main study observations per simulation. Model 1 assumes dependent and differential misclassification.
Model 2 assumes independent and differential misclassification.
Model 3 assumes completely nondifferential misclassification.
Results of simulations addressing main/internal validation studybased analysis mimicking HERS data
Model 
 95% CI coverage 

Naïve  1.42 (0.23)  67.4% 
Gold standard  1.15 (0.18)  93.6% 
Model 1  1.16 (0.34)  95.7% 
Model 2  1.28 (0.34)  93.3% 
Model 3  1.58 (0.31)  72.4% 
Notes: 500 simulations; 229 internal validation and 687 main study observations per simulation. True log(OR) = 1.14.
Model assuming dependent and differential misclassification.
Model assuming independent and differential misclassification.
Model assuming completely nondifferential misclassification.
Performance of model selection with main/internal validation studybased analysis under a negative association
Model 
 Mean SE  95% CI coverage 

Naïve  −0.32 (0.15)  0.15  0 
Gold standard  −1.10 (0.14)  0.15  95.4% 
Model 1  −1.10 (0.28)  0.29  95.2% 
Model 2  −1.10 (0.28)  0.28  94.8% 
Model 3 (underlying model)  −1.10 (0.27)  0.27  95.4% 
Model selection  −1.10 (0.27)  0.27  95.4% 
Naïve  −0.61 (0.15)  0.15  9.4% 
Gold standard  −1.10 (0.16)  0.15  93.2% 
Model 1  −1.10 (0.30)  0.28  94.4% 
Model 2 (underlying model)  −1.10 (0.29)  0.28  94.6% 
Model 3  −1.28 (0.26)  0.26  90.0% 
Model selection  −1.10 (0.29)  0.28  94.2% 
Naïve  0.82 (0.27)  0.20  0 
Gold standard  −1.11 (0.15)  0.15  94.6% 
Model 1 (underlying model)  −1.12 (0.28)  0.27  94.1% 
Model 2  −1.00 (0.27)  0.27  85.2% 
Model 3  −0.62 (0.27)  0.27  58.3% 
Model selection  −1.11 (0.28)  0.27  93.2% 
Notes: 500 simulation studies; 229 internal validation observations and 687 main study observations per simulation. Data were generated from a multinomial distribution with cell probabilities of (
Model 3 selected 88.8% of the time.
Model 2 selected 92.4% of the time.
Model 1 selected 85.0% of the time.
Performance of model selection with main/internal validation studybased analysis under a moderate positive association
Model 
 Mean SE  95% CI coverage 

Naïve  0.22(0.13)  0.14  1.2% 
Gold standard  0.81(0.14)  0.14  94.6% 
Model 1  0.82(0.27)  0.26  94.8% 
Model 2  0.82(0.26)  0.26  95.0% 
Model 3 (underlying model)  0.82(0.25)  0.25  95.8% 
Model selection  0.82(0.25)  0.25  95.8% 
Naïve  −0.28(0.14)  0.14  0 
Gold standard  0.81(0.14)  0.14  94.6% 
Model 1  0.81(0.26)  0.26  95.6% 
Model 2 (underlying model)  0.81(0.25)  0.25  94.8% 
Model 3  0.60(0.25)  0.25  84.6% 
Model selection  0.81(0.25)  0.25  95.0% 
Naïve  1.64(0.17)  0.17  7.2% 
Gold standard  0.82(0.14)  0.14  94.6% 
Model 1 (underlying model)  0.81(0.25)  0.26  95.0% 
Model 2  0.93(0.24)  0.25  86.6% 
Model 3  1.60(0.24)  0.24  17.0% 
Model selection  0.81(0.25)  0.26  94.4% 
Notes: 500 simulation studies; 229 internal validation observations and 687 main study observations per simulation. Data were generated from a multinomial distribution with cell probabilities of (
Model 3 selected 88.0% of the time.
Model 2 selected 94.0% of the time.
Model 1 selected 94.8% of the time.
Performance of model selection with main/internal validation studybased analysis under a strong positive association
Model 
 Mean SE  95% CI coverage 

Naïve  0.46(0.14)  0.14  0 
Gold standard  1.80(0.14)  0.15  96.4% 
Model 1  1.82(0.28)  0.30  96.8% 
Model 2  1.82(0.28)  0.29  96.8% 
Model 3 (underlying model)  1.82(0.27)  0.28  96.4% 
Model selection  1.82(0.28)  0.28  96.4% 
Naïve  −0.20(0.15)  0.15  0 
Gold standard  1.80(0.16)  0.15  93.8% 
Model 1  1.81(0.31)  0.29  93.6% 
Model 2 (underlying model)  1.81(0.31)  0.29  93.4% 
Model 3  1.59(0.30)  0.29  85.8% 
Model selection  1.81(0.31)  0.29  93.2% 
Naïve  1.98(0.18)  0.18  68.2% 
Gold standard  1.79(0.14)  0.15  96.8% 
Model 1 (underlying model)  1.80(0.28)  0.29  97.0% 
Model 2  1.95(0.28)  0.28  92.8% 
Model 3  2.57(0.27)  0.27  19.8% 
Model selection  1.80(0.28)  0.29  97.0% 
Notes: 500 simulation studies; 229 internal validation observations and 687 main study observations per simulation. Data were generated from a multinomial distribution with cell probabilities of (
Model 3 selected 87.2% of the time.
Model 2 selected 90.6% of the time.
Model 1 selected 95.8% of the time.