To investigate the ability of the propensity score to reduce confounding bias in the presence of nondifferential misclassification of treatment, using simulations.

Using an example from the pregnancy medication safety literature, we carried out simulations to quantify the effect of nondifferential misclassification of treatment under varying scenarios of sensitivity and specificity, exposure prevalence (10%, 50%), outcome type (continuous and binary), true outcome (null and increased risk), confounding direction, and different propensity score applications (matching, stratification, weighting, regression), and obtained measures of bias and 95% confidence interval coverage.

All methods were subject to substantial bias towards the null due to nondifferential exposure misclassification (range: 0% to 47% for 50% exposure prevalence and 0% to 80% for 10% exposure prevalence), particularly if specificity was low (<97%). Propensity score stratification produced the least biased effect estimates. We observed that the impact of sensitivity and specificity on the bias and coverage for each adjustment method is strongly related to prevalence of exposure: as exposure prevalence decreases and/or outcomes are continuous rather than categorical, the effect of misclassification is magnified, producing larger biases and loss of coverage of 95% confidence intervals. Propensity score matching resulted in unpredictably biased effect estimates.

The results of this study underline the importance of assessing exposure misclassification in observational studies in the context of propensity score methods. While propensity score methods reduce confounding bias, bias owing to nondifferential misclassification is of potentially greater concern.

Propensity score methods are used to estimate causal effects in observational studies when there are systematic imbalances in confounders across treatment groups under study, under assumptions of consistency, exchangeability, positivity, and correct model specification.^{1,2}, This technique has gained in popularity in the medical literature,^{3} and much of the recent methodologic literature has focused on what covariates should be included in the propensity score,^{4} as well as the performance of different propensity score methods under different outcome types.^{5,6} However, the role of treatment misclassification as a threat to the ability of the propensity scores to reduce bias in the estimation of treatment effects has, to the best of our knowledge, not been explored, although one previous study suggested that this potential for bias should be explored and quantified.^{7} Misclassification of exposure is known to cause bias, which may be towards or away from the null, depending on the type of misclassification,^{8–10} and so appreciating the potential impact of misclassification of the propensity score is vital to understanding its operating characteristics.

To ground this methodologic exploration in the real world, we will use an example from the pregnancy medication safety literature. Research on the safety and efficacy of medication use during pregnancy poses particular exposure misclassification problems. Birth cohort studies, such as the Norwegian Mother and Child Cohort Study,^{11} collect medication use data directly from mothers via prospective self-report. Several studies on the accuracy of prospective maternal recall of medication use during pregnancy suggest that while specificity is often high (values of 0.99 to 1.00), sensitivity may be low (0.17 to 0.41), particularly for medications taken intermittently, such as analgesics.^{12,13} Comparing prescription redemptions in administrative databases^{14} to self-report data often shows substantial disagreement between these information sources,^{12,13} and because pregnancy is a major predictor of medication discontinuation,^{15} women may incorrectly be classified as exposed when they have reduced or discontinued medication use. Many medications that women take during pregnancy may be acquired over-the-counter (OTC), or from other sources, and will not be captured in databases relying on prescription fills. In all of these scenarios, misclassification of a binary exposure is likely to be nondifferential with respect to outcome, and so has an expectation of bias towards the null over many studies, although individual studies may be biased towards or away from the null.^{9,10,16}

To our knowledge, no study has examined the impact of nondifferential exposure misclassification on the performance of the most common propensity score methods employed in the pharmacoepidemiology literature. Using a simulation study constructed with realistic parameters from the pregnancy medication safety literature, we compared the validity of estimates of the exposure effect derived from the application of propensity score methods under varying degrees of nondifferential exposure misclassification. Secondarily, we have compared the bias and coverage resulting from misclassified propensity scores across a variety of common applications of the propensity score. Our aim was to determine the extent to which the ability of propensity score methods to reduce bias was affected by nondifferential exposure misclassification, and additionally, whether some applications of propensity score methods perform better or worse under certain misclassification scenarios.

No ethics review was required because this is a simulation study.

For this simulation study, we consider the case of NSAID use during pregnancy and a continuous outcome, birth weight. NSAIDs are analgesic medications available through both prescription and OTC avenues. Prior research on the safety of NSAID use during pregnancy has produced inconsistent results, with some studies suggesting an increased risk of malformations^{17,18} or low birth weight,^{19} while others find no effect.^{20} Despite recommendations that women discontinue NSAID use during the first and third trimester in pregnancy, as many as 19% of women use NSAIDs during pregnancy^{21,22} with wide variation in prevalence (7% to 19%). This variation is due to whether drug utilization studies considered only prescription, or prescription plus OTC drug use as well as differences in prescribing practices between countries. Further, among persons with certain pain indications such as migraine or arthritis, prevalence of NSAID use is even higher. For studies using drug registries or administrative records, women who acquired NSAIDs OTC will not be counted as exposed, meaning that studies will consider some women unexposed when they were truly exposed (i.e. decreased sensitivity). Conversely, registry and administrative data reflect only medications prescribed or dispensed, not medications actually consumed, which means that some women classified as exposed are truly unexposed (i.e. decreased specificity).

The data were generated to closely follow realistic scenarios for NSAID exposure, pregnancy outcome, and confounders. Details on the parameters used for the simulation are outlined in _{1} throughX_{5},), and an outcome Y; the proposed causal model is shown in _{1} (analogous to indication for NSAID use, e.g. severe pain, with prevalence 0.50)^{23}, X_{2} (analogous to folate supplementation, with prevalence 0.60)^{24}, X_{3} (analogous to smoking during pregnancy, with prevalence 0.15)^{25}, X_{4} (analogous to concomitant opioid use, with prevalence 0.05)^{26}, X_{5} (analogous to maternal age, with a mean of 30 and a standard deviation of 5). Exposure A, with a prevalence of 50% or 10%, was simulated conditional on these confounders. The nodes A* and U_{A} represent the misclassified exposure and all sources of error leading to misclassification of the exposure, respectively. Exposure was simulated as in

Model used to simulate probability of treatment, A, conditional on confounders.

We considered two possible outcome specifications: a continuous variable with a mean of 3500 grams and a standard deviation (SD) of 500 grams, analogous to the mean and SD of birth weight (grams) among term births (^{27} The outcome variables were generated conditional on the exposure A and the confounders X_{1} through X_{5}, from the models described in

Model used to simulate continuous outcome, Y, conditional on treatment A and confounders.

Model used to simulate binary outcome, Y, conditional on treatment A and confounders.

For the continuous outcome, the true mean difference in birth weight was set to 200 grams, a difference that would be of clinical concern.^{28} Similarly, the true effect size for the binary outcome was set to an odds ratio of 2.0 (log odds of 0.7). We also considered continuous and binary outcome scenarios in which the true effect of treatment was zero. We simulated joint confounding by the confounders X_{1} through X_{5} to produce effect estimates that (1) overestimated the true effect size by about 15% or (2) underestimated the true effect size by about 15%. Overall, the data generation process was planned in order to show realistic, clinically-meaningful true effect sizes, which were moderately biased due to confounding.

To assess the impact of varying degrees of exposure misclassification, we created misclassified exposure variables ^{12,13} We investigated all 49 possible combinations, assuming nondifferential misclassification; because results from the 70% scenarios were uniformly poor, we have included these data only in the

We first fit a propensity score model using logistic regression with the correctly classified exposure, _{1} throughX_{5} as independent variables; we derived the predicted probability of prenatal exposure to NSAIDs from this model. We then fit propensity score models for all combinations of sensitivity and specificity, resulting in 49 predicted probabilities of exposure. We used the propensity scores to calculate inverse probability of treatment weights (IPTW), in which exposed units received weights of [1/PS] and unexposed received weights of [1/(1-PS)]. We also calculated standardized morbidity/mortality rate weights (SMRW), where exposed individuals received weights of 1 and unexposed received weights of [PS/(1-PS)]. After deriving the propensity score and weights, we fit outcome models in six ways: (1) Fitting an unadjusted model, (2) Adjusting for the propensity score as a covariate in the multivariable model, (3) Matching on propensity score, using a nearest neighbor 1:1 matching algorithm and a caliper equal to 0.2 of the standard deviation of the logit of the propensity score, (4) Calculating five strata of the propensity score based on the distribution of the propensity score in the exposed, stratifying the outcome model to estimate the effect of exposure on outcome within each stratum, and calculating a pooled effect estimate across strata, (5) Fitting an IPT-weighted model, and (6) Fitting an SMR-weighted model. Matching and stratification were performed using the R package ^{29} Weighted models were fit with robust standard errors using the R package ^{30} We performed steps 1–6 on the perfectly classified exposure,

The results of the simulation are presented by outcome type (continuous or categorical), direction of confounding (underestimate vs overestimate), exposure prevalence (50% vs 10%), and true effect size (increased risk vs. no effect). Performance of PS methods varied according to the scenarios considered, with some overall trends emerging. Results for continuous and categorical outcome models, in which exposure prevalence was 10%, are presented in

In

PS matched estimates initially appeared to outperform other methods as specificity decreased, particularly in cases where confounding was negative. However, this observation is limited to better coverage, as percent bias for matched estimates tended to be comparable to IPTW, SMRW, and PS adjusted results. Examination of the sample sizes included in the PS matched samples suggests that coverage is improved due to smaller sample size and correspondingly wider confidence intervals (

We observed one other phenomenon that deserves attention. Examining the graph of coverage shown in

We compared the performance of five common implementations of propensity score methods, including adjustment, two methods of weighting, matching, and stratification, for varying exposure prevalence, and for both continuous and categorical outcome models. We found that all methods were vulnerable to bias due to misclassification of exposure, and that losses of specificity had a greater impact on effect estimates than losses of sensitivity. PS matching more often produced estimates with worse coverage and greater bias, although in the presence of even moderate misclassification, all methods showed substantial loss of coverage and increase in bias.

The effects of misclassification were more extreme for an exposure with 10% prevalence compared to 50% prevalence. A recent simulation study on applying PS methods to rare exposures found that an exposure prevalence of 10% produced estimates with low bias and acceptable variability in comparison to very rare exposure;^{31} however, this study did not consider systematic error due to misclassification. In reality, studies with exposure prevalence of 10% may be more biased than expected if exposure data are misclassified.^{32} Additionally, we found that estimates from categorical outcome models less biased than estimates from continuous outcome models where exposure prevalence and misclassification were similar.

It is unsurprising that PS stratification is less vulnerable to misclassification than PS matching. In PS matching, an exposed individual is matched to unexposed individuals, conditional on the propensity score. In the case of moderate to low specificity and low sensitivity, few truly exposed individuals are included in the outcome analysis, which will clearly result in bias towards the null. Prior research on the performance of matching estimators compared to other PS methods is conflicting, with one study showing that matching on the PS is preferable to stratification for purposes of reducing bias due to confounding;^{33} however, these studies assumed perfect classification of exposure. Other recent work suggests that PS matching can substantially increase bias compared to other estimators.^{34} Our results are more in line with the latter study, and suggest that other sources of bias, such as misclassification, should be considered when selecting an adjustment method.

It is less clear why PS stratification should outperform PS adjustment or weighting. PS stratification methods estimate the treatment effect within each stratum, and so fitting the parametric model within each stratum relies only on local, rather than global assumptions; this increases the robustness of the estimate,^{29} and could explain why PS stratification methods appear less vulnerable to bias due to misclassification of the exposure.

A possible, and alluring, conclusion to be drawn from these results, given that the unadjusted estimates are often less biased, with better coverage than the adjusted estimates, particularly for poor sensitivity and specificity, is that researchers are better off not adjusting for confounders. While this is true in the case of this simulation, when the magnitude of bias due to confounding is known and fixed, this conclusion should not be generalized to observational research, where the magnitude of this bias is unknown. Rather, this finding illustrates a long-understood phenomenon when working with real data: that effect estimates are subject to multiple sources of bias.

Propensity score stratification, adjustment, and inverse probability of treatment weighting estimate the average treatment effect in the population (ATE), or the effect we would expect to see if all exposed individuals were unexposed. Matching and standardized morbidity/mortality rate weighting, by contrast, estimate the average treatment effect in the treated (ATT). One recent study examining the performance of different PS methods for a rare exposure found that ATT estimators were more reliable than ATE estimators,^{31} which is not consistent with our findings, although this may be explained by the fact that we simulated our outcome model in a full cohort (ATE). Further research should seek to clarify whether ATE or ATT measures are more susceptible to exposure misclassification.

As with all simulation studies, this study is likely an oversimplification of reality. We simulated independent, rather than correlated confounders, and examined scenarios where only the exposure, not the confounders, was misclassified. One prior study on the impact of misclassification of confounders included in the propensity score found that even small levels of covariate measurement error reduced the ability of the propensity score to control confounding; further, higher correlation among covariates led to increased bias.^{35} This suggests that we might expect to see more extreme levels of bias, if we had used a more complex confounding structure. We elected to limit our study to misclassification of the exposure, rather than exposure and confounders, but future studies should examine the impact of joint misclassification of exposure, as well as differential exposure misclassification with respect to outcome, on effect estimation within the propensity score context.

This is not the first study to have observed and described a problem of exposure misclassification in the pregnancy medication literature,^{12,13} and others have noted that misclassification of exposure in epidemiologic studies is an endemic and serious problem in the field.^{16} Indeed, the idea that misclassification of exposure will result in bias of effect estimates, likely towards the null, is not new in observational research,^{8,9,36} and various methods, including regression calibration and multiple imputation as well as probabilistic bias analysis, have emerged to address this problem,^{37–41} although the application of these techniques has not yet been tested in propensity score methods. Further, studies of medication safety during pregnancy have recognized,^{42} and in some cases taken steps to correct for,^{43} exposure misclassification. Our study adds to the current literature by underlining the importance of considering multiple sources of systematic bias, not just confounding, when using propensity score methods, particularly propensity score matching.

It is important to note that values of sensitivity and specificity that we refer to as “low” are in fact common in studies of medication use during pregnancy,^{12,13} and that for some medications such as OTC analgesics, both maternal recall and capture in automated electronic may be far worse than the “worst case” 80% scenario we used in this study. Nondifferential misclassification tends to results in bias towards the null,^{16} so this kind of systematic error will generally not result in false positive studies. However, because studies of safety and efficacy of drugs used in pregnancy are almost exclusively performed using observational studies, the fact that we are certainly failing to detect meaningful risks should be of major concern.

Financial support: This project was not directly funded. Drs. Wood and Chrysanthopoulou are supported by internal institutional funds. Hedvig Nordeng is funded by the ERC StG DrugsInPregnancy (639377). Kate Lapane receives salary support from investigator initiated grants from NCI (R21CA198172), NCATS (TL1TR001454), NIA (R21AG046839), NINR (R56NR015498), NIOSH (R21OH010769), NIGMS (R25GM113686), and Merck.

The authors have no other financial or other conflicts of interest to disclose.

Selected portions of this work were presented as a poster at the 2016 annual meeting of the Society for Epidemiologic Research and as an oral presentation at the annual meeting of the International Society for Pharmacoepidemiology.

inverse probability of treatment weight

propensity score

sensitivity

specificity

non-steroidal anti-inflammatory drugs

over-the-counter

standardized morbidity/mortality rate weight

Causal diagram showing measurement error, U_{A}, leading to nondifferential misclassification of the exposure, A, into A*. X_{1} through X_{5} are confounders of the A-Y association. In this simulation study, A is NSAID use in pregnancy, Y is birth weight, X_{1} is indication for NSAID use, X_{2} is folate supplementation, X_{3} is smoking during pregnancy, X_{4} is concomitant opioid use, and X_{5} is maternal age.

Results from continuous outcome models showing (A) Percent bias and (B) coverage for positive confounding. Results are for propensity score matching, regression adjustment, stratification, weighted, and unadjusted models, under varying values of sensitivity and specificity, with a true mean difference of −200 and exposure prevalence set to 10%. Percent bias is calculated as [(observed – truth)/truth]*100%. Coverage is defined as the percent of simulations in which the confidence interval of the effect estimate contained the true effect.

Results from categorical outcome models showing (A) Percent bias and (B) coverage for positive confounding. Results are for propensity score matching, regression adjustment, stratification, weighted, and unadjusted models, under varying values of sensitivity and specificity, with the true log-odds of Y set to 0.7 and exposure prevalence set to 10%. Percent bias is calculated as [(observed – truth)/truth]*100%. Coverage is defined as the percent of simulations in which the confidence interval of the effect estimate contained the true effect.

Parameters of the simulation study

Variable | Prevalence or mean, standard deviation | Effect size (Treatment model) | Parameter (Treatment model) | Effect size (Outcome model) | Parameter (Outcome model) |
---|---|---|---|---|---|

A | 10%, 50% | α_{0} | −200 [0.7, 0] | β_{A} | |

X_{1} | 50% | 0.8 [−0.8] | α_{1} | −20 [0.2, −0.2] | β_{1} |

X_{2} | 60% | −0.05 [0.05] | α_{2} | −38 [0.1, −0.1] | β_{2} |

X_{3} | 15% | 0.4 [−0.4] | α_{3} | −300 [0.7, −0.7] | β_{3} |

X_{4} | 5% | 1.2 [−1.2] | α_{4} | −50 [0.15, −0.15] | β_{4} |

X_{5} | 30(5) | 0.01 [−0.01] | α_{5} | 5 [0.05, −0.05] | β_{5} |

Y_{1} | 3500(500), 5% | β_{0} |

Confounders X_{1} to X_{5} were simulated to have parameters analogous to indication for medication use (X_{1}), folic acid supplementation (X_{2}), smoking during pregnancy (X_{3}), concomitant opioid use (X_{4}), and maternal age (X_{5}). A has a prevalence similar to NSAID use during pregnancy, and its simulated effect on Y_{1} indicates a 200-gram decrease in birth weight. Y was simulated to have the mean and variation of birth weight (−200 for continuous outcomes) and the expected prevalence of low birth weight (5% for categorical outcomes) among term infants in the normal birthing population. Effect sizes for simulation parameters are shown with alternate scenarios in square brackets: e.g. the treatment effect size for the continuous outcome model was −200, with 0.7 and 0 considered as alternate scenarios. Effect sizes for categorical variables are given in log odds, e.g., an odds ratio of 2.0 is equal to a log odds of 0.7.

Selected results for continuous and categorical outcome models with 10% exposure prevalence

PS Adjusted | PS Matched | PS Stratified | IPT Weighted | SMR Weighted | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Sens/Spec | β | Bias(%) | Cov. | β | Bias(%) | Cov. | β | Bias(%) | Cov. | β | Bias(%) | Cov. | β | Bias(%) | Cov. |

1.00/1.00 | −199.9 | −0.1 | 95.7 | −198.7 | −0.7 | 95.2 | −200.5 | 0.2 | 94.5 | −199.8 | −0.1 | 95.4 | −199.8 | −0.1 | 96.1 |

1.00/0.99 | −185.1 | −7.4 | 84.2 | −184.5 | −7.7 | 87.9 | −189.9 | −5.0 | 89.2 | −182.3 | −8.8 | 81.8 | −185.9 | −7.1 | 86.8 |

1.00/0.97 | −161.4 | −19.3 | 25.6 | −163.2 | −18.4 | 53.2 | −172.3 | −13.8 | 56.2 | −156.3 | −21.8 | 20.1 | −163.0 | −18.5 | 32.0 |

0.99/1.00 | −199.6 | −0.2 | 95.4 | −198.5 | −0.8 | 96.8 | −200.1 | 0.1 | 94.6 | −199.5 | −0.2 | 95.5 | −199.5 | −0.3 | 96.0 |

0.97/1.00 | −198.9 | −0.5 | 95.2 | −198.9 | −0.5 | 94.8 | −199.5 | −0.3 | 94.8 | −199.1 | −0.5 | 95.2 | −198.8 | −0.6 | 95.7 |

0.95/1.00 | −198.2 | −0.9 | 95.5 | −197.7 | −1.2 | 96.2 | −198.5 | −0.7 | 94.1 | −198.5 | −0.8 | 95.5 | −198.0 | −1.0 | 96.1 |

0.90/1.00 | −196.8 | −1.6 | 95.0 | −196.4 | −1.8 | 94.9 | −197.0 | −1.5 | 94.6 | −197.4 | −1.3 | 94.8 | −196.5 | −1.8 | 95.7 |

0.99/0.99 | −184.5 | −7.7 | 84.4 | −184.4 | −7.8 | 89.9 | −189.4 | −5.3 | 89.0 | −181.8 | −9.1 | 81.0 | −185.2 | −7.4 | 85.7 |

0.99/0.97 | −160.6 | −19.7 | 25.6 | −161.2 | −19.4 | 50.9 | −171.7 | −14.2 | 54.6 | −155.5 | −22.2 | 20.2 | −162.2 | −18.9 | 30.8 |

0.97/0.99 | −183.8 | −8.1 | 82.4 | −183.6 | −8.2 | 88.8 | −189.0 | −5.5 | 88.3 | −180.9 | −9.5 | 80.7 | −184.4 | −7.8 | 84.4 |

0.97/0.97 | −159.0 | −20.5 | 21.8 | −160.0 | −20.0 | 49.0 | −169.6 | −15.2 | 50.4 | −153.9 | −23.0 | 17.6 | −160.5 | −19.7 | 26.4 |

0.95/0.99 | −182.5 | −8.7 | 81.5 | −182.7 | −8.6 | 90.1 | −187.5 | −6.3 | 87.1 | −179.9 | −10.1 | 78.8 | −183.2 | −8.4 | 83.0 |

0.95/0.97 | −157.6 | −21.2 | 19.9 | −159.0 | −20.5 | 47.3 | −168.7 | −15.7 | 49.1 | −152.5 | −23.7 | 15.4 | −159.2 | −20.4 | 24.0 |

0.90/0.99 | −180.4 | −9.8 | 76.8 | −180.8 | −9.6 | 87.6 | −185.0 | −7.5 | 84.0 | −178.0 | −11.0 | 75.1 | −180.8 | −9.6 | 79.2 |

0.90/0.97 | −154.4 | −22.8 | 15.7 | −154.8 | −22.6 | 39.5 | −164.8 | −17.6 | 40.9 | −149.5 | −25.3 | 12.4 | −155.8 | −22.1 | 19.3 |

| |||||||||||||||

1.00/1.00 | 0.70 | −0.2 | 96.1 | 0.69 | −2.0 | 94.9 | 0.70 | −0.4 | 96.1 | 0.69 | −1.6 | 96.3 | 0.69 | −1.7 | 95.5 |

1.00/0.99 | 0.66 | −6.2 | 93.5 | 0.65 | −7.3 | 93.1 | 0.67 | −4.8 | 95.5 | 0.64 | −8.5 | 93.2 | 0.65 | −7.0 | 92.9 |

1.00/0.97 | 0.59 | −15.6 | 85.1 | 0.59 | −16.0 | 89.5 | 0.61 | −12.3 | 90.6 | 0.57 | −18.2 | 81.8 | 0.59 | −15.6 | 84.4 |

0.99/1.00 | 0.70 | −0.4 | 95.7 | 0.69 | −1.6 | 95.0 | 0.69 | −0.7 | 96.0 | 0.69 | −1.8 | 96.0 | 0.69 | −1.9 | 95.2 |

0.97/1.00 | 0.69 | −1.1 | 95.7 | 0.68 | −2.4 | 95.0 | 0.69 | −1.6 | 95.7 | 0.68 | −2.4 | 96.2 | 0.68 | −2.7 | 95.5 |

0.95/1.00 | 0.69 | −1.1 | 95.7 | 0.69 | −1.7 | 95.3 | 0.69 | −1.6 | 95.9 | 0.68 | −2.3 | 96.2 | 0.68 | −2.7 | 94.8 |

0.90/1.00 | 0.68 | −2.9 | 95.4 | 0.67 | −4.2 | 94.7 | 0.68 | −3.5 | 95.6 | 0.67 | −3.8 | 95.3 | 0.67 | −4.6 | 94.1 |

0.99/0.99 | 0.66 | −6.4 | 94.1 | 0.65 | −7.1 | 93.2 | 0.66 | −5.1 | 94.7 | 0.64 | −8.5 | 93.1 | 0.65 | −7.2 | 93.5 |

0.99/0.97 | 0.59 | −16.2 | 84.0 | 0.59 | −16.1 | 89.0 | 0.61 | −12.9 | 89.8 | 0.57 | −18.9 | 81.4 | 0.59 | −16.2 | 84.4 |

0.97/0.99 | 0.65 | −7.0 | 93.4 | 0.65 | −7.3 | 93.9 | 0.66 | −5.8 | 95.1 | 0.64 | −9.1 | 93.9 | 0.64 | −7.9 | 93.0 |

0.97/0.97 | 0.58 | −16.8 | 83.3 | 0.58 | −17.3 | 87.5 | 0.60 | −13.7 | 88.8 | 0.56 | −19.3 | 80.7 | 0.58 | −16.8 | 83.2 |

0.95/0.99 | 0.65 | −7.5 | 94.5 | 0.65 | −7.6 | 93.5 | 0.66 | −6.3 | 95.0 | 0.63 | −9.5 | 93.2 | 0.64 | −8.4 | 93.6 |

0.95/0.97 | 0.58 | −17.8 | 82.4 | 0.58 | −17.5 | 86.6 | 0.60 | −14.5 | 88.8 | 0.56 | −20.5 | 78.0 | 0.58 | −17.8 | 82.9 |

0.90/0.99 | 0.64 | −8.7 | 93.1 | 0.64 | −8.0 | 93.9 | 0.65 | −7.8 | 94.7 | 0.63 | −10.5 | 92.6 | 0.63 | −9.6 | 91.7 |

0.90/0.97 | 0.56 | −19.7 | 78.8 | 0.56 | −19.9 | 86.0 | 0.58 | −16.5 | 87.1 | 0.54 | −22.2 | 76.5 | 0.56 | −19.8 | 78.7 |

1.00/1.00 | 0.70 | −0.2 | 96.1 | 0.69 | −2.0 | 94.9 | 0.70 | −0.4 | 96.1 | 0.69 | −1.6 | 96.3 | 0.69 | −1.7 | 95.5 |

Abbreviations: IPT: inverse probability of treatment; SMR: standardized mortality/morbidity rate; PS: propensity score; Sens: sensitivity; Spec: specificity; Cov: coverage.

Observed effect of exposure, A, on outcome Y, expressed as mean difference between exposed and unexposed (for continuous outcome models) and log odds (log odds of 0.7 is equal to an odds ratio of 2.0) for categorical outcome models.

Percent bias, calculated as: [(β_{TRUE} − β_{OBSERVED})/β_{TRUE}]*100.

Percentage of simulations in which the true effect (−200 for continuous outcome models and 0.7 for categorical outcome models) is included in the 95% confidence interval.

1:1 matching on caliper equal to 20% of the standard deviation of the PS.

Five strata based on distribution of the propensity score within the exposed group.

IPT weighted analyses where exposed individuals received weights of [1/PS] and unexposed received weights of [1/(1-PS)].

SMR weighted analyses where exposed individuals received weights of 1 and unexposed received weights of [PS/(1-PS)].

Different propensity score methods show varying levels of bias due to nondifferential exposure misclassification.

Bias due to misclassification was greater than bias due to choice of propensity score method.

Difference between methods was more pronounced for lower prevalence exposures.

Losses in specificity resulted in more bias than losses in sensitivity.

Propensity score matching most often performed worst compared to other methods.