Along with the rapid emergence of web surveys to address time-sensitive priority topics, various propensity score (PS)-based adjustment methods have been developed to improve population representativeness for nonprobability- or probability-sampled web surveys subject to selection bias. Conventional PS-based methods construct pseudo-weights for web samples using a higher-quality reference probability sample. The bias reduction, however, depends on the outcome and variables collected in both web and reference samples. A central issue is identifying variables for inclusion in PS-adjustment. In this paper, directed acyclic graph (DAG), a common graphical tool for causal studies but largely under-utilized in survey research, is used to examine and elucidate how different types of variables in the causal pathways impact the performance of PS-adjustment. While past literature generally recommends including all variables, our research demonstrates that only certain types of variables are needed in PS-adjustment. Our research is illustrated by NCHS’ Research and Development Survey, a probability-sampled web survey with potential selection bias, PS-adjusted to the National Health Interview Survey, to estimate U.S. asthma prevalence. Findings in this paper can be used by National Statistics Offices to design questionnaires with variables that improve web-samples’ population representativeness and to release more timely and accurate estimates for priority topics.

Producing timely data is a priority of National Statistics Offices (NSOs). However, some of the more timely data collections, including web-based surveys, may be subject to biases relative to large nationally representative surveys conducted by NSOs due to lower coverage and response rates. Adjusting these timelier sources with less timely but higher quality reference surveys may decrease their biases.

Selection bias has been acknowledged in different areas (

The amount of bias reduction, however, varies depending on the outcome and variables that are collected in both the target and reference data sources.

High-quality probability samples surveyed through well-designed questionnaires are in great demand as reference surveys for at least two reasons: 1) Different PS adjustment methods, including PS-based weighting and matching methods, require a high-quality probability sample as the reference in order to create a set of pseudo-weights for the target sample to better represent the underlying target population; 2) Different target samples may use a common high-quality probability sample as the reference for cost efficiency by using the same questions with exact wordings to avoid potential reporting/measurement error. Given a high-quality population representative reference survey, we are interested in identifying the types of variables that are critical for collection in the target sample to improve its external validity in estimating population quantities. The findings can be used in turn to plan for future surveys.

The target sample motivating this research is collected through the National Center for Health Statistics’ (NCHS) Research and Development Survey (RANDS), a probability-based panel survey that has been conducted using online and phone administration (

Propensity model variable inclusion has been widely studied in different areas, including clinical trial or medical research and survey research. In clinical trial research, participants are included for clinical and experimental purposes (mainly for treatment effect estimation) and are not necessarily representative of the U.S. population. Simulations (

In survey research, propensity analyses have been conducted to estimate response propensity (

This paper, in contrast to the interest of estimating treatment effects in clinical research, aims to estimate population quantities such as the population mean. We are interested in identifying key auxiliary information in a reference probability survey to improve the external validity of inferences from a target dataset. This is an important obligation for survey designers because the choice and inclusion of these variables has a tremendous effect on both the bias and the precision of the estimates of population quantities. This differs from the goal of nonresponse adjustment which uses chosen covariates for predicting response propensities as, in nonresponse adjustment research, respondents are nested within the sampled units, and respondents and nonrespondents share common sampling design variables. As a result, unweighted analysis of response propensity can be performed conditional on the design and response predictive variables (

This paper aims to examine how different types of variables included in a propensity model impact the performance of population mean estimation using target samples through the directed acyclic graph (DAG), a common graphical tool in causal studies but largely under-utilized in survey research. The DAG is used to identify certain types of variables in the causal pathway to be included in the PS model which results in the lowest bias and highest precision under various scenarios. Estimated population means and their variances are evaluated analytically and numerically under various mis-specified propensity models, including with and without interactive effects. Different levels of variable correlations in the finite population are considered to mimic real data scenarios. The findings are applied to RANDS, with NHIS as the reference, to estimate the prevalence of asthma in the U.S. The RANDS evaluation demonstrates the advantage of this approach compared to the approach when the propensity model includes all available variables.

The results from this research provide insight for data analysts on propensity model construction to improve the population representativeness of target samples. It also provides insight for questionnaire designers on the critical auxiliary information to collect from the reference survey. NSOs, using the paper results, can design the questionnaires for both the target and reference surveys and release accurate estimates for priority topics from more timely data sources.

We first introduce some notation. Suppose Y is a binary outcome of interest (e.g., for estimating the prevalence of a disease or health condition: Y=1 if event and 0 otherwise). In the context of survey sampling, suppose A is the binary selection indicator variable (i.e., A=1 if a population unit participates in the target sample and 0 otherwise). Note A indicates the target sample participation with value of one representing population units who are recruited and respond to the survey.

We adapt the framework of

variables related to both the outcome Y and the selection indictor A of the target sample — confounders (_{1});

variables related to Y but not related to A – outcome predictors (_{2});

variables related to A but not related to Y – selection variables (_{3}).

We now present some background about PS adjustment methods. For estimation of the finite population (FP) mean of a binary outcome

More specifically in PS-based adjustment methods, the population mean _{c}_{c}_{i}_{c}_{j}_{c}_{1}-_{3}, and are available in both the target sample and the reference survey, while the outcome variable _{c}

Various PS-based adjustment methods, including PS weighting and PS matching methods, have been developed under the following assumptions. First, the reference survey sample (in our real data example, the NHIS), through weighting, properly represents the target population of interest. Second, all finite population units have a positive participation rate (i.e., each individual in the population has a positive propensity to volunteer to participate in RANDS panel). Third, conditional exchangeability holds with no unmeasured confounders, that is, the probability for each individual in the FP to participate in the target sample is not related to his/her outcome, after adjusting for all measured variables. It is a common practice that the variables in the target sample are measured using same question wordings as in the reference survey to avoid potential reporting or measurement error.

While PS weighting and PS matching methods have similar assumptions, PS _{c}_{c}_{j}

In _{1}, _{2}, and _{3} are mutually independent in the FP and study how the PS-based adjustment methods reduce the bias and variance through the incorporation of different types of variables in the propensity models. We further consider real situations in

It is readily shown in _{1} induce the bias when we use the simple sample mean to estimate the population mean _{1}, but not _{2} or _{3}. This result is consistent with the bias calculation below. For selection variables (_{3}) or predictors (_{2}), we have _{1}, PS-based adjustment methods create pseudo-weights and reweight the target sample such that the weighted sample distribution of the confounder _{1} is same as that in the FP, i.e., _{1} ⊥ _{1}-A is blocked (i.e., there is no information exchange between the two nodes) by reweighting the target sample and hence

As a result, the estimator _{1}), where _{1} is the realized value of the confounder _{1}, is approximately unbiased. Analogously, it is readily shown that the estimator _{1} distribution between the target sample and the FP, including _{1}, _{2}), _{1}, _{3}), or _{1}, _{2}, _{3}), is also unbiased. Note that the three sets of pseudo-weights of _{1}, _{2}), _{1}, _{3}), or _{1}, _{2}, _{3}) balance the _{1} distribution and also the distribution of _{2}, _{3}, or _{2} and _{3}, respectively, between the target sample _{c}

In contrast, pseudo-weights of _{2}), _{3}) or _{2}, _{3}) do not balance the _{1} distribution and therefore the corresponding weighted estimators in

Among the four unbiased estimators based on _{1}), _{1}, _{2}), _{1}, _{3}), and _{1}, _{2}, _{3}), we compare their efficiencies. We first compare the variance of _{1} versus _{1}, _{3}), denoted by

Note the selection variable is independent of the outcome and thus the pseudo-weights based on _{3} are non-informative of the outcome Y. The corresponding pseudo-weighted mean, although adding no bias, loses efficiency due to the differential non-informative pseudo-weights. Taking the adjusted logistic propensity pseudo-weights (denoted by ALP in _{c}_{x} is the regression coefficient associated with _{j}(_{c}

For simple illustration, assume _{j}_{1}, _{3}) = _{j}_{1})_{j}_{3}) and _{j}_{3}) are noninformative weights since _{3} ⊥ _{j}_{3}) weights. Thus,

Note that the model parameter _{x}_{2} in

Along the same line of justification,

In summary, to achieve unbiasedness and efficiency of pseudo-weighted mean estimators, the propensity model that considers confounders (_{1}) alone, or together with outcome predictors (_{2}), should be used to construct the pseudo-weights in

The above justification assumes the logistic regression

We now consider more realistic scenarios in which the confounders, the outcome predictors, and the selection variables can be correlated to each other. _{1} and _{3}, _{1} and _{2}, and _{2} and _{3}, respectively. In addition, any two or all three pairs can be correlated simultaneously in the FP.

For unbiased estimation of the FP mean of Y using the target sample (A=1), we need to block all paths connecting A and Y such that

As shown by the dotted lines in _{1}-_{3}). By the backdoor criteria, _{1} blocks the identified paths in _{1}) so that the _{1} distribution in the weighted target sample is same as that in the FP in _{1} and _{3} or _{1} and _{2} are correlated (i.e., _{x1x3} ≠ 0 or _{x1x2} ≠ 0). Thus, the _{1})-weighted target sample mean of Y is an unbiased estimator of the FP mean. In _{2} or _{3}, in addition to _{1}, block the two identified paths (_{2} or _{3}, in addition to _{1}, denoted by _{1}, _{2}) or _{1}, _{3}), should be constructed for the target sample units when the pair of _{2} and _{3} are correlated in _{x2x3} ≠ 0). This result also applies to cases when any two pairs or all three pairs of covariates are simultaneously correlated in the FP, and the _{1}, _{2})- or _{1}, _{3})-pseudo-weighted target sample means are approximately unbiased.

In summary, similar to the scenario shown in _{1} (_{1}, _{2}) or (_{1}, _{3}) (

In practice, the variable types (confounder, predictor, selection variable) need to be identified for propensity model construction. Since we are not concerned about model interpretation, parametric models with complex functional forms or nonparametric models can be fitted. In our RANDS example (

Alternative model selection criteria can be employed, such as Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) (

The true propensity model of the underlying selection mechanism of the target sample (A=1) is often unknown but complicated, which may involve covariate terms of higher orders of nonlinearity and/or nonadditivity. For example, _{1} and _{2} (or _{1} and _{3}) can interactively affect the outcome Y (or selection indicator A), the scenario considered in simulation study 3 (to be shown in

For example, PS weighting methods (such as the ALP) can be sensitive to model misspecification (_{1}, _{3}, and their interaction _{1} * _{3}, produce unbiased estimators. The estimators, however, are biased if the model is misspecified, for example, the interaction term is omitted from the propensity model.

In contrast, PS _{1} in _{1} and _{3} or _{1} and _{2} in _{1} * _{3} interaction) in the propensity model (as shown in

Simulation studies were conducted to evaluate the performance of the mean estimator from

We generate a finite population _{1i}, _{2i}, _{3i}, _{i}_{1}, _{2}, and _{3} follow standard trivariate normal distributions with pairwise correlations _{x1x2}, _{x1x3}, _{x2x3}. A binary outcome Y is generated following the Bernoulli distribution with a mean of

We specify (_{0}, _{1}, _{2}) = (−1, .5, .5) so that _{1} and _{2} are associated with _{12} = 0.5 or 0 with and without the interaction term. As a result, the FP mean

A sample of size _{c}_{c}_{0} + _{1}_{1} + _{3}_{3} + _{13}_{1}_{3}) so that the inclusion probability is

We specify _{0}, _{1}, _{3}) = (−1, .5, .5) so that _{1} and _{3} are associated with _{13} = .5 or 0 with or without the interactive effect in the propensity model. We have the target sample participation rate of

The inclusion probabilities (i.e., sample weights) are masked in the analysis and treated as unknown (i.e., equal sample weights of 1 used). Note that the target sample without weights is not representative of the population.

An independent probability sample of size _{s}_{s}

Pseudo-weighted means, i.e., (_{1}), outcome predictors (_{2}), the selection variables (_{3}), and/or their interactions, were compared. Three simulation studies are conducted with results presented in _{x1x2} = _{x1x3} = _{x2x3} = 0) without interaction effects of covariates on the outcome or the target sample inclusion (i.e., _{12} = _{13} = 0). Simulation 2 varies the covariate correlation in the FP by (_{x1x2}, _{x1x3}, _{x2x3}) = (.6,0,0), (0,.6,0), (0,0,.6), (.6,.6,0), (.6,0,.6), (0,.6,.6), or (.6,.6,.6) , while keeping _{12} = _{13} = 0. Simulation 3 further complicates the underlying outcome model and the propensity model by including the interaction terms with _{12} = _{13} = 0.5.

^{th} simulated target sample under various analytical propensity models. The w(_{1}), w(_{12}), and w(_{13}) denote the propensity models including main effects of, respectively, _{1}, _{1} and _{2}, _{1} and _{3}. Models including _{2} only, _{3} only, and _{2} and _{3} are denoted as w(_{2}), w(_{3}), and w(_{23}), respectively.

Three observations are made in _{1}, i.e. w(_{1}), w(_{12}), w(_{13}), produce approximately unbiased estimates of the FP mean of Y; the estimates are badly biased under the propensity models which include _{2} only, _{3} only, or _{2} and _{3}. _{3}) yields inflated variance estimates compared to w(_{1}) or w(_{2}), and w(_{2}) has the smallest empirical variances. _{1}) yields the most efficient estimates relative to w(_{12}) or w(_{13}).

_{2} or _{3}, in addition to _{1}, produced approximately unbiased estimates across various correlations; see the shaded two columns of w(_{12}) and w(_{13}). _{12}) and w(_{13}), the empirical variance estimates and MSEs under w(_{12}) tend to be smaller than those under w(_{13}). _{1} in the propensity model, i.e., _{1}), although efficient, may induce bias, especially when correlation exists between _{2} and _{3}.

Simulation 3 compares biases of estimated population means by the KW matching method and the ALP weighting method when the underlying outcome and propensity models include the interaction terms, i.e., _{12} = _{13} = 0.5 (see _{2} or _{3} in addition to the confounder _{1}, are considered and they are 1) w(_{12}), _{1} and _{2} main effects only, 2) w(_{13}), _{1} and _{3} main effects only, 3) w(_{1} * _{2}), _{1} and _{2} main effects and their interaction, and 4) w(_{1} * _{3}), including _{1} and _{3} main effects and their interaction. Recall KW is a type of PS matching method and expected to be more robust to model misspecification compared to the ALP method. As expected, the KW method consistently yields approximately unbiased estimates across four propensity models with or without interaction terms. In the contrast, the ALP approach directly uses the inverse of the participation rates estimated from the assumed propensity model as pseudo-weights, and the ALP estimates are approximately unbiased only under the true propensity model w(_{1} * _{3}). Furthermore, it can be observed that biases of the ALP estimates are consistently closer to zero than the KW under the true model. Results with covariate correlations (_{x1x2}, _{x1x3}, _{x2x3}) = (.6,0,0), (0,0,.6), (.6,0,.6) and (0,.6,.6) showed a similar pattern and hence are not shown.

RANDS, a series of web-based probability panel surveys conducted at NCHS (

Data from the third round of RANDS (RANDS 3) is evaluated. RANDS 3 was collected in 2019 using NORC’s AmeriSpeak® Panel (

Common covariates available in RANDS 3 and the 2019 NHIS that were potentially related to diagnosed asthma or the selection indicator were considered (see

To check for correlation between covariates, bivariate correlations were assessed on the weighted NHIS data. Bivariate correlations for all selected covariates were statistically significant. Prior to evaluating the propensity models, the survey weights for both data sets were normalized to their respective sample sizes (n=2,646 for RANDS, n=31,997 for NHIS) as suggested by

A full propensity model (denoted by model.all) that includes all covariates and their pairwise interactions was used to create pseudo-weights. Due to the large number of parameters in the full model, estimated propensity scores can be unstable. As a result, some form of stepwise propensity model selection methods have been conducted in different studies (_{1} and _{3}), which can be main effects of covariates or their nonlinear/nonadditive combinations such as pairwise interactions, are recommended as terms for inclusion. Based on the simulation results, we expect that the pseudo-weighted mean under model.x13 would be unbiased but with higher variability when compared with the estimates under model.x12 that includes the confounders and outcome predictors.

Accordingly, we conducted the outcome model selection using backward selection on the reference survey (e.g., the 2019 NHIS), to identify terms which were confounders or outcome predictors. We defined the selected model as model.x12.n (contains _{1} and _{2}) with “n” indicating that the outcome predictors were identified using the NHIS. However, it is often the case that the reference probability surveys have no collected information on the outcome variable. In this case, we have only the target sample (e.g., RANDS) available for outcome model selection. With the assumption of the conditional noninformative sampling of the target sample, it is expected the unweighted regression of the outcome would produce unbiased estimates of regression coefficients (_{1} and _{2}) indicating that the outcome predictors were identified using RANDS. The common terms in model.x13 and model.x12 (denoted by either x12.n or x12.r based on the information available) are confounders, and the corresponding propensity model is denoted by model.x1. The identified covariate types under each model are reported in the

The outcome models utilized the observations in the NHIS or the RANDS, whereas the propensity model utilized the observations in the combined NHIS and RANDS data, from which the estimated propensities were obtained and used for construction of the KW pseudo-weight for each individual in RANDS. Note that RANDS has panel weights, which were computed as an overall sampling weight for the selection of each panel member from the sampling frame and the selection of the panel member into RANDS. We considered two scenarios of 1) panel weights or 2) no panel weights for the propensity analysis.

Various propensity models that included different types of covariates were evaluated by the coefficient of variation (CV) of the KW pseudo-weights (

Four observations can be made from

In brief, for the evaluation of diagnosed asthma using the RANDS data, we would recommend the pseudo-weights constructed under Model.x12.n with the confounders and predictors selected from the reference survey (e.g., NHIS). In situations where outcome variables are not collected in the reference survey but available only in the target sample (e.g., RANDS), Model.x12.r can be the alternative model to construct the KW pseudo-weights, assuming conditional noninformative sampling holds for the target sample.

Identifying and collecting the best information on more timely target sample and on higher quality reference surveys can increase the ability of NSOs to produce timely estimates with lower bias from target samples. This paper examined how different types of variables that are included in a propensity model impact the performance of PS-based pseudo-weighted estimators for population mean estimation from a target sample. Means and variances of estimated population means under various mis-specified propensity models, including different types of variables with and without interactive effects, were evaluated analytically and numerically. Different levels of variable correlations in the finite population were also considered to reflect real data scenarios. We have the following findings: 1) confounders, the variables related to both the selection indicator and the outcome of interest, are important variables to include in the propensity model; 2) pseudo-weights that balance the distributions in the outcome predictor _{2} or the selection variable _{3}, in addition to the confounder _{1}, denoted by _{1}, _{2}) or _{1}, _{3}), should be constructed for the target sample units so that the corresponding pseudo-weighted target sample mean is approximately unbiased; 3) compared to _{1}, _{3}), the pseudo-weights _{1}, _{2}) gain more efficiency in estimating population means. In contrast, the inclusion of selection variables, compared to the outcome predictors, in the propensity model tended to inflate the estimated variances. Intuitively, the outcome predictor is related to the outcome variable; including outcome predictors in the propensity score model distinguishes differences between the outcome in the reference and target samples, which results in weights related to outcome and therefore yields estimates with smaller variance estimates. Finally, findings are applied to real target data from RANDS, a survey that uses commercial probability panels, which has potential selection bias. Under the model with confounders and outcome predictors (Model.x12) or model with confounders and selection variables (Model.x13), the RANDS estimate of U.S. asthma prevalence had the greatest bias reduction (relative bias ranging from 11.37%-13.51% compared to the NHIS) when the panel weights are not used to construct KW pseudo-weights, compared to the original panel-weighted RANDS estimates (relative bias of 25.31%).

Results from this paper have several important applications in practice for NSOs that collect data from both target surveys and high-quality reference surveys. First, this study provides a principled approach to select covariates for the PS model. Rather than including all variables or selecting certain demographic variables, covariates are assessed based on their variable type (confounder, outcome predictor, selection variable) to be included in the PS model for population mean estimation. Second, guidance on how to design the questionnaire for a target survey with specific research questions (e.g., SARS-CoV2 seropositivity web survey by

The proposed variable inclusion strategies have limitations that can be of interest for future research. First, the strategy is developed for single-outcome studies with research questions related to one outcome of interest, e.g., SARS-CoV2 seropositivity study (

The findings and conclusions in this paper are those of the authors and do not necessarily represent the views of the National Center for Health Statistics, Centers for Disease Control and Prevention.

Covariate types (X_{1}, confounder; X_{2}, predictor; X_{3}, selection indicator) reported for each model covariate used in the real data analysis (

Covariate Type | |||||
---|---|---|---|---|---|

Panel weights | No weights | ||||

Variable | Model.n | Model.r | Model.n | Model.r | |

1 | Age group (years) | X_{1} | X_{1} | X_{1} | X_{1} |

2 | Sex | X_{1} | X_{1} | X_{1} | X_{1} |

3 | Race/Ethnicity | X_{1} | X_{1} | X_{1} | X_{1} |

4 | Marital status | X_{1} | X_{1} | X_{1} | X_{1} |

5 | Education level | X_{1} | X_{1} | X_{1} | X_{1} |

6 | Smoking status | X_{1} | X_{1} | X_{1} | X_{1} |

7 | Diagnosed with high cholesterol | X_{1} | X_{1} | X_{1} | X_{1} |

8 | Diagnosed with COPD, emphysema, or chronic bronchitis | X_{1} | X_{1} | X_{1} | X_{1} |

9 | Diagnosed with diabetes | X_{1} | X_{1} | X_{1} | X_{1} |

10 | Diagnosed with hypertension | X_{1} | X_{1} | X_{1} | X_{1} |

11 | Employment status | X_{1} | X_{1} | X_{1} | X_{1} |

12 | Age group (years) * Sex | ||||

13 | Age group (years) * Race/Ethnicity | X_{1} | X_{1} | X_{1} | X_{1} |

14 | Age group (years) * Marital status | X_{3} | X_{1} | X_{3} | X_{1} |

15 | Age group (years) * Education level | X_{3} | X_{3} | X_{3} | X_{3} |

16 | Age group (years) * Smoking status | X_{2} | X_{2} | X_{2} | |

17 | Age group (years) * Diagnosed with high cholesterol | ||||

18 | Age group (years) * Diagnosed with COPD, emphysema, or chronic bronchitis | X_{3} | X_{3} | X_{3} | |

19 | Age group (years) * Diagnosed with diabetes | X_{3} | X_{1} | X_{3} | |

20 | Age group (years) * Diagnosed with hypertension | X_{3} | X_{3} | ||

21 | Age group (years) * Employment status | X_{3} | X_{3} | ||

22 | Sex * Race/Ethnicity | ||||

23 | Sex * Marital status | X_{2} | X_{2} | X_{2} | X_{2} |

24 | Sex * Education level | X_{1} | X_{3} | X_{1} | X_{3} |

25 | Sex * Smoking status | X_{2} | X_{1} | X_{3} | |

26 | Sex * Diagnosed with high cholesterol | X_{2} | X_{2} | X_{2} | X_{2} |

27 | Sex * Diagnosed with COPD, emphysema, or chronic bronchitis | X_{2} | |||

28 | Sex * Diagnosed with diabetes | X_{2} | |||

29 | Sex * Diagnosed with hypertension | X_{3} | X_{3} | ||

30 | Sex * Employment status | X_{3} | X_{3} | ||

31 | Race/Ethnicity * Marital status | X_{2} | X_{2} | X_{1} | X_{1} |

32 | Race/Ethnicity * Education level | X_{3} | X_{3} | X_{3} | X_{3} |

33 | Race/Ethnicity * Smoking status | X_{3} | X_{3} | X_{3} | X_{3} |

34 | Race/Ethnicity * Diagnosed with high cholesterol | ||||

35 | Race/Ethnicity * Diagnosed with COPD, emphysema, or chronic bronchitis | ||||

36 | Race/Ethnicity * Diagnosed with diabetes | ||||

37 | Race/Ethnicity * Diagnosed with hypertension | X_{2} | X_{2} | ||

38 | Race/Ethnicity * Employment status | ||||

39 | Marital status * Education level | X_{2} | X_{2} | ||

40 | Marital status * Smoking status | X_{2} | X_{2} | ||

41 | Marital status * Diagnosed with high cholesterol | X_{2} | X_{2} | ||

42 | Marital status * Diagnosed with COPD, emphysema, or chronic bronchitis | X_{2} | X_{2} | ||

43 | Marital status * Diagnosed with diabetes | X_{2} | X_{2} | X_{2} | X_{2} |

44 | Marital status * Diagnosed with hypertension | X_{2} | X_{2} | ||

45 | Marital status * Employment status | X_{2} | X_{2} | ||

46 | Education level * Smoking status | X_{2} | |||

47 | Education level * Diagnosed with high cholesterol | X_{3} | X_{3} | X_{3} | X_{3} |

48 | Education level * Diagnosed with COPD, emphysema, or chronic bronchitis | ||||

49 | Education level * Diagnosed with diabetes | ||||

50 | Education level * Diagnosed with hypertension | ||||

51 | Education level * Employment status | ||||

52 | Smoking status * Diagnosed with high cholesterol | ||||

53 | Smoking status * Diagnosed with COPD, emphysema, or chronic bronchitis | X_{2} | X_{2} | ||

54 | Smoking status * Diagnosed with diabetes | X_{2} | X_{2} | ||

55 | Smoking status * Diagnosed with hypertension | ||||

56 | Smoking status * Employment status | ||||

57 | Diagnosed with high cholesterol * Diagnosed with COPD, emphysema, or chronic bronchitis | X_{2} | X_{2} | ||

58 | Diagnosed with high cholesterol * Diagnosed with diabetes | X_{2} | |||

59 | Diagnosed with high cholesterol * Diagnosed with hypertension | X_{3} | X_{3} | X_{3} | X_{3} |

60 | Diagnosed with high cholesterol * Employment status | X_{3} | X_{3} | ||

61 | Diagnosed with COPD, emphysema, or chronic bronchitis * Diagnosed with diabetes | X_{2} | X_{2} | ||

62 | Diagnosed with COPD, emphysema, or chronic bronchitis * Diagnosed with hypertension | ||||

63 | Diagnosed with COPD, emphysema, or chronic bronchitis * Employment status | ||||

64 | Diagnosed with diabetes * Diagnosed with hypertension | ||||

65 | Diagnosed with diabetes * Employment status | ||||

66 | Diagnosed with hypertension * Employment status |

DAG for three types of covariates with the selection indicator (A) and the outcome (Y)

DAG for _{1}-_{3}(a), _{1}-_{2}(b) and _{2}-_{3}(c). Blocking dotted path(s) to have

Bias of Kernel Weighting (KW) vs. Adjusted Logistic Propensity (ALP) Estimated under Various Propensity Models with w(_{12}), w(_{13}), w(_{1} * _{2}), and w(_{1} * _{3}) including, respectively, main effects of _{1} and _{2}, main effects of _{1} and _{3}, main and interactive effects of _{1} and _{2}, and main and interactive effects of _{1} and _{3}, with (_{x1x2}, _{x1x3}, _{x2x3}) = (0,0,0) (a), (0,.6,0) (b), (.6,.6,0) (c), and (.6,.6,.6) (d), to cover 0, 1, 2, and 3 pair(s) of covariate correlations in the FP, and interactive effects _{12} = _{13} = 0.5. Propensity model with w(_{1} * _{3}) is the true model.

Results from population mean estimation^{1} under various propensity score models^{2} with covariate correlations (_{x1x2}, _{x1x3}, _{x2x3}) = (0, 0, 0) and interaction effects _{12} = _{13} = 0.

Sample^{3} | w(_{1}) | w(_{2}) | w(_{3}) | w(_{12}) | w(_{13}) | w(_{23}) | |
---|---|---|---|---|---|---|---|

Bias (×10^{2}) | 4.61 | 0.26 | 4.50 | 4.83 | 0.26 | 0.41 | 4.77 |

EmpVar (×10^{4}) | 2.20 | 2.68 | 2.62 | 2.96 | 2.92 | 3.43 | 3.32 |

MSE (×10^{4}) | 23.48 | 2.75 | 22.85 | 26.31 | 2.99 | 3.60 | 26.04 |

Kernel weighting estimator (

w(_{1}), w(_{2}), w(_{3}), w(_{12}), w(_{13}), and w(_{23}) denote pseudo-weighted means with pseudo-weights constructed under the propensity model with main effect(s) of _{1}, _{2}, _{3}, _{1} and _{2}, _{1} and _{3}, and _{2} and _{3}, respectively.

sample denotes the unweighted mean

Results from population mean estimation^{1} under various propensity score models^{2} by covariate correlations with interaction effects _{12} = _{13} = 0.

Sample^{3} | w(_{1}) | w(_{2}) | w(_{3}) | w(_{12}) | w(_{13}) | w(_{23}) | |
---|---|---|---|---|---|---|---|

(_{x1x2}, _{x1x3}, _{x2x3}) = (.6, 0, 0) | |||||||

Bias (×10^{2}) | 7.35 | 0.37 | 2.98 | 7.60 | 0.37 | 0.59 | 3.25 |

EmpVar (×10^{4}) | 2.15 | 2.59 | 2.64 | 2.77 | 2.66 | 2.88 | 2.84 |

MSE (×10^{4}) | 56.14 | 2.72 | 11.52 | 60.57 | 2.79 | 3.23 | 13.42 |

(_{x1x2}, _{x1x3}, _{x2x3}) = (0, .6, 0) | |||||||

Bias (×10^{2}) | 7.27 | 0.32 | 7.16 | 3.21 | 0.30 | 0.41 | 3.12 |

EmpVar (×10^{4}) | 2.17 | 3.60 | 2.39 | 3.68 | 3.53 | 4.05 | 3.66 |

MSE (×10^{4}) | 54.98 | 3.70 | 53.67 | 13.97 | 3.62 | 4.22 | 13.39 |

(_{x1x2}, _{x1x3}, _{x2x3}) = (0, 0, .6) | |||||||

Bias (×10^{2}) | 7.55 | 2.98 | 4.65 | 4.87 | 0.26 | 0.37 | 4.83 |

EmpVar (×10^{4}) | 2.01 | 2.52 | 2.38 | 2.57 | 2.66 | 2.75 | 2.66 |

MSE (×10^{4}) | 59.00 | 11.38 | 24.00 | 26.30 | 2.73 | 2.89 | 26.03 |

(_{x1x2}, _{x1x3}, _{x2x3}) = (.6, .6, 0) | |||||||

Bias (×10^{2}) | 9.81 | −1.04 | 5.36 | 6.09 | 0.54 | 0.54 | 1.70 |

EmpVar (×10^{4}) | 2.16 | 3.45 | 2.94 | 3.79 | 3.94 | 3.98 | 4.19 |

MSE (×10^{4}) | 98.39 | 4.52 | 31.61 | 40.93 | 4.23 | 4.27 | 7.09 |

(_{x1x2}, _{x1x3}, _{x2x3}) = (.6, 0, .6) | |||||||

Bias (×10^{2}) | 10.27 | 3.11 | 1.67 | 7.60 | 0.50 | 0.60 | 2.36 |

EmpVar (×10^{4}) | 2.33 | 2.84 | 2.96 | 2.94 | 2.93 | 3.11 | 3.00 |

MSE (×10^{4}) | 107.86 | 12.51 | 5.76 | 60.72 | 3.18 | 3.46 | 8.58 |

(_{x1x2}, _{x1x3}, _{x2x3}) = (0, .6, .6) | |||||||

Bias (×10^{2}) | 10.09 | 3.11 | 7.24 | 1.56 | 0.33 | 0.44 | 2.34 |

EmpVar (×10^{4}) | 1.98 | 3.65 | 2.63 | 3.28 | 3.45 | 3.61 | 3.48 |

MSE (×10^{4}) | 103.83 | 13.33 | 54.98 | 5.71 | 3.55 | 3.80 | 8.97 |

(_{x1x2}, _{x1x3}, _{x2x3}) = (.6, .6, .6) | |||||||

Bias (×10^{2}) | 13.28 | 1.73 | 4.52 | 4.81 | 0.77 | 0.89 | 3.30 |

EmpVar (×10^{4}) | 1.98 | 3.93 | 3.39 | 3.88 | 3.91 | 4.29 | 3.97 |

MSE (×10^{4}) | 178.29 | 6.91 | 23.83 | 26.99 | 4.50 | 5.07 | 14.88 |

Kernel weighting estimator (

w(_{1}), w(_{2}), w(_{3}), w(_{12}), w(_{13}), and w(_{23}) denote pseudo-weighted means with pseudo-weights constructed under the propensity model with main effect(s) of _{1}, _{2}, _{3}, _{1} and _{2}, _{1} and _{3}, and _{2} and _{3}, respectively.

sample denotes the unweighted mean.

Distribution of selected covariates and asthma in the Research and Development Survey (RANDS) 3 and the 2019 National Health Interview Survey (NHIS)

Variable | Subgroup | RANDS (n=2,646) | NHIS (n=31,997) | |||
---|---|---|---|---|---|---|

N | % | Wt % | n | Wt % | ||

| ||||||

Ever diagnosed with asthma | Yes | 431 | 16.4 | 16.9 | 4,229 | 13.5 |

No | 2,197 | 83.6 | 83.1 | 27,718 | 86.5 | |

| ||||||

Age group (years) | 18-34 | 721 | 27.2 | 29.9 | 7,058 | 29.7 |

35-49 | 652 | 24.6 | 24.1 | 7,250 | 24.3 | |

50-64 | 687 | 26.0 | 25.1 | 8,313 | 24.9 | |

65+ | 586 | 22.1 | 20.9 | 9,376 | 21.1 | |

Sex | Male | 1,318 | 49.8 | 48.3 | 14,733 | 48.3 |

Female | 1,328 | 50.2 | 51.7 | 17,261 | 51.7 | |

Race/Ethnicity | Non-Hispanic white | 1,729 | 65.3 | 63.1 | 21,915 | 63.2 |

Non-Hispanic black | 273 | 10.3 | 11.9 | 3,483 | 11.8 | |

Non-Hispanic other | 227 | 8.6 | 8.5 | 2,447 | 8.5 | |

Hispanic | 417 | 15.8 | 16.5 | 4,152 | 16.5 | |

Marital status | Married | 1,282 | 48.5 | 47.7 | 14,759 | 52.4 |

Widowed | 134 | 5.1 | 4.5 | 3,115 | 6.0 | |

Divorced | 350 | 13.2 | 12.4 | 4,317 | 9.0 | |

Separated | 50 | 1.9 | 1.8 | 456 | 1.2 | |

Never married | 618 | 23.4 | 24.3 | 6,368 | 22.5 | |

Living with partner | 212 | 8.0 | 9.3 | 2,136 | 8.9 | |

Education level | High school diploma or less | 577 | 21.8 | 38.8 | 11,155 | 39.9 |

Some college | 1,222 | 46.2 | 27.7 | 9,386 | 31.1 | |

Bachelor's degree or more | 847 | 32.0 | 33.5 | 11,277 | 29.0 | |

Smoking status^{1} | Current | 409 | 15.5 | 17.2 | 4,296 | 14.0 |

Former | 811 | 30.8 | 28.9 | 7,973 | 22.5 | |

Never | 1,411 | 53.6 | 53.9 | 18,931 | 63.5 | |

Diagnosed with high cholesterol | Yes | 976 | 37.1 | 36.4 | 9,179 | 24.9 |

No | 1,657 | 62.9 | 63.6 | 22,697 | 75.1 | |

Diagnosed with COPD, emphysema, or chronic bronchitis | Yes | 213 | 8.1 | 8.4 | 1,787 | 4.6 |

No | 2,420 | 91.9 | 91.6 | 30,158 | 95.4 | |

Diagnosed with diabetes^{2} | Yes | 279 | 10.6 | 10.5 | 3,355 | 9.3 |

No | 2,352 | 89.4 | 89.5 | 28,594 | 90.7 | |

Diagnosed with hypertension | Yes | 989 | 37.5 | 37.0 | 11,480 | 31.7 |

No | 1,648 | 62.5 | 63.0 | 20,458 | 68.3 | |

Employment status | Paid employee | 1,630 | 61.6 | 58.6 | 18,810 | 64.6 |

Looking for work | 166 | 6.3 | 7.2 | 485 | 2.0 | |

Not looking for work | 850 | 32.1 | 34.2 | 11,919 | 33.4 |

Notes: n=unweighted sample size, %=unweighted percent, Wt % = weighted percent

Smoking status: Current smoker is defined as someone who has smoked at least 100 cigarettes in their lifetime and now smokes every day or some days. Former smoker is defined as someone who has smoked at least 100 cigarettes in their lifetime and now does not smoke. Never smokers are defined as persons who have smoked less than 100 cigarettes in their lifetime.

Diagnosed diabetes excludes pre-diabetes and gestational diabetes.

Analysis Results for estimation of the prevalence of diagnosed asthma for adults from RANDS 3 under various propensity models and RANDS 3 weights

Propensity model^{1} | CV^{2}(KW) | relBias^{3} (%) | se^{4} (× 10^{2}) | MSE^{5} (× 10^{4}) |
---|---|---|---|---|

Original panel-weighted | 0.91 | 25.31 | 0.98 | 12.56 |

Unweighted | 0 | 21.89 | 0.72 | 9.19 |

panel weights | ||||

Model.all | 1.13 | 17.55 | 1.21 | 7.04 |

Model.x13 | 1.10 | 13.35 | 1.04 | 4.31 |

Model.x12.n | 1.07 | 11.41 | 0.93 | 3.23 |

Model.x1.n | 1.07 | 12.38 | 0.95 | 3.67 |

Model.x12.r | 1.08 | 12.85 | 0.97 | 3.94 |

Model.x1.r | 1.08 | 12.85 | 0.97 | 3.93 |

no weights | ||||

Model.all | 0.83 | 14.02 | 1.07 | 4.70 |

Model.x13 | 0.80 | 13.51 | 0.97 | 4.24 |

Model.x12.n | 0.70 | 11.38 | 0.81 | 3.01 |

Model.x1.n | 0.69 | 13.67 | 0.82 | 4.06 |

Model.x12.r | 0.73 | 11.37 | 0.84 | 3.04 |

Model.x1.r | 0.71 | 13.44 | 0.84 | 3.98 |

Original panel-weighted denotes the RANDS 3 estimate using the original panel weights without PS adjustment; unweighted denotes the RANDS 3 estimate using weight = 1 without PS adjustment; model.all: the full propensity model with all main and pairwise interaction terms; Model.x13: the propensity model including selected terms of the confounders and selection predictors; Model.x12.n: propensity model including terms of the confounders and outcome predictors selected using the National Health Interview Survey (NHIS); Model.x12.r: propensity model including terms of the confounders and outcome predictors selected using the Research and Development Survey (RANDS). Panel weights indicates that the RANDS 3 original panel weights were used as the base weight for the PS adjustment. No weights indicates that the RANDS 3 original panel weights were not included in the PS adjustment.

se=standard error of estimated mean