<!DOCTYPE article
PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with MathML3 v1.3 20210610//EN" "JATS-archivearticle1-3-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="1.3" xml:lang="en" article-type="research-article"><?properties open_access?><?properties manuscript?><processing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats"><restricted-by>pmc</restricted-by></processing-meta><front><journal-meta><journal-id journal-id-type="nlm-journal-id">9918646055106676</journal-id><journal-id journal-id-type="pubmed-jr-id">52735</journal-id><journal-id journal-id-type="nlm-ta">Surv Stat</journal-id><journal-id journal-id-type="iso-abbrev">Surv Stat</journal-id><journal-title-group><journal-title>Survey statistician</journal-title></journal-title-group><issn pub-type="ppub">0214-3240</issn><issn pub-type="epub">2521-991X</issn></journal-meta><article-meta><article-id pub-id-type="pmid">37576783</article-id><article-id pub-id-type="pmc">10422982</article-id><article-id pub-id-type="manuscript">HHSPA1921960</article-id><article-categories><subj-group subj-group-type="heading"><subject>Article</subject></subj-group></article-categories><title-group><article-title>Multiple Imputation of Missing Complex Survey Data using SAS<sup>&#x000ae;</sup>: A Brief Overview and An Example Based on the Research and Development Survey (RANDS)</article-title></title-group><contrib-group><contrib contrib-type="author"><name><surname>He</surname><given-names>Yulei</given-names></name><xref rid="A1" ref-type="aff">1</xref><xref rid="CR1" ref-type="corresp">1</xref></contrib><contrib contrib-type="author"><name><surname>Zhang</surname><given-names>Guangyu</given-names></name><xref rid="A2" ref-type="aff">2</xref></contrib></contrib-group><aff id="A1"><label>1</label>Division of Research and Methodology, U.S. Centers for Desease Control and Prevention</aff><aff id="A2"><label>2</label>National Center for Health Statistics, U.S. Centers for Desease Control and Prevention</aff><author-notes><corresp id="CR1">
<label>1</label>
<email>wdq7@cdc.gov</email>
</corresp></author-notes><pub-date pub-type="nihms-submitted"><day>7</day><month>8</month><year>2023</year></pub-date><pub-date pub-type="ppub"><month>1</month><year>2023</year></pub-date><pub-date pub-type="pmc-release"><day>12</day><month>8</month><year>2023</year></pub-date><volume>87</volume><fpage>37</fpage><lpage>47</lpage><permissions><license><ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/" specific-use="textmining" content-type="ccbylicense">https://creativecommons.org/licenses/by/4.0/</ali:license_ref><license-p>This is an Open Access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution Licence</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p></license></permissions><abstract id="ABS1"><p id="P1">Multiple imputation (MI) is a widely used analytic approach to address missing data problems. SAS<sup>&#x000ae;</sup> (SAS Institute Inc, Cary, N.C.) has established MI procedures including PROC MI and PROC MIANALYZE. We illustrate the use of these procedures for conducting MI analysis of complex survey data by an example from RANDS. <xref rid="S1" ref-type="sec">Section 1</xref> contains the introduction. <xref rid="S2" ref-type="sec">Section 2</xref> includes some necessary methodological background. <xref rid="S6" ref-type="sec">Section 3</xref> shows a MI example with an arbitrary missing data pattern. <xref rid="S10" ref-type="sec">Section 4</xref> concludes the paper with a discussion.</p></abstract><kwd-group><kwd>Complex Survey</kwd><kwd>Missing Data</kwd><kwd>Multiple Imputation</kwd><kwd>SAS<sup>&#x000ae;</sup></kwd></kwd-group></article-meta></front><body><sec id="S1"><label>1</label><title>Introduction</title><p id="P2">Population-based studies often rely on surveys to collect information and conduct data analysis. However, survey data are often subject to nonresponse or missing data problems. Multiple imputation (MI) is arguably one of the most popular statistical strategies to handle missing data issues in many fields (<xref rid="R6" ref-type="bibr">Rubin 1987</xref>; <xref rid="R4" ref-type="bibr">He et al., 2022</xref>) including survey nonresponse problems.</p><p id="P3">The default option in statistical software is to remove cases with missing values from the analysis (i.e., case-deletion). The practicality of MI sits on its successful implementations in some mainstream software packages (e.g., SAS<sup>&#x000ae;</sup> and R) so that practitioners can use straightforward programming statements to conduct the analysis. For example, <xref rid="R1" ref-type="bibr">Berglund and Heeringa (2014)</xref> provided an overview of MI and its applications, using SAS<sup>&#x000ae;</sup> for illustration. Similar research literature can be found for other software packages. In addition, practitioners can refer to the software documentation for guidance.</p><p id="P4">Missing data problems in complex surveys pose some unique challenges (<xref rid="S1" ref-type="sec">Section 2</xref>). For survey item nonresponse problems, MI has been proven to be a useful analytical tool supported by a large body of literature (e.g., <xref rid="R6" ref-type="bibr">Rubin 1987</xref>; <xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Chapter. 10). However, most of the literature has focused on the technical aspects of MI and yet touched less on the programming components. In addition, the relevant programming literature and documentation are largely targeted to non-survey types of data.</p><p id="P5">To fill this gap, the aim of this paper is to provide a brief overview and a real example of MI for complex survey data using SAS<sup>&#x000ae;</sup> programming statements (version 9.4; additionally, the users can also use the free cloud SAS platform on <ext-link xlink:href="https://www.sas.com/en_us/software/on-demand-for-academics.html" ext-link-type="uri">https://www.sas.com/en_us/software/on-demand-for-academics.html</ext-link>).</p></sec><sec id="S2"><label>2.</label><title>Method Background</title><sec id="S3"><label>2.1</label><title>Missing data mechanism</title><p id="P6">Briefly speaking, the missing data mechanism of an incomplete variable describes how the probability of its missingness (i.e., being missing) is related to the original data. In general, there are three types of missing data mechanisms: (1) Missing completely at random (MCAR): the missingness of a variable is not related to any variable in the data; (2) Missing at random (MAR): the missingness of a variable is only related to other fully-observed variables in the data; (3) Missing not at random (MNAR): the missingness of a variable is related to the missing values after controlling for other fully-observed variables.</p></sec><sec id="S4"><label>2.2</label><title>Multiple Imputation</title><p id="P7">To conduct a MI analysis of a dataset, an appropriate missing data mechanism (e.g., MAR) is first assumed. Then a statistical imputation model is formulated to relate the missing variable(s) with observed variable(s) in the dataset. Next, missing values are imputed (i.e., replaced) by random draws from their posterior predictive distributions or their approximations derived from the imputation model. Such a procedure is independently repeated multiple (say M) times, resulting in M sets of imputed values. Early research (e.g., <xref rid="R6" ref-type="bibr">Rubin 1987</xref>) suggested setting M=5 is sufficient for regular analyses applied to datasets with a small or moderate amount of missing data. More recent research (e.g., <xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Section 3.3.3) has shown that larger numbers (e.g., M &#x0003e; 5) might be desired when computing and data storage resources are available. After imputation, each of the M completed datasets, including both the observed and the imputed values, is analyzed separately and results in M sets of analysis results/estimates. Finally, these M sets of results are combined to yield a single set of statistical inference using the so-called Rubin&#x02019;s combining rules (<xref rid="R6" ref-type="bibr">Rubin 1987</xref>).</p></sec><sec id="S5"><label>2.3</label><title>Multiple Imputation for Complex Survey Missing Data Problems</title><p id="P8">Most surveys are based on sample designs with one or more complex features such as stratification, clustering of sampled elements, and weighting to compensate for differential probabilities of sample inclusion or varying response rates. Therefore, it is essential to incorporate this design information for survey data analysis (<xref rid="R2" ref-type="bibr">Cochran 1977</xref>). Survey data analysis procedures accounting for the design information are readily available in SAS<sup>&#x000ae;</sup> (<xref rid="S6" ref-type="sec">Section 3</xref>).</p><p id="P9">The above principle also holds for analyzing multiply-imputed complex survey data. Additionally, a principled MI procedure for complex survey missing data problems should also include the design information in the imputation process. However, there exist alternative practical options for incorporating the sample design (e.g., <xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Section 10.3). Here we outline a hierarchical, trial-and-error strategy:</p><list list-type="order" id="L1"><list-item><p id="P10">Include the survey weight as a variable (predictor) in the imputation;</p></list-item><list-item><p id="P11">To include information about the sampling strata and clusters:</p><list list-type="simple" id="L2"><list-item><p id="P12">(2.1) First, create a new categorical variable that combines the sampling strata and the nested clusters, and include this variable in the imputation;</p></list-item><list-item><p id="P13">(2.2) If the imputation model has some estimation issues due to a large number of categories from the above combining variable, then collapse clusters within a sampling stratum for clusters with small sample sizes or only includes the sampling strata variable in the imputation;</p></list-item><list-item><p id="P14">(2.3) If the model estimation issue still exists because some strata only have very few units then collapse these small-sample strata together to ensure each final stratum has a sufficient sample size, and then include the collapsed-strata variable in the imputation.</p></list-item></list></list-item></list><p id="P15">An additional major challenge for surveys is that missing data often happen for multiple variables, and this issue is usually coupled with another fact that survey variables are typically bounded. A feasible MI approach is the so-called &#x0201c;Fully Conditional Specification&#x0201d; (FCS) strategy, which imputes each incomplete variable based on a model that includes all other variables as the predictors and then cycles through all missing variables sequentially. FCS is arguably the most popular MI strategy for multivariate survey missing data problems (<xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Chapter. 7).</p></sec></sec><sec id="S6"><label>3.</label><title>A Multiple Imputation Example using SAS<sup>&#x000ae;</sup></title><sec id="S7"><label>3.1</label><title>Major SAS<sup>&#x000ae;</sup> Procedures</title><p id="P16">The two main SAS<sup>&#x000ae;</sup> procedures for MI are PROC MI and PROC MIANALYZE. Other SAS<sup>&#x000ae;</sup> procedures and data steps are also often used, depending on the analytic goals and contexts. Here we outline five major programming stages in a typical MI analysis.</p><p id="P17">Stage 1 (processing): Processing data before imputation to construct the working dataset including both the target missing and fully-observed variables. Exploratory analyses are often conducted at this stage.</p><p id="P18">Stage 2 (imputation): Running imputation M times by applying PROC MI to the working dataset.</p><p id="P19">Stage 3 (analysis): Applying the planned (post-imputation) analysis to the completed datasets by running SAS<sup>&#x000ae;</sup> statistical procedures. In the context of complex survey data, these procedures typically include PROC SURVEYMEANS, PROC SURVEYREG, etc.</p><p id="P20">Stage 4 (combining): Combining the results to yield the final estimates with PROC MIANALYZE.</p><p id="P21">Stage 5 (evaluation): An evaluation analysis that typically compares results among different MI models and with the case-wise deletion method.</p></sec><sec id="S8"><label>3.2</label><title>Data Background</title><p id="P22">The example is illustrated using a subset of Research and Development Survey (RANDS) (<ext-link xlink:href="https://www.cdc.gov/nchs/rands/" ext-link-type="uri">https://www.cdc.gov/nchs/rands/</ext-link>), a series of probability-sampled web-based surveys conducted by the National Center for Health Statistics (e.g., <xref rid="R3" ref-type="bibr">He et al, 2020</xref>). Specifically, we use some variables from the publicly released RANDS during COVID-19 data (the 3<sup>rd</sup> round), which is a special series of RANDS used to rapidly report on the impact of the COVID-19 pandemic (<xref rid="R5" ref-type="bibr">Irimata and Scanlon, 2022</xref>). The original dataset contains 5,458 records; it can be downloaded from (<ext-link xlink:href="https://www.cdc.gov/nchs/rands/data.htm" ext-link-type="uri">https://www.cdc.gov/nchs/rands/data.htm</ext-link>). <xref rid="T1" ref-type="table">Table 1</xref> briefly describes the variables used in the example.</p></sec><sec id="S9"><label>3.3</label><title>Sample Code and Output</title><p id="P23">Stage 1: The selected variables contain no missing values in the original data. For illustrative purpose, we created around 20% missing values in both INCOME and MARITAL_NEW. The missingness of INCOME is related to AGE, GENDER, EDUC, and INTERNET; the missingness of MARITAL_NEW is related to AGE, EDUC, INTERNET, and HHSIZE. The missingness of both variables follows MAR (<xref rid="S3" ref-type="sec">Section 2.1</xref>). For illustration, the key missing data-generating step for INCOME is included as follows (the initial dataset is called rands_covid3_new, while the new one is called rands_covid_missing):</p><preformat position="float" xml:space="preserve">
data rands_covid_missing; 
  set rands_covid3_new; 
  p_miss_INCOME = exp(&#x02212;2+0.5*EDUC&#x02212;0.5*GENDER&#x02212;0.01*AGE+0.5*INTERNET) 
          /(1+exp(&#x02212; 2+0.5*EDUC&#x02212;0.5*GENDER&#x02212;0.01*AGE+0.5*INTERNET)); 
  rnumber_INCOME = ranuni(20110411); 
  If rnumber_INCOME &#x0003c; p_miss_INCOME then R_miss_INCOME =1; 
           else R_miss_INCOME=0; 
  If R_miss_INCOME = 1 then INCOME=.; 
run; 
</preformat><p id="P24">In SAS<sup>&#x000ae;</sup>, missing values are coded by &#x0201c;.&#x0201d; (dot). In the code above, INCOME is set as missing if a uniform random number is less than a pre-specified missingness probability, which is related to other variables by a logit function. As outlined in <xref rid="S7" ref-type="sec">Section 3.1</xref>, additional SAS<sup>&#x000ae;</sup> data steps and exploratory analyses can be done for the data processing stage of the MI analysis.</p><p id="P25">Stage 2: We first briefly discuss some possible modeling strategies. Since both INCOME and MARITAL_NEW have missing values, the desirable imputation strategy is FCS (<xref rid="S5" ref-type="sec">Section 2.3</xref>). Under FCS, there exist alternative modeling options, some of which are included as follows:</p><list list-type="order" id="L3"><list-item><p id="P26">INCOME has 16 categories (i.e., 1&#x02013;16) with an ordinal nature. Although each integer value does not represent the same dollar amount range, for simplicity we only consider these integers as our imputation and analysis metric. For convenience of illustration, this variable can be treated as a positive continuous variable and modeled via a linear regression model conditional on other variables. However, the imputed INCOME values can take fractional numbers. To preserve the integer format, a naive post-imputation rounding step can be taken; imputed values less than 1 can be set as 1 and those above 16 can be set as 16. Additionally, PROC MI has an option to force the imputed values being generated within a pre-specified range (e.g., [1,16]), and then rounding is only necessary for imputed values within the range. On the other hand, INCOME can also be imputed using the predictive mean matching (PMM) method (e.g., <xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Section 5.5). Briefly, PMM can be viewed as a MI extension of hot-deck imputation, where each missing value is replaced with an observed response from a &#x0201c;similar&#x0201d; unit. In our example, PMM can naturally preserve the range and integer format of the imputations without the need of rounding.</p></list-item><list-item><p id="P27">ARITAL_NEW has two categories, it can be modelled using a logistic regression conditional on other variables. Alternatively, binary or nominal variables such as MARITAL_NEW can be imputed via a discriminant analysis model. That is, stratified by MARITAL_NEW, other variables are assumed to follow a multivariate normal distribution (e.g., <xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Section 4.3.2).</p></list-item></list><p id="P28">The sample code is as follows:</p><preformat position="float" xml:space="preserve">
proc mi data =rands_covid_missing seed =197789 out= income_impute nimpute =5 
               min = 1 . . . . . . . . max = 16 . . . . . . . . ; 
        class EDUC GENDER INTERNET MARITAL_NEW S_VSTRAT_COMBINE ; 
       fcs nbiter=20 reg (INCOME/details) logistic (MARITAL_NEW / details likelihood=augment) ; 
       *fcs nbiter=20 regpmm (INCOME/details) logistic (MARITAL_NEW/details likelihood=augment); 
       *fcs nbiter=20 reg (INCOME/details) discrim (MARITAL_NEW/classeffects =include details); 
       *fcs nbiter=20 regpmm (INCOME/details) discrim (MARITAL_NEW/classeffects=include details); 
       var INCOME AGE WEIGHT_CALIBRATED EDUC GENDER INTERNET MARITAL_NEW HHSIZE S_VSTRAT_COMBINE; 
run; 
</preformat><p id="P29">We provide some additional remarks about the above code.</p><list list-type="alpha-lower" id="L4"><list-item><p id="P30">The input dataset is &#x0201c;rands_covid_missing&#x0201d;; the output dataset containing the multiple imputation results is &#x0201c;income_impute&#x0201d;; &#x0201c;nimpute=&#x0201d; specifies the number of imputations (we use 5 in this example); &#x0201c;seed=&#x0201d; specifies the initial random seed used in MI. Fixing the random seed can render reproducible results.</p></list-item><list-item><p id="P31">The variables included in the imputation are specified after &#x0201c;var&#x0201d;. Among them, categorical variables are specified after &#x0201c;class&#x0201d;.</p></list-item><list-item><p id="P32">To include the design variables, we initially include WEIGHT_CALIBRATED and the combined strata and PSU variable (S_VSTRAT and S_VPSU, respectively) in the model (after &#x0201c;var&#x0201d;). However, the model has estimation problems because some sampling strata have very few samples. As a result, SAS<sup>&#x000ae;</sup> would issue warnings in log files. They would also be noticed by checking the regression coefficients of the output. Therefore, we collapse some small strata so that each final stratum has at least 10 samples, which is coded by the new variable S_VSTRAT_COMBINE. We also exclude S_VPSU from the model.</p></list-item><list-item><p id="P33">fcs nbiter=20 reg (INCOME/details) logistic (MARITAL_NEW/details likelihood=augment). This statement specifies that we use FCS to impute both INCOME and MARITAL_NEW. Specifically, &#x0201c;nbiter=20&#x0201d; specifies 20 iterations are to be used; &#x0201c;reg (INCOME/details)&#x0201d; specifies a linear regression model for INCOME, and the &#x0201c;details&#x0201d; option asks for outputting the regression coefficients of the model fit across all imputations; &#x0201c;logistic/details&#x0201d; specifies a logistic regression imputation model for MARITAL_NEW with coefficients output; &#x0201c;likelihood=augment&#x0201d; specifies a robust logistic regression to deal with possible data separation issues (e.g., <xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Section 4.3.2.4).</p></list-item><list-item><p id="P34">We can specify &#x0201c;min=1&#x0201d; and &#x0201c;max=16&#x0201d; after &#x0201c;proc&#x0201d; to force the imputed values of INCOME falling in this range. For the variables that do not need the bounds, their &#x0201c;min&#x0201d; and &#x0201c;max&#x0201d; are assigned as missing values.</p></list-item><list-item><p id="P35">fcs nbiter=20 regpmm (INCOME/details) logistic (MARITAL_NEW/details likelihood=augment).</p><p id="P36">This statement (commented out with a &#x0201c;*&#x0201d;) specifies another modeling option: a PMM imputation for INCOME and a logistic regression imputation for MARITAL_NEW.</p></list-item><list-item><p id="P37">fcs nbiter=20 reg (INCOME/details) discrim (MARITAL_NEW/classeffects =include details). This statement (commented out with a &#x0201c;*&#x0201d;) specifies another modeling option: a linear normal imputation for INCOME and a discriminant analysis model for MARITAL_NEW. For the latter, &#x0201c;classeffects=include&#x0201d; specifies that all of the remaining variables, both continuous and categorical, are included in the discriminant analysis.</p></list-item><list-item><p id="P38">fcs nbiter=20 regpmm (INCOME/details) discrim (MARITAL_NEW/classeffects =include details). This statement (commented out with a &#x0201c;*&#x0201d;) specifies another modeling option: a PMM imputation for INCOME and a discriminant analysis model for MARITAL_NEW.</p></list-item></list><p id="P39">We now include some output from the above code and provide remarks. For ease of illustration, we separate the output into four parts and then comment on them one by one.</p><p id="P40">Output 1</p><table-wrap position="anchor" id="T3"><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><tbody><tr><th align="center" valign="top" colspan="2" rowspan="1">The MI Procedure</th></tr><tr><th align="center" valign="top" colspan="2" rowspan="1">Model Information</th></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Data Set</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">WORK.RANDS_COVID_MISSING</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Method</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">FCS</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Number of Imputations</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">5</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Number of Burn-in Iterations</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">20</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Seed for random number generator</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">197789</td></tr></tbody><tbody><tr><th align="center" valign="top" colspan="2" rowspan="1">FCS Model Specification</th></tr><tr><th align="left" valign="top" rowspan="1" colspan="1">Method</th><th align="left" valign="top" rowspan="1" colspan="1">Imputed Variables</th></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Regression</td><td align="left" valign="top" rowspan="1" colspan="1">INCOME AGE WEIGHT_CALIBRATED HHSIZE</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Logistic Regression</td><td align="left" valign="top" rowspan="1" colspan="1">MARITAL_NEW</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Discriminant Function</td><td align="left" valign="top" rowspan="1" colspan="1">EDUC GENDER INTERNET S_VSTRAT_COMBINE</td></tr></tbody></table></table-wrap><p id="P41">Output 1 provides some general information about the imputation model setup and the variables included. For categorical variables, the discriminant analysis imputation model is the default option.</p><p id="P42">Output 2 shows the missingness pattern of the variables and some descriptive statistics of the associated subgroups. Specifically, Group 1 has all variables fully observed, denoted by &#x02018;X&#x0201d; for each variable; Group 2 has only MARITAL_NEW with missing values (denoted by &#x0201c;.&#x0201d;); Group 3 has only INCOME with missing values; and Group 4 has missing values on both INCOME and MARITAL_NEW. The means of the continuous variables of each subgroup are also displayed. For instance, the average age from Group 1 (=53.386) is higher than those from the other three groups.</p><p id="P43">Output 2</p><table-wrap position="anchor" id="T4"><table frame="box" rules="all"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="16" rowspan="1">Missing Data Patterns</th></tr><tr><th align="right" valign="top" rowspan="2" colspan="1">Group</th><th align="left" valign="top" rowspan="2" colspan="1">INCOME</th><th align="left" valign="top" rowspan="2" colspan="1">AGE</th><th align="left" valign="top" rowspan="2" colspan="1">WEIGHT_CALIBRATED</th><th align="left" valign="top" rowspan="2" colspan="1">EDUC</th><th align="left" valign="top" rowspan="2" colspan="1">GENDER</th><th align="left" valign="top" rowspan="2" colspan="1">INTERNET</th><th align="left" valign="top" rowspan="2" colspan="1">MARITAL_NEW</th><th align="left" valign="top" rowspan="2" colspan="1">HHSIZE</th><th align="left" valign="top" rowspan="2" colspan="1">S_VSTRAT_COMBINE</th><th align="right" valign="top" rowspan="2" colspan="1">Freq</th><th align="right" valign="top" rowspan="2" colspan="1">Percent</th><th align="center" valign="top" colspan="4" rowspan="1">Group Means</th></tr><tr><th align="right" valign="top" rowspan="1" colspan="1">INCOME</th><th align="right" valign="top" rowspan="1" colspan="1">AGE</th><th align="right" valign="top" rowspan="1" colspan="1">WEIGHT_CALIBRATED</th><th align="right" valign="top" rowspan="1" colspan="1">HHSIZE</th></tr></thead><tbody><tr><td align="right" valign="top" rowspan="1" colspan="1">
<bold>1</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="right" valign="top" rowspan="1" colspan="1">3289</td><td align="right" valign="top" rowspan="1" colspan="1">60.26</td><td align="right" valign="top" rowspan="1" colspan="1">9.9811</td><td align="right" valign="top" rowspan="1" colspan="1">53.386</td><td align="right" valign="top" rowspan="1" colspan="1">0.9476</td><td align="right" valign="top" rowspan="1" colspan="1">2.4387</td></tr><tr><td align="right" valign="top" rowspan="1" colspan="1">
<bold>2</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1"/><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="right" valign="top" rowspan="1" colspan="1">948</td><td align="right" valign="top" rowspan="1" colspan="1">17.37</td><td align="right" valign="top" rowspan="1" colspan="1">9.9535</td><td align="right" valign="top" rowspan="1" colspan="1">48.800</td><td align="right" valign="top" rowspan="1" colspan="1">1.1708</td><td align="right" valign="top" rowspan="1" colspan="1">3.9409</td></tr><tr><td align="right" valign="top" rowspan="1" colspan="1">
<bold>3</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1"/><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="right" valign="top" rowspan="1" colspan="1">941</td><td align="right" valign="top" rowspan="1" colspan="1">17.24</td><td align="right" valign="top" rowspan="1" colspan="1"/><td align="right" valign="top" rowspan="1" colspan="1">49.865</td><td align="right" valign="top" rowspan="1" colspan="1">0.9972</td><td align="right" valign="top" rowspan="1" colspan="1">2.5622</td></tr><tr><td align="right" valign="top" rowspan="1" colspan="1">
<bold>4</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1"/><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1"/><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="left" valign="top" rowspan="1" colspan="1">X</td><td align="right" valign="top" rowspan="1" colspan="1">280</td><td align="right" valign="top" rowspan="1" colspan="1">5.13</td><td align="right" valign="top" rowspan="1" colspan="1"/><td align="right" valign="top" rowspan="1" colspan="1">46.867</td><td align="right" valign="top" rowspan="1" colspan="1">1.0469</td><td align="right" valign="top" rowspan="1" colspan="1">4.2107</td></tr></tbody></table></table-wrap><p id="P44">Output 2 also shows that the data have an arbitrary missing data pattern. On the opposite, a monotone missingness pattern is usually seen in longitudinal studies where once a subject drops out, his/her measurements at later times are always missing. Note that PROC MI has specific options for imputing monotone missing data. However, for brevity, they are not covered in this paper.</p><p id="P45">Output 3</p><table-wrap position="anchor" id="T5"><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="8" rowspan="1">Regression Models for FCS Method</th></tr><tr><th align="left" valign="top" rowspan="2" colspan="1">Imputed Variable</th><th align="left" valign="top" rowspan="2" colspan="1">Effect</th><th align="right" valign="top" rowspan="2" colspan="1">EDUC</th><th align="center" valign="top" colspan="5" rowspan="1">Imputation</th></tr><tr><th align="right" valign="top" rowspan="1" colspan="1">1</th><th align="right" valign="top" rowspan="1" colspan="1">2</th><th align="right" valign="top" rowspan="1" colspan="1">3</th><th align="right" valign="top" rowspan="1" colspan="1">4</th><th align="right" valign="top" rowspan="1" colspan="1">5</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Intercept</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">.</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.223674</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.202220</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.219967</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.191550</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.188843</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>AGE</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">.</td><td align="right" valign="top" rowspan="1" colspan="1">0.020476</td><td align="right" valign="top" rowspan="1" colspan="1">0.029064</td><td align="right" valign="top" rowspan="1" colspan="1">0.038018</td><td align="right" valign="top" rowspan="1" colspan="1">0.022661</td><td align="right" valign="top" rowspan="1" colspan="1">0.018259</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>WEIGHT_CALIBRATED</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">.</td><td align="right" valign="top" rowspan="1" colspan="1">0.042331</td><td align="right" valign="top" rowspan="1" colspan="1">0.031441</td><td align="right" valign="top" rowspan="1" colspan="1">0.069224</td><td align="right" valign="top" rowspan="1" colspan="1">0.061784</td><td align="right" valign="top" rowspan="1" colspan="1">0.034479</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>EDUC</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">2.000</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.377725</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.396039</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.394906</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.384208</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.329835</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>EDUC</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">3.000</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.036021</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.014313</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.023572</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.011021</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.077057</td></tr></tbody></table><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="7" rowspan="1">Logistic Models for FCS Method</th></tr><tr><th align="left" valign="top" rowspan="2" colspan="1">Imputed Variable</th><th align="left" valign="top" rowspan="2" colspan="1">Effect</th><th align="center" valign="top" colspan="5" rowspan="1">Imputation</th></tr><tr><th align="right" valign="top" rowspan="1" colspan="1">1</th><th align="right" valign="top" rowspan="1" colspan="1">2</th><th align="right" valign="top" rowspan="1" colspan="1">3</th><th align="right" valign="top" rowspan="1" colspan="1">4</th><th align="right" valign="top" rowspan="1" colspan="1">5</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>MARITAL_NEW</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Intercept</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.246513</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.105072</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.137450</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.165294</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.110486</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>MARITAL_NEW</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.885190</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.903693</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.923058</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.902241</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.809001</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>MARITAL_NEW</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>AGE</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.524000</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.520803</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.542015</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.571203</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.524950</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>MARITAL_NEW</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">
<bold>WEIGHT_CALIBRATED</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.339081</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.282057</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.328680</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.264126</td><td align="right" valign="top" rowspan="1" colspan="1">&#x02212;0.374018</td></tr></tbody></table></table-wrap><p id="P46">Output 3 shows some details about the fit for each of the imputation models used in FCS. If we use the modeling option &#x0201c;fcs nbiter=20 reg (INCOME/details) logistic (MARITAL_NEW/details likelihood=augment)&#x0201d; in PROC MI, then the output contains the linear regression coefficients for INCOME and logistic regression coefficients for MARITAL_NEW across 5 imputations. For simplicity we do not include all coefficients here. Specifically, the results under &#x0201c;Regression Models for FCS Method&#x0201d; lists the coefficients for fitting INCOME. For example, the coefficient for AGE is 0.020476 for the 1<sup>st</sup> imputation, 0.029064 for the 2<sup>nd</sup> imputation, etc. The results under &#x0201c;Logistic Models for FCS Method&#x0201d; lists the coefficients for fitting MARITAL_NEW. For instance, the coefficient for AGE is &#x02212;0.524000 for the 1<sup>st</sup> imputation, &#x02212;0.520803 for the 2<sup>nd</sup> imputation, etc.</p><p id="P47">We previously discussed the need for collapsing some small strata and excluding clusters to achieve stable model estimates. If this was not implemented, in addition to seeing warning statements from SAS<sup>&#x000ae;</sup> log files, we would also see some very extreme logistic regression coefficients (e.g., outside the range [&#x02212;5,5]) in Output 3.</p><p id="P48">Output 4</p><table-wrap position="anchor" id="T6"><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="8" rowspan="1">Variance Information (5 Imputations)</th></tr><tr><th align="left" valign="top" rowspan="2" colspan="1">Variable</th><th align="center" valign="top" colspan="3" rowspan="1">Variance</th><th align="right" valign="top" rowspan="2" colspan="1">DF</th><th align="right" valign="top" rowspan="2" colspan="1">Relative Increase in Variance</th><th align="right" valign="top" rowspan="2" colspan="1">Fraction Missing Information</th><th align="right" valign="top" rowspan="2" colspan="1">Relative Efficiency</th></tr><tr><th align="right" valign="top" rowspan="1" colspan="1">Between</th><th align="right" valign="top" rowspan="1" colspan="1">Within</th><th align="right" valign="top" rowspan="1" colspan="1">Total</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">0.001115</td><td align="right" valign="top" rowspan="1" colspan="1">0.003021</td><td align="right" valign="top" rowspan="1" colspan="1">0.004360</td><td align="right" valign="top" rowspan="1" colspan="1">41.96</td><td align="right" valign="top" rowspan="1" colspan="1">0.443073</td><td align="right" valign="top" rowspan="1" colspan="1">0.337540</td><td align="right" valign="top" rowspan="1" colspan="1">0.936761</td></tr></tbody></table><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="11" rowspan="1">Parameter Estimates (5 Imputations)</th></tr><tr><th align="left" valign="top" rowspan="1" colspan="1">Variable</th><th align="right" valign="top" rowspan="1" colspan="1">Mean</th><th align="right" valign="top" rowspan="1" colspan="1">Std Error</th><th align="right" valign="top" colspan="2" rowspan="1">95% Confidence Limits</th><th align="right" valign="top" rowspan="1" colspan="1">DF</th><th align="right" valign="top" rowspan="1" colspan="1">Minimum</th><th align="right" valign="top" rowspan="1" colspan="1">Maximum</th><th align="right" valign="top" rowspan="1" colspan="1">Mu0</th><th align="right" valign="top" rowspan="1" colspan="1">t for H0: Mean=Mu0</th><th align="right" valign="top" rowspan="1" colspan="1">Pr &#x0003e; |t|</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">10.014665</td><td align="right" valign="top" rowspan="1" colspan="1">0.066028</td><td align="right" valign="top" rowspan="1" colspan="1">9.881411</td><td align="right" valign="top" rowspan="1" colspan="1">10.14792</td><td align="right" valign="top" rowspan="1" colspan="1">41.96</td><td align="right" valign="top" rowspan="1" colspan="1">9.974504</td><td align="right" valign="top" rowspan="1" colspan="1">10.065734</td><td align="right" valign="top" rowspan="1" colspan="1">0</td><td align="right" valign="top" rowspan="1" colspan="1">151.67</td><td align="right" valign="top" rowspan="1" colspan="1">&#x0003c;.0001</td></tr></tbody></table></table-wrap><p id="P49">Output 4 shows some combined estimates after MI. It only displays simple means for continuous variables (e.g., INCOME) and some associated statistics. Note that it might be inappropriate to use this output as the basis for final results. For example, the mean estimation of INCOME here does not account for the complex survey design of RANDS.</p><p id="P50">Stage 3: we use the mean estimates as an analytical example. The example code is as follows:</p><preformat position="float" xml:space="preserve">
proc surveymeans data=income_impute; 
   weight WEIGHT_CALIBRATED; 
   strata S_VSTRAT; 
   cluster S_VPSU;
   var INCOME MARITAL_NEW; 
   by _imputation_; 
   ods output Statistics = mean_income_imp; 
run; 
</preformat><p id="P51">For illustration, we estimate the overall mean of INCOME and MARITAL_NEW using PROC SURVEYMEANS, which uses the survey design information including strata, clusters, and weights. The working dataset &#x0201c;data=income_impute&#x0201d; reads the output dataset from PROC MI. In that dataset, a variable &#x0201c;_imputation_&#x0201d; is used to label the number of imputations (i.e., 1&#x02013;5), and the dataset has 27,290 (=5458&#x000d7;5) records. A &#x0201c;by&#x0201d; option is used to run the analyses separately. Finally, the &#x0201c;ods output statistics = mean_income_imp&#x0201d; is used to store the output of the 5 analyses in the dataset &#x0201c;mean_income_imp&#x0201d; for carrying out the combining step in Stage 4.</p><p id="P52">Output 5 shows the means and standard errors of both variables from the 1<sup>st</sup> imputed dataset. It contains the default output from PROC SURVEYMEANS. For example, the mean of the completed INCOME is 10.161342 and the standard error estimate is 0.110726. The full SAS<sup>&#x000ae;</sup> output would include results from all 5 imputations and distribution plots of both variables (details not shown).</p><p id="P53">Output 5</p><table-wrap position="anchor" id="T7"><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="2" rowspan="1">The SURVEYMEANS Procedure Imputation Number=1</th></tr><tr><th align="center" valign="top" colspan="2" rowspan="1">Data Summary</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Number of Strata</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">71</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Number of Clusters</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">159</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Number of Observations</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">5458</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Sum of Weights</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">5457.99708</td></tr></tbody></table><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="6" rowspan="1">Statistics</th></tr><tr><th align="left" valign="top" rowspan="1" colspan="1">Variable</th><th align="right" valign="top" rowspan="1" colspan="1">N</th><th align="right" valign="top" rowspan="1" colspan="1">Mean</th><th align="right" valign="top" rowspan="1" colspan="1">Std Error of Mean</th><th align="right" valign="top" colspan="2" rowspan="1">95% CL for Mean</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>INCOME</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">5458</td><td align="right" valign="top" rowspan="1" colspan="1">10.161342</td><td align="right" valign="top" rowspan="1" colspan="1">0.110726</td><td align="right" valign="top" rowspan="1" colspan="1">9.94129666</td><td align="right" valign="top" rowspan="1" colspan="1">10.3813876</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>MARITAL_NEW</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">5458</td><td align="right" valign="top" rowspan="1" colspan="1">0.646931</td><td align="right" valign="top" rowspan="1" colspan="1">0.010449</td><td align="right" valign="top" rowspan="1" colspan="1">0.62616660</td><td align="right" valign="top" rowspan="1" colspan="1">0.6676959</td></tr></tbody></table></table-wrap><p id="P54">Stage 4: We synthesize the results from the multiply-imputed datasets using PROC MIANALYZE.</p><p id="P55">For example, the following code combines the survey mean estimates for INCOME.</p><preformat position="float" xml:space="preserve">
proc mianalyze data =mean_income_imp edf=88; 
   modeleffects mean; 
   stderr stderr; 
   where varname = &#x02018;INCOME&#x02019;; 
   ods output parameterestimates=MI_results_income; 
run; 
</preformat><p id="P56">The procedure reads in the dataset mean_income_imp, which contains the separate estimates from the multiply-imputed datasets. The option &#x0201c;EDF= &#x0201d; is not the default but necessary for complex survey data analysis because it specifies the degrees of freedom in the combining step. In this example, we specify the degrees of freedom as the number of clusters minus the number of strata in the dataset. The statement &#x0201c;modeleffects mean&#x0201d; specifies that the estimand for combining is the mean estimates. The statement &#x0201c;stderr stderr&#x0201d; lists standard errors associated with the means. &#x0201c;where varname = &#x02018;INCOME&#x02019;&#x0201d; indicates that the combining step only applies to INCOME. Finally, &#x0201c;ods output parameterestimates=MI_results_income&#x0201d; saves the combined estimates to the dataset MI_results_income.</p><p id="P57">Output 6</p><table-wrap position="anchor" id="T8"><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="2" rowspan="1">The MIANALYZE Procedure</th></tr><tr><th align="center" valign="top" colspan="2" rowspan="1">Model Information</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Data Set</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">WORK.MEAN_INCOME_IMP</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Number of Imputations</bold>
</td><td align="left" valign="top" rowspan="1" colspan="1">5</td></tr></tbody></table><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="8" rowspan="1">Variance Information (5 Imputations)</th></tr><tr><th align="left" valign="top" rowspan="2" colspan="1">Parameter</th><th align="center" valign="top" colspan="3" rowspan="1">Variance</th><th align="right" valign="top" rowspan="2" colspan="1">DF</th><th align="right" valign="top" rowspan="2" colspan="1">Relative Increase in Variance</th><th align="right" valign="top" rowspan="2" colspan="1">Fraction Missing Information</th><th align="right" valign="top" rowspan="2" colspan="1">Relative Efficiency</th></tr><tr><th align="right" valign="top" rowspan="1" colspan="1">Between</th><th align="right" valign="top" rowspan="1" colspan="1">Within</th><th align="right" valign="top" rowspan="1" colspan="1">Total</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Mean</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">0.000756</td><td align="right" valign="top" rowspan="1" colspan="1">0.013659</td><td align="right" valign="top" rowspan="1" colspan="1">0.014453</td><td align="right" valign="top" rowspan="1" colspan="1">80.302</td><td align="right" valign="top" rowspan="1" colspan="1">0.058117</td><td align="right" valign="top" rowspan="1" colspan="1">0.055225</td><td align="right" valign="top" rowspan="1" colspan="1">0.997246</td></tr></tbody></table><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="center" valign="top" colspan="11" rowspan="1">Parameter Estimates (5 Imputations)</th></tr><tr><th align="left" valign="top" rowspan="1" colspan="1">Parameter</th><th align="right" valign="top" rowspan="1" colspan="1">Estimate</th><th align="right" valign="top" rowspan="1" colspan="1">Std Error</th><th align="right" valign="top" colspan="2" rowspan="1">95% Confidence Limits</th><th align="right" valign="top" rowspan="1" colspan="1">DF</th><th align="right" valign="top" rowspan="1" colspan="1">Minimum</th><th align="right" valign="top" rowspan="1" colspan="1">Maximum</th><th align="right" valign="top" rowspan="1" colspan="1">Theta0</th><th align="right" valign="top" rowspan="1" colspan="1">t for H0: Parameter=Theta0</th><th align="right" valign="top" rowspan="1" colspan="1">Pr &#x0003e; |t|</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">
<bold>Mean</bold>
</td><td align="right" valign="top" rowspan="1" colspan="1">10.230448</td><td align="right" valign="top" rowspan="1" colspan="1">0.120219</td><td align="right" valign="top" rowspan="1" colspan="1">9.991217</td><td align="right" valign="top" rowspan="1" colspan="1">10.46968</td><td align="right" valign="top" rowspan="1" colspan="1">80.302</td><td align="right" valign="top" rowspan="1" colspan="1">10.196713</td><td align="right" valign="top" rowspan="1" colspan="1">10.294728</td><td align="right" valign="top" rowspan="1" colspan="1">0</td><td align="right" valign="top" rowspan="1" colspan="1">85.10</td><td align="right" valign="top" rowspan="1" colspan="1">&#x0003c;.0001</td></tr></tbody></table></table-wrap><p id="P58">Output 6 shows the results from PROC MIANALYZE. The combined mean estimate of INCOME is 10.230448, its standard error is 0.120219, and the 95% confidence limits are (9.991212, 10.46968). Detailed explanations of other statistics (e.g., between/within variance) can be found in the literature (e.g., <xref rid="R4" ref-type="bibr">He et al. 2022</xref>, Chapter. 3).</p><p id="P59">Stage 5: We conduct some diagnostics and evaluation. We have considered different modeling options for INCOME and MARITAL_NEW (Section 3.2.2). In this example, since we create the missing values, the imputation analysis results can also be compared with those from complete data as well as from the case-deletion method. The programming code for Stage 5 would be running different MI models and analyses (e.g., remark (d)-(h) after PROC MI in <xref rid="S9" ref-type="sec">Section 3.3</xref>). Omitting the details, the evaluation results are summarized in <xref rid="T2" ref-type="table">Table 2</xref>.</p><p id="P60">The mean estimates from the case-deletion are considerably lower than the complete-data analysis due to MAR. In general, all MI methods correct for the biases somewhat. In addition, MI analyses yield generally narrower confidence intervals than the case-deletion method. Among different MI methods applied, it seems that when INCOME is imputed via PMM, the corresponding results are the closest to the complete-data analysis for both variables. Therefore, we would choose PMM+logit as the final MI modeling option.</p></sec></sec><sec id="S10"><label>4.</label><title>Discussion</title><p id="P61">We provide some simple illustrations on how to use SAS<sup>&#x000ae;</sup> to conduct MI analysis for complex survey data. In addition to providing some sample code and output, we provide some general guidance on constructing imputation models and running some evaluations. The full programming code is available at <ext-link xlink:href="https://github.com/he-zhang-hsu/multiple_imputation_book/tree/Survey_statistician" ext-link-type="uri">https://github.com/he-zhang-hsu/multiple_imputation_book/tree/Survey_statistician</ext-link>. Additional references on SAS<sup>&#x000ae;</sup>-based MI applications can be found in <xref rid="R1" ref-type="bibr">Berglund and Heeringa (2014)</xref> and relevant SAS documentation. References on MI strategies and applications, including non-survey data and how they can be implemented using other software packages such as R (<ext-link xlink:href="https://www.r-project.org/" ext-link-type="uri">https://www.R-project.org/</ext-link>) package &#x0201c;mice&#x0201d; (see <xref rid="R7" ref-type="bibr">van Buuren and Groothuis-Oudshoorn, 2011</xref>), can be found in <xref rid="R4" ref-type="bibr">He et al. (2022)</xref>.</p></sec></body><back><ref-list><title>References</title><ref id="R1"><mixed-citation publication-type="book"><name><surname>Berglund</surname><given-names>P</given-names></name> and <name><surname>Heeringa</surname><given-names>S</given-names></name>. <source>Multiple Imputation of Missing Data Using SAS</source>. <year>2014</year>. <publisher-loc>Cary, NC</publisher-loc>: <publisher-name>SAS Institute Inc</publisher-name>.</mixed-citation></ref><ref id="R2"><mixed-citation publication-type="book"><name><surname>Cochran</surname><given-names>WG</given-names></name>. <source>Sampling Techniques</source>, <edition>3rd</edition> Edition. <year>1977</year>. <publisher-loc>New York</publisher-loc>: <publisher-name>Wiley</publisher-name>.</mixed-citation></ref><ref id="R3"><mixed-citation publication-type="journal"><name><surname>He</surname><given-names>Y</given-names></name>, <name><surname>Cai</surname><given-names>B</given-names></name>, <name><surname>Shin</surname><given-names>H-C</given-names></name>, <name><surname>Beresovsky</surname><given-names>V</given-names></name>, <name><surname>Parsons</surname><given-names>V</given-names></name>, <name><surname>Irimata</surname><given-names>K</given-names></name>, <etal/>
<article-title>The National Center for Health Statistics&#x02019; 2015 and 2016 Research and Development Surveys</article-title>. <source>National Center for Health Statistics. Vital Health Stat</source>
<volume>1</volume>(<issue>64</issue>). <year>2020</year>.</mixed-citation></ref><ref id="R4"><mixed-citation publication-type="book"><name><surname>He</surname><given-names>Y</given-names></name>, <name><surname>Zhang</surname><given-names>G</given-names></name>, <name><surname>Hsu</surname><given-names>CH</given-names></name>. <source>Multiple Imputation of Missing Data in Practice: Basic Theory and Analysis Strategies</source>, <edition>1st</edition> Edition. <year>2022</year>. <publisher-name>Chapman and Hall/CRC Press</publisher-name>.</mixed-citation></ref><ref id="R5"><mixed-citation publication-type="journal"><name><surname>Irimata</surname><given-names>KE</given-names></name> and <name><surname>Scanlon</surname><given-names>PJ</given-names></name>. <year>2022</year>. <article-title>The Research and Development Survey (RANDS) during COVID-19</article-title>. <source>Statistical Journal of the International Association for Official Statistics</source>
<volume>38</volume>(<issue>1</issue>): <fpage>13</fpage>&#x02013;<lpage>21</lpage>.<pub-id pub-id-type="pmid">35928170</pub-id></mixed-citation></ref><ref id="R6"><mixed-citation publication-type="book"><name><surname>Rubin</surname><given-names>DB</given-names></name>. <source>Multiple Imputation for Nonresponse in Surveys</source>. <year>1987</year>. <publisher-loc>New York</publisher-loc>: <publisher-name>Wiley</publisher-name>.</mixed-citation></ref><ref id="R7"><mixed-citation publication-type="journal"><name><surname>van Buuren</surname><given-names>S</given-names></name> and <name><surname>Groothuis-Oudshoorn</surname><given-names>K</given-names></name> (<year>2011</year>). <article-title>mice: Multivariate Imputation by Chained Equations in R</article-title>. <source>Journal of Statistical Software</source>, <volume>45</volume>(<issue>3</issue>), <fpage>1</fpage>&#x02013;<lpage>67</lpage>. DOI <pub-id pub-id-type="doi">10.18637/jss.v045.i03</pub-id>.</mixed-citation></ref></ref-list></back><floats-group><table-wrap position="float" id="T1"><label>Table 1:</label><caption><p id="P62">Variables Used in the Example</p></caption><table frame="box" rules="all"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="left" valign="top" rowspan="1" colspan="1">Variable</th><th align="left" valign="top" rowspan="1" colspan="1">SAS<sup>&#x000ae;</sup> name</th><th align="left" valign="top" rowspan="1" colspan="1">Specifications</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">Age in years</td><td align="left" valign="top" rowspan="1" colspan="1">AGE</td><td align="left" valign="top" rowspan="1" colspan="1">18&#x02013;70; Age &#x02265; 70 is top-coded</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Sex</td><td align="left" valign="top" rowspan="1" colspan="1">GENDER</td><td align="left" valign="top" rowspan="1" colspan="1">Male/Female</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Education</td><td align="left" valign="top" rowspan="1" colspan="1">EDUC</td><td align="left" valign="top" rowspan="1" colspan="1">High school diploma or less/ Some college/Bachelor&#x02019;s degree or higher</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Marital status</td><td align="left" valign="top" rowspan="1" colspan="1">MARITAL_NEW</td><td align="left" valign="top" rowspan="1" colspan="1">Married or living with partners / Others<xref rid="TFN2" ref-type="table-fn">*</xref></td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Household internet use</td><td align="left" valign="top" rowspan="1" colspan="1">INTERNET</td><td align="left" valign="top" rowspan="1" colspan="1">Yes/No</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Household size</td><td align="left" valign="top" rowspan="1" colspan="1">HHSIZE</td><td align="left" valign="top" rowspan="1" colspan="1">1&#x02013;6; household size &#x0003e;=6 is top-coded</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Household income</td><td align="left" valign="top" rowspan="1" colspan="1">INCOME</td><td align="left" valign="top" rowspan="1" colspan="1">1&#x02013;16<xref rid="TFN3" ref-type="table-fn">**</xref></td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Sampling strata</td><td align="left" valign="top" rowspan="1" colspan="1">S_VSTRAT</td><td align="left" valign="top" rowspan="1" colspan="1">71 sampling strata in the original data</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Sampling clusters</td><td align="left" valign="top" rowspan="1" colspan="1">S_VPSU</td><td align="left" valign="top" rowspan="1" colspan="1">2 to 7 clusters per stratum</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Survey weights</td><td align="left" valign="top" rowspan="1" colspan="1">WEIGHT_CALIBRATED</td><td align="left" valign="top" rowspan="1" colspan="1">0.0096&#x02013;17.6472<xref rid="TFN4" ref-type="table-fn">***</xref></td></tr></tbody></table><table-wrap-foot><fn id="TFN1"><p id="P63">Note:</p></fn><fn id="TFN2"><label>*</label><p id="P64">collapsed from 6 categories in the original data; &#x0201c;Others&#x0201d; has four categories: widowed, divorced, separated, and never married.</p></fn><fn id="TFN3"><label>**</label><p id="P65">1: &#x0003c; $5000; 2: $5000&#x02013;9999; 3: $10000&#x02013;14999; 4: $15000&#x02013;19999; 5: $20000&#x02013;24999; 6: $25000&#x02013;29999; 7: $30000&#x02013;34999; 8: $35000&#x02013;39999; 9: $40000&#x02013;49999; 10: $50000&#x02013;59999; 11: $60000&#x02013;74999; 12: $75000&#x02013;84999; 13: $85000&#x02013;99999; 14: $100000&#x02013;124999; 15: $125000&#x02013;149999; 16: &#x0003e; $150000.</p></fn><fn id="TFN4"><label>***</label><p id="P66">normalized survey weights after calibrating to adjust for possible selection bias of RANDS.</p></fn></table-wrap-foot></table-wrap><table-wrap position="float" id="T2"><label>Table 2:</label><caption><p id="P67">Mean Estimates of INCOME and MARITAL_NEW from Different Methods</p></caption><table frame="box" rules="all"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><thead><tr><th align="left" valign="top" rowspan="1" colspan="1">Method</th><th align="left" valign="top" rowspan="1" colspan="1">INCOME</th><th align="left" valign="top" rowspan="1" colspan="1">MARITAL_NEW</th></tr></thead><tbody><tr><td align="left" valign="top" rowspan="1" colspan="1">Complete-data</td><td align="left" valign="top" rowspan="1" colspan="1">10.38 (10.14, 10.62)</td><td align="left" valign="top" rowspan="1" colspan="1">0.613 (0.592, 0.634)</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">Case-deletion</td><td align="left" valign="top" rowspan="1" colspan="1">10.17 (9.91, 10.43)</td><td align="left" valign="top" rowspan="1" colspan="1">0.589 (0.565, 0.614)</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">MI: linear+logit</td><td align="left" valign="top" rowspan="1" colspan="1">10.36 (10.12, 10.59)</td><td align="left" valign="top" rowspan="1" colspan="1">0.624 (0.600, 0.648)</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">MI: linear+discriminant</td><td align="left" valign="top" rowspan="1" colspan="1">10.35 (10.13, 10.58)</td><td align="left" valign="top" rowspan="1" colspan="1">0.620 (0.596, 0.643)</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">MI: (constrained) linear+logit</td><td align="left" valign="top" rowspan="1" colspan="1">10.26 (10.03, 10.49)</td><td align="left" valign="top" rowspan="1" colspan="1">0.621 (0.597, 0.644)</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">MI: (constrained) linear+discriminant</td><td align="left" valign="top" rowspan="1" colspan="1">10.26 (10.03, 10.49)</td><td align="left" valign="top" rowspan="1" colspan="1">0.623 (0.600, 0.645)</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">MI: PMM + logit</td><td align="left" valign="top" rowspan="1" colspan="1">10.39 (10.17, 10.62)</td><td align="left" valign="top" rowspan="1" colspan="1">0.621 (0.597, 0.644)</td></tr><tr><td align="left" valign="top" rowspan="1" colspan="1">MI: PMM + discriminant</td><td align="left" valign="top" rowspan="1" colspan="1">10.40 (10.16, 10.64)</td><td align="left" valign="top" rowspan="1" colspan="1">0.620 (0.596, 0.644)</td></tr></tbody></table><table-wrap-foot><fn id="TFN5"><p id="P68">Note: 1. 95% confidence intervals are in the parentheses. 2. INCOME is modelled by either &#x0201c;linear&#x0201d; or &#x0201c;PMM&#x0201d;; MARITAL_NEW is modelled by either &#x0201c;logit&#x0201d; or &#x0201c;discriminant&#x0201d;. 3. &#x0201c;constrained&#x0201d; denotes imputed values for INCOME are forced to be in [1,16]. 4. Rounding is applied for fractional numbers when applicable.</p></fn></table-wrap-foot></table-wrap></floats-group></article>