Increasing numbers of individuals are choosing to opt out of population-based sampling frames due to privacy concerns. This is especially a problem in the selection of controls for case–control studies, as the cases often arise from relatively complete population-based registries, whereas control selection requires a sampling frame. If opt out is also related to risk factors, bias can arise.
We linked breast cancer cases who reported having a valid driver’s license from the 2004–2008 Wisconsin women’s health study (N = 2,988) with a master list of licensed drivers from the Wisconsin Department of Transportation (WDOT). This master list excludes Wisconsin drivers that requested their information not be sold by the state. Multivariate-adjusted selection probability ratios (SPR) were calculated to estimate potential bias when using this driver’s license sampling frame to select controls.
A total of 962 cases (32%) had opted out of the WDOT sampling frame. Cases age <40 (SPR = 0.90), income either unreported (SPR = 0.89) or greater than $50,000 (SPR = 0.94), lower parity (SPR = 0.96 per one-child decrease), and hormone use (SPR = 0.93) were significantly less likely to be covered by the WDOT sampling frame (α = 0.05 level).
Our results indicate the potential for selection bias due to differential opt out between various demographic and behavioral subgroups of controls. As selection bias may differ by exposure and study base, the assessment of potential bias needs to be ongoing.
SPRs can be used to predict the direction of bias when cases and controls stem from different sampling frames in population-based case–control studies.
Selection bias in population-based cancer research can affect the validity of the evidence used for public health practice (
An official master file of drivers with valid licenses is available for epidemiologic studies in many states (
This analysis used data from the breast cancer cases enrolled in the Wisconsin women’s health study (WWHS), a federally funded population-based case–control study designed to examine the associations of lifestyle factors and genetics with breast cancer risk (
We matched the participating 2,988 WWHS breast cancer cases to WDOT driver’s license files of individuals who had not "opted out" to estimate the completeness of the driver’s license sampling frame used to select controls. WWHS cases were linked with data from the 2006 WDOT master file. A dichotomous variable was created indicating whether each WWHS case record matched a record from the driver’s license master file. First, exact match linkages were conducted on the basis of gender, last name, first name, date of birth, and zip code between WWHS case records and WDOT driver’s license records. Second, manual review of exact match linkages was conducted based first on gender, last name, first name, and date of birth and then based on gender, last name, and date of birth for all remaining unmatched case records. Manual review of the remaining unmatched WWHS cases, aided by various weighting schemes developed by the National Center for Health Statistics for the National Death Index, was then used to find any additional cases with matching WDOT license records (
It has been shown that the ratio of the probability of being on the sampling frame among individuals with a risk factor divided by the corresponding probability among those without the risk factor is a measure of the bias in the estimated odds ratio (OR) of disease for the factor (
As risk factor information is not available for controls who are not on the drivers’ license data frame, it is not possible to directly estimate γ/δ for the controls. Under the assumption that opting out patterns are similar for cases and controls, however, γ/δ is the ratio associated with opt out among both cases and controls, and we can estimate γ/δ from our match of cases to the drivers’ license file. The selection probability ratio (SPR) 1/(γ/δ) can then be used to estimate the magnitude of bias. It should be noted that this is using data on a subset of cases solely for the estimation of bias in the control group, whereas all cases will be included in the final case–control analysis.
Analyses were conducted with SAS version 9.1 (SAS Institute). Among the 2,988 WWHS cases, we modeled probabilities of cases being on the sampling frame provided in 2006 by the WDOT on the basis of demographics (e.g., age, education, and income), lifestyle characteristics (e.g., physical activity, obesity, and alcohol intake), and medical factors (e.g., cancer screening and comorbidities). Multivariable-adjusted SPRs were estimated by fitting a generalized linear model with the log link function, poisson distribution, and robust error variance to the WWHS cases (
The approach here to obtain ORs for specific risk factors and sampling ratios for coverage probabilities from separate regression analyses does not fulfill the criterion of the same weighting across adjustment variables or strata. Hence, the sampling ratios cannot be used to directly correct the ORs (
The WDOT master file of licensed drivers for November 2006 included 3,018,192 records. This master file was linked with the 2,988 breast cancer cases that participated in the WWHS. A total of 2,026 (67.8%) of the WWHS cases were found to have a matching record on the WDOT master file. Of these 2,026 cases, 1,477 (72.9%) had an exact match based on gender, last name, first name, date of birth, and zip code. An additional 391 cases (19.3%) were matched on the basis of gender, last name, first name, and date of birth, and 66 (3.3%) were matched on gender, last name, and date of birth. Manual review, aided by weighting schemes based on those used by the National Death Index, matched 92 (4.5%) additional cases.
Awareness of selection bias in specific studies and study design types is important for both researchers and policy makers in public health. In case–control studies, selection bias may be compounded as cases and controls are often taken from different sampling frames but assumed to represent the same study base. Hence, selection is associated with disease outcome, and if this selection is also associated with specific risk factors, bias may result.
This study used breast cancer cases from an interview-based case–control study to examine coverage error of the WDOT sampling frame used to select controls from the population. This approach assumes that factors associated with coverage are similar in both case and control groups. In this database, 31% of cases (N = 926) found in the WDOT master file renewed their driver’s license after being diagnosed with cancer. For those cases, we were unable to determine their opt-out status before diagnosis. However, removing those 31% of cases did not change the factors determined to be associated with "opt out". In addition, to argue that the determinants of opting out of the WDOT sampling frame are the same in this population of cases as in the general population, one would have to assume that nonparticipants have the same determinants of opting out as the cases that took part in this study, and that these determinants have the same magnitude of association with opting out. While this may be a valid assumption, more research is needed on the cases that did not participate in the WWHS study to evaluate whether the opt-out determinants remain consistent.
This research indicates that breast cancer cases under 40 years of age are more likely to "opt out". Bias due to this finding in breast cancer research would be expected to be minimal with approximately 6% of all incident breast cancer cases occurring in women under the age of 40 (
Of interest to breast cancer research were the findings that parity, income, and the use of hormone replacement therapy were associated with coverage on the sampling frame used to select controls. Women with higher parity, lower income, and never users of postmenopausal hormones are also less likely to develop breast cancer. Using a sampling frame that has fewer women with high parity, fewer women with high incomes, and fewer women who have ever used hormones to select controls may bias the results of a case–control study when those risk factors are included as primary exposures or as covariates. By assuming the same SPRs for cases and controls, we estimate control SPRs < 1.0 for greater parity, higher income, and use of hormone replacement therapy. Control SPRs < 1.0 would create observed ORs that are numerically higher than the truth (positive bias) when evaluating these factors in case–control studies that use driver’s license master files to identify controls. For example, although there is widespread agreement that postmenopausal estrogen–progesterone therapy is a risk factor for the development of breast cancer, an OR from a case–control study that uses the driver’s license master file to identify controls that enabled potential controls to opt out due to privacy concerns would likely be overestimated. For conditions other than breast cancer, these factors (income, postmenopausal hormone use, and parity) are also likely associated with socioeconomic variables.
One simple option for reducing selection bias may be to exclude cases not on the sampling frame used to select the controls. It is often assumed that this exclusion ensures that the study bases for cases and controls are comparable. However, this option requires one of the assumptions that we made in our investigation: that factors responsible for opting out are the same between cases and controls. Selection bias will remain even after excluding cases that could not be approached to serve as controls if this assumption is not met. In addition, the exclusion of potentially up to 32% of cases would be wasteful in terms of statistical power and precision. Besides using SPRs to approximate the selection bias due to inadequate sampling frame coverage, a researcher could calculate the predicted probability of coverage for each control and use inverse probability weighting or propensity scores to adjust for selection bias (
Investigators should evaluate the comparability of each sampling frame’s study base when designing the study. When coverage of the study base differs between cases and controls, investigators should calculate the expected direction of bias indicated by the SPR for each exposure of interest and use one of the established correction methods to adjust results, when a new exposure of interests is evaluated. A previous study (
Also, this study assumed that the case sampling frame, the Wisconsin Cancer Reporting System, was 100% complete. This allows for a simplified calculation of SPRs based on setting α and β to 1. There appears to be regional variation in reporting due to privacy concerns of neighboring states (
This study and linkage procedures have some limitations. In addition to the assumptions discussed earlier, some errors in linkage between WWHS cases and the WDOT master list of licensed drivers may have occurred. Linkage procedures could partially explain the observed association between marital status and coverage. The first 3 linkage procedures used last name and address when merging the driver’s license list with WWHS case data. Last name and/or address often change when marital status changes. However, these data are normally updated when a license is renewed (every 7 years in Wisconsin). Also, linkages focusing on date of birth, aided by various weighting schemes, were evaluated to reduce misclassification. The WWHS parent study obtained a master list of licensed drivers in 2004 and 2006. However, due to cost constraints only the 2006 master file was prepared for linkage. Cases that were interviewed in 2004 may have moved away, passed away or had a disease progression that would explain absence from the 2006 master list of licensed drivers. However, the 2-year emigration and mortality rates are low for these women. Additional linkage errors probably resulted in nondifferential misclassification bias, resulting in attenuated SPRs.
Our results indicate the potential for selection bias due to differential opt out between various demographic and behavioral subgroups of controls. The potential for bias due to inadequate coverage of the study base will increase if more individuals opt out of inclusion in common sampling frames used in cancer research. All current control ascertainment schemes, including random digit dialing, have coverage issues, and as response rates and participation rates decline, understanding the effects of sampling frames that do not fully enumerate the study base will be critical to establishing the validity of study results.
No potential conflicts of interest were disclosed.
The authors thank Leonelo Bautista, Nora Cate Schaeffer, Paul Peppard, John Hampton, Julie McGregor, and Laura Stephenson for study assistance provided.
This work was supported by the NIH grants CA47147, CA67264, and CD000712.
Effects of differential sampling frame coverage of controls by exposure in a case–control study where the sampling frame for cases is assumed to have full case reporting. aIn this study, under the assumption that opting out patterns are similar for cases and controls, we illustrate the likely magnitude of γ/δ via the corresponding ratio of probabilities of a case being on the drivers' license file used for controls. bTrue OR assuming no systematic or random error.
SPRs for driver's license sampling frame coverage for breast cancer cases in the WWHS, 2004–2008
| Characteristic at | Percent of | SPR |
|---|---|---|
| Age, y | ||
| 20–39 | 6.2 | 1 (reference) |
| 40–69 | 93.8 | 1.12 (1.00–1.26) |
| Marital status at diagnosis | ||
| Single, never married | 5.4 | 1 (reference) |
| Married | 77.6 | 0.90 (0.79–1.01) |
| Living with partner | 2.2 | 0.71 (0.55–0.91) |
| Divorced, separated | 9.1 | 0.83 (0.72–0.95) |
| Widowed | 4.7 | 0.89 (0.76–1.04) |
| Income | ||
| <$50,000 | 49.2 | 1 (reference) |
| ≥$50,000 | 39.7 | 0.94 (0.89–1.00) |
| Missing | 11.1 | 0.89 (0.81–0.98) |
| Postmenopausal hormone use | ||
| Never | 66.0 | 1 (reference) |
| Ever | 33.2 | 0.93 (0.88–0.98) |
| Ever use of antidepressant medication use | ||
| No | 68.3 | 1 (reference) |
| Yes | 31.66 | 0.95 (0.93–0.98) |
| Race | ||
| White, non-Hispanic | 95.1 | 1 (reference) |
| Other | 3.4 | 1.06 (0.94–1.20) |
| Education | ||
| College degree or more | 40.6 | 1 (reference) |
| Some college | 26.3 | 1.01 (0.94–1.08) |
| High school or less | 33.1 | 1.02 (0.96–1.09) |
| Parity | ||
| ≥3 | 36.5 | 1 (reference) |
| 2 | 36.7 | 0.94 (0.89–1.00) |
| 1 | 11.7 | 0.86 (0.78–0.94) |
| 0 | 14.3 | 0.87 (0.80–0.96) |
Not all categories sum to 100% due to missing values.
ORs and 95% CIs were derived by fitting a generalized linear model with the log link function.
Adjusted for parity, education, race, antidepressant medication use, hormone use, income, marital status at diagnosis, and age (over/under 40).