The Behavioral Risk Factor Surveillance System (BRFSS) is commonly used for estimating the prevalence of chronic disease. One limitation of the BRFSS is that valid estimates can only be obtained for states and larger geographic regions. Limited health data are available on the county level and, thus, many have used small-area analysis techniques to estimate the prevalence of disease on the county level using BRFSS data.

This study compared the validity and precision of 4 small-area analysis techniques for estimating the prevalence of 3 chronic diseases (asthma, diabetes, and hypertension) by race on the county level. County-level reference estimates obtained through local data collection were compared with prevalence estimates produced by direct estimation, synthetic estimation, spatial data smoothing, and regression. Discrepancy statistics used were Pearson and Spearman correlation coefficients, mean square error, mean absolute difference, mean relative absolute difference, and rank statistics.

The regression method produced estimates of the prevalence of chronic disease by race on the county level that had the smallest discrepancies for a large number of counties.

Regression is the preferable method when applying small-area analysis techniques to obtain county-level prevalence estimates of chronic disease by race using a single year of BRFSS data.

The Behavioral Risk Factor Surveillance System (BRFSS) collects uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases in adults (

Several statistical procedures for small-area analysis have been developed to help fill the local data void. Small-area analysis is a statistical procedure that provides a better estimate when the sample size for an area is too small or nonexistent. These approaches, as discussed in Jia et al (

The most commonly used methods include direct estimation, synthetic estimation, spatial data smoothing, and regression analysis (

It is unknown which small-area technique produces the most valid and precise results for racial subgroup estimates on the county level, and the validity and precision of the BRFSS for county-level estimation of chronic disease prevalence have not been discussed in the literature. I examine the validity and precision of BRFSS for estimating the prevalence of disease by racial subgroup on the county level (

I examined the reliability and accuracy of direct estimation, synthetic estimation, spatial smoothing, and regression for small-area analysis. I used each method to compare 2003 BRFSS prevalence estimates with county-level reference estimates of asthma, hypertension, and diabetes for non-Hispanic whites and non-Hispanic blacks.

County-level reference estimates were obtained for US counties from publicly available county-level data collected in 2003 (eg, data for New York City come from the New York City Department of Health Community Health Assessment) for which the prevalence of asthma, diabetes, or hypertension was available or could be calculated for non-Hispanic whites and blacks. Most of the prevalence estimates by race and county used in this analysis were available on the Internet; other estimates were obtained by contacting state and county departments of health. For some counties, estimates were not available for all 3 diseases or for both racial subgroups; thus, selection bias is possible. In the 2000 US Census, the percentage of non-Hispanic blacks for those counties included in this analysis varied from 1% to 56% (an average of 9% across counties). Likewise, the percentage of non-Hispanic whites varied from 33% to 96% (an average of 77% across counties). Most counties had a mix of urban and rural areas; 65% of the population in these counties live in urban areas. Seven counties were urban and 10 counties were rural. Because of the variety of geographic locations, demographic composition, and mix of rural, urban, and suburban counties for which county-level estimates were obtained, I believe this analysis is generalizable to US counties not included in this analysis that have similar characteristics.

Prevalence estimates for asthma were obtained from a sequence of 2 BRFSS questions. Survey participants were first asked, "Have you ever been told by a doctor, nurse, or other health professional that you had asthma?" If the respondents answered yes they were then asked, "Do you still have asthma?" Respondents responding yes to both questions were considered to have asthma. The prevalence of diabetes and hypertension were calculated using survey participants' responses to the questions "Have you ever been told by a doctor that you have diabetes?" and "Have you ever been told by a doctor, nurse, or other health professional that you have high blood pressure?" respectively. Respondents answering yes were then asked, "Was this only when you were pregnant?"; respondents answering "yes, but only during pregnancy" were considered as not having chronic diabetes or hypertension for the purpose of this analysis.

Direct prevalence estimates for asthma, hypertension, and diabetes were calculated by race and county by using weighted 2003 BRFSS data for counties with more than 50 respondents.

The synthetic estimate for county

Equation 1

where _{ij}

The demographic population estimates (_{ij}_{i.}

Spatial prevalence estimates were obtained by using the weighted "head-banging" spatial data smoothing algorithm (_{i}_{i}

high screen for county _{i}

low screen for county _{i}

The weights were based on the county population. If the estimated prevalence rate for county

Multilevel logistic regression models with random effects were used to obtain county prevalence estimates:

Equation 2

logit(p_{ij}) = _{i}

where _{ij} = (_{ij1},...,_{ij}_{1},...,_{q})' is the corresponding vector of fixed effects, and α_{i}^{2}. If the random effect term was too small to affect the accuracy of estimated county prevalence rates (<0.001%), to simplify analysis, the random effects were not estimated and were assumed to have a value of 0. Even when the random effect term was assumed to be 0 it was still included in the model to improve estimation for the fixed effects and to ensure correct selection of the variables for inclusion in the model (

Analysis was conducted by using SAS/STAT version 9.1 (SAS Institute, Inc, Cary, North Carolina) with SAS-callable Sudaan version 9.0 (RTI, Research Triangle Park, North Carolina) to adjust for the complex sampling design in BRFSS (_{i}_{i}_{i}_{i}

Pearson and Spearman correlation coefficients

Mean square error (MSE):

Mean absolute difference (MAD):

Mean relative absolute difference (MRAD):

Rank statistics (

In each equation,

BRFSS does not identify counties with a population of less than 150,000; these counties were excluded from analysis. Pearson correlation coefficient, MSE, MAD, and MRAD are parametric statistics and assume normality in test assumptions. For the purpose of this analysis I assumed normality of the errors between the small-area BRFSS estimates and the county-level reference estimates via the central limit theorem; all discrepancy statistics were based on sample sizes greater than 50. Spearman correlation coefficients and rank statistics are provided as nonparametric alternatives in case the normality assumption is violated.

The Pearson and Spearman correlation statistics are numerical representations of scatterplots and provide a more objective way to test the hypothesis that the BRFSS prevalence estimates and county-level estimates are linearly correlated. Ideally, the small-area BRFSS prevalence estimates (_{i}_{i}) and therefore lie on straight line with a 45-degree angle. By using the Pearson correlation coefficient and its nonparametric counterpart the Spearman correlation coefficient, I test the null hypothesis that the BRFSS estimates and reference estimates are not linearly related. Correlation coefficients close to 1 would indicate that BRFSS prevalence estimates and county-level estimates have a high linear correlation, thus the small-area analysis technique produces valid and precise estimates. Good small-area analysis prevalence estimates would have MSE, MAD, MRAD, and rank statistics close to 0, indicating very little discrepancy with county-level estimates.

Of the 1,937 BRFSS estimates of race by county, 906 (47%) had subgroup sample sizes of less than 50, the minimum needed for direct estimation of prevalence (

For the prevalence of asthma by race, 190 BRFSS prevalence estimates were compared with the corresponding 190 county-level reference estimates. Direct estimation produced the largest discrepancy statistics (

For the prevalence of diabetes by race, 181 county-level reference estimates were compared with the corresponding BRFSS prevalence estimates. Direct estimation had the largest discrepancy. Spatial smoothing ranged from second best to second worst depending on the amount of smoothing (number of times algorithm is repeated). Synthetic estimation performed slightly better than direct estimation and produced significant correlation coefficients. Regression showed significance only in the nonparametric Spearman correlation coefficient. Overall, the regression approach showed the least amount of discrepancy, making it the better small-area analysis technique for estimating the prevalence of diabetes by race on the county level (

For the prevalence of hypertension by race, 182 county-level reference estimates were compared with BRFSS estimates. Direct estimation and spatial smoothing showed the biggest discrepancies (

I examined data for non-Hispanic whites and non-Hispanic blacks because the prevalence of asthma, hypertension, and diabetes were consistently measured for these groups. Other racial/ethnic groups for which reference prevalence estimates are consistently measured were hard to obtain because of the small sample size (eg, Asians, Native Americans, Pacific Islanders, Hispanics). Generalizability of small-area analysis techniques for these subpopulations has not been validated and is an area for future research.

Direct estimation had the largest discrepancies, likely because the BRFSS is not designed to produce subpopulation county-level estimates because of small subgroup sample sizes on the county level. This was especially true for non-Hispanic blacks and demonstrates a major limitation of this technique. Although regression appears to be the best small-area analysis technique, synthetic estimation and spatial smoothing often performed better than regression when no county-level variables were significantly associated with the outcome. Other smoothing methods may be appropriate for this type of analysis, which raises questions about the proper choice of smoothing technique and choosing the appropriate degree of smoothing for estimation. The synthetic method has been used widely in public health practice, likely because of the ease of calculation. However, researchers have also used Bayesian methods and complex regression analysis to produce estimates; a comparison of these approaches may also prove beneficial.

This area of research is limited by the lack of systematic local data collection of chronic disease prevalence by race/ethnicity. Development and refinement of small-area analysis techniques relies heavily on statistically sound reference estimates. It was challenging to obtain county-level reference estimates by race; this was especially true for non-Hispanic blacks as the estimates were often unstable because of small sample sizes. There is a potential for selection bias based on publicly available data used as reference estimates.

Statistically sound local-level estimates of chronic disease by race may improve our ability to address racial/ethnic disparities in chronic disease using evidence-based public health. Small-area analysis can provide reliable county-level estimates for the prevalence of chronic disease by race using BRFSS data when a county has few respondents. BRFSS data is a probability sample of US households with a telephone. Telephone coverage varies by state and subpopulation, which raises issues of selection bias in BRFSS data collection. Despite its limitations, BRFSS remains the best available health data for substate estimation.

This study and the work of Dr Goodman were supported by the Robert Wood Johnson Foundation New Connections Program.

The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the US Department of Health and Human Services, the Public Health Service, the Centers for Disease Control and Prevention, or the authors’ affiliated institutions. Use of trade names is for identification only and does not imply endorsement by any of the groups named above. URLs for nonfederal organizations are provided solely as a service to our users. URLs do not constitute an endorsement of any organization by CDC or the federal government, and none should be inferred. CDC is not responsible for the content of Web pages found at these URLs.

Discrepancy Statistics Comparing 2003 Behavioral Risk Factor Surveillance System (BRFSS) Estimates and 2003 County-Level Estimates for Prevalence of Asthma

Direct | Spatial Smoothing | Synthetic | Regression | |
---|---|---|---|---|

Pearson correlation coefficient | 0.0250 | −0.0272 | 0.7624 | 0.8277 |

Spearman correlation coefficient | −0.0413 | 0.0394 | 0.6820 | 0.7721 |

Mean square error (MSE) | 0.2768 | 0.3044 | 0.1529 | 0.1496 |

Mean absolute difference (MAD) | 0.4443 | 0.4688 | 0.3557 | 0.3451 |

Mean relative absolute difference (MRAD) | 3.9961 | 3.8929 | 0.6481 | 0.5278 |

Rank statistics | 0.3141 | 0.3463 | 0.2372 | 0.1471 |

Correlation coefficients close to 1 indicate that BRFSS prevalence estimates and county-level estimates have a high linear correlation, thus producing valid and precise estimates. MSE, MAD, MRAD, and rank statistics close to 0 indicate little discrepancy with county-level estimates.

Discrepancy Statistics Comparing 2003 Behavioral Risk Factor Surveillance System (BRFSS) Estimates and 2003 County-Level Estimates for Prevalence of Diabetes

Direct | Spatial Smoothing | Synthetic | Regression | |
---|---|---|---|---|

Pearson correlation coefficient | 0.0515 | 0.1096 | 0.0541 | 0.1328 |

Spearman correlation coefficient | 0.1291 | 0.1506 | 0.2068 | 0.2309 |

Mean square error (MSE) | 0.0121 | 0.0527 | 0.0083 | 0.0020 |

Mean absolute difference (MAD) | 0.0655 | 0.2396 | 0.0563 | 0.0351 |

Mean relative absolute difference (MRAD) | 0.8819 | 2.0876 | 0.6075 | 0.5554 |

Rank statistics | 0.0872 | 0.1622 | 0.0688 | 0.0178 |

Correlation coefficients close to 1 indicate that BRFSS prevalence estimates and county-level estimates have a high linear correlation, thus producing valid and precise estimates. MSE, MAD, MRAD, and rank statistics close to 0 indicate little discrepancy with county-level estimates.

Discrepancy Statistics Comparing 2003 Behavioral Risk Factor Surveillance System (BRFSS) Estimates and 2003 County-Level Estimates for Prevalence of Hypertension

Direct | Spatial Smoothing | Synthetic | Regression | |
---|---|---|---|---|

Pearson correlation coefficient | 0.0599 | 0.1984 | −0.0525 | 0.0573 |

Spearman correlation coefficient | 0.0913 | 0.1294 | −0.0731 | 0.2153 |

Mean square error (MSE) | 0.0466 | 0.1046 | 0.0386 | 0.0315 |

Mean absolute difference (MAD) | 0.1382 | 0.2396 | 0.1987 | 0.1654 |

Mean relative absolute difference (MRAD) | 0.4809 | 0.8327 | 0.2720 | 0.2067 |

Rank statistics | 0.1805 | 0.2601 | 0.0965 | 0.0535 |

Correlation coefficients close to 1 indicate that BRFSS prevalence estimates and county-level estimates have a high linear correlation, thus producing valid and precise estimates. MSE, MAD, MRAD, and rank statistics close to 0 indicate little discrepancy with county-level estimates.