Count data are often collected in chronic disease research, and sometimes these data have a skewed distribution. The number of unhealthy days reported in the Behavioral Risk Factor Surveillance System (BRFSS) is an example of such data: most respondents report zero days. Studies have either categorized the Healthy Days measure or used linear regression models. We used alternative regression models for these count data and examined the effect on statistical inference.
Using responses from participants aged 35 years or older from 12 states that included a homeownership question in their 2009 BRFSS, we compared 5 multivariate regression models — logistic, linear, Poisson, negative binomial, and zeroinflated negative binomial — with respect to 1) how well the modeled data fit the observed data and 2) how model selections affect inferences.
Most respondents (66.8%) reported zero mentally unhealthy days. The distribution was highly skewed (variance = 58.7, mean = 3.3 d). Zeroinflated negative binomial regression provided the bestfitting model, followed by negative binomial regression. A significant independent association between homeownership and number of mentally unhealthy days was not found in the logistic, linear, or Poisson regression model but was found in the negative binomial model. The zeroinflated negative binomial model showed that homeowners were 24% more likely than nonowners to have excess zero mentally unhealthy days (adjusted odds ratio, 1.24; 95% confidence interval, 1.08–1.43), but it did not show an association between homeownership and the number of unhealthy days.
Our comparison of regression models indicates the importance of examining data distribution and selecting models with appropriate assumptions. Otherwise, statistical inferences might be misleading.
Medscape, LLC is pleased to provide online continuing medical education (CME) for this journal article, allowing clinicians the opportunity to earn CME credit.
This activity has been planned and implemented in accordance with the Essential Areas and policies of the Accreditation Council for Continuing Medical Education through the joint sponsorship of Medscape, LLC and Preventing Chronic Disease. Medscape, LLC is accredited by the ACCME to provide continuing medical education for physicians.
Medscape, LLC designates this Journalbased CME activity for a maximum of 1
All other clinicians completing this activity will be issued a certificate of participation. To participate in this journal CME activity: (1) review the learning objectives and author disclosures; (2) study the education content; (3) take the posttest with a 70% minimum passing score and complete the evaluation at
Upon completion of this activity, participants will be able to:
Distinguish characteristics of different tools for data analysis
Analyze how data regarding selfreported health can be skewed in the Behavioral Risk Factor Surveillance System (BRFSS) survey
Evaluate results of different evaluation tools on count data from the BRFSS survey
Ellen Taratus, Editor,
Charles P. Vega, MD, Associate Professor and Residency Director, Department of Family Medicine, University of California, Irvine. Disclosure: Charles P. Vega, MD, has disclosed no relevant financial relationships.
Disclosures: Hong Zhou, Paul Z. Siegel, Rashid S. Njai, Charlotte Kent, Youlian Liao, William W. Thompson, and John Barile have disclosed no relevant financial relationships.
Hong Zhou, MS, MPH, Division of Health Informatics and Surveillance, Center for Surveillance, Epidemiology and Laboratory Services, Centers for Disease Control and Prevention, Atlanta, Georgia. Paul Z. Siegel, MD, MPH; Rashid S. Njai, PhD; Charlotte Kent, PhD; and Youlian Liao, MD, National Center for Chronic Disease Prevention and Health Promotion, Centers for Disease Control and Prevention, Atlanta, Georgia. William W. Thompson, PhD, National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, Georgia. John Barile, PhD, Department of Psychology, University of Hawaii at Manoa, Manoa, Hawaii.
Researchers of chronic disease often gather data that are measured on a continuum rather than as a “present–absent” or “yes–no” dichotomy. Examples include the following: episodes of a symptom; number of sick days, cigarettes smoked, or alcoholic drinks consumed; measures of health care use, such as number of doctor visits or days of hospitalization; and costs incurred (in dollars). Such measures are referred to as “count” data; that is, the observations can have only nonnegative integer values (0, 1, 2, 3, . . . ). Such data are most often gathered during a specified period of time (eg, the past month or year). For some of these measures, most study participants may have a zero count (eg, no episode of a symptom, no cigarettes smoked, no use of health care services). These data are typically not normally distributed, and the positive skew in their distribution cannot be resolved by data transformation. The Centers for Disease Control and Prevention’s (CDC’s) healthrelated quality of life (HRQOL) Healthy Days measure (
The Behavioral Risk Factor Surveillance System (BRFSS) questionnaire includes an HRQOL section composed of 3 questions related to respondents’ healthy days. These questions ask respondents to report the number of days in the previous 30 days when 1) their physical health was not good, 2) their mental health was not good, and 3) poor physical or mental health kept them from doing their usual activities (
Two simple and familiar methods have often been used to analyze Healthy Days data. The first categorizes the data into 2 (eg, ≥14 vs <14 d) (
The second most common method of analyzing the association between various risk factors and the number of reported physically and mentally unhealthy days uses linear regression models and keeps the outcome in its original scale of 0 to 30 days (
This study used data from the 12 states that included a question on homeownership in their 2009 BRFSS to examine the independent relationship between homeownership and number of mentally unhealthy days. Studies have shown that homeownership is associated with several health outcomes (
BRFSS is a statebased system of annual health surveys (
The independent variable for this study was homeownership, based on the following question in the BRFSS: “Do you own or rent your home?” The response options are own, rent, or other arrangement (such as group home or staying with friends or family without paying rent). We classified respondents who rented a home or lived by other arrangement as nonhomeowners. The outcome measure was the number of days reported by respondents to the question: “Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?” Covariates included age, sex, race/ethnicity, education, household income, marital status, household size, and employment status. The 2009 BRFSS questionnaire is available at
There were 68,258 adults aged 18 or older who responded to both the homeownership and mentally unhealthy days questions in the 12 states. We limited the analysis to the 60,113 people aged 35 or older, because those younger than 35 were unlikely to own a home. We excluded 550 (0.9%) people who had missing data for any of these covariates: education, marital status, household size, and employment status. People with missing data on household income (n = 6,582, 7.5%) were classified as a separate category (“unknown”) and were not excluded from the analysis. The analyzed sample included 59,563 adults (22,568 men and 36,995 women).
We first examined the distribution of mentally unhealthy days, including the frequency of zero, mean, median, skew, and variance. We then examined the associations between homeownership and number of mentally unhealthy days by using 5 models:
For each model, we plotted the sample (observed) percentage distribution of the number of unhealthy days (from 0 to 30) against the distribution predicted by the model. If the percentage distribution predicted by a model closely matched the observed distribution in the plot, the model was considered a good fit to the data.
In the modeling, we simultaneously adjusted for age (35–44, 45–54, 55–64, and ≥65), sex, race and ethnicity (nonHispanic white, nonHispanic black, Hispanic, and all others), education level (less than high school, high school graduate to <4 y of college, and ≥4 y of college ), household income (<25,000, 25,000 to <50,000, ≥50,000, and unknown), marital status (married, divorced/widowed/separated, and never married), household size (1 or 2, 3 or 4, 5 or 6, and ≥7), employment status (employed, unemployed, homemaker, retired, and unable to work). In the univariate analyses, all of these covariates were significantly associated with homeownership and significantly associated with the number of mentally unhealthy days. We considered these covariates as confounders in the relation between homeownership and number of unhealthy days and therefore included them in our multivariate models.
We used Stata version 12 (StataCorp LP, College Station, Texas) to perform all statistical analyses and take into account the complex sampling design of the survey.
Among adults aged 35 years or older, about fourfifths (79.3%) owned a home (
Characteristic  Unweighted Sample Size  % 

 
35–44  9,034  26.8 (26.0–27.7) 
45–54  13,997  27.7 (26.9–28.6) 
55–64  15,281  21.8 (21.1–22.5) 
≥65  21,251  23.7 (23.0–24.3) 
 
Male  22,568  47.7 (46.8–48.6) 
Female  36,995  52.3 (51.4–53.2) 
 
NonHispanic white  43,901  66.8 (65.8–67.8) 
NonHispanic black  6,008  8.8 (8.3–9.3) 
Hispanic  3,399  15.4 (14.5–16.3) 
Other  6,255  9.0 (8.4–9.6) 
 
<High school  5,575  11.6 (10.9–12.4) 
High school graduate to <4 y of college  34,130  51.0 (50.1–51.9) 
≥4 y of college  19,858  37.4 (36.5–38.2) 
 
<25,000  15,262  22.6 (21.8–23.4) 
25,000 to <50,000  15,006  22.7 (21.9–23.4) 
≥50,000  22,713  47.2 (46.3–48.1) 
Unknown  6,582  7.5 (7.2–7.9) 
 
Married  34,624  68.9 (68.1–69.7) 
Divorced, widowed, or separated  19,373  21.1 (20.5–21.8) 
Never married  5,566  10. 0 (9.4–10.6) 
 
1 or 2  18,104  14.4 (14.0–14.8) 
3 or 4  31,618  52.5 (51.6–53.4) 
5 or 6  8,346  26.4 (25.6–27.3) 
7 or more  1,495  6.7 (6.0–7.4) 
 
Employed  29,110  56.3 (55.4–57.1) 
Unemployed  2,879  7.0 (6.5–7.6) 
Homemaker  4,259  8.2 (7.7–8.7) 
Retired  18,785  21.6 (21.0–22.3) 
Unable to work  4,530  6.9 (6.5–7.4) 
 
Own  49,574  79.3 (78.5–80.2) 
Do not own  9,989  20.7 (19.8–21.5) 
 
0  42,029  66.8 (65.9–67.6) 
1–10  11,285  22.2 (21.5–23.0) 
11–20  2,587  5.0 (4.6–5.4) 
21–30  3,662  6.0 (5.6–6.5) 
Abbreviations: YEAH, Youth Engagement and Action for Health; SD, standard deviation.
Alabama, Arkansas, California, Hawaii, Illinois, Kansas, Louisiana, Nebraska, New Mexico, Oklahoma, South Carolina, and Wisconsin.
Weighted percentage.
Weighted 95% confidence interval.
The logistic regression analysis found no significant association (
Regression Model  Parameter Estimate  Standard Error 


Model 1: Logistic (≥14 d vs <14 d)  −0.139  (0.113)  .22 
Model 2: Linear  −0.456  (0.257)  .08 
Model 3: Poisson  −0.085  (0.059)  .15 
Model 4: Negative binomial  −0.137  (0.065)  .04 
Model 5: Zeroinflated negative binomial  
Zeroinflated component  0.216  (0.072)  .003 
Negative binomial component  −0.011  (0.050)  .83 
Nonhomeowner is the reference group in all models. All models included the following covariates: age groups, sex, race/ethnicity, education, household income, marital status, household size, and employment status.
Alabama, Arkansas, California, Hawaii, Illinois, Kansas, Louisiana, Nebraska, New Mexico, Oklahoma, South Carolina, and Wisconsin.
Both linear and Poisson regression models underestimated the percentage of nonoccurrence (0 days) and overestimated the percentage in the category 1 to 9 days (
Comparison of the observed percentage distribution of number of mentally unhealthy days and the percentage distribution predicted by the multivariate linear and Poisson regression models. Data were obtained from the 2009 Behavioral Risk Factor Surveillance System in 12 states.
No. of Mentally Unhealthy Days Observed Linear Poisson 0 66.78 1.37 8.06 1 3.54 3.37 16.04 2 5.89 7.25 18.53 3 3.07 20.09 16.48 4 1.51 24.74 12.66 5 3.51 15.55 8.99 6 0.38 11.53 6.16 7 1.21 7.49 4.19 8 0.28 5.17 2.86 9 0.06 2.59 1.97 10 2.78 0.67 1.36 11 0.01 0.20 0.93 12 0.23 — 0.63 13 0.01 — 0.42 14 0.39 — 0.27 15 2.75 — 0.17 16 0.01 — 0.10 17 0.00 — 0.06 18 0.01 — 0.04 19 0.00 — 0.02 20 1.57 — 0.01 21 0.07 — 0.01 22 0.02 — 0.00 23 0.02 — 0.00 24 0.01 — 0.00 25 0.45 — 0.00 26 0.09 — 0.00 27 0.09 — 0.00 28 0.11 — 0.00 29 0.11 — 0.00 30 5.02 — 0.00
Negative binomial regression resulted in a better fit of the data than did either linear or Poisson regression (
Comparison of the observed percentage distribution of number of mentally unhealthy days and the percentage distribution predicted by the negative binomial and zeroinflated negative binomial models. Data were obtained from the 2009 Behavioral Risk Factor Surveillance System in 12 states.
No. of Mentally Unhealthy Days Observed Negative Binomial ZeroInflated Negative Binomial 0 66.78 65.52 68.22 1 3.54 8.38 4.06 2 5.89 4.52 3.16 3 3.07 3.06 2.61 4 1.51 2.28 2.22 5 3.51 1.80 1.92 6 0.38 1.47 1.67 7 1.21 1.23 1.48 8 0.28 1.04 1.31 9 0.06 0.90 1.17 10 2.78 0.79 1.04 11 0.01 0.69 0.94 12 0.23 0.62 0.84 13 0.01 0.55 0.76 14 0.39 0.49 0.69 15 2.75 0.45 0.63 16 0.01 0.40 0.57 17 0.00 0.37 0.52 18 0.01 0.34 0.47 19 0.00 0.31 0.43 20 1.57 0.28 0.39 21 0.07 0.26 0.36 22 0.02 0.24 0.33 23 0.02 0.22 0.30 24 0.01 0.21 0.28 25 0.45 0.19 0.26 26 0.09 0.18 0.24 27 0.09 0.17 0.22 28 0.11 0.15 0.20 29 0.11 0.14 0.19 30 5.02 0.13 0.17
The zeroinflated negative binomial regression provided a better fit of the data than did negative binomial regression (
In studying the association between homeownership and CDC’s Healthy Days measure as an example, we demonstrated how different models can influence statistical inference — the process of drawing conclusions from empirical data. We did not find an independent association between homeownership and number of mentally unhealthy days by logistic, linear, or Poisson regression models. The negative binomial model showed that homeowners had a moderate but significantly lower number of unhealthy days than nonhomeowners. The zeroinflated negative binomial model indicated an association between homeownership and whether individuals reported any mentally unhealthy days but not the number of unhealthy days.
We found that a zeroinflated negative binomial model fit the observed number of mentally unhealthy days reported in BRFSS data better than any of the other models we tested. Despite its ability to model count data, Poisson regression did not fully address the problem of overdispersion. Overdispersion may result in misleading inferences about regression parameters (
The main purpose of this data analysis was not to establish or affirm the “true” relationships between homeownership and number of mentally unhealthy days. We applied various models to BRFSS Healthy Days data as an example to illustrate the importance of appropriate model selection. The study has several limitations. First, it was based on selfreported data from 12 states that elected to include the social context module in its 2009 BRFSS. Second, the survey was conducted through telephone interviews; people without telephones and those who used only cell phones were excluded; these people may be less likely to be homeowners. Third, the BRFSS is a crosssectional survey: information on the outcome measure (number of mentally unhealthy days) and characteristics (eg, homeownership) of the respondents were assessed at a single point in time. Hence, determining whether the association of characteristics with outcomes preceded or followed the outcomes was not possible.
Any statistical inference requires some assumptions, and incorrect assumptions can invalidate statistical inference (
We believe that when the assumptions of analytic techniques are carefully matched to the nature of the data distribution, the results will be more accurate and compelling. False results can mislead researchers, the public, and policy makers and are potentially detrimental to public health. The selection of data analytic techniques is not a trivial statistical matter. Using appropriate analytic procedures will maximize the accuracy and utility of the findings on factors that are of great importance in clinical, policy, and fiscal decisions.
We have received no funding for this study. At the time of the research, Hong Zhou was affiliated with the Division of Community Health, National Center for Chronic Disease Prevention and Health Promotion, Centers for Disease Control and Prevention.
The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the U.S. Department of Health and Human Services, the Public Health Service, the Centers for Disease Control and Prevention, or the authors' affiliated institutions.
To obtain credit, you should first read the journal article. After reading the article, you should be able to answer the following, related, multiplechoice questions. To complete the questions (with a minimum 70% passing score) and earn continuing medical education (CME) credit, please go to
Which of the following statements regarding different models of data analysis is
Logistic regression evaluates data on a continuum of the complete scale of values
Ordinary leastsquares linear regression is the primary model for count data
Poisson regression is the simplest model for count data
Zeroinflated negative binomial regression cannot allow for count data that are skewed
What is the most common answer from patients regarding the number of poor health days per month on the Behavioral Risk Factor Surveillance System (BRFSS) survey?
0
6
10
14
Which of the following statements regarding the results of different data analysis tools is
The Poisson regression analysis correctly predicted that 3% of participants had no mentally unhealthy days
Linear and Poisson regression models overestimated the percentage with no mentally unhealthy days and underestimated the proportion of participants with 1 to 9 unhealthy days
Home ownership failed to affect the percentage of mentally unhealthy disease days in all study analyses
The zeroinflated negative binomial regression model provided a better fit of the data compared with negative binomial regression
 

 
1  2  3  4  5 
 

 
1  2  3  4  5 
 

 
1  2  3  4  5 
 

 
1  2  3  4  5 