The Foodborne Diseases Active Surveillance Network (FoodNet) is currently using a negative binomial regression model to estimate temporal changes in the incidence of

The Foodborne Diseases Active Surveillance Network (FoodNet) is a collaboration among the Centers for Disease Control and Prevention (CDC), 10 state health departments, the U.S. Department of Agriculture’s Food Safety and Inspection Service (USDA-FSIS), and the Food and Drug Administration (FDA). FoodNet conducts active, population-based surveillance for laboratory-confirmed infections of nine bacterial and parasitic pathogens transmitted commonly through food. The FoodNet surveillance area includes the full states of Connecticut, Georgia, Maryland, Minnesota, New Mexico, Oregon, and Tennessee, and selected counties in California, Colorado, and New York. One aim of FoodNet is to track changes over time in the incidence of 9 enteric pathogens commonly transmitted through food. FoodNet is currently using a negative binomial regression model to estimate temporal changes (

The FoodNet model is used on data aggregated by year and FoodNet site to account for the growth of the surveillance area from 5 sites in 1996 to 10 sites in 2004, and adjust for site to site variation in incidence. This level of aggregation limits our ability to explore variations in incidence for smaller geographic areas or units of time, or demographic features of individual cases, such as patients’ age and sex; all factors that have been shown to describe unique characteristics of

Zero-augmented models consist of two separate model components: one for modeling case counts (using a negative binomial distribution) and one for modeling the proportion of zeros (using a binomial distribution). The zero-inflated and hurdle models differ in whether their count model component can yield a count of zero. Zero-inflated models assume zeros can be either structural or true observational zeros and therefore zeros are estimated by both binary and count components and have an additional mixing parameter not present in hurdle models. Hurdle models assume that all zeros are structural zeros and therefore only model the binary component and use conditionally specified versions of the negative binomial distribution which are truncated to begin at a count of one (

Consequently, zero-augmented models, hurdle and zero-inflated, may be useful to model

We examined zero-augmented modifications (zero-inflated, hurdle) of the regression model used by FoodNet to estimate changes over time and added predictors to account for additional sources of variation in incidence. We hypothesized that modeling structural zeros and including demographic variables would increase the fit of FoodNet’s

Data were available for 48088 cases of

Case-patients were classified by age group [Age_Group: less than 5 (1), 5–17 (2), 18–24 (3), 25–44 (4), 45–64 (5), and 65+ (6) years of age] using categories used in previous FoodNet publications and that represent different life stages: preschool age, school age, college age, younger working age, older working age, retirement age (

The distribution and basic statistics of case counts and incidence were examined for all subgroups. The annual observed incidences per county were divided into 4 quartiles. The quartiles were used to construct choropleth maps where counties were shaded by incidence quartile using qGIS version 1.8.0 (

The data were evaluated for overdispersion by comparing the overall mean and variance of case counts for each subgroup (

The first model was a negative binomial (NB) that included Year and State as nominal categorical predictors. Season, Age_Group, and Sex were added as categorical predictors to produce the next model (NB.Plus). To focus on the mixture difference between the zero-inflated (ZINB) and hurdle models (Hurdle NB) and to facilitate comparison, the models were built without variables included in the models’ component which models the proportion of zeros. This was followed by fitting a zero-inflated negative binomial and hurdle model using forward selection. Forward selection was used rather than backwards elimination since the saturated models did not converge or were overfit. Variables were added individually in both model components separately and any significant variables were used in the final combination model (ZINB Full, Hurdle NB Full) (

The zero-augmented and non-zero-augmented models were estimated by a maximum likelihood algorithm. The Akaike information criterion (AIC), Bayesian information criterion (BIC), and −2 log-likelihood were computed for comparison. The BIC-corrected Vuong test was used to compare the fit of non-nested models and the likelihood ratio test was used to compare the fit of nested models (

Model assessment was done by evaluating the mean absolute error using leave-one-out-cross-validation (

On average 5027 (±SD 300) cases of

To provide a visual representation of geographic variation in incidence among counties, quartiles of annual county level incidence were mapped for Minnesota, Georgia, New Mexico, and Oregon as examples (

Variance (1.71) and mean (0.43) of all the

All variables included in the non-zero-augmented models (NB, NB.Plus), both count and zero portions of the Hurdle models, and the count portion of the ZINB model were statistically significant predictors in the models. The ZINB Full was not included in the model comparison because none of the variables added by forward step selection were significant in the binary portion of the model. The individual model results are shown in

The count components of all models (NB, NB.Plus, Hurdle NB, Hurdle NB Full, ZINB) had similar results in terms of coefficient direction, magnitude, and significance. However, Tennessee, year 2010, and age group 65+ were significant in the NB, NB.Plus and the ZINB models but not in the count components of the Hurdle NB and Hurdle NB Full models. The other difference was that the age group that includes 45–64 year olds was significant in the count component of the Hurdle NB and Hurdle NB Full models but not in the count component of the ZINB model.

The zero component intercepts in the zero-augmented models all had large negative coefficient values which do not support the idea of zero inflation in the data. This is further supported by the goodness of fit evaluations summarized in

Adding the demographic variables to the non-augmented models decreased the mean absolute error by 0.0249 (decreased the error). For the zero-augmented NB.Plus model the addition increased the mean absolute error by 2.726e-6 for the ZINB and by 0.0165 for the Hurdle NB model (increased the error). There were 72918 zero case counts in the dataset and the hurdle models predicted the exact number. When we rounded the predicted number of zeros to the nearest integer, both the ZINB and NB.Plus models predicted 73403 zeros or 485 more than the observed number of zero counts. The hurdle models were superior at predicting zero counts because of their truncated structure.

The aim of this analysis was to explore different methods to analyze campylobacteriosis case counts ascertained by FoodNet surveillance sites at a finer geographic level, to evaluate the effect on incidence of covariates that may vary geographically, and examine the characteristics of zero counts in FoodNet

Zero-augmented modifications (zero-inflated, hurdle) of the regression models were used to examine a possible separation of observational and structural zeros. We anticipated that a significant proportion of zero case counts were observational; differences in county size and population demographics among the FoodNet surveillance sites result in very small subpopulation sizes among counties and a high probability that no cases will be observed among many counties. Our finding that the hurdle models did not fit the data well supports this assumption. Although we hypothesized that several surveillance and epidemiologic factors may contribute to structural zeros in the data, our analysis suggests that zero inflation is not apparent at the level of disaggregation of demographic covariates we studied; this finding is supported by the observation that inclusion of zero-augmentation mixing fractions did not improve the models’ fit.

Although zero inflation was not present in the dataset, zero-augmented modeling techniques are likely to be important for future analyses including modeling of other pathogens under FoodNet surveillance. Our models included only data ascertained by FoodNet active surveillance activities, and it is likely that inclusion of data from sites conducting passive surveillance, as well as data obtained from other sources, such as household income and access to healthcare, would contribute to the presence of structural zeros in the modeled data. The differences in data collection associated with different surveillance systems and data sources would likely result in excess zero case counts where at least a portion (structural zeros) arise from a process different from the positive counts. Although both hurdle and zero-inflated models may be used to model this type of data, it is likely best modeled by a zero-inflated model because the zeros are modeled as a mixture of both observational and structural zeros.

We removed the California observations because there were no zero case counts in any county subgroup, complicating our exploration of models for zero case counts. Removal of the California data eliminated convergence issues and allowed exploration of the effect of zero inflation. Removing the California data decreased the dataset’s variance but overdispersion was still prominent. A negative binomial distribution helped in modeling the overdispersed data; however, there were still case counts that were outside the expected distribution. These case counts may be associated with undetected outbreaks (i.e., clusters of cases originating from a common exposure) which were not excluded from the analysis. Further exploration of these outliers, using compound distributions, would help better characterize them and might yield more information on risk factors of potential outbreaks (

The addition of the demographic and seasonal variables when modeling

None

None

None

Observed county incidence per 100000 in A) Minnesota, B) Georgia, C) New Mexico and D) Oregon in 2011. Counties are shaded based on the quartiles of county annual incidence per 100000.

Count frequency of

Residual boxplot of negative binomial model with demographic covariates (NB.Plus)

Goodness of fit and statistics comparison by model

Model 1 | ||||||
---|---|---|---|---|---|---|

Hurdle NB | NB | Hurdle NB Full | ZINB | NB.Plus | ||

M0 | LR | |||||

V (BIC) | 79.0, | 79.9, | 91.5, | 89.5, | 89.5, | |

| ||||||

Hurdle | LR | |||||

NB | V (BIC) | 9.2, | 37.6, | 40.4, | 40.5, | |

| ||||||

NB | LR | |||||

V (BIC) | (−9.2), | 29.6, | 33.4, | 33.5, | ||

| ||||||

Hurdle | LR | |||||

NB full | V (BIC) | (−37.6), | (−29.6), | 3.7, 0.0001 | 3.9, 5.1e-5 | |

| ||||||

ZINB | LR | |||||

V (BIC) | (−40.4), | (−33.4), | (−3.7), 0.0001 | 174.2, | ||

| ||||||

NB.Plus | LR | |||||

V (BIC) | (−40.5), | (−33.5), | (−3.9), 5.1e-5 | (−174.2), | ||

| ||||||

−2 × log likelihood | −115539 | −114403 | −109482 | −109525 | −109525 | |

25 | 17 | 47 | 25 | 24 | ||

AIC | 115589 | 114437 | 109576 | 109575 | 109573 | |

BIC | 115825 | 114597 | 110019 | 109811 | 109799 | |

MAE | 0.3963 | 0.4046 | 0.3809 | 0.3798 | 0.3798 | |

Predicted no. zeros | 72918 | 73540 | 72918 | 73403 | 73403 |

Models are listed from left to right and top to bottom as their fits improve;

Hurdle NB = Hurdle negative binomial with covariates in the count component only, NB = Negative binomial without demographic covariates, Hurdle NB Full = hurdle negative binomial with covariates in both zero and count components, ZINB = Zero-inflated negative binomial with covariates in the count component only, NB.Plus = Negative binomial with demographic covariates;

Null model; LR= Likelihood ratio test; V (BIC) = Vuong BIC corrected Non-Nested Hypothesis Test-Statistic;

= p-value less than 2.2e-16 when testing model 1 versus model2 with alpha < 0.05;

Number of parameters estimated; AIC =