Twenty years of data provide valuable insights for the design of large automated outbreak detection systems.

Outbreak detection systems for use with very large multiple surveillance databases must be suited both to the data available and to the requirements of full automation. To inform the development of more effective outbreak detection algorithms, we analyzed 20 years of data (1991–2011) from a large laboratory surveillance database used for outbreak detection in England and Wales. The data relate to 3,303 distinct types of infectious pathogens, with a frequency range spanning 6 orders of magnitude. Several hundred organism types were reported each week. We describe the diversity of seasonal patterns, trends, artifacts, and extra-Poisson variability to which an effective multiple laboratory-based outbreak detection system must adjust. We provide empirical information to guide the selection of simple statistical models for automated surveillance of multiple organisms, in the light of the key requirements of such outbreak detection systems, namely, robustness, flexibility, and sensitivity.

The past decade has witnessed much interest in real-time outbreak detection methods for infectious diseases, driven by worries about the possibility of large-scale bioterrorism, public concern about emerging and reemerging infections, and the increased availability of computerized data (

In England and Wales, automated laboratory surveillance of infectious diseases has been undertaken since the early 1990s. Laboratory surveillance is based on counts of laboratory isolates of infectious pathogens, usually classified for epidemiologic purposes by subtype or phage type. The organism reports come mainly from samples sent to hospital laboratories or to specialist laboratories when additional typing is required, as for salmonellae.

This automated system was designed to supplement the frontline investigator-led outbreak detection methods used by national and regional epidemiologists, with the primary aim of identifying geographically distributed outbreaks that may have escaped local detection. In a typical week, several hundred different pathogens are reported; the automated system provides a back-up and the assurance that the entire database is routinely scanned. The output comprises a short list of organisms with potential outbreaks for review, ranked according to an exceedance score that measures the degree of statistical aberrance. The statistical methodology of the system was described previously (

Much research on statistical methods of prospective outbreak detection has been aimed at identifying unusual clusters of 1 syndrome or disease (

We are reviewing the statistical methods used in the England and Wales system. The first stage of this review, reported here, has been to carry out a detailed analysis of the data accumulated over the 2 decades since 1991. We aimed to document some of the generic features of surveillance data and their imperfections across the range of organisms of interest and to identify the key problems confronting automated outbreak detection systems. Specifically, we endeavored to answer 2 key questions: How diverse are the patterns displayed by the range of organisms monitored? How complex must a statistical algorithm be to handle this diversity?

The data were provided by the Health Protection Agency (

The outbreak detection system runs automatically every weekend, processing the previous week’s reports. Thus, the time unit of analysis is by the week unless otherwise specified. We obtained weekly counts of all infectious disease organisms reported to the Health Protection Agency between week 1, 1991, and week 52, 2011, by date of report and date of specimen collection. In years with 53 weeks, the week 53 count was added to the week 1 count of the following year. To mitigate the effect of delays at the end of the series, only isolates with specimen dates through week 26 of 2011 were used in the analyses. All analyses are by week of specimen collection unless otherwise specified.

Calculating rates and other organism-specific statistics is complicated by the fact that it is not possible to distinguish between genuine zeroes, corresponding to organisms looked for but not found, and missing values that arise when organisms are not sought. It is highly likely that some organisms that were identified toward the end of the study period would not have been identified by the tests that were performed a decade or so earlier. Rates and trends calculated without taking any account of this feature would be biased. To reduce this bias, we recoded all leading sequences of zeroes as missing. However, this in turn introduces a selection bias, because every time series would then start with a nonzero count. To mitigate this, we reduced the first nonzero count by 1.

The statistical models are described informally; a technical account is provided in the

log (average count in week

We fitted a range of such log-linear models to the data, incorporating a smooth long-term trend component and monthly seasonality for each series of organism counts (

Variance of count in week

Models of this form are called quasi-Poisson. The dispersion in Equation 1 is a constant specific to each organism. In a Poisson model, the dispersion is equal to 1. When the dispersion is >1, more variability is thereby allowed.

We also investigated the negative binomial model. This model satisfies equation 1 but also allows a greater degree of skewness (that is, asymmetry around the mean) than the Poisson model. For the negative binomial model,

Skewness of count in week

where the constant in equation 2 is nonnegative. For the Poisson model, this constant is zero: equation 2 allows greater positive skewness.

We sought a simple family of models that adequately describes all organisms, rather than a well-fitting model for any particular organism. Formal goodness-of-fit tests were not used because they can be unreliable with sparse data. Our criterion was that the relationships between mean, variance, and skewness should be adequately described. To display these relationships for each organism, we subdivided the data into 41 half-years, dropped weeks 52 (or 53) and 1 (which are atypical, as noted above), and de-seasonalized the data. We then calculated the mean, variance, and skewness in each half-year.

For each organism, we investigated the validity of equation 1 by plotting the log of the variance of the weekly counts against the log of the average weekly count in the 41 half-years. If equation 1 holds, the points should lie on a straight line with slope 1. We obtained the histogram of these slopes; a narrow spread around 1 suggests that the quasi-Poisson model is adequate.

Similarly, we investigated the validity of equation 2 by plotting the skewness of the weekly counts against the log of the average weekly counts. If equation 2 holds, the points should lie on the curve determined by this equation, for which the coefficient of the log of the average weekly count is −0.5. We obtained these coefficients and plotted their histogram; a narrow spread around −0.5 suggests that the negative binomial model is adequate.

We present the results in 5 subsections: global features of the surveillance system; frequency distributions; means, seasonality and trends; dispersion; and relationships between mean, variance, and skewness. Additional details are available in the online Technical Appendix.

More than 9 million individual isolates were collected with specimen dates from week 1 in 1991 through week 26 in 2011. These isolates were of 3,303 different organism types.

Weekly counts of organisms by date of specimen collection, England and Wales, 1991–2011: A) isolates; B) organism types.

The strong upward trends shown in

On average, the weekly count of isolates is the same, whether ordered by week of specimen or by week of report; this is also true of organism counts. The variation in the differences in counts reflects the variability in delays from specimen collection to reporting, which can be considerable (

The distribution of delays between date of specimen and date of report varies from organism to organism, with the median typically in the range of 7–28 days, depending on the complexity of the laboratory procedures involved. For example, modal delays for salmonellae are increased by the additional subtyping step required. Extreme delays are not uncommon, owing to late submissions or data entry errors (

There is huge variation in frequency, seasonality, and trends among the 3,303 organism types reported.

Weekly counts for 6 selected organisms, by date of specimen collection, England and Wales, 1991–2011.

Organism name | Mean weekly count |
---|---|

1,480 | |

899 | |

764 | |

Clostridium difficile toxin detection | 313 |

Rotavirus | 303 |

267 | |

167 | |

119 | |

Herpes simplex virus untyped | 102 |

Herpes simplex virus type 2 | 100 |

96 | |

Herpes simplex virus type 1 | 92 |

86 | |

86 | |

81 | |

Norovirus | 76 |

57 | |

54 | |

52 | |

50 | |

49 | |

Adenovirus untyped | 43 |

36 | |

35 | |

34 | |

Cytomegalovirus | 29 |

26 | |

21 | |

19 | |

14 |

This variation in the number of nonzero counts is mirrored by the maximum weekly count for each organism. For 90% of all organisms, the weekly maximum was

The large increase in numbers of organisms reported over time (

Distributions of mean (A) and SD (B) of weekly counts for all organisms, England and Wales, 1991–2011.

We fitted log-linear models to the 2,254 organisms for which nonzero counts spanned >1 year. The distribution of slope parameters for linear trend is shown in

A) Distribution of estimated linear trend parameters (units: per week) for data on 2,250 organisms (excluding 4 organisms with extreme slopes), England and Wales, 1991–2011. B) Stacked bar chart of modal seasonal period for 2,254 organisms. The black bar sections represent organisms for which the seasonal effect is statistically significant.

Some organisms displayed evidence of nonconstant seasonality. Rotavirus, for example, which typically peaks in the early months of the year, had slightly earlier peaks in the earlier years of data collection.

For 1,333 (59%) organisms, the dispersion (that is, the ratio of variance to mean, equation 1) is >1, indicating that the variability of weekly counts of that organism is greater than that of a Poisson distribution. There is a general tendency for the dispersion to increase with the mean: the more common the organism, the less appropriate a Poisson model tends to be (

In some cases, a contributing factor to the extra variation is large systematic variation in diagnostic practice, resulting in large variations in reporting intensity, notably, long runs of zeroes, as with

Relationships between mean, variance, and skewness were investigated for the 1,001 organisms with dispersion >1 for which nonzero means and variance were obtained for

Relationships between mean and variance for data on organisms collected, England and Wales, 1991–2011. A) The log of variance plotted against log of mean for

For 538 (54%) of these 1,001 organisms, the slope of the best-fit line is significantly different from 1, the value corresponding to the quasi-Poisson model, and in 535 of these the slope is >1 (the exceptions are

Most organisms, other than the most common, displayed a degree of positive skeweness, that is, long upper tails. The plots of skewness against log(mean), though often broadly exponential, showed more scatter than those of log(variance) against log(mean).

Relationships between mean and skewness for data on organisms collected, England and Wales, 1991–2011. A) Skewness plotted against log of mean for

For 486 (49%) of the 1,001 organisms, the slope parameter (on the log scale) is significantly different from −0.5, the value corresponding to the negative binomial. For 475 of these it was greater than −0.5. Again, departures from this reference value were moderate, most slope parameters lying between −0.6 and 0, as shown in

What these results signify is that the quasi-Poisson model provides an adequate, though far from perfect, account of the week-to-week variability in organism counts for the broad range of organisms considered. The negative binomial model may also provide an adequate representation of these highly heterogeneous data; because this model accounts for the skewness in the data, which the quasi-Poisson model does not, it may provide more accurate threshold values, above which counts are declared to be aberrant.

We have undertaken a detailed analysis of the global features of a large surveillance database accumulated during >20 years. Most striking is the variety of temporal patterns, in terms of frequencies, trends, and seasonality. Some valuable general conclusions emerge of direct relevance to the design of outbreak detection systems.

The first stems from the great variation in organism frequency, which stretches over 6 orders of magnitude (from 10^{−3} to 10^{3} per week). The sensitivity and specificity of the detection system should remain broadly constant over this range, so that the system performs well for both rare and common organisms. The primary output from a multiple outbreak detection system is likely to be a ranking of aberrances in decreasing order of the statistical evidence underpinning them. The correctness of the ordering is arguably more important than achieving nominal sensitivity and specificity levels, so that attention is focused on the most discrepant organisms. In practice, this means that outbreak detection methods used with multiple surveillance systems must perform robustly and consistently over the range of frequencies expected (or a large part of this range).

A second conclusion is that the systematic components of the statistical outbreak detection models must be able to cope automatically with the idiosyncrasies of individual data series, notably seasonality and trends, without requiring intervention by the user. This necessitates the use of suitably flexible modeling environments, though excessive flexibility can itself cause problems of overfitting. A careful balance needs to be struck: for example, between the detailed modeling required to incorporate seasonal effects, which is crucial for some organisms, while recognizing that such effects are not greatly relevant for many others. In addition, robust numerical algorithms that are guaranteed to work for all but known extreme data configurations are essential.

Third, our analyses provide empirical support for the use of a single, robust algorithm across this range of organisms. The data suggest that the great majority of organisms can adequately—though far from perfectly—be represented by a statistical model in which the variance is proportional to the mean, such as the quasi-Poisson or negative binomial models. Some improvement would nevertheless be possible through the use of more general models in which the variance is proportional to a power of the mean. Such more general distributions, based on birth processes, have been studied (

These conclusions apply specifically to the use of automated biosurveillance as a second line of defense in support of investigator-led outbreak detection methods, as implemented in England and Wales. Thus, we seek a system that performs adequately over the entire range of organisms, to be scanned routinely, rather than one that is optimized for a particular organism. We believe that integrating investigator-led and automated surveillance in this way plays best to the strengths of each method.

Each week, the England and Wales detection system flags ≈20 organisms, listed in decreasing order of aberrance, for further investigation. A proportion of these results are false positive and do not correspond to a genuine outbreak. The remainder are genuine outbreaks, many of which will also have been picked up by the front-line investigator-led network of surveillance specialists, as intended. Occasionally, genuine outbreaks are picked up which have escaped detection by other means. These events often involve pathogens with a wide geographic distribution and relatively high baseline frequency of reporting. Such dispersed outbreaks may be overlooked at the local level, where they often equate to only marginal increases, but nationally may represent noteworthy events. Recent examples include outbreaks of

Our current efforts at improving the system are to reduce the false-positive rate while maintaining sufficient power to detect genuine outbreaks. Some of the key issues to be revisited are treatment of trends, seasonality, and calculation of thresholds, in the light of the findings presented here. Other issues are how to handle past outbreaks and delays between specimen collection and reported identification. The data and experience gained from >20 years’ of automated biosurveillance will provide valuable empirical underpinning for such improvements.

Automated biosurveillance data from England and Wales, 1991–2011. This online appendix provides technical details of statistical methods, further technical description of results, and 5 supplemental figures.

We thank James Freed and Doris Otomewo for extracting the LabBase data.

This research was supported by a project grant from the UK Medical Research Council and by a Royal Society Wolfson Research Merit Award.

Dr Enki is a postdoctoral researcher in statistics at Open University, United Kingdom. His research interests include statistical epidemiology and multivariate analysis.