This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
The Centers for Disease Control and Prevention's (CDC's) BioSense system provides near-real time situational awareness for public health monitoring through analysis of electronic health data. Determination of anomalous spatial and temporal disease clusters is a crucial part of the daily disease monitoring task. Our study focused on finding useful anomalies at manageable alert rates according to available BioSense data history.
The study dataset included more than 3 years of daily counts of military outpatient clinic visits for respiratory and rash syndrome groupings. We applied four spatial estimation methods in implementations of space-time scan statistics cross-checked in Matlab and C. We compared the utility of these methods according to the resultant background cluster rate (a false alarm surrogate) and sensitivity to injected cluster signals. The comparison runs used a spatial resolution based on the facility zip code in the patient record and a finer resolution based on the residence zip code.
Simple estimation methods that account for day-of-week (DOW) data patterns yielded a clear advantage both in background cluster rate and in signal sensitivity. A 28-day baseline gave the most robust results for this estimation; the preferred baseline is long enough to remove daily fluctuations but short enough to reflect recent disease trends and data representation. Background cluster rates were lower for the rash syndrome counts than for the respiratory counts, likely because of seasonality and the large scale of the respiratory counts.
The spatial estimation method should be chosen according to characteristics of the selected data streams. In this dataset with strong day-of-week effects, the overall best detection performance was achieved using subregion averages over a 28-day baseline stratified by weekday or weekend/holiday behavior. Changing the estimation method for particular scenarios involving different spatial resolution or other syndromes can yield further improvement.
Epidemiologists responsible for routine public health surveillance seek early indications of possible disease outbreaks. A growing collection of data types has become available to aid this surveillance, but with the analysis of these data comes the need to reduce the search possibilities. With cluster detection methodology, analysis of data sources can help by indicating the likely location and extent of a potential outbreak. Spatial and spatiotemporal scan statistics have the additional advantages of controlling the number of alarms resulting from multiple testing and accounting for usual clustering patterns, as found in the data [
Some natural clustering is expected, as in urban areas with high population density. If cluster detection methods do not account for this heterogeneity, customarily high concentrations will be signaled as anomalous. Only departures from the usual distribution should be flagged. In the implementation of the widely used SaTScan software [
A natural choice for the expected spatial distribution is the census population by sub-region of the geographic area under surveillance. In the United States, for example, population counts stratified by 3-digit zip code, 5-digit zip code, or census tract are freely downloadable [
If substantial (typically multi-year) quality historical data are available, detailed modeling of counts at the sub-region level could be attempted to calculate expected distributed counts. Kleinman et al [
For clarity, we distinguish two separate problems in applying data with spatial information to detection of disease clusters. The first problem is how to aggregate the data into sub-regions. Ozonoff et al and Olson et al. [
The choice of an effective method for estimating the case distribution is a function of the monitored data source. Even within the same data source, for example, estimation of the current distribution for a broad, seasonal syndromic filtering of clinical records is likely different from the same estimation for a rare sub-syndrome occurring evenly throughout the year. For the current study, we used data from BioSense, a US national biosurveillance system operated by the Centers for Disease Control and Prevention (CDC) [
The study dataset included 3 years (1/1/2004–12/31/2006) of DoD patient records from military outpatient clinics in the state of Texas, which has a large, widespread military population. These records were classified according to their
One obvious difference between the geographic patient distribution for the study dataset and the general census distribution results from the large fraction of non-Texas residents among these patients who live near the military bases, with the result that zip codes near those bases are overrepresented relative to census figures. However, for reasons of both patient privacy and military operational security, the spatial distribution of people eligible for treatment at base clinics is unavailable for analysis. Because of commuting by Texas residents, it cannot be assumed from Figure
Using location fields in patient records to seek infectious disease clusters, the underlying assumption is that the field value approximates the location of exposure to the disease-causing agent. The location of exposure is difficult to determine in individual-based studies and much more so in automated, population-based systems. The records in the study dataset contain fields for zip codes of the treatment facility and of the patient's home. In the DoD population whose records compose this dataset, most of the active duty personnel work on military bases that include treatment facilities. Therefore, a working hypothesis is that the facility zip code should be used to detect an outbreak in which exposures occur in the work environment; whereas, the home zip code should be used to detect an outbreak resulting from residence-based exposures. However, the patient home zip codes in the study data often refer to the home town of the patient, not to the zip code of military duty at the time of treatment. In such cases, the facility zip code is a better choice for the current residence. There is no practical way to know which zip code to use for the residence-based analysis, so in view of the number of patients who commute, we have used the following "100-mile rule" on patient zip codes to prepare the data before the analysis.
▪ First, we included all records that have either a facility zip code or a home zip code inside Texas. We also included records from four facilities outside of Texas but near the Texas-Oklahoma and Texas-Louisiana borders because of the large numbers of Texas residents commuting across the border and using those facilities.
▪ Second, we calculated for each record the distance between the home and facility zip codes.
▪ For the residence-based analysis, we used the patient home zip code if this distance was less than 100 miles. Otherwise, we used the facility zip code.
▪ If the home zip code field was empty, the facility zip code was used as the residence zip code. If the facility zip code was missing or outside Texas, the home zip code was used.
We used this reasonably simple rule because a more detailed assignment would require information that was available neither for the study nor for routine monitoring. In a normal situation most people would probably not travel more than a couple of hours to a facility unless there were no other choices.
Restriction of the study data set to those records classified in syndrome groups yielded 1,233 residence zip codes and 32 facility zip codes. The residence zip codes were well distributed in the eastern half of the state with high concentrations around the cities, but much sparser to the west. Most of the 32 facility zip codes were to the east with a high concentration in the San Antonio area. The west was covered only by two facilities at El Paso in the western tip. The mean daily record counts per residence zip code were 0.29 and 2.67 for rash and respiratory syndromes, respectively. The corresponding mean daily record counts per facility zip code were 4 and 38. Thus, respiratory counts exceeded rash counts by a factor of >9. Although rash counts have a relatively smooth pattern with a slightly increasing trend during the 3-year period, respiratory counts have a strong seasonal pattern (Figure
We used two types of estimation approaches to calculate expected counts: a baseline-mean approach and a space-time permutation approach. Figure
Figure
▪
▪
▪
We used baseline lengths of 7, 28, and 56 days (1, 4, and 8 weeks) with a 2-day buffer between baseline and test day. As in the EARS methods [
The expected count for each sub-region is calculated as a fraction of the total number of cases in the test period, set to 1 day in this study. It is critical to correctly estimate this fraction to the correct recognition of spatial aberrations. In the traditional scan statistics application of SaTScan, this fraction is the population of the sub-region divided by the total population of the surveillance region. In the current data environment, where the denominator population is unknown, we estimate this fraction as the number of sub-region baseline cases divided by the sum of baseline cases in the entire region. The only difference in this estimate for the C2-, W2-, and DOW-based prediction methods is which days are included in the baseline.
This approach must be adjusted for the possibility that the expected count for a sub-region might be zero. For many test statistics, including the Poisson generalized likelihood ratio [
a) The test day is a weekday.
b) There are 50 cases in the entire region for the test day and 200 data subregions.
c) There are a total of 1,000 cases on weekdays in the baseline.
d) Sub-region
If one or more of the cases above are true, then the expected number of cases is [(75 + 1)/(1000 + 200)] * 50 = 3.167. See Appendix 1 for the formal statement of this estimation method.
Figure
We implemented the algorithms described in the previous sections into C and MatLab computer programs both for research purposes and for possible use in BioSense. Development of codes giving identical random draws in both software environments allowed perfect agreement of results despite the diverse coding methods. These methods were also developed to facilitate comparison of the spatial estimation methods, to allow inspection of all variables at all stages, and to incorporate two of the following features: (1) computation of the statistical significance of maximum likelihood clusters using the extreme value (or Gumbel) distribution [
Our implementation searches cylinders with circular bases. The cylinder height is the number of days in the current test period, so that for a 7-day cylinder, a candidate cluster might have from 1 to 7 days of data. The analysis that follows focuses on comparing spatial estimation methods and is restricted to 1-day cylinders. For the spatial search, any point grid of possible cluster centers could be used, and our study used the set of centroids of zip codes in the data. For analyses based on facility and presumed residence zip codes, the numbers of zip codes were 32 and 1,233, respectively. We defined the radius of a candidate cluster as the maximum distance from the center to another centroid in the cluster, and we allowed radii as large as 100 km for the residence zip code runs and 300 km for the clinic zip runs, based on possible care seeking behavior.
We applied the
This statistic gives a measure of the likelihood of the observed counts inside and outside the cluster given the expected distribution and the total number of observed cases. Candidate clusters within the search space are ranked in descending order by their GLR values. The next step is to decide which if any of these clusters should be flagged as statistically significant.
The LLR values used as the test statistic do not comply with a known distribution in general, so thresholds for significance are usually determined empirically. The empirical approach is to randomly generate a large set of simulated data distributions under the null hypothesis of the expected spatial distribution for the test period. For each simulated distribution
We use the same procedure for generating null distributions for the C2-based, W2-based, and DOW-based estimation methods. In this procedure, we form a probability vector
from the set of sub-region expected values
For the space-time permutation method, we generated trial distributions by implementing the procedure in [
We applied the search for the maximum test statistic for each random trial distribution exactly as for the observed data and collected the set of trial LLR maxima.
To determine significance of the observed LLR maxima, we applied the Gumbel distribution using parameters derived from the set of trial maxima [
The p-value is then tested for significance against a threshold chosen to control the background cluster rate. We used a p-value threshold of 0.0027, for a nominal daily background rate of one cluster per year, in the results below. Following [
Reliable quantitative measurement of the sensitivity and specificity of a cluster detection method requires multiple signals distributed in time over the sub-regions of interest. Documentation of authentic outbreaks sufficient to provide enough detail for this purpose is very rare. Previous power studies [
The signal generation method is as follows:
▪ For each trial, a signal start date is selected, and we generate a stochastic epidemic curve using random draws from a lognormal distribution [
▪ Parameters of this distribution are supplied by the user to control the median peak day and time spread of the injected cases.
▪ These random draws are tabulated to give the total number of injected cases on each day. For each day with injected cases, these cases must be assigned to sub-regions, or zip codes in this study.
▪ Three geometries have been implemented to reflect hypotheses of spread of an aerosol pathogen. The implemented geometries include radial, wedge-shaped, and hourglass-shaped spread at rates that the user may specify.
▪ For the results given here, we used the radial spread, in which the drop-off of cases from the centre zip code is proportional to
In the current study, total signals consisted of either 50 or 100 cases added to the original respiratory counts for either home or clinic zip codes. For epicurve generation, we used lognormal parameters ζ = 1.3, σ = .4. These parameters gave region-wide time series epidemic curves of 8–13 days in duration. The radial spatial distribution was calculated using distance from the signal centre in kilometres, and for our Texas DoD dataset, we used a dispersion factor of
For the sequential runs without injects to assess background cluster rates, the update procedure is depicted in Figure
Cross-checking results of the C and Matlab software described above for implementing scan statistics, we compared the four estimation methods described in the Methods section using both facility and patient residence zip codes with 7-, 28-, and 56-day sliding baselines. The measures of effectiveness were compared using the background cluster rate and sensitivity to simulated clusters at chosen significance thresholds.
We applied the four estimation methods to scan all test dates and all coordinates in this study dataset. Because they are to be used for daily surveillance, and disease outbreaks are expected to be rare, a primary concern for using these methods is the actual cluster determination rate, or how often the method will indicate a need for investigation. If alarms occur too often, then the alarms corresponding to true health events might be ignored.
We ran the purely spatial cluster detection algorithm for the 3 years of data with no correction for temporal multiple testing to get the relative cluster determination rates of the four methods. In the absence of information about any true outbreaks during the data period, we could not refer to significant clusters as true or false alarms, so as a surrogate for a false alarm rate, we measured the background alarm rate as the number of significant clusters per unit time for the p-value 0.003.
Table
Number of clusters found per 100 weekdays/weekends, DoD Texas, 2004–2006, (p-value = 0.003, sub-region = 2).
| 2.0 | 1.9 | 2.0 | 0.9 | 1.7 | 1.9 | 1.5 | 2.2 | 1.8 | 1.8 | |||
| 1.3 | 0.6 | 0.6 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | |||
| 21.6 | 21.4 | 25.3 | 13.9 | 18.5 | 18.9 | 19.9 | 17.0 | 18.9 | 23.4 | |||
| 21.3 | 23.9 | 22.0 | 9.2 | 8.0 | 7.6 | 6.1 | 22.3 | 23.6 | 23.6 | |||
| 4.3 | 1.8 | 1.4 | 1.9 | 1.5 | 5.4 | 2.7 | 1.9 | 1.3 | 1.5 | |||
| 2.2 | 1.6 | 1.6 | 1.6 | 0.6 | 3.2 | 1.6 | 2.2 | 2.2 | 1.9 | |||
| 20.1 | 13.0 | 15.3 | 13.3 | 13.7 | 23.7 | 18.9 | 10.9 | 11.1 | 14.5 | |||
| 44.3 | 36.6 | 35.0 | 19.7 | 20.4 | 19.4 | 15.9 | 44.6 | 34.7 | 34.4 | |||
▪ From Table
▪ For the rash syndrome, the W2 estimation method with the 28-day baseline has a clear advantage over the unstratified averaging of the C2 method at the facility level, but not at the residence level, at which the day-of-week effect is less clear. The plotted levels of cluster alerts suggest that the full respiratory syndrome is too noisy for this dataset in the sense that its high mean and overdispersed variance cause much more significant but random clustering than would be expected for a sparser outcome variable, or especially for a dataset in which spatial distribution is similar to that of census data. Replacement by a more specific sub-syndrome such as influenza-like illness (ILI) should be considered; even though ILI counts are also seasonal, counts are much smaller, and their spatial distribution is likely to be fairly stable for most of the year.
▪ For the respiratory syndrome, management of day-of-week (DOW) effects is a primary concern, and the W2 and DOW methods clearly yield fewer cluster alerts than methods that ignore these effects. See Figures
▪ The longest baseline of 56 days gives reduced background clustering only in certain circumstances, such as when the DOW method is used at the residence zip level. The finer temporal DOW stratification, combined with the finer spatial stratification, appears to require a longer baseline. For coarser aggregation, the longer baseline can actually increase the number of cluster alerts, likely because of seasonal effects.
▪ Extending the temporal stratification from the W2 to the DOW method never yields a clear overall advantage in reduced alerting.
▪ The W2 method with a 28-day baseline is a safe choice for controlling the number of cluster alerts overall and gives as clear advantage in some situations. However, for a sparse outcome variable such as the rash visit counts and a fine spatial subdivision of records, a longer baseline might reduce the alerting.
As described in the Methods section, we randomly generated 127 visit count signals with a region-wide lognormal epidemic curve and a radically decaying spatial distribution. Separate signal sets were generated for totals of 50 and 100 outbreak cases. For each set, we added these signals to the authentic respiratory syndrome background data for a series of detection trials.
Figure
In (a) and (b), the x-axis gives the estimated background cluster rates from the dotted curves of Figure
The above results substantially support the idea that accurate spatial estimation methods are essential for robust detection of anomalous spatial clusters at controlled background cluster rates. Among the SaTScan user community, when an expected spatial distribution of cases is not available in the form of a population file with reliable estimates, a common method for deriving subregion expected counts is the conditioning on marginal distributions for both dates and subregions, as in the space-time permutation scan statistic [
The evaluation criteria presented for comparing these estimation methods are the background cluster rate of each method and the ability to detect injected signals. For the background cluster rate, Figures
a) The complexity of spatial estimation should be appropriate for the time series structure and reflect what is known about the data. Simple methods are preferable unless the data structure is rich enough to derive advantage from more complex methods such as finer stratification (e.g., the DOW method) or the regression modeling of [
b) Choosing the baseline interval length involves a trade-off between stability and capturing recent data behavior. In the charted results for the flat average (C2 based) method, 7-day baseline estimates produced consistently higher background cluster rates than the 28-day estimates. Lengthening the baseline further gives more stability, but can actually degrade the clustering performance if the time series mean changes either because of unmodeled seasonal effects or because of altered data provider participation or other common changes in the data acquisition process. Lengthening the baseline from 28 days to 56 days yielded an advantage for certain situations, such as when the DOW stratification was used, but in some situations produced higher background cluster rates.
c) In some health monitoring systems, a single alerting method with fixed parameters might be required for operational reasons. Given such a requirement, the weekday or weekend/holiday stratified averages (W2-based) with a 28-day baseline provide a safe and sometimes the best choice among the methods tested for the DoD outpatient visit data. The unstratified averages might yield marginally better results if day-of-week effects are absent, but could fare much worse if such effects are present. Moreover, day-of-week effects are not always apparent from visual inspection.
d) The results suggest that more robust clustering performance is possible in a system whereby the estimation method and its parameters could be chosen according to the data. For example, one could use day-of-week stratification time series with rich counts like the respiratory syndrome and large subregions, as in the facility-level runs. A longer baseline could be advantageous for more rare syndrome groupings such as rash, especially if seasonal effects are absent.
The experience of this study also suggested that the p-value threshold for cluster significance should be determined by test runs using authentic data for different outcome variables. Finally, clustering runs on background data could help evaluate the utility of syndrome groupings. The year-round high mean and variance of the respiratory syndrome count data lead to a high background cluster rate, as measured by the sequential background runs. This observation suggests a more selective record filtering, such as a subsyndrome for influenza-like illness. The runs using the sparser rash syndrome counts support this approach as did additional runs with other syndrome groupings.
This study was conducted as part of an effort to bridge the gap between surveillancerelated research and the need for robust and sensitive routine health monitoring. The BioSense data environment has grown to include various types of data from an increasing number of hospitals and other clinical data providers. Statistical behaviors such as seasonal trends, day-of-week effects, and age distributions vary among these sources, and appropriate syndromic classifications are still under study. Next steps are to seek appropriate spatial estimation methods for the new BioSense clinical data sources and record groupings and to standardize the method selection criteria. Regardless of the cluster detection method of choice in the future, reliable spatial background estimation will remain essential for robust surveillance.
The authors declare that they have no competing interests.
JX carried out the evaluation of spatial estimation methods and cluster detection studies, participated in the design, and drafted the manuscript.
HB conceived of the study, participated in its design and coordination, carried out the ROC analysis and helped to draft the manuscript.
LM carried out and programmed C source code for the signal simulation and injection studies, and draft the portion of manuscript.
JE carried out and programmed space-time scan statistics study cross-checked in Matlab, implemented and tested Gumbel algorithm in Matlab.
ML carried out and programmed space-time scan statistics study in C.
JT conceived of the study, and participated in its design and helped to draft the manuscript.
All authors read and approved the final manuscript.
Computing expected values for each sub-region by the modified baseline method:
Let
Where,
The
We acknowledge Drs Kenneth Cox and Julie Pavlin, and other US Department of Defense personnel who have graciously shared data with the BioSense program for several years.