This is an Open Access article distributed under the terms of the Creative Commons Attribution License (

The spatial scan statistic is a widely used statistical method for the automatic detection of disease clusters from syndromic data. Recent work in the disease surveillance community has proposed many variants of Kulldorff's original spatial scan statistic, including expectation-based Poisson and Gaussian statistics, and incorporates a variety of time series analysis methods to obtain expected counts. We evaluate the detection performance of twelve variants of spatial scan, using synthetic outbreaks injected into four real-world public health datasets.

The relative performance of methods varies substantially depending on the size of the injected outbreak, the average daily count of the background data, and whether seasonal and day-of-week trends are present. The expectation-based Poisson (EBP) method achieves high performance across a wide range of datasets and outbreak sizes, making it useful in typical detection scenarios where the outbreak characteristics are not known. Kulldorff's statistic outperforms EBP for small outbreaks in datasets with high average daily counts, but has extremely poor detection power for outbreaks affecting more than

Our results suggest four main conclusions. First, spatial scan methods should be evaluated for a variety of different datasets and outbreak characteristics, since focusing only on a single scenario may give a misleading picture of which methods perform best. Second, we recommend the use of the expectation-based Poisson statistic rather than the traditional Kulldorff statistic when large outbreaks are of potential interest, or when average daily counts are low. Third, adjusting for seasonal and day-of-week trends can significantly improve performance in datasets where these trends are present. Finally, we recommend discontinuing the use of randomization testing in the spatial scan framework when sufficient historical data is available for empirical calibration of likelihood ratio scores.

Systems for automatic disease surveillance analyze electronically available public health data (such as hospital visits and medication sales) on a regular basis, with the goal of detecting emerging disease outbreaks as quickly and accurately as possible. In such systems, the choice of statistical methods can make a substantial difference in the sensitivity, specificity, and timeliness of outbreak detection. This paper focuses on methods for spatial biosurveillance (detecting clusters of disease cases that are indicative of an emerging outbreak), and provides a systematic comparison of the performance of these methods for monitoring hospital Emergency Department and over-the-counter medication sales data. The primary goal of this work is to determine which detection methods are appropriate for which data types and outbreak characteristics, with an emphasis on finding methods which are successful across a wide range of datasets and outbreaks. While this sort of analysis is essential to ensure that a deployed surveillance system can reliably detect outbreaks while keeping false positives low, most currently deployed systems which employ spatial detection methods simply use the default approaches implemented in software such as SaTScan [

In our typical disease surveillance task, we have daily count data aggregated at the zip code level for data privacy reasons. For each zip code _{i}, we have a time series of counts _{max }represent the counts from 1 to _{max }days ago respectively. Here we consider two types of data: hospital Emergency Department (ED) visits and sales of over-the-counter (OTC) medications. For the ED data, counts represent the number of patients reporting to the ED with a specified category of chief complaint (e.g. respiratory, fever) for that zip code on that day. For the OTC sales data, counts represent the number of units of medication sold in a particular category (e.g. cough/cold, thermometers) for that zip code on that day. Given a single data stream (such as cough and cold medication sales), our goal is to detect anomalous increases in counts that correspond to an emerging outbreak of disease. A related question, but one that we do not address here, is how to combine multiple streams of data, in order to increase detection power and to provide greater situational awareness. Recent statistical methods such as the multivariate Poisson spatial scan [

For this problem, a natural choice of outbreak detection method is the

Here we focus on the use of spatial scan methods for syndromic surveillance, monitoring patterns of health-related behaviors (such as hospital visits or medication sales) with the goal of rapidly detecting emerging outbreaks of disease. We assume that an outbreak will result in increased counts (e.g. more individuals going to the hospital or buying over-the-counter medications) in the affected region, and thus we wish to detect anomalous increases in count that may be indicative of an outbreak. Such increases could affect a single zip code, multiple zip codes, or even all zip codes in the monitored area, and we wish to achieve high detection power over the entire range of outbreak sizes. We note that this use of spatial scan statistics is somewhat different than their original use for spatial analysis of patterns of chronic illness, in which these methods were used to find localized spatial clusters of increased disease rate. One major difference is that we typically use historical data to determine the expected counts for each zip code. We then compare the observed and expected counts, in order to find spatial regions where the observed counts are significantly higher than expected, or where the ratio of observed to expected counts is significantly higher inside than outside the region.

Many recent variants of the spatial scan differ in two main criteria: the set of potential outbreak regions considered, and the statistical method used to determine which regions are most anomalous. While Kulldorff's original spatial scan approach [

In this study, we compare the expectation-based Poisson and expectation-based Gaussian statistics to Kulldorff's original statistic. For each of these methods, we consider four different methods of time series analysis used to forecast the expected count for each location, giving a total of 12 methods to compare. Our systematic evaluation of these methods suggests several fundamental changes to current public health practice for small-area spatial syndromic surveillance, including use of the expectation-based Poisson (EBP) statistic rather than the traditional Kulldorff statistic, and discontinuing the use of randomization testing, which is computationally expensive and did not improve detection performance for the four datasets examined in this study. Finally, since the relative performance of spatial scan methods differs substantially depending on the dataset and outbreak characteristics, an evaluation framework which considers multiple datasets and outbreak types is useful for investigating which methods are most appropriate for use in which outbreak detection scenarios.

In the spatial disease surveillance setting, we monitor a set of spatial locations _{i}, and are given an observed count (number of cases) _{i }and an expected count _{i }corresponding to each location. For example, each _{i }may represent the centroid of a zip code, the corresponding count _{i }may represent the number of Emergency Department visits with respiratory chief complaints in that zip code for some time period, and the corresponding expectation _{i }may represent the expected number of respiratory ED visits in that zip code for that time period, estimated from historical data. We then wish to detect any spatial regions

The spatial scan statistic [_{i}, and finding those regions which maximize some likelihood ratio statistic. Given a set of alternative hypotheses _{1}(_{0 }(representing no clusters), the likelihood ratio

If the null or alternative hypotheses have any free parameters, we can compute the likelihood ratio statistic using the maximum likelihood parameter values [

Once we have found the regions with the highest scores

Kulldorff's original spatial scan approach [_{beat }is the number of replica datasets with maximum region score higher than

We consider three different variants of the spatial scan statistic: Kulldorff's original Poisson scan statistic [

The Poisson distribution is commonly used in epidemiology to model the underlying randomness of observed case counts, making the assumption that the variance is equal to the mean. If this assumption is not reasonable (i.e. counts are "overdispersed" with variance greater than the mean, or "underdispersed" with variance less than the mean), we should instead use a distribution which separately models mean and variance. One simple possibility is to assume a Gaussian distribution, and both the Poisson and Gaussian distributions lead to simple and easily computable score functions

A second distinction in our models is whether the score function

All three methods assume that each observed count _{i }is drawn from a distribution with mean proportional to the product of the expected count _{i }and an unknown relative risk _{i}. For the two Poisson methods, we assume _{i }~ Poisson(_{i}_{i}), and for the expectation-based Gaussian method, we assume _{i }~ Gaussian(_{i }_{i}, _{i}). The expectations _{i }are obtained from time series analysis of historical data for each location _{i}. For the Gaussian statistic, the variance _{i}, using the mean squared difference between the observed counts

Under the null hypothesis of no clusters _{0}, the expectation-based statistics assume that all counts are drawn with mean _{i }= 1 everywhere. Kulldorff's statistic assumes instead that all counts are drawn with mean _{i }= _{all }everywhere, for some unknown constant _{all}. The value of _{all }is estimated by maximum likelihood: _{all }and _{all }are the aggregate observed count ∑ _{i }and aggregate expected count ∑ _{i }for all locations _{i }respectively.

Under the alternative hypothesis _{1}(_{in }> 1, and thus _{i }= _{in }inside region _{i }= 1 outside region _{in }is estimated by maximum likelihood. For the expectation-based Poisson statistic, the maximum likelihood value of _{in }is _{in }and _{in }are the aggregate count _{in }is _{i }and expectations _{i }respectively, where the weighting of a location _{i }is inversely proportional to the coefficient of variation

for _{in }> _{in}, and _{EBP}(

for _{EBG}(

Kulldorff's scan statistic uses fundamentally different assumptions than the expectation-based statistics: under the alternative hypothesis _{1}(_{in }and _{out }respectively, where _{in }> _{out}. In this case, the maximum likelihood estimate of _{in }is _{out }is

if _{KULL}(

It is an open question as to which of these three spatial scan statistics will achieve the highest detection performance in real-world outbreak detection scenarios. We hypothesize that EBG will outperform EBP for datasets which are highly overdispersed (since in this case the Poisson assumption of equal mean and variance is incorrect) and which have high average daily counts (since in this case the discrete distribution of counts may be adequately approximated by a continuous distribution). Furthermore, we note that Kulldorff's statistic will not detect a uniform, global increase in counts (e.g. if the observed counts were twice as high as expected for all monitored locations), since the ratio of risks inside and outside the region would remain unchanged. We hypothesize that this feature will harm the performance of KULL for outbreaks which affect many zip codes and thus have a large impact on the global risk

In the typical prospective surveillance setting [_{i}, we have a time series of counts _{max}, where time 0 represents the present. Our goal is to find any region _{max}. Space-time scan statistics are considered in detail in [

The moving average may be adjusted for day of week trends (MA-DOW) by computing the proportion of counts _{i }on each day of the week (

When we predict the expected count for a given location on a given day, we choose the corresponding value of

The 28-day moving average takes seasonality into account by only using the most recent four weeks of data, but it may lag behind fast-moving seasonal trends, causing many false positives (if it underestimates expected counts for an increasing trend) or false negatives (if it overestimates expected counts for a decreasing trend). Thus we can perform a simple seasonal adjustment by multiplying the 28-day moving average by the ratio of the "global" 7-day and 28-day moving averages:

This "moving average with current week adjustment" (MA-WK) method has the effect of reducing the lag time of our estimates of global trends. One potential disadvantage is that our estimates of the expected counts using the 7-day average may be more affected by an outbreak (i.e. the estimates may be contaminated with outbreak cases), but using global instead of local counts reduces the variance of our estimates and also reduces the bias resulting from contamination. We can further adjust for day of week (MA-WK-DOW) by multiplying by seven times the appropriate

We obtained four datasets consisting of real public health data for Allegheny County: respiratory Emergency Department visits from January 1, 2002 to December 31, 2002, and three categories of over-the-counter sales of medication (cough/cold and anti-fever) or medical supplies (thermometers) from October 1, 2004 to January 4, 2006. We denote these four datasets by ED, CC, AF, and TH respectively. The OTC datasets were collected by the National Retail Data Monitor [

Dataset description

dataset | minimum | maximum | mean | standard deviation |

ED | 5 | 62 | 34.40 | 8.34 |

TH | 4 | 99 | 41.44 | 17.96 |

CC | 338 | 5474 | 2428.46 | 923.47 |

AF | 83 | 2875 | 1321.70 | 279.88 |

Minimum, maximum, mean, and standard deviation of daily counts for each of the four public health datasets (respiratory ED visits, OTC thermometer sales, OTC cough/cold medication sales, and OTC anti-fever medication sales).

Our first set of experiments used a semi-synthetic testing framework (injecting simulated outbreaks into the real-world datasets) to evaluate detection power. We considered a simple class of circular outbreaks with a linear increase in the expected number of cases over the duration of the outbreak. More precisely, our outbreak simulator takes four parameters: the outbreak duration _{min }and _{max}. Then for each injected outbreak, the outbreak simulator randomly chooses the start date of the outbreak _{start}, number of zip codes affected _{center}. The outbreak is assumed to affect _{center }and its _{i }Δ) cases into each affected zip code, where _{i }is the "weight" of each affected zip code, set proportional to its total count

We performed three simulations of varying size for each dataset: "small" injects affecting 1 to 10 zip codes, "medium" injects affecting 10 to 20 zip codes, and "large" injects affecting all monitored zip codes in Allegheny County (88 zip codes for the ED dataset, and 58 zip codes for the three OTC datasets). For the ED and TH datasets, we used Δ = 3, Δ = 5, and Δ = 10 for small, medium, and large injects respectively. For the AF dataset, we used Δ = 30, Δ = 50, and Δ = 100, and for the CC dataset, we used Δ = 60, Δ = 100, and Δ = 200 for the three sizes of inject. We used a value of

We note that simulation of outbreaks is an active area of ongoing research in biosurveillance. The creation of realistic outbreak scenarios is important because of the difficulty of obtaining sufficient labeled data from real outbreaks, but is also very challenging. State-of-the-art outbreak simulations such as those of Buckeridge et al. [

We tested a total of twelve methods: each combination of the three scan statistics (KULL, EBP, EBG) and the four time series analysis methods (MA, MA-DOW, MA-WK, MA-WK-DOW) discussed above. For all twelve methods, we scanned over the same predetermined set of search regions. This set of regions was formed by partitioning Allegheny County using a 16 × 16 grid, and searching over all rectangular regions on the grid with size up to 8 × 8. Each region was assumed to consist of all zip codes with centroids contained in the given rectangle. We note that this set of search regions is different than the set of inject regions used by our outbreak simulator: this is typical of real-world outbreak detection scenarios, where the size and shape of potential outbreaks are not known in advance. Additionally, we note that expected counts (and variances) were computed separately for each zip code, prior to our search over regions. As discussed above, we considered four different datasets (ED, TH, CC, and AF), and three different outbreak sizes for each dataset. For each combination of method and outbreak type (dataset and inject size), we computed the method's proportion of outbreaks detected and average number of days to detect as a function of the allowable false positive rate.

To do this, we first computed the maximum region score _{S }

The detection performance of each of the 12 methods is presented in Tables

Comparison of detection power on ED and TH datasets, for varying outbreak sizes

method | ED large | ED medium | ED small | TH large | TH medium | TH small |

KULL MA | 6.05 (35.4%) | 3.11 (98.4%) | 3.67 (84.9%) | 6.58 (12.8%) | 4.36 (92.8%) | 4.06 (93.1%) |

KULL MA-DOW | 5.74 (39.7%) | 3.04 (98.3%) | 3.60 (85.1%) | 6.54 (12.2%) | 4.36 (92.4%) | 4.06 (92.2%) |

KULL MA-WK | 6.01 (37.0%) | 3.11 (98.5%) | 3.65 (85.0%) | 6.57 (13.1%) | 4.36 (92.8%) | 4.06 (93.1%) |

KULL MA-WK-DOW | 5.74 (40.0%) | 3.04 (98.3%) | 3.61 (85.1%) | 6.54 (12.3%) | 4.36 (92.4%) | 4.06 (92.2%) |

EBP MA | ||||||

EBP MA-DOW | 3.79 (96.5%) | |||||

EBP MA-WK | 3.53 (96.3%) | |||||

EBP MA-WK-DOW | 3.64 (93.2%) | 3.56 (97.5%) | 3.81 (94.7%) | |||

EBG MA | 2.98 (100%) | 2.77 (99.9%) | 3.37 (88.4%) | 4.46 (88.8%) | 4.34 (90.2%) | 4.27 (86.6%) |

EBG MA-DOW | 3.04 (100%) | 2.90 (99.4%) | 3.41 (88.3%) | 5.00 (78.4%) | 4.94 (77.9%) | 4.76 (74.5%) |

EBG MA-WK | 2.99 (99.9%) | 2.78 (99.8%) | 4.71 (79.8%) | 4.42 (88.2%) | 4.31 (85.1%) | |

EBG MA-WK-DOW | 3.15 (99.7%) | 2.97 (98.8%) | 3.40 (87.9%) | 5.24 (63.2%) | 5.02 (73.1%) | 4.76 (73.6%) |

Average days to detection, and percentage of outbreaks detected, at 1 false positive per month. Methods in bold are not significantly different (in terms of days to detect, at

Comparison of detection power on CC and AF datasets, for varying outbreak sizes

method | CC large | CC medium | CC small | AF large | AF medium | AF small |

KULL MA | 6.43 (14.8%) | 2.53 (100%) | 2.33 (99.5%) | 6.64 (10.6%) | 3.20 (100%) | 3.00 (99.3%) |

KULL MA-DOW | 5.69 (60.8%) | 6.44 (23.9%) | ||||

KULL MA-WK | 6.43 (14.8%) | 2.53 (100%) | 2.33 (99.5%) | 6.64 (10.6%) | 3.20 (100%) | 3.00 (99.3%) |

KULL MA-WK-DOW | 5.69 (60.8%) | 6.44 (23.9%) | ||||

EBP MA | 4.61 (87.5%) | 4.07 (96.7%) | 4.13 (94.0%) | 3.95 (99.9%) | 3.87 (98.0%) | |

EBP MA-DOW | 4.59 (88.6%) | 4.06 (95.8%) | 4.10 (94.4%) | 4.70 (97.5%) | 4.37 (99.0%) | 4.32 (95.9%) |

EBP MA-WK | 3.30 (98.2%) | 2.76 (100%) | 2.83 (99.2%) | 4.64 (82.5%) | 4.07 (98.2%) | 3.87 (96.7%) |

EBP MA-WK-DOW | 2.58 (99.9%) | 2.57 (99.3%) | 4.65 (83.8%) | 4.01 (98.5%) | 3.89 (97.0%) | |

EBG MA | 4.73 (80.7%) | 4.30 (89.5%) | 4.43 (76.5%) | 4.80 (91.4%) | 4.56 (94.5%) | 4.47 (84.1%) |

EBG MA-DOW | 4.81 (80.5%) | 4.36 (89.8%) | 4.54 (75.0%) | 4.96 (89.2%) | 4.68 (93.2%) | 4.70 (78.9%) |

EBG MA-WK | 3.73 (91.5%) | 3.07 (99.4%) | 3.12 (95.4%) | 4.93 (75.7%) | 4.47 (93.3%) | 4.27 (85.9%) |

EBG MA-WK-DOW | 3.68 (92.4%) | 3.03 (99.5%) | 3.06 (96.3%) | 5.04 (74.0%) | 4.54 (92.0%) | 4.36 (84.2%) |

Average days to detection, and percentage of outbreaks detected, at 1 false positive per month. Methods in bold are not significantly different (in terms of days to detect, at

For the datasets of respiratory Emergency Department visits (ED) and over-the-counter sales of thermometers (TH) in Allegheny County, the EBP methods displayed the highest performance for all three outbreak sizes, as measured by the average time until detection and proportion of outbreaks detected. There were no significant differences between the four variants of EBP, suggesting that neither day-of-week nor seasonal correction is necessary for these datasets. For small outbreaks, the EBG and KULL methods performed nearly as well as EBP (between 0.1 and 0.6 days slower). However, the differences between methods became more substantial for the medium and large outbreaks: for large outbreaks, EBG detected between 0.5 and 1.5 days slower than EBP, and KULL had very low detection power, detecting less than 40% of outbreaks and requiring over three additional days for detection.

For the dataset of cough and cold medication sales (CC) in Allegheny County, the most notable difference was that the time series methods with adjustment for seasonal trends (MA-WK) outperformed the time series methods that do not adjust for seasonality, achieving 1–2 days faster detection. The relative performance of the EBP, EBG, and KULL statistics was dependent on the size of the outbreak. However, the variants of the EBP method with adjustment for seasonality (EBP MA-WK and EBP MA-WK-DOW) were able to achieve high performance across all outbreak sizes. For small to medium-sized outbreaks, KULL outperformed EBP by a small but significant margin (0.3 to 0.5 days faster detection) when adjusted for day of week, and performed comparably to EBP without day-of-week adjustment. For large outbreaks, KULL again performed poorly, detecting three days later than EBP, and only detecting 15–61% of outbreaks (as compared to 98–99% for EBP).

For the dataset of anti-fever medication sales (AF) in Allegheny County, the results were very similar to the CC dataset, except that seasonal adjustment (MA-WK) did not improve performance. EBP methods performed best for large outbreaks and achieved consistently high performance across all outbreak sizes, while KULL outperformed EBP by about 1.2 days for small to medium-sized outbreaks. As in the other datasets, KULL had very low power to detect large outbreaks, detecting less than 25% of outbreaks and requiring more than six days to detect.

To further quantify the relationship between outbreak size and detection power, we measured the average number of injected cases needed for each method to detect 90% of outbreaks at 1 false positive per month, as a function of the number of zip codes affected. For this experiment, we used the same time series method for each detection method (MA-WK for the CC dataset, and MA for the other datasets). We also used the same set of scan regions for each detection method, searching over the set of distinct circular regions centered at each zip code, as in [_{S }

In this experiment, we saw substantial differences in the relative performance of methods between the datasets with low average daily counts (ED and TH) and the datasets with high average daily counts (CC and AF). For the ED and TH datasets, the EBP method outperformed the EBG and KULL methods (requiring fewer injected cases for detection) across the entire range of outbreak sizes. While EBP and EBG required a number of injected cases that increased approximately linearly with the number of affected zip codes, KULL showed dramatic decreases in detection power and required substantially more injected cases when more than 1/3 of the zip codes were affected. For the CC and AF datasets, EBP and EBG again required a number of injected cases that increased approximately linearly with the number of affected zip codes, with EBP outperforming EBG. KULL outperformed EBP when less than 2/3 of the zip codes were affected, but again showed very low detection power as the outbreak size became large.

In typical public health practice, randomization testing is used to evaluate the statistical significance of the clusters discovered by spatial scanning, and all regions with p-values below some threshold (typically,

We first examined whether the p-values produced by randomization testing are properly calibrated for our datasets. For each combination of method and dataset, we computed the p-value for each day of data with no outbreaks injected, using randomization testing with

False positive rates with randomization testing

method | ED dataset | TH dataset | CC dataset | AF dataset |

KULL MA | .046 | .141 | .544 | .358 |

KULL MA-DOW | .050 | .146 | .284 | .202 |

KULL MA-WK | .050 | .114 | .568 | .337 |

KULL MA-WK-DOW | .053 | .130 | .289 | .186 |

EBP MA | .068 | .162 | .517 | .403 |

EBP MA-DOW | .071 | .141 | .409 | .340 |

EBP MA-WK | .064 | .159 | .520 | .422 |

EBP MA-WK-DOW | .071 | .149 | .348 | .332 |

EBG MA | .334 | .398 | .244 | .204 |

EBG MA-DOW | .473 | .496 | .268 | .249 |

EBG MA-WK | .349 | .390 | .218 | .226 |

EBG MA-WK-DOW | .466 | .485 | .252 | .249 |

Proportion of days significant at

Thus we compared the detection power of each method with and without randomization testing, using empirically determined p-value and score thresholds corresponding to an actual false positive rate of

Detection power with and without randomization testing

method | ED dataset | TH dataset | CC dataset | AF dataset |

KULL MA | 3.23/3.23 | 4.41/4.23 | 5.40/ | 4.52/ |

KULL MA-DOW | 3.45/3.04 | 4.95/ | 3.73/ | 3.65/ |

KULL MA-WK | 3.31/3.23 | 5.26/ | 6.04/ | 5.30/ |

KULL MA-WK-DOW | 3.19/3.04 | 5.20/ | 3.57/ | 3.99/ |

EBP MA | 2.54/2.50 | 3.95/ | 6.36/ | 5.89/ |

EBP MA-DOW | 2.65/2.53 | 3.51/3.44 | 4.59/4.10 | 5.62/ |

EBP MA-WK | 2.74/2.50 | 5.04/ | 5.84/ | 5.11/ |

EBP MA-WK-DOW | 2.92/2.59 | 4.31/ | 5.05/ | 5.30/ |

EBG MA | 4.50/ | 5.90/ | 4.94/4.43 | 4.92/4.63 |

EBG MA-DOW | 5.48/ | 5.15/4.66 | 5.61/ | 5.00/4.79 |

EBG MA-WK | 4.87/ | 5.92/ | 3.82/ | 4.58/4.43 |

EBG MA-WK-DOW | 5.53/ | 5.92/ | 4.90/ | 4.53/4.56 |

Average days to detection at 1 false positive per month for "medium-sized" outbreaks injected into each dataset, using empirically determined thresholds on p-value (computed by randomization testing) and score (without randomization testing) respectively. If there is a significant difference between the detection times with and without randomization, the better-performing method is marked in bold.

One potential solution is to perform many more Monte Carlo replications, requiring a further increase in computation time. To examine this solution, we recomputed the average number of days to detection for the EBP MA method on each dataset, using

Recent work by Abrams et al. [

Detection power with and without randomization testing, using empirical/asymptotic p-values

method | ED dataset | TH dataset | CC dataset | AF dataset |

KULL MA | 3.17/3.23 | 4.24/4.23 | 2.60/2.55 | 3.18/3.19 |

KULL MA-DOW | 3.26/3.04 | 4.23/4.09 | 2.26/2.26 | 2.83/2.80 |

KULL MA-WK | 3.21/3.23 | 4.02/4.23 | 2.58/2.55 | 3.08/3.19 |

KULL MA-WK-DOW | 3.21/3.04 | 4.03/4.09 | 2.41/2.26 | 2.82/2.80 |

EBP MA | 2.48/2.50 | 3.28/3.29 | 3.90/3.99 | |

EBP MA-DOW | 2.49/2.53 | 3.57/3.44 | 4.20/4.36 | |

EBP MA-WK | 2.67/2.50 | 3.64/3.40 | 3.08/ | 4.62/ |

EBP MA-WK-DOW | 2.92/2.59 | 4.02/3.75 | 2.90/ | 4.53/ |

EBG MA | 2.84/2.91 | 4.00/4.19 | 4.52/4.43 | 4.35/4.63 |

EBG MA-DOW | 3.05/3.01 | 4.92/4.66 | 4.67/4.50 | 4.83/4.79 |

EBG MA-WK | 2.91/2.87 | 4.20/4.24 | 3.25/3.16 | 4.51/4.43 |

EBG MA-WK-DOW | 2.95/3.04 | 4.99/4.73 | 3.08/2.96 | 4.46/4.56 |

Average days to detection at 1 false positive per month for "medium-sized" outbreaks injected into each dataset, using empirically determined thresholds on p-value (computed by randomization testing, using empirical/asymptotic p-values [

In Table

Score and p-value thresholds corresponding to one false positive per month

method | ED dataset | TH dataset | CC dataset | AF dataset |

KULL MA | 7.3/0.029 | 10.6/3.7 × 10^{-3} | 25.0/2.4 × 10^{-7} | 18.6/9.0 × 10^{-6} |

KULL MA-DOW | 6.0/0.034 | 8.7/4.6 × 10^{-3} | 15.4/4.8 × 10^{-5} | 12.0/5.9 × 10^{-4} |

KULL MA-WK | 7.3/0.025 | 10.6/5.3 × 10^{-3} | 25.0/2.0 × 10^{-7} | 18.6/1.0 × 10^{-5} |

KULL MA-WK-DOW | 6.0/0.033 | 8.7/5.6 × 10^{-3} | 15.4/6.0 × 10^{-5} | 12.0/3.2 × 10^{-4} |

EBP MA | 6.7/0.025 | 10.3/2.6 × 10^{-3} | 68.6/3.7 × 10^{-13} | 31.4/1.3 × 10^{-11} |

EBP MA-DOW | 6.0/0.029 | 9.2/1.9 × 10^{-3} | 57.8/2.6 × 10^{-11} | 30.9/1.7 × 10^{-9} |

EBP MA-WK | 6.4/0.030 | 10.2/2.8 × 10^{-3} | 34.9/1.4 × 10^{-12} | 33.9/3.0 × 10^{-14} |

EBP MA-WK-DOW | 6.0/0.019 | 9.1/3.2 × 10^{-3} | 26.8/5.3 × 10^{-11} | 29.4/6.3 × 10^{-10} |

EBG MA | 13.9/4.5 × 10^{-5} | 20.9/2.1 × 10^{-7} | 28.6/8.2 × 10^{-11} | 19.9/6.1 × 10^{-7} |

EBG MA-DOW | 15.9/2.7 × 10^{-6} | 27.9/3.4 × 10^{-11} | 30.8/3.5 × 10^{-13} | 23.7/2.1 × 10^{-8} |

EBG MA-WK | 13.5/6.1 × 10^{-5} | 21.0/3.1 × 10^{-7} | 16.7/1.2 × 10^{-6} | 17.5/2.1 × 10^{-6} |

EBG MA-WK-DOW | 16.0/2.2 × 10^{-6} | 26.8/1.6 × 10^{-11} | 19.4/7.1 × 10^{-8} | 20.9/6.0 × 10^{-8} |

Score threshold (computed without randomization testing) and p-value threshold (computed by randomization testing, using empirical/asymptotic p-values) corresponding to an observed false positive rate of 0.0329, i.e. one false positive per month.

A number of other evaluation studies have compared the performance of spatial detection methods. These include studies comparing the spatial scan statistic to other spatial detection methods [

Nevertheless, it is important to acknowledge several limitations of the current study, which limit the generality of the conclusions that can be drawn from these experiments. First, this paper focuses specifically on the scenario of monitoring syndromic data from a small area (a single county) on a daily basis, with the goal of rapidly detecting emerging outbreaks of disease. In this case, we wish to detect higher than expected recent counts of health-related behaviors (hospital visits and medication sales) which might be indicative of an outbreak, whether these increases occur in a single zip code, a cluster of zip codes, or even the entire monitored county. This is different than the original use of spatial scan statistics for analysis of spatial patterns of chronic illnesses such as cancer, where we may not compare observed and expected counts, but instead attempt to detect clusters with higher disease rates inside than outside. Similarly, while we focused on county-level surveillance, responsibility for outbreak detection ranges across much broader levels of geography (e.g. state, national, and international), and larger-scale disease surveillance efforts might have very different operational requirements and limitations. Second, spatial syndromic surveillance approaches (including all of the methods considered in this study) might not be appropriate for all types of disease outbreaks. Our simulations focused on outbreaks for which these approaches are likely to have high practical utility. Such outbreaks would affect a large number of individuals (thus creating detectable increases in the counts being monitored), exhibit spatial clustering of cases (since otherwise spatial approaches might be ineffective), and have non-specific early-stage symptoms (since otherwise earlier detection might be achieved by discovering a small number of highly indicative disease findings). Third, our retrospective analysis did not account for various sources of delay (including lags in data entry, collection, aggregation, analysis, and reporting) which might be present in prospective systems. Any of these sources might result in additional delays between the first cases generated by an outbreak and its detection by a deployed surveillance system. Similarly, the absolute results (number of days to detect) are highly dependent on the number and spatial distribution of injected cases; for these reasons, the comparative performance results reported here should not be interpreted as an absolute operational metric. Fourth, while differences in the relative performance of methods between datasets demonstrate the importance of using multiple datasets for evaluation, this study was limited by data availability to consider only four datasets from a single county, three of which were different categories of OTC sales from the same year. Expansion of the evaluation to a larger number of datasets, with a higher degree of independence between datasets, would provide an even more complete picture of the relative performance of methods. Finally, this analysis used existing health datasets which were aggregated to the zip code level prior to being made available for this study. Data aggregation was necessary to protect patient privacy and preserve data confidentiality, but can result in various undesirable effects related to the "modifiable areal unit problem" (MAUP) [

Next, we consider several issues regarding the detection of large outbreaks affecting most or all of the monitored zip codes. In the original spatial scan setting, where the explicitly stated goal was to detect significant differences in disease rate inside and outside a region, such widespread increases might not be considered relevant, or might be interpreted as a decreased disease rate outside the region rather than an increased rate inside the region. However, our present work focuses on the detection of emerging outbreaks which result in increased counts, and when we are monitoring a small area (e.g. a single county), many types of illness might affect a large portion of the monitored area. In this case, it is essential to detect such widespread patterns of disease, and to distinguish whether differences in risk are due to higher than expected risk inside the region or lower than expected risk outside the region. Kulldorff's description of the SaTScan software [

It is also informative to consider our empirical results (in which EBP outperformed KULL for large outbreak sizes on all four datasets, and for small outbreak sizes on two of four datasets) in light of the theoretical results of Kulldorff [_{in }> _{out}) as compared to the null hypothesis of spatially uniform risk (_{in }= _{out }= _{all}). While KULL is optimal for differentiating between these two hypotheses, it is not necessarily optimal for differentiating between outbreak and non-outbreak days which do not correspond to these specific hypotheses. Even when no outbreaks are occurring, the real-world health datasets being monitored are unlikely to correspond to the hypothesis of independent Poisson-distributed counts and spatially uniform risk; they may be overdispersed, exhibit spatial and temporal correlations, and contain outliers or other patterns due to non-outbreak events. Similarly, real-world outbreaks may not result in a constant, multiplicative increase in expected counts for the affected region, as assumed by KULL. Finally, we note that Kulldorff's notion of an "individually most powerful" test is somewhat different than that of a "uniformly most powerful" test, being geared mainly toward correct identification of the affected cluster as opposed to determination of whether or not the monitored area contains any clusters. Our empirical results demonstrate that high detection power in the theoretical setting (assuming ideal data generated according to known models) may not correspond to high detection power in real-world scenarios when the given model assumptions are violated.

This study compared the performance of twelve variants of the spatial scan statistic on the detection of simulated outbreaks injected into four different real-world public health datasets. We discovered that the relative performance of methods differs substantially depending on the size of the injected outbreak and various characteristics of the dataset (average daily count, and whether day-of-week and seasonal trends are present). Our results demonstrate that the traditional (Kulldorff) spatial scan statistic approach performs poorly for detecting large outbreaks that affect more than two-thirds of the monitored zip codes. However, the recently proposed expectation-based Poisson (EBP) and expectation-based Gaussian (EBG) statistics achieved high detection performance across all outbreak sizes, with EBP consistently outperforming EBG. For small outbreaks, EBP outperformed Kulldorff's statistic on the two datasets with low average daily counts (respiratory ED visits and OTC thermometer sales), while Kulldorff's statistic outperformed EBP on the two datasets with high average counts (OTC cough/cold and anti-fever medication sales). Using a simple adjustment for seasonal trends dramatically improved the performance of all methods when monitoring cough/cold medication sales, and adjusting for day-of-week improved the performance of Kulldorff's statistic on the cough/cold and anti-fever datasets. In all other cases, a simple 28-day moving average was sufficient to predict the expected counts in each zip code for each day. Finally, our results demonstrate that randomization testing is not necessary for spatial scan methods, when performing small-area syndromic surveillance to detect emerging outbreaks of disease. No significant performance gains were obtained from randomization on our datasets, and in many cases the resulting p-values were miscalibrated, leading to high false positive rates and reduced detection power.

These results suggest the following practical recommendations regarding the use of spatial scan methods for outbreak detection:

1. When evaluating the relative performance of different spatial scan methods, we recommend using a variety of different datasets and outbreak characteristics for evaluation, since focusing only on a single outbreak scenario may give a misleading picture of which methods perform best.

2. The traditional (Kulldorff) spatial scan statistic has very poor performance for large outbreak sizes, and thus we recommend the use of the expectation-based Poisson (EBP) statistic instead when large outbreaks are of potential interest. If only small outbreaks are of interest, we recommend the use of EBP on datasets with low average daily counts and Kulldorff's statistic on datasets with high average daily counts.

3. Adjustments for seasonal and day-of-week trends can significantly improve performance in datasets where these trends are present.

4. If a sufficient amount of historical data is available, we recommend empirical calibration of likelihood ratio scores (using the historical distribution of maximum region scores) instead of the current practice of statistical significance testing by randomization. If little historical data is available, we recommend the use of empirical/asymptotic p-values, and a threshold much lower than

We are in the process of using the evaluation framework given here to compare a wide variety of other spatial biosurveillance methods, including Bayesian [

The author declares that they have no competing interests.

The author wishes to thank Greg Cooper and Jeff Lingwall for their comments on early versions of this paper. This work was partially supported by NSF grant IIS-0325581 and CDC grant 8-R01-HK000020-02. A preliminary version of this work was presented at the 2007 Annual Conference of the International Society for Disease Surveillance, and a one-page abstract was published in the journal