Surveillance of univariate syndromic data as a means of potential indicator of developing public health conditions has been used extensively. This paper aims to improve the performance of detecting outbreaks by using a background forecasting algorithm based on the adaptive recursive least squares method combined with a novel treatment of the Day of the Week effect.

Previous work by the first author has suggested that univariate recursive least squares analysis of syndromic data can be used to characterize the background upon which a prediction and detection component of a biosurvellance system may be built. An adaptive implementation is used to deal with data non-stationarity. In this paper we develop and implement the RLS method for background estimation of univariate data. The distinctly dissimilar distribution of data for different days of the week, however, can affect filter implementations adversely, and so a novel procedure based on linear transformations of the sorted values of the daily counts is introduced. Seven-days ahead daily predicted counts are used as background estimates. A signal injection procedure is used to examine the integrated algorithm's ability to detect synthetic anomalies in real syndromic time series. We compare the method to a baseline CDC forecasting algorithm known as the W2 method.

We present detection results in the form of Receiver Operating Characteristic curve values for four different injected signal to noise ratios using 16 sets of syndromic data. We find improvements in the false alarm probabilities when compared to the baseline W2 background forecasts.

The current paper introduces a prediction approach for city-level biosurveillance data streams such as time series of outpatient clinic visits and sales of over-the-counter remedies. This approach uses RLS filters modified by a correction for the weekly patterns often seen in these data series, and a threshold detection algorithm from the residuals of the RLS forecasts. We compare the detection performance of this algorithm to the W2 method recently implemented at CDC. The modified RLS method gives consistently better sensitivity at multiple background alert rates, and we recommend that it should be considered for routine application in bio-surveillance systems.

Timely detection of an outbreak is a major goal of surveillance of public health data. Many techniques in the last several years have been developed to address anomaly detection in univariate time series. For instance, the CDC's current methods for time series aberration detection are based on Xbar and CUSUM control charts [

Multistream (multivariate) anomaly detection has also received some attention in the hope that a major outbreak could have early indications in some streams. For example, over the counter medication sales as a potential early indicator of developing public health conditions, in particular in cases of interest to biosurvellance, has been suggested in the literature [

In a previous publication [

The finite impulse response implementation of the background predictor is, however, adversely affected by several issues. Many syndromic time series have low counts, and in these cases any potential advantages over simpler methods are lost. So we limit our study to those syndromic series that have mean and median daily counts of more than 100. Seasonal fluctuations are sufficiently long in period that our adaptive methods are quite capable of handling. However, a strong Day of the Week effect (DOW) has too short a period, and we illustrate the DOW problem using the respiratory-1 syndromic data from military outpatient clinics of a major metropolitan area in the United States. We use a definition of Respiratory illness that can be defined as acute infection of the upper and/or lower respiratory tract, excluding chronic conditions such as chronic bronchitis, asthma, and sinusitis [

Figure

Figure _{j }[^{st }week's data would be _{1 }[1], _{2 }[1],..., _{7 }[1]. Now we sort each {_{j }[_{j }≤ _{j}, is denoted by

As indicated in the previous section we deal exclusively with syndromic time series whose mean and median daily counts are greater than 100 [e.g., see Table

Time series data from military clinics – means and standard deviations

Data Descriptor | Mean | SDev |
---|---|---|

335 | 196 | |

829 | 449 | |

682 | 277 | |

448 | 239 | |

393 | 242 | |

373 | 167 | |

368 | 208 | |

359 | 225 | |

352 | 177 | |

250 | 196 | |

247 | 131 | |

247 | 126 | |

227 | 138 | |

215 | 122 | |

214 | 110 | |

186 | 90 |

The normalization method is based on the sorted data values shown in figure _{j }[

and applied to the sorted data values to obtain the normalized data with the day of the week effect removed:

Normalized data

The process described above is performed on successive days and the output is used to make background predictions based on the recursive least squares method that we describe now. Denoting the clinical data time series on day number

and the recursive least squares method to compute the filter coefficients adaptively. The filter coefficients

includes a forgetting factor λ (see the appendix) equivalent to a 4-week effective memory length _{λ},

which corresponds to λ = 0.9655 See the appendix for a description of the updating procedure. The RLS recursive solution, using finite impulse response (FIR) filters and described in detail in the appendix, is then used in the following algorithm: For a k-step ahead prediction of the background we compute multiple predictions for each day as follows. Denote the current step by

Thus in our method each day is predicted multiple times. Among the multiplicity of the error terms to be fed back into the recursive algorithm update equations, we have obtained better background estimates by using the error with the smallest magnitude.

We have applied the methods of the previous section to several univariate data streams. The first data set is respiratory 1 syndromic data (defined earlier) from military outpatient clinics of a major metropolitan area in the United States. In figure

In order to better quantify the differences in the context of a biosurveillance alerting system, we performed an analysis of the detection performance of our method and compared them to one using the W2 predictions. This analysis was based on the injection of a log-normal signal into authentic time series of clinical visit counts. The signal type follows the observation of Sartwell [

We chose the Respiratory-1 data consisting of 1402 consecutive authentic daily visit counts as our background time series. For each given day, the estimated (predicted) background was used to normalize the actual daily count for that day (a simple division), and the result thresholded at different levels. Note that the day under consideration matches the maximum signal level, in the sense that if day

_{rise}: _{Fall}] = backgrond[_{rise}: _{Fall}] × (1 + α

where _{Rise }and _{Fall }indicate the rise time and the fall time of the signal and the index

A detection was recorded each time a threshold was exceeded. We found the probability of detection by dividing the total number of detections by the total number of available days, when the signal was actually present on the given day. Similarly, in the absence of the signal, a probability of false alarm was computed. The false alarm rate per number of days of interest was computed by multiplying the false alarm probability by the number of days of interest, e.g. the false alarm rate per week was found by multiplying the false alarm probability by 7.

Figure

Next set of data consists of aggregated counts of syndromic surveillance data from the BioALIRT (Bio-Event Advanced Leading Indicator Recognition Technology) program that was conducted by DARPA [

Data comprises three types of daily syndromic counts from ten large metropolitan areas: diagnoses from military clinics, filled military prescriptions, and civilian physician office visits. Out of the available 30 time series we selected 15 of them with the largest mean values, and they represent daily visit counts classified in Respiratory (Resp) and Gastrointestinal (GI) syndrome groups. These data together with their means and standard deviations are listed in Table

The current paper introduces a prediction approach for city-level biosurveillance data streams such as time series of outpatient clinic visits and over-the-counter remedy sales. This approach uses recursive-least-squares filters modified by a correction for the weekly patterns often seen in these data series. Unlike regression methods, these filters have the ability to adapt quickly to short-term trends, and this ability is essential for sensitivity to anomalies on a daily basis. Unlike some other adaptive methods based on autoregressive error modeling, this approach is applicable to many data types without detailed analysis. This flexibility is essential in surveillance applications.

For the study presented above, we formed a threshold detection algorithm from the residuals of these RLS forecasts. We compared the detection performance of this algorithm to the W2 method recently implemented at CDC. Like the day-of-week normalization above, the W2 method is a modification of the standard C2 aberration detector for improved handling of day-of-week effects. These 2 detectors were compared for power to detect realistic, simulated signals injected into each of a set of 16 actual datasets. These datasets were drawn from several sources, and the modified RLS-based method gave consistently better sensitivity at multiple background alert rates. Thus, this forecast technique should be further considered for routine application in biosurveillance systems.

Limitations of this modified RLS forecast method are the scale of time series required for application of the weekly pattern correction and the amount of data history required for useful filter coefficients. Regarding the necessary data scale, this forecaster gave improvements for time series with mean values as low as 30 counts per day, but informal tests have suggested that the method is applicable for lower scales as long as weekend counts do not drop off to zero. For the training data issue, 25% of the 700 days of the BioALIRT data-about a 6-month warmup-was used for the RLS forecasts in the simulation comparisons. The effective extent of these limitations and possible enhancements to overcome them are subjects for further research.

The RLS adaptation presented here should also be considered for multivariate forecasts. The adaptive modeling of cross-correlation effects combined with the ability to capture trends on a short time scale suggest a possible detection advantage over multivariate statistical process control charts. The increasing availability of multiple data sources emphasizes the need for tools that can effectively combine various types of statistical evidence.

OTC: over the counter (medications); BioALIRT: Bio-Event Advanced Leading Indicator Recognition Technology; RLS: Recursive Least Squares.

The authors declare that they have no competing interests.

The idea of predicting clinical data using recursive least squares with feedback of minimum error among multiple looks, and the associated computer programs were developed by AHN. Data was provided by HB, who also wrote most of the conclusion, and contributed to the responses to the referees.

The pre-publication history for this paper can be accessed here:

Click here for file

This journal article was supported by Grant 1-R01-PH000024-01 from CDC. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC.