This is an Open Access article distributed under the terms of the Creative Commons Attribution License (

When reporting incidence rate estimates for relatively rare health conditions, associated case counts are often assumed to follow a Poisson distribution. Case counts obtained from large-scale electronic surveillance systems are often inflated by the presence of false positives, however, and adjusted case counts based on the results of a validation sample will have variances which are hyper-Poisson. This paper presents a simple method for constructing interval estimates for incidence rates based on case counts that are adjusted downward using an estimate of the predictive value positive of the surveillance case definition.

Large-scale surveillance for selected medical or health conditions often relies on electronic data sources which provide comprehensive coverage of a given population. For example, the Centers for Disease Control and Prevention conduct surveillance of brain injuries involving hospitalization or death, based on electronic hospital discharge and vital statistics data received from twelve to fifteen states each year [

As with most surveillance methods, an operational case definition as described above may admit some records that do not represent true cases under a strict clinical definition ("false positives") and may also fail to capture some records representing true cases ("false negatives"). The customary terms reflecting these aspects of an operational case definition are predictive value positive (PVP) and sensitivity, defined in the present context as the conditional probabilities [

PVP = Pr{case meets clinical definition | case detected under operational definition};

sensitivity = Pr{case detected under operational definition | case meets clinical definition}.

Depending on the extent to which false positives and/or false negatives are believed to influence the surveillance process, it may be appropriate to use estimates of PVP and/or sensitivity to adjust incidence rate estimates accordingly. It is not generally possible to assess PVP or sensitivity using electronic surveillance data alone. The most direct approach to obtaining the additional data required for estimation of PVP involves manual review of medical records for a random sample of provisional cases identified via the operational case definition. Obtaining the additional data necessary for estimation of sensitivity may be more labor-intensive, particularly when considering an uncommon condition. Without additional "markers" (apart from the operational case definition) to narrow the scope of review, it may be necessary to select a very large sample of general medical records in order to identify enough true cases to support a stable estimate of sensitivity.

The methodology described in this paper is oriented to surveillance of relatively rare health conditions. Because validation data quantifying the influence of false positives will typically be easier to obtain than data quantifying the influence of false negatives in this setting, the development concentrates on incidence rate estimates reflecting adjustments for PVP. This emphasis is not intended to diminish the potential influence of false negatives; rather, it reflects the logistical difficulties associated with obtaining data on false negatives as part of ongoing surveillance. If there is sufficient doubt surrounding the sensitivity of case ascertainment for any particular surveillance process, the proposed methodology should be applied with due caution.

For a given surveillance period, it is assumed that case confirmation data are available for a random sample (selected without replacement) of provisional cases. Data obtained through such validation efforts allow estimation of PVP as well as adjustments to case counts to eliminate the bias due to false positives. To illustrate, suppose that for a set period (e.g., one year) of observation:

N = size of the at-risk population covered by the surveillance system;

M = count of provisional cases detected under the operational case definition;

M_{T }= count of true cases (unknown) among the provisional cases;

M_{F }= count of false positive cases (unknown) among the provisional cases = M - M_{T};

S = number of provisional cases sampled for case confirmation;

C_{T }= count of confirmed true cases among those sampled;

C_{F }= count of cases determined to be false positives among those sampled = S - C_{T}.

The usual estimate of PVP is given by [

_{T}/S = C_{T}/(C_{T }+ C_{F}).

Noting that

Case counts obtained through comprehensive surveillance may be considered inherently variable even though they are essentially census-level quantities, in the sense that a case count can be viewed as representing one observation from a hypothetically repeatable process [_{T }could be determined. When reporting the corresponding incidence rate R = M_{T}/N one might also make use of the variance estimate _{T }represents one observation from a Poisson process [

The remainder of this paper addresses three aspects of the problem outlined above: (i) a simple model for the true and false positive case counts within the defined framework, (ii) selected properties of

To evaluate the proposed estimator _{T}, and M_{F }is needed. For a given at-risk population and surveillance period it will be assumed that the provisional case count M is generated according to a Poisson process with parameter λ. Each provisional case, independent of other provisional cases, will be assumed to be a true case with probability equal to the underlying PVP. These assumptions are reflected in the following mixture model [

M ~ POI(λ);

M_{T}|M ~ BIN(M, PVP)

where POI denotes the Poisson distribution and BIN denotes the binomial distribution. The count of false positive cases is implicitly given by M_{F }= M - M_{T}. It is well-established that under this type of decomposition M_{T }and M_{F }are independent Poisson random variables such that M_{T }~ POI(τ) and M_{F }~ POI(φ), where τ = λ·PVP and φ = λ·(1-PVP) [

This section examines several important properties of the estimator

E[

and when f·λ is sufficiently large:

Equality (2) indicates that _{T}. The second component approximates the addition to variance that results from the case count adjustment based on

It is noted in passing that when case populations are typically small, it may be feasible to adopt the practice of confirming all provisional cases. Under this approach _{T }and it follows that

The remaining objective is the formulation of a simple method for constructing interval estimates for τ and the corresponding incidence rate. From (2) it is already known that

Based on (4) an approximate (1-α)·100% confidence interval (adjusted for the false positive bias) for the recurring case count τ is given by:

where z_{α/2 }represents the appropriate quantile of the standard normal distribution. The corresponding interval estimate for the population-based incidence rate is:

where it will be recalled that N is the size of the at-risk population under surveillance. As an example, suppose that an interval estimate providing 95% relative frequency of coverage is desired for the population-based incidence rate. Table

Estimated Relative Coverage Frequencies of a Nominal 95% Interval with Variance Correction.

PVP = 0.70 | PVP = 0.80 | PVP = 0.90 | |||||||

f | λ = 100 | λ = 500 | λ = 1000 | λ = 100 | λ = 500 | λ = 1000 | λ = 100 | λ = 500 | λ = 1000 |

0.10 | 0.92 | 0.94 | 0.95 | 0.92 | 0.95 | 0.95 | 0.94 | 0.95 | 0.95 |

0.25 | 0.94 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |

0.50 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.94 | 0.95 | 0.95 |

To illustrate the importance of the correction to the variance, Table

Estimated Relative Coverage Frequencies of a Nominal 95% Interval w/o Variance Correction.

PVP = 0.70 | PVP = 0.80 | PVP = 0.90 | |||||||

f | λ = 100 | λ = 500 | λ = 1000 | λ = 100 | λ = 500 | λ = 1000 | λ = 100 | λ = 500 | λ = 1000 |

0.10 | 0.73 | 0.70 | 0.68 | 0.78 | 0.77 | 0.76 | 0.86 | 0.84 | 0.84 |

0.25 | 0.84 | 0.85 | 0.85 | 0.87 | 0.88 | 0.88 | 0.92 | 0.91 | 0.91 |

0.50 | 0.91 | 0.92 | 0.91 | 0.92 | 0.93 | 0.93 | 0.93 | 0.94 | 0.94 |

Extensions to independent subgroups (e.g., age groups) and aggregates (e.g., age-adjusted rates) are straightforward. Provided that subgroup boundaries do not divide the surveillance population too finely, the error associated with the interval estimation method described above should remain minimal.

This paper was motivated by considerations related to analysis of data from the brain injury surveillance system mentioned in the introduction. Beginning with surveillance year 2000, a number of participating states identified provisional cases which were subsequently determined to be false positives upon in-depth review. Preliminary estimates of PVP were observed to fall close to 0.9 for some states, suggesting the need for adjusted incidence rate estimates. This issue is also relevant in a broader context, as a wide range of PVP estimates have been reported for other surveillance systems [

Adjustments to incidence rate estimates to eliminate the false positive bias are straightforward. However, since the PVP estimates used to make such downward adjustments are subject to random variation, the adjusted rates have an additional source of variation beyond what is usually assumed. Interval estimates failing to account for this fact may have coverage frequencies well below the nominal level. This paper presents a simple method of interval estimation for rates that have been adjusted to remove the bias due to false positives, applicable in large-scale surveillance settings.

The methodology presented does not address the potential bias associated with false negatives. In situations where validation data also support estimation of sensitivity, surveillance case counts could be further adjusted to reduce or eliminate such bias. This in turn would introduce another source of variation in the adjusted case counts and associated rates. Other types of sampling plans might also be considered. For example, a fixed sample size s* might be preferred, in which case S = min(s*, M) and an alternate expression for Var(

In the sampling procedure considered, the size of the validation sample depends on the provisional case count M. To make the analysis generic, the sample size will be denoted by s(M) where s(·) depends on the particular sampling procedure but is assumed positive whenever M > 0. The PVP-adjusted case count (1) can then be defined more precisely as:

where implicitly _{T}/s(M). When M > 0 the distribution of C_{T }conditional on M and M_{T }is hypergeometric [_{T}|M, M_{T }~ HYP(s(M), M_{T}, M). It is not difficult to show that when M > 0 the distribution of C_{T }conditional on M only is binomial, that is, C_{T}|M ~ BIN(s(M), PVP). It follows that E[

E[

To determine Var(

Var(

Since E[^{2}. Evaluation of the first component of variance is more complicated. Defining:

it follows from (A.1) and the fact that C_{T}|M ~ BIN(s(M), PVP) when M > 0 that:

Var(

The task is thus reduced to determining E[g(M)]. When s(M) =

Numerical calculation of Var(

The following is proposed as an estimator of the right-hand side of (A.2):

Defining:

it follows from the treatment in Appendix A that the expected value of the variance estimator (B.1) conditioned on M is:

Then, since

When s(M) =

Algebraic simplification results in:

As f·λ becomes large, approximation (A.2) results.

The author(s) declare that they have no competing interests.