Among the goals of the molecular epidemiology of infectious disease are to quantify the extent of ongoing transmission of infectious agents and to identify host- and strain-specific risk factors for disease spread. I demonstrate the potential bias in estimates of recent transmission and the impact of risk factors for clustering by using computer simulations to reconstruct populations of tuberculosis patients and sample from them. The bias consistently results in underestimating recent transmission and the impact of risk factors for recent transmission.

Molecular epidemiology makes use of the genetic diversity within strains of infectious organisms to track the transmission of these organisms in human populations. It is used extensively to differentiate reactivation tuberculosis (TB), which is due to a remote infection, from disease caused by recently transmitted organisms. This approach is based on the concept that epidemiologically related organisms share similar or identical genetic fingerprints, while unrelated organisms differ at some genetic loci. Isolates of

In addition to distinguishing primary TB from reactivation disease, these molecular techniques have been used to identify risk factors for recent transmission in population-based epidemiologic studies

Two different methods have been used to estimate the proportion of clustered cases. The first method, usually referred to as the “n” method, uses the number of all cases that fall into clusters as the estimator of clustered cases. The “n minus one” method assumes that one case per cluster is a case of reactivation TB and thus removes one case per cluster from the counts of “clustered” cases. The “n minus one” approach gives a number of clustered cases that is always less than that calculated by the n approach. Covariates associated with clustered fingerprints are taken to be host-specific risk factors for recent transmission of

These population-based molecular studies are often based on random or convenience samples drawn from available clinical isolates of

The magnitude of the bias incurred by sampling strategies depends both on the sampling fraction and the frequency distribution of sizes of clusters in the population. A recent simulation study of the influence of sampling on estimates of recent TB transmission demonstrated that an increase in sampling fraction yields an increase in the proportion of isolates identified as clustered

The purpose of this study is to investigate biases inherent in estimating measures of clustering and risk factors for clustering when common sampling strategies are used to collect the empirical data. Since the true distributions of cluster sizes cannot be directly observed if sampling is not complete, I used a Monte Carlo simulation model to generate a variety of hypothetical cluster distributions based on simple assumptions about TB transmission. These distributions represent a wide range of potential data structures reflecting heterogeneous transmission parameters, contact networks, and sociodemographic variables. Accordingly, my aim here is not to model TB transmission dynamics with precision but to generate a collection of heterogeneous cluster distributions that could be used to demonstrate the effects of sampling, given a variety of potential transmission settings.

Generally, the microsimulation model enumerates a population of discrete individuals, each of whom is characterized by a vector of variables that affect risk for TB infection, for clinical disease, and for transmitting infection once infected. Persons are assigned to a series of social and physical spaces such as households, neighborhoods, and multineighborhood communities. The model also specifies the stochastic processes by which latent disease reactivates, infection progresses to primary TB, immunity is conferred by vaccination or by previous infection, and duration of disease is determined. Persons to whom disease is transmitted during the simulation acquire a variable reflecting the strain number of the source of their infection; thus, chains of disease transmission can be identified as “clusters” of cases sharing a specific strain number. The model is run over a time period during which these stochastic processes may occur. Output of the model includes standard measures of the incidence of infection and disease, the prevalence of infectious TB over time, and a count of cluster sizes. Five different cluster distributions were generated on the basis of running the model for 4 years with input variables specific to the different geographic and social settings in which TB is transmitted. The assumptions and baseline input variables for the model have been described

The proportion of unique cases calculated after sampling and the variance of that proportion were estimated as follows. Using the “n” method to estimate the proportion of clustered cases, we assume that the true set of isolates is composed of _{k}_{max}. Further, we assume that each subject in the true set of isolates is sampled independently with a common sampling probability

Let I_{ijk}^{th} subject _{k}_{,} of size _{ijk}

Now let _{jk} =^{th} cluster of size _{jk}_{jk}^{k-1}^{th} cluster of size k. Hence,

The expectation and large sample variance of the random variable (

These simulations were repeated using the “n minus one” approach, in which one case per cluster is removed from the count of clustered cases and added to the count of reactivation cases. The analytic solution follows the same logic (Appendix 2).

The magnitude of bias in the odds ratios of potential risk factors introduced by the misclassification of clustering due to sampling error was also assessed. Risk factors for clustering were postulated to which were assigned “true” odds ratios of 2, 5, and 10. The prevalence of these risk factors in the absence of clustering was set at 0.1. This exposure was thus randomly assigned to 10% of the unclustered cases and proportions of the clustered cases to obtain the specified odds ratios in each of the modeled data sets. The odds ratios were recalculated after sampling by moving the clustered cases that were sampled as unique from the category of recently transmitted cases to the category of reactivated cases and reassessing the respective exposure status for these outcomes.

Output from the transmission model (

Output statistics | High burden | Moderate burden | Low burden | ||
---|---|---|---|---|---|

Sudan | NY prison | Algeria | US prison | Netherlands | |

Tuberculosis incidence^{a} | 190 | 581 | 32 | 82 | 14 |

Consensus incidence estimates | 200 | NA^{b} | 44 | NA^{b} | 10 |

ARI^{c} | 0.025 | 0.046 | 0.003 | 0.005 | 0.001 |

Maximum cluster size | 87 | 19 | 9 | 17 | 15 |

Mean cluster size | 10.2 | 3.2 | 1.7 | 2.9 | 1.7 |

Proportion of unique isolates | 0.181 | 0.253 | 0.432 | 0.289 | 0.490 |

^{a}Incidence per 100,000. Consensus incidence estimates are shown for comparison with estimates obtained from the model.
^{b}No data available.
^{c}ARI = Annual risk of infection.

Sampling fraction | 1 | 0.7 | 0.5 | 0.1 |
---|---|---|---|---|

Proportion of reactivated isolates | ||||

n method | 0.18 | 0.19 (0.18-0.19) | 0.21 (0.19-0.23) | 0.37 (0.30-0.43) |

“n minus one” method | 0.28 | 0.32 (0.30-0.34) | 0.35 (0.30-0.41) | 0.54 (0.47-0.62) |

Odds ratios^{a} | 2 | 1.88 (1.87-1.97) | 1.77 (1.66-1.88) | 1.34 (1.28-1.45) |

5 | 4.18 (4.13-4.78) | 3.51 (3.01-4.18) | 1.84 (1.60-2.18) | |

10 | 7.52 (7.38-9.27) | 1.84 (1.67-1.84) | 2.37 (2.08-2.99) | |

Proportion of reactivated isolates | ||||

n method | 0.12 | 0.14 (0.12-0.16) | 0.16 (0.13-0.20) | 0.45 (0.28-0.62) |

“n minus one” method | 0.33 | 0.36 (0.33-0.38) | 0.39 (0.32-0.45) | 0.67 (0.61-0.73) |

Odds ratios | 2 | 1.77 (1.60-1.95) | 1.62 (1.45-1.83) | 1.16 (1.12-1.29) |

5 | 3.51 (2.73-4.58) | 2.78 (2.18-3.83) | 1.37 (1.25-1.37) | |

10 | 5.79 (4.07-8.66) | 4.17 (2.99-6.58) | 1.58 (1.39-2.12) | |

Proportion of reactivated isolates | ||||

n method | 0.43 | 0.48 (0.45-0.51) | 0.16 (0.13-0.20) | 0.45 (0.28-0.62) |

“n minus one” method | 0.65 | 0.71 (0.69-0.74) | 0.76 (0.69-0.83) | 0.92 (0.81-0.99) |

Odds ratios | 2 | 1.81 (1.75-1.92) | 1.67(1.45-1.97) | 1.29 (1.18-1.73) |

5 | 3.68 (2.79-4.82) | 3.02 (2.99-4.73) | 1.98 (1.77-2.33) | |

10 | 6.58 (5.55-8.23) | 5.05 (4.17-6.58) | 2.62 (2.26-3.28) | |

Proportion of reactivated isolates | ||||

n method | 0.29 | 0.33 (0.29-0.39) | 0.37 (0.29-0.48) | 0.68 (0.35-1.00) |

“n minus one” method | 0.33 | 0.35 (0.31-0.38) | 0.37 (0.32-0.41) | 0.62 (0.50-0.73) |

Odds ratios | 2 | 1.8 (1.62-1.98) | 1.67 (1.45-1.97) | 1.29 (1.18-1.73) |

5 | 3.68 (2.79-4.82) | 3.02 (2.99-4.73) | 2.11 (1.62-5.31) | |

10 | 6.86 (5.30-9.66)) | 4.67 (3.02-9.13) | 2.11 (1.62-5.31) | |

Proportion of reactivated isolates | ||||

n method | 0.49 | 0.62 (0.55-0.69) | 0.62 (0.55-0.69) | 0.89 (0.77-1.00) |

“n minus one” method | 0.65 | 0.78 (0.72-0.85) | 0.78 (0.72-0.85) | 0.93 (0.79-1.00) |

Odds ratios | 2 | 1.8 (1.62-1.98) | 1.67 (1.57-1.81) | 1.40 (1.31-1.49) |

5 | 3.68 (2.79-4.82) | 3.05 (2.63-3.78) | 2.02 (1.79-2.32) | |

10 | 6.86 (5.30-9.66)) | 4.76 (3.86-6.47) | 3.86 (2.39-3.25) |

^{a}Confidence intervals for odds ratios are based on the results of 2 by 2 tables, with data adjusted for the mean misclassification introduced by sampling.

The bias in the proportions of clustered and unclustered cases results from misclassification of cluster status due to inadequate sampling; this misclassification also biases the results of analyses of risk factors for recent transmission in the direction of the null hypothesis of no effect.

The recent development of molecular methods to accurately type infectious organisms has led to a marked proliferation in studies of the molecular epidemiology of infectious diseases, especially of TB. The goals of many of these studies have been to address the longstanding problem of assessing the relative proportions of incident TB cases due to recent transmission and to chronic or reactivated disease and to identify risk factors for recent transmission. A systematic bias that consistently underestimates the proportion of cases due to recent transmission could present a serious impediment to the constructive use of molecular typing techniques for studying the epidemiology of infectious disease.

The results of this study show the extent to which bias can be introduced by sampling strategies commonly used in the molecular epidemiology of TB. Depending on the underlying distribution of cluster sizes, the error involved in underestimating the proportion of unique TB isolates in a sample may be sizable, even when up to 70% of the complete data is sampled. The odds ratios for risk factors for clustering are also consistently and markedly underestimated with this approach. The findings of this study support the conclusions of previous investigators

I considered how much impact this kind of sampling bias might have had on the studies of the molecular epidemiology of TB published to date. Many researchers report on a convenience sample of cases drawn from one or more clinical sites, without providing an estimate of the number of incident cases in the area in question during the period in which the cases were collected (

In industrialized countries with lower rates of incident TB, researchers have tried to enroll a compete cohort of patients by making use of public health reporting systems to identify and fingerprint all new cases of clinical TB in a defined geographic region during a specified time period (

The “complete” data sets used to estimate bias in this study were generated through stochastic epidemic modeling that outputs cluster distributions in addition to estimates of the incidence of TB infection and disease. Multiple demographic and disease-specific parameters have been found to affect cluster distributions, and many potential “transmission scenarios” could be generated by varying these parameters. In addition, the length of the study period and the stability of the molecular markers used will impact the observed patterns of clustering (

These results demonstrate that estimates of clustering based on molecular fingerprinting of a population of isolates of infectious agents may be severely biased. When these methods are used to estimate the extent of primary and reactivation disease in a community, they consistently underestimate recent transmission. In circumstances in which the error is greatest, the bias may undermine the value of an investigation by providing a community with false reassurance that ongoing transmission is being curtailed and therefore that control measures are adequate.

The findings of this study further suggest that molecular methods in epidemiology require the development of both appropriate epidemiologic study design and analytic tools to yield meaningful assessments of disease transmission. In particular, they imply that estimates of recent transmission obtained by molecular methods cannot be compared across studies which have used different sampling fractions and in which the distribution of cluster size can reasonably be expected to vary. One way for molecular epidemiologists to approach this problem is to provide sensitivity analyses estimating the potential error involved, given prior expectations of cluster distributions and an estimate of the fraction of cases sampled in a particular study. The analytic solution presented here can be easily programmed and used to explore the range of potential error under a variety of hypothetical transmission scenarios.

We wish to derive the expectation and variance of the random variable

and the variance of

It only remains to evaluate

Lemma: Under our assumptions,

Proof: by independence, _{jk},N_{jk}_{jk}, N_{jk}_{jk}_{jk}_{jk}_{jk}_{jk}_{jk}_{jk}_{jk}^{k}^{– 1} and _{jk}_{k}p.

The bias in the proportion of reactivated cases after sampling when the clustered cases are counted by using the “n minus one” method is described below. The number of cases considered to be due to reactivation is the sum of the unique cases and the source cases. The “true” number of source cases is equal to the number of clusters in the complete data set,

We are interested in finding the number of source cases after sampling. Since the number of source cases in a sample is equal to the “true” number of source cases minus the source cases that are not sampled or are sampled as unique, we need to estimate the expected value of the numbers of clusters not sampled and the expected value of the clusters sampled as unique. Let E(CL0) and E(CL1) be the expected values of the numbers of clustered not sampled or sampled as unique, respectively. Then, by using the nomenclature defined in the text and following the logic there described:

The expected number of source cases after sampling a fraction

The overall estimate of the proportion of reactivated cases can then be obtained by summing the number of unique cases after sampling with the number of source cases and dividing by the expected number of sampled isolates,

Suggested citation: Murray M. Sampling Bias in the Molecular Epidemiology of Tuberculosis. Emerg Infect Dis. [serial on the Internet]. 2002 Apr [date cited]. Available from

I thank Jamie Robins for substantial input into the analytic solutions presented in this paper and Sidney Atwood, Jean Marie Arduino, Barry Bloom, Sam Bozeman, Marc Lipsitch, and Christopher Murray for their help and advice on this manuscript.

This work was supported by National Institutes of Health grant k08 A1-01430-01.

Dr. Murray is an assistant professor of epidemiology at the Harvard Medical School and an infectious disease clinician at the Massachusetts General Hospital. Her research interests include epidemiologic methods, the molecular epidemiology of tuberculosis, and the transmission dynamics of infectious diseases.