Reductions in tuberculosis (TB) transmission have been instrumental in lowering TB incidence in the United States. Sustaining and augmenting these reductions are key public health priorities.

We fit mechanistic transmission models to distributions of genotype clusters of TB cases reported to the Centers for Disease Control and Prevention during 2012–2016 in the United States and separately in California, Florida, New York, and Texas. We estimated the mean number of secondary cases generated per infectious case (_{0}) and individual-level heterogeneity in _{0} at state and national levels and assessed how different definitions of clustering affected these estimates.

In clusters of genotypically linked TB cases that occurred within a state over a 5-year period (reference scenario), the estimated _{0} was 0.29 (95% confidence interval [CI], .28–.31) in the United States. Transmission was highly heterogeneous; 0.24% of simulated cases with individual _{0} >10 generated 19% of all recent secondary transmissions. _{0} estimate was 0.16 (95% CI, .15–.17) when a cluster was defined as cases occurring within the same county over a 3-year period. Transmission varied across states: estimated _{0}

TB transmission in the United States is characterized by pronounced heterogeneity at the individual and state levels. Improving detection of transmission clusters through incorporation of whole-genome sequencing and identifying the drivers of this heterogeneity will be essential to reducing TB transmission.

Tuberculosis (TB) incidence in the United States fell by more than 70% between 1993 and 2017; reductions in transmission driven by progress in detecting and treating latent TB infection among persons recently exposed have been a key component of this decline [

Transmission of

We used data from the US National Tuberculosis Surveillance System and the National Tuberculosis Genotyping Service for TB cases reported from the 50 US states and District of Columbia during 2012–2016 to infer the distribution of TB clusters in the United States and independently in California, Florida, New York, and Texas. Cases were defined as clustered if they had matching spacer oligonucleotide typing (spoligotype) and 24-locus MIRU-VNTR genotyping results, were reported within specified geographic boundaries (ie, the same county or state), and occurred during 2 time periods (ie, 2012–2016 or 2014–2016). A cluster definition that included cases reported within a single state boundary and within 2012–2016 was defined as the reference scenario.

We used a branching process framework to describe recent transmission and cluster formation [_{0}, the reproductive number, equal to the average number of secondary cases resulting from a single case. We assume that this probability distribution of secondary cases follows a Poisson distribution with parameter

To incorporate individual-level heterogeneity in transmission, we varied the value of _{0}. Model I (“homogeneous” model) assumes that _{0}) so that all individuals have the same infectious potential and the number of secondary cases resulting from each case is Poisson-distributed. Model II (“Susceptible-Infectious-Recovered-type” model) assumes that the individual reproductive number is distributed exponentially, similar to assumptions in standard to assumptions in standard Susceptible-Infectious-Recovered (SIR) type compartmental models. Model III (“overdispersed” model), a model previously used to capture heterogeneity in transmission of TB and other infectious diseases [_{0} and shape parameter ^{2})), where ^{2} are indicative of increased heterogeneity.

We used a likelihood-based framework to evaluate and compare the fit of each model described above to the observed data. Using the likelihood function (described in detail in the

To assess the sensitivity of model inference to possible imperfections in data, we conducted a simulation study in which we considered 2 mechanisms by which observed data could differ from true clustering. First, we assumed underreporting of clustered TB cases due to factors such as cases not being reported in local jurisdictions, cases not being culture-confirmed (eg, pediatric cases) or isolates not being genotyped, or cases being right- or left-censored over time. Second, we assumed overascertainment of clusters due to inclusion of imported cases of matching genotype (ie, not due to local or recent transmission). We generated synthetic cluster distributions by simulating the branching process models under various assumptions about _{0} and individual-level heterogeneity (taken as true parameter values), which also incorporated imperfections in data described above. For each synthetic cluster distribution, we then applied the likelihood-based inference method to estimate both _{0} and individual-level heterogeneity (estimated parameter values). By comparing true parameter values to their corresponding estimates, we inferred the sensitivity of each estimated parameter value to underreporting or overascertainment. (See

Of 35 313 genotyped TB cases reported during 2012–2016 in the United States, 13 159 (37%) were clustered under the reference definition of having the same genotyping result as at least 1 other case that was reported from the same state during the same period (

Of the 4 models considered, model IV (long-tailed model), which assumed the highest level of individual-level heterogeneity, provided the best statistical fit. The MLE under model IV was statistically >1000 times more likely to explain the data than models I–III (

The estimated distribution of the individual reproductive number revealed substantial individual-level heterogeneity (

The estimated value of _{0}
_{0} >10 fell between 16% and 19% regardless of cluster definition (

The cluster distribution of TB cases and the corresponding estimates of individual-level reproductive numbers varied considerably at the state level. For example, the proportion of clusters with ≥10 cases was nearly 8-fold larger in Texas compared with New York (_{0} >10 to the total secondary cases varied from 9.5% (from 0.13% of individuals) in Florida to 20% (from 0.3% individuals) in California (

Under- and overascertainment of clusters had a predictable effect on the inference of _{0}. _{0} was underestimated (bottom left quadrant in

This model-based analysis of genotype-clustered TB cases in the United States revealed that there is substantial heterogeneity in transmission. We estimated that 95% of individual cases transmit to less than 1 secondary case each and contribute to only 38% of overall secondary transmission. By contrast, 0.24% of cases were estimated to transmit to 10 or more secondary cases, resulting in 19% of all secondary cases. This degree of heterogeneity is larger than described with prior models (ie, negative binomial distribution) [

The characteristics of _{0} in Texas was twice as high, suggesting that more cases in Texas reflect recent transmission, whereas more cases in New York may represent reactivation of latent infection or importation. These findings are consistent with estimates of recent transmission from the Centers for Disease Control and Prevention [

Conventional genotyping has known limitations that can lead to underestimation or overestimation of clustering. Underestimation may occur if true transmission-linked cases are not detected (eg, individuals move out of a jurisdiction or are reported elsewhere), do not have a specimen culture showing _{0}) to fall or rise proportionally. The choice of time periods that we examined seemed to have less impact on estimates of transmission. None of the mechanisms mentioned above substantially affected estimates of individual-level heterogeneity in transmission, which remained high in all of our sensitivity analyses. When choosing the appropriate administrative level at which to define clusters, it is important to additionally consider the geographic size and population of administrative units, their interconnectedness, the relative value of a sensitive vs specific definition, and the level at which any response could be organized.

Conventional genotyping methods may also overestimate clustering by falsely attributing transmission links to cases that share common ancestry but are not related by recent transmission. TB often has a decades-long latency period, genotyping cannot be performed during this latent period, and molecular changes occur slowly; these factors can limit the use of conventional genotyping to estimate transmission. For example, cases that result from a commonly circulating (endemic) strain might reactivate at similar times and thus could share a genotype but not reflect recent transmission events. Genotyping clusters defined by 24-loci MIRU-VNTR could encompass transmission events up to 3 decades in the past [

Recent and local transmission can be corroborated through identification of epidemiologic links [

The high degree of heterogeneity in the individual reproductive number estimated here might not only reflect individual-level factors but environmental conditions and societal and healthcare provider–related factors that individuals experience. Communities or populations in which background

In conclusion, this model-based analysis of molecular surveillance data in the United States suggests that, although the overall rate of recent TB transmission is generally low, a small fraction of TB cases probably plays an important role in driving transmission at the population level. Understanding the drivers of this heterogeneity, by identifying populations, settings, and activities that are more frequently associated with large outbreaks, could improve outbreak prevention and response (through early and accurate detection of large clusters), reduce TB transmission, and improve TB-related resource allocation in the United States and more broadly.

This work was supported by the CDC, National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention Epidemiologic and Economic Modeling Agreement (grant 5U38PS004646).

Supplementary Data

Supplementary materials are available at

Genotype cluster distribution of tuberculosis (TB) cases in the United States. Shown are the frequency of observed genotype TB clusters of various sizes in the United States based on cases reported within a given state and occurring within a 5-year time period (2012 to 2016)

Fitting branching process models to genotype cluster distributions of tuberculosis (TB) in the United States. We fit branching process models to the cluster distribution consisting of genotyped TB cases occurring within US state boundaries over a 5-year time period (shown in

Underlying individual-level heterogeneity of tuberculosis (TB) transmission. Shown is the probability density function corresponding to the best-fit Poisson lognormal model, describing the distribution of the individual reproductive number under the reference scenario (clustering based on genotyped cases reported within state boundary and occurring between 2012 and 2016). The solid vertical line shows the mean of the distribution (ie,

Comparing model-based inferences under different definitions of tuberculosis clusters in the United States. We fit Poisson lognormal models to 4 cluster distributions, each using a different geographic boundary and time window for cluster ascertainment. Shown are the estimated mean reproductive number (ie,

State-level heterogeneity in tuberculosis cluster distributions and transmission across 4 US states, California, Florida, New York, and Texas, between 2014 and 2016.

Notations and Symbols Used in the Study, Detailed Descriptions, and Underlying Assumptions

Notations and Symbols | Description | Underlying Assumption |
---|---|---|

_{0}
| Reproductive number, or average number of secondary cases resulting from a single case | Theoretical concept |

| Individual reproductive number, expected number of secondary cases resulting from each individual | Assumed to vary based on the underlying models; see |

| Estimated reproductive number, based on maximum likelihood estimate | Estimates are aimed to capture cases resulting from recent transmission (and exclude cases resulting from reactivation that occur at longer time scales) |

| Offspring distribution of a branching process that describes the probability distribution of the number of secondary cases resulting from a single case | Varies based on the underlying models; see |

Description of Four Models of Individual-level Heterogeneity and Comparison of Their Statistical Fits to the Reference Data

Model | Model Description | Underlying Distribution of Individual Reproductive Number, ν; Resulting Distribution of Secondary Cases, | Maximum Likelihood Estimate, Log Scaled (Difference in Log Likelihood Units Relative to the Highest Estimate) | Relative Likelihood Compared With the Best Model^{a} |
---|---|---|---|---|

I, Homogeneous^{b} | Assumes no individual-level heterogeneity, that is, all individuals have the | ν is constant;_{0});_{0} | −16 787.68 (−1450.19) | <1/1000 |

II, SIR-type^{b} | Reflecting assumption in standard Susceptible-Infectious-Recovered-type compartmental models, assumes exponentially distributed individual reproductive numbers | ν is exponentially distributed;_{0});_{0}(1+_{0}) | −17 804.98 (−2468.19) | <1/1000 |

III, Overdispersed | Assumes that the number of secondary cases from an individual are overdispersed and the degree of overdispersion is estimated | ν is gamma distributed;_{0}, | −15 507.78 (−170.99) | <1/1000 |

IV, Long-tailed | Assumes that individual-level heterogeneity is lognormally distributed (allowing for even larger heterogeneity) | ν is lognormally distributed;^{2})^{2} are, respectively, mean variance of the underlying normal distribution;_{0} [1 + _{0} (^{2}) − 1)] | − |

Relative likelihood is given by the quantity exp((_{min} − _{min} is the

Poisson and geometric models are specific instances of the negative binomial model. Negative binomial model with dispersion parameter k→ ∞ is a Poisson model, and k = 1 is a geometric model.