An unprecedented number of nationwide tuberculosis (TB) prevalence surveys will be implemented between 2010 and 2015, to better estimate the burden of disease caused by TB and assess whether global targets for TB control set for 2015 are achieved. It is crucial that results are analysed using best-practice methods.
To provide new theoretical and practical guidance on best-practice methods for the analysis of TB prevalence surveys, including analyses at the individual as well as cluster level and correction for biases arising from missing data.
TB prevalence surveys have a cluster sample survey design; typically 50-100 clusters are selected, with 400-1000 eligible individuals in each cluster. The strategy recommended by the World Health Organization (WHO) for diagnosing pulmonary TB in a nationwide survey is symptom and chest X-ray screening, followed by smear microscopy and culture examinations for those with an abnormal X-ray and/or TB symptoms. Three possible methods of analysis are described and explained. Method 1 is restricted to participants, and individuals with missing data on smear and/or culture results are excluded. Method 2 includes all eligible individuals irrespective of participation, through multiple missing value imputation. Method 3 is restricted to participants, with multiple missing value imputation for individuals with missing smear and/or culture results, and inverse probability weighting to represent all eligible individuals. The results for each method are then compared and illustrated using data from the 2007 national TB prevalence survey in the Philippines. Simulation studies are used to investigate the performance of each method.
A cluster-level analysis, and Methods 1 and 2, gave similar prevalence estimates (660 per 100,000 aged ≥ 10 years old), with a higher estimate using Method 3 (680 per 100,000). Simulation studies for each of 4 plausible scenarios show that Method 3 performs best, with Method 1 systematically underestimating TB prevalence by around 10%.
Both cluster-level and individual-level analyses should be conducted, and individual-level analyses should be conducted both with and without multiple missing value imputation. Method 3 is the safest approach to correct the bias introduced by missing data and provides the single best estimate of TB prevalence at the population level.
National population-based surveys of the prevalence of pulmonary tuberculosis (TB) disease in adults can be used to measure the burden of disease caused by TB, to measure trends in this burden when repeat surveys are performed and to understand why people with TB have not been detected or diagnosed by national TB control programmes (NTPs). Surveys are of greatest relevance in countries with a high burden of TB in which surveillance data capture much less than 100% of cases. Global targets for reductions in disease burden set for 2015 include halving prevalence rates compared with their level in 1990; the other targets are that mortality rates should be halved between 1990 and 2015, and that TB incidence should be falling by 2015 [
The Global Task Force on TB Impact Measurement is hosted by the World Health Organization (WHO) with a mandate to ensure the best-possible assessment of whether 2015 global targets for reductions in TB disease burden are achieved [
TB prevalence surveys have a cluster sample survey design, in which groups of individuals are sampled, with clusters selected at random from an area sampling frame with probability proportional to size (PPS). While the classic method of using each survey cluster as the unit of analysis has been carefully and thoroughly described for a TB prevalence survey [
Findings from national TB prevalence surveys completed in 2007 in the Philippines and Viet Nam have been published [
This paper (outlined in Figure
Paper outline.
For on-going and future TB prevalence surveys, the eligible population is defined as individuals aged ≥15 years old who were already resident in the selected cluster at the time of the survey team’s first pre-survey visit [
There are 2 co-primary outcomes in a TB prevalence survey: (1) smear-positive pulmonary TB and (2) bacteriologically-confirmed pulmonary TB (smear-positive and/or culture-positive). The TB case definition, and the screening strategy used to identify pulmonary TB, in a national-level prevalence survey are summarised in Figure
TB case definition, and screening strategy for pulmonary TB.
The number of individuals who were enumerated, were eligible to participate, and who participated at various stages of the survey should be summarised, for example as depicted in Figure
Before analysis of the two co-primary outcomes is done, it is essential to describe the completeness and internal consistency of the “core” data i.e. the data that it is essential to collect in all TB prevalence surveys. This is covered in detail in the WHO handbook [
The two outcomes of smear-positive pulmonary TB and bacteriologically-confirmed pulmonary TB should be analysed separately. Here, we illustrate methods for an individual-level analysis using the outcome of bacteriologically-confirmed pulmonary TB, which we will refer to hereafter as pulmonary TB. It should be noted that the analytical approach would also be the same for other outcomes that are binary (yes or no), for example TB diagnosed using the recently endorsed molecular test Xpert MTB/RIF [
Individual-level analyses of pulmonary TB prevalence are performed using logistic regression, in which the log odds, i.e.
Two types of logistic regression model are recommended for the analysis of a TB prevalence survey, both of which allow for the clustering in the sampling design. These are: (1) logistic regression, with robust standard errors calculated from observed between-cluster variability and (2) random-effects logistic regression, in which a parameter for between-cluster variation in pulmonary TB prevalence is included in the probability model.
Random-effects logistic regression models may be preferred for quantifying the association between risk factors and pulmonary TB prevalence, because they provide a full probability model for the data including the between-cluster variability in true TB prevalence. However, the estimation process used in these models produces a “shrunken” point estimate of the overall nationwide pulmonary TB prevalence that is too low because it is calculated as a geometric, and not arithmetic, mean of the observed cluster-specific prevalence values. Therefore, robust standard error logistic regression models, which are “population-average” models within a generalised estimating equations framework, are preferred for the overall estimation of nationwide pulmonary TB prevalence.
To estimate overall pulmonary TB prevalence, it is recommended to use 3 methods of analysis in total, one of which does not account for missing data and two of which attempt to correct for bias due to missing data. In Figure
Methods 1-3, placed within a conceptual framework for analytical methods that attempt to correct for bias introduced by missing data.
This method uses a logistic regression model with robust standard errors, no missing value imputation, and analysis is restricted to survey participants (=N2 in Figure
This method uses a logistic regression model with robust standard errors, with missing value imputation for survey non-participants as well as participants, and includes all individuals who were eligible for the survey in the analysis (=N1 in Figure
The third method is also a logistic regression model with robust standard errors, with missing value imputation done among the subset of survey participants who were eligible for sputum examination but for whom smear and/or culture results were missing, and inverse probability weighting applied to all survey participants. This method aims to represent the whole of the survey eligible population (=N1 in Figure
Three main types of missing data mechanism have been distinguished in the literature [
(i)
Data are MCAR if the probability that an individual has missing data on the outcome, pulmonary TB, is NOT related to either a) the value of the outcome (that is, TB case yes or no) or b) an individual characteristic that is a risk factor for the outcome (for example age, sex, stratum, cluster, TB symptoms). In this case, analysis can be restricted to individuals who DO participate fully in the survey, and an unbiased estimate of the true overall prevalence of pulmonary TB in the population will be obtained. In other words, the (probabilistic) sampling design itself automatically allows for “completely at random” missing data.
(ii)
In the context of a TB prevalence survey, data are MAR if two conditions are fulfilled. First, the probability that an individual has missing data for the outcome variable of pulmonary TB (yes or no)
If data are MAR, the observed prevalence of pulmonary TB can be used to predict TB (yes or no) for individuals for whom data are missing, provided this is done with stratification on at least an individual’s age, sex, area of residence, TB symptoms, and field chest X-ray reading. Having done this, an unbiased estimate of the true overall prevalence of pulmonary TB in the population can be obtained.
(iii)
Data are MNAR if the probability of an individual having missing data on the outcome variable (that is, TB case yes or no) is different for individuals who have pulmonary TB compared with individuals who do not have pulmonary TB, even after post-stratification of individuals using characteristics that are known to be risk factors for pulmonary TB (such as area of residence, age, sex). If data are MNAR, it is not possible to correct the estimate of pulmonary TB prevalence simply by using missing value imputation based on the patterns in the observed data. Instead, a sensitivity analysis is required (see below), which is an area of on-going research [
The observed data themselves cannot be used to distinguish between MAR and MNAR. Missing value imputation is implemented under the assumption that data are MAR.
In a TB prevalence survey, it is usually the case (based on experience to date) that age, sex, stratum, and cluster are known for all (or almost all) eligible individuals, while there will be missing data on TB symptoms, field and central chest X-ray readings, smear and culture results, and the primary outcome of pulmonary TB.
It is essential to start by exploring the extent to which data are missing, in order to understand the possible biases that may result from an analysis that is restricted to survey participants and to choose imputation models that make the MAR assumption plausible. The following three variables should be summarized: the proportion of eligible individuals who participated in the symptom and chest X-ray screening; the proportion of those with two sputum samples among people eligible for sputum examination; and the proportion with smear and culture results from 0, 1 or 2 sputum samples. These summaries should be done overall, and be broken down by individual risk factors for pulmonary TB such as age group, sex and stratum – in order to know which individual characteristics are predictors of missingness.
Missing value imputation is done using regression models in a procedure called “imputation by chained equations”, and can be implemented using standard statistical software packages such as Stata, SAS, and R [
Our recommendation, following from this, is as follows. The outcome variable in a TB prevalence survey is pulmonary TB; sputum smear and culture results, the field and central chest X-ray reading, and TB symptoms are used in combination to define if an individual has pulmonary TB (see Additional file
The process described in Additional file
The overall prevalence of pulmonary TB is calculated for each imputed dataset. The national-level pulmonary TB prevalence estimate is then calculated as the average of the pulmonary TB prevalence values from each imputed dataset, with a 95% CI that takes into account both the sampling design and the uncertainty due to missing value imputation. In Stata, this can be done using the
Multiple imputation is an efficient method for accounting for missing data, provided the imputation models are specified appropriately [
Survey participants can be divided into two groups, eligible or ineligible for sputum examination. Individuals who were ineligible for sputum examination are assumed not to have pulmonary TB, unless they had a normal field chest X-ray reading but an abnormal central chest X-ray reading. For those eligible for sputum examination (N6 in Figure
For each imputed dataset, a point estimate and 95% CI for population pulmonary TB prevalence is then calculated, using logistic regression with robust standard errors and weights. Weights are calculated for each combination of cluster, age group, and sex. This is done by a) counting the number of eligible individuals in each combination of cluster, age group, and sex (Nijk, for cluster
An advantage of using IPW combined with MI, rather than just MI, is that it is relatively simple and transparent to calculate the probability of survey participation by cluster, age group and sex, compared with adjusting for non-participation through the use of a multivariable imputation model [
If point estimates of pulmonary TB prevalence and their confidence intervals vary greatly among Methods 1–3, it is essential to try to understand the reasons for the differences and the results of the survey should be interpreted within these limitations. Method 1 introduces biases, as explained above, so it is not surprising if it provides a prevalence estimate that is different to the one obtained from Methods 2 and 3. If the prevalence estimates from Methods 2 and 3 are considerably different, this may be due to misspecification of the imputation models used in Method 2.
A simple way to implement a sensitivity analysis is to use as a starting point the imputed datasets that were created using Method 2.
For an “extreme” situation in which there are 0 pulmonary TB cases among non-participants, the prevalence of pulmonary TB is estimated simply as the observed number of pulmonary TB cases divided by the total eligible survey population. For an opposite “extreme” in which the risk of pulmonary TB is twice as high among non-participants as in participants (within sub-groups defined by stratum, age group, sex, and other variables included in the imputation model for pulmonary TB), the number of pulmonary TB cases among non-participants is estimated for each imputed dataset as 2ti, where ti is the number of pulmonary TB cases that were imputed in the ith imputed dataset. Then the overall pulmonary TB prevalence is calculated as the average of the 2ti values, plus the number of pulmonary TB cases among survey participants, divided by the total eligible survey population.
Simulation studies were done for 4 plausible scenarios through which missing data could be generated in TB prevalence surveys. We explored missingness of data on the outcome of prevalent TB by age, sex, stratum and cluster. We chose these four variables on the basis that they are associated both with the outcome and the reason for missingness [
Missing values were then introduced into this dataset to create 1000 datasets with missing data on the field chest X-ray reading and TB symptoms, and smear and culture results, for each of the following 4 scenarios:
1. Differential participation by age group, sex, and stratum (n = 3), with overall participation approximately 90%; 15% of smear and culture results missing completely at random among individuals eligible for sputum examination; overall, 19% of eligible individuals with missing data on pulmonary TB.
2. Differential participation by age group, sex, and cluster (n = 50), with overall participation approximately 90%; 15% of smear and culture results missing completely at random among individuals eligible for sputum examination; overall, 20% of eligible individuals with missing data on pulmonary TB.
3. As for 2, but among individuals eligible for sputum examination, the probability of missing smear and culture results varied among the 3 strata; overall, 20% of eligible individuals with missing data on pulmonary TB.
4. As for 2, but among individuals eligible for sputum examination the probability of missing smear and culture results varied among the 50 clusters; overall, 20% of eligible individuals with missing data on pulmonary TB.
To illustrate the 3 methods of analysis outlined above, we use the 2007 national TB prevalence survey in the Philippines. In this example, the eligible survey population was individuals aged ≥10 years old, which is different from the current WHO recommendation for the survey population to consist of individuals aged ≥15 years old [
Overall, participation was high at 90% of eligible individuals, though it was higher in rural and urban areas than in the capital city, lower among 20–39 year olds than other age groups, and the age-pattern of survey participation differed between men and women (data not shown). Additional details about the survey are provided elsewhere [
Results for the prevalence of pulmonary TB are summarised in Table
Prevalence of pulmonary TB (per 100,000 population) in the Philippines 2007 national TB prevalence survey
| 663 (516–810) | 660 (520–810) | 660 (530–800) | 680 (530–830) | |
| Metro Manila | 671 (238–1105) | 670 (100–1240) | 640 (160–1120) | 710 (100–1320) |
| Other urban | 671 (421–921) | 660 (470–860) | 680 (500–860) | 700 (490–910) |
| Rural | 655 (447–863) | 660 (450–870) | 650 (460–850) | 660 (440–870) |
| | ||||
| 136/20 544 (660, 560–780) | ||||
| Metro Manila | 15/2253 (670, 370–1100) | |||
| Other urban | 50/7519 (660, 490–880) | |||
| Rural | 71/10,772 (660, 520–830) | |||
1Robust standard errors.
2Robust standard errors with missing value imputation.
3Robust standard errors with missing value imputation and inverse probability weighting.
4Stratum-specific estimates are calculated from an overall regression model including all clusters and all individuals, with stratum fitted as a fixed-effect in the model.
5Crude prevalence is calculated as the total number of individuals with a positive smear and/or culture result divided by the total number of individuals who have been screened for TB by chest X-ray and/or interview. Confidence interval for this estimate is calculated with exact binomial probability theory.
The point prevalence estimate of pulmonary TB from Method 3, combining multiple imputation with inverse probability weighting, is slightly higher than the estimates from Methods 1 and 2, at 680 per 100,000 and with a slightly wider confidence interval.
Among survey participants, multiple imputation of missing smear and culture results increases the estimate of the prevalence of pulmonary TB from 660 to 670 per 100,000. This is a relatively small increase, reflecting that among individuals eligible for sputum examination the proportion with missing data on smear and/or culture results was very low. Using inverse probability weighting to account for differentials in survey participation by cluster, age group, and sex increases the prevalence estimate from 670 to 680 per 100,000.
Overall, the cluster-level analysis and the results from each of Methods 1, 2, and 3 show that the best estimate of pulmonary TB prevalence is of the order of 660 – 680 per 100,000 population among individuals aged ≥10 years old, with the coverage of the 95% CIs ranging from 516 to 830 per 100,000 population.
A sensitivity analysis in which pulmonary TB prevalence among non-participants ranges from 0 to being twice as high as among participants, gives a range of the point estimate of pulmonary TB prevalence from 595 to 731 per 100,000 population, compared with the estimate from Methods 1 and 2 of 660 per 100,000.
For all of scenarios 1–4, we analysed each of the 1000 datasets using Methods 1, 2 and 3. For both Methods 2 and 3, 20 imputed datasets were created for each of the 1000 “starting” datasets. Simulation results showed that for all 4 scenarios, Method 1 underestimated TB prevalence by an average of approximately 9%, with prevalence estimates lower than the true value of 1263 per 100,000 for 97% of the Scenario 4 simulations. Method 2 overestimated TB prevalence by an average of around 1.5%, while Method 3 estimated TB prevalence to an average that was within 1% of the true value. Details of the results are summarised in Table
Simulation study results, for 4 scenarios of how missing data could arise in a prevalence survey
| | ||||||
|---|---|---|---|---|---|---|
| 1143 (60.0) | −10 | 1276 (65.5) | 1.0 | 1273 (66.0) | 0.8 | |
| 1144 (64.0) | −9 | 1279 (70.4) | 1.3 | 1270 (70.1) | 0.6 | |
| 1139 (65.0) | −10 | 1278 (71.4) | 1.2 | 1269 (71.8) | 0.5 | |
| 1144 (64.8) | −9 | 1281 (71.2) | 1.4 | 1272 (71.7) | 0.7 | |
1Robust standard errors.
2Robust standard errors with missing value imputation.
3Robust standard errors with missing value imputation and inverse probability weighting.
4Mean estimate of pulmonary TB prevalence (per 100,000 population aged ≥10 years old), over 1000 simulations, and standard deviation of the 1000 pulmonary TB prevalence estimates. The true value of TB prevalence in these data was 1263 per 100,000 population.
5The relative bias is defined as the percentage = (mean-true)/true. Negative values indicate under-, positive values over-, estimation of the true prevalence by the simulated series of data.
We recommend that the method that uses the cluster as the unit of analysis should remain the first step in the analysis of a TB prevalence survey [
Following a general recommendation [
Overall, we recommend Method 3, inverse probability weighting combined with multiple imputation of missing data among individuals eligible for sputum examination, as the method that provides the safest approach and the single best estimate of population pulmonary TB prevalence.
CC: Complete case; CI: Confidence interval; IPW: Inverse probability weighting; MAR: Missing at random; MCAR: Missing completely at random; MI: Multiple imputation; MNAR: Missing not at random; NTP: National TB Programme; PPS: Probability proportional to size; TB: Tuberculosis; WHO: World Health Organization.
All authors have no competing interests to declare.
SF, CS, and KF wrote the paper; all co-authors suggested edits and gave comments on drafts of the manuscript, and all approved the final version. SF and CS led the analytical work, with important contributions from NY, RD, FM, PG, IO, and KF. RD, EB, ET, and FM contributed to the chapter in the WHO handbook on the analysis of TB prevalence surveys (2011). IO is the lead person in WHO for TB prevalence surveys. JL and RV took key roles in the planning and implementation of the TB prevalence survey that was conducted in the Philippines during 2007.
Multiple missing value imputation for analysis of pulmonary TB prevalence.
Click here for file
Charalambos Sismanidis, Katherine Floyd, Ikushi Onozaki, and Philippe Glaziou are staff members of the World Health Organization. The authors alone are responsible for the views expressed in this publication and they do not necessarily represent the decisions or policies of the World Health Organization.
The findings and conclusions in this manuscript are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.