Development of methods to accurately estimate HIV incidence rate remains a challenge. Ideally, one would follow a random sample of HIV-negative individuals under a longitudinal study design and identify incident cases as they arise. Such designs can be prohibitively resource intensive and therefore alternative designs may be preferable. We propose such a simple, less resource-intensive study design and develop a weighted log likelihood approach which simultaneously accounts for selection bias and outcome misclassification error. The design is based on a cross-sectional survey which queries individuals’ time since last HIV-negative test, validates their test results with formal documentation whenever possible, and tests all persons who do not have documentation of being HIV-positive. To gain efficiency, we update the weighted log likelihood function with potentially misclassified self-reports from individuals who could not produce documentation of a prior HIV-negative test and investigate large sample properties of validated sub-sample only versus pooled sample estimators through extensive Monte Carlo simulations. We illustrate our method by estimating incidence rate for individuals who tested HIV-negative within 1.5 and 5 years prior to Botswana Combination Prevention Project enrolment. This paper establishes that accurate estimates of HIV incidence rate can be obtained from individuals’ history of testing in a cross-sectional cohort study design by appropriately accounting for selection bias and misclassification error. Moreover, this approach is notably less resource-intensive compared to longitudinal and laboratory-based methods.

Incidence, the rate of new cases of a disease within a specified period of time,^{1} plays a key role in understanding dynamics of an epidemic and in evaluating the impact of public health interventions to control its spread. In the case of Human Immunodeficiency Virus (HIV) in Sub-Saharan Africa, obtaining reliable incidence estimates poses several challenges including a lengthy incubation time resulting in slow accumulation of data, lack of precise estimates for incubation period and false recency rates among others.^{23} A direct approach is to estimate HIV incidence rate from longitudinal studies where a representative sample of disease free individuals is followed over time and new cases recorded. Incidence rate is then computed as the number of persons newly infected with HIV during a specified time period to the cumulative person-time at risk of infection.^{4} However, cohort studies are resource intensive, time consuming, and also subject to selection bias if retention is low.^{5} To avoid such problems, certain methods have been developed which estimate incidence rate from cross-sectional data. Among these methods are assay-based techniques which look at levels and proportions of certain antibodies in the blood sample to show whether the infection was recent or has been present for some time. These methods have been justified by maximum likelihood criteria and improved by incorporation of past prevalence and false recency rates on incidence estimates.^{6782} Other approaches are based on mathematical models that decompose observed changes in prevalence between two sero-surveys into contributions of new infections and mortality assuming incidence remains constant between surveys.^{910} However, these methods are sensitive to the use of anti retro-viral therapy (ART), hence with a global commitment of up-scaling ART to include every HIV-positive individual,^{11} they are bound to misclassifying some incident cases as established ones and vice versa. An alternative method is to estimate HIV incidence rate from a cross-sectional cohort study design, where selection of subjects is done presently and assessment covers both individuals’ present and past experiences.^{12} This design is less resource intensive as it allows one to calculate rates in a one-time survey and according to ^{12} To estimate HIV incidence, we propose a weighted log likelihood approach based on a cross-sectional study design, that queries individuals’ time since they last tested for HIV, validates the test results with formal documentation whenever possible, and tests all persons who do not have documentation of an HIV-positive status. The weights correct for differences in key HIV risk factors between persons with and without documentation of most recent HIV test. To gain efficiency, we incorporate into the weighted log-likelihood available information on error-prone self-reports from individuals who could not produce documentation. Our approach addresses two potential problems that arise in the cross-sectional cohort study design;

^{1314} Properties of such weighted estimators may be deduced by viewing them as solutions to a set of estimating equations and appealing to the well-established theory of M-estimation.^{1516} To gain more statistical efficiency, we incorporate into the weighted log-likelihood available information on error-prone self-reports from individuals who could not produce documentation.

^{17} We account for misclassification error by incorporating into our proposed weighted log likelihood function an explicit probabilistic model relating self-report records to documented dates of last HIV test estimated in the validated sub-sample where both are available. We refer to this new approach as a “pooled cross-sectional cohort study design”, where the additional term refers to the fact that we are augmenting documented dates with self-reports.

To investigate finite sample properties of our weighted estimator, we conduct extensive Monte Carlo simulations, and compare our new estimator with other available methods. We then use the “Ya Tsie” data (also known as Botswana Combination Prevention Project, or BCPP) to estimate HIV incidence rates for 1.5 and 5 years prior to the survey simultaneously accounting for selection bias and misclassification error. The rest of the paper is organised as follows; In

The methods developed in this paper are largely motivated by the BCPP study, which is a pair-matched cluster-randomized trial, funded by the United States of America President’s Emergency Plan for AIDS Relief, designed to test whether a package of combination prevention interventions reduces population-level cumulative 30-month HIV incidence. The trial is being conducted in 30 communities in Botswana (15 matched-pairs) with a total population of about 180,000 people, representing nearly 10 % of Botswana’s estimated population. Fifteen communities were randomized to a combination prevention arm and 15 to a non-intervention arm. Interventions in the combination prevention group include home-based and mobile HIV testing, and counselling; point-of-care CD4 testing; linkage to care support; expanded ART; and enhanced male circumcision services. Detailed BCPP study procedures were previously published.^{18} As part of this study, a random sample of 12,610 adults in 30 communities throughout Botswana, representing approximately 20 % of their respective households was enrolled. HIV status was obtained for 99.7% trial participants at enrolment (either through a documented positive HIV status or in-home rapid testing). Additionally, self-reported information on prior HIV testing and when available, corresponding documentation of self-reported result was also obtained at enrolment. In our analysis, these two sources of information; (1) self-reported and (2) documented dates of most recent HIV-negative tests were combined to retrospectively construct a cohort of HIV-negative persons, all of whom underwent HIV testing at enrolment. Participants who reported dates of their last HIV negative test in the last 1.5 years (6570 days) prior to BCPP enrolment were included in the primary analysis. In secondary analysis, we expanded the study population to include all subjects reporting dates of last HIV negative test within the prior 5 years (21900 days). We defined incident cases of HIV positivity if a person with a previous HIV-negative test result subsequently tested HIV-positive on the date of BCPP enrolment. Person-time at-risk of infection was calculated from date of the most recent HIV negative test to the date of testing during the BCPP enrolment. Through this, we identified 6,542 and 6,942 individuals for primary and secondary analyses respectively.

Let _{i} denote person _{i} denote documented time from the last HIV negative test to BCPP enrolment and HIV test, hereafter referred to as retrospective follow-up time. We define a constant _{i} ≤ _{i} be corresponding discretized version of _{i} with _{i}, the former may be viewed as a misclassified version of the latter. Let _{i} = 1 if individual _{i} = 0 denotes an individual with self-reported HIV negative test during the at-risk follow-up period without formal documentation. We refer to these persons as validated (_{T} denote population density of _{i}, _{F} be population density of _{i} given

Throughout our analysis, we make the following assumptions;

Non-di erential misclassification, i.e.,

Coarsened misclassification, i.e.,

Constant hazard rate of infection, i.e.,_{i} ~ exponential (

Constant hazard of testing times, i.e., _{i} ~ exponential(_{.}

Missing at random (MAR), i.e.,

Assumption 1 implies that the probability mass function of self-reported information given documented date of last HIV negative test, HIV status, selection into validated or non-validated sub-samples and a set of covariates measured at BCPP enrolment only depends on documented date of last HIV negative test. Furthermore, from Assumption 2, this function depends on true retrospective follow-up time only through the time interval it belongs to. We encode this model as polytomous logistic regression for misclassification error with unknown parameters

Assumption 3 is reasonable for short enough retrospective follow-up time such as 1.5 years and could be relaxed by assuming a piece-wise constant hazard if necessary. It is also important to note that Assumption 4 can be replaced by an alternative choice of parametric model. Under Assumption 5, we specify a parametric model for the selection process of the form;

We propose to estimate γ with the standard maximum likelihood estimator

Let

Under assumptions 1 to 4 and the weaker Assumption 5, i.e., selection into the validated sub-sample depends only on observed data

Note that the un-weighted estimating equation corresponds to

In order to evaluate the performance of our proposed estimator, we conducted Monte Carlo simulations under conditions motivated by BCPP dataset. For all individuals (_{i} and time to sero-conversion _{i} from exponential distributions with parameters 0.3 and 0.2 respectively. As defined earlier, HIV statusf at cross-sectional survey was then given as _{1i} and _{2i} from Normal (0,4.41) and (0,1.44) respectively, then _{i} = 1) and non-validated (_{i} = 0) sub-samples comparable to BCPP, we simulated a binary variable _{i} from Bernoulli _{i} into 4 categories to define _{i}, taking values 0 if _{i} > 7 in order to have a reasonably short period of retrospective follow-up motivated by BCPP. To simulate self-reported times since individuals’ last HIV negative test, we constructed

We performed two types of simulations being un-weighted and weighted analysis. For each of them, we compared the validated sample-only versus pooled sample estimators. For un-weighted analyses, validated sample-only estimator exclusively uses documented, error-free individuals’ times since last HIV negative test while the pooled estimator incorporates error-prone self-reports accounting for misclassification but not for selection. The weighted analyses involved adjusting individual’s log-likelihood functions with inverse probability weights of selection into validated or non-validated sub-samples given _{1i},_{2i} and _{3i} in the two estimators to account for selection bias. Estimated weights were computed based on the MLE of a correctly specified logistic regression model for ^{19}

We used BCPP enrolment data to obtain both un-weighted and weighted estimates of HIV incidence rates for individuals reporting negative status in the last 1.5 and 5 years prior to this survey using the validated sample-only and pooled sample estimators. ^{20} Evidence of misclassification error is shown in

We have proposed a resource-Efficient cross-sectional cohort study design, that relies on querying individuals’ history of HIV testing and where possible, validating it with formal documentation to estimate HIV incidence rate, and testing all persons not known to be HIV-positive at the cross-sectional visit. We proposed and validated through extensive Monte Carlo simulations a corresponding weighted log likelihood estimator for incidence rate under this study design and model assumptions. This estimator combines individuals’ self-reported and documented times since their last HIV-negative test and simultaneously accounts for possible selection bias and misclassification error assuming no model misspecification. Our estimator is therefore robust to both potential sources of bias in cross-sectional cohort studies as shown in simulation studies and an application to BCPP enrolment data. Our best estimate of HIV incidence rate is largely consistent with figures from Botswana Aids Impact Survey, which estimated a crude incidence rate of 1.35 per 100 person years at-risk of infection, using the Recent Infection Algorithm (RITA)^{21} and BCPP laboratory based which rely on recency assays (Incidence rate= 1.06, 95 % CI= (0.70, 1.42)), but notably more statistically efficient and less resource intensive. Our confidence intervals were 77 % and 71% narrower than estimates of Abuelezam et al,^{20} who obtained inverse probability weighted incidence rates of 0.98 (0.32, 1.65) and 1.01 (0.52, 1.51) per 100 person years at-risk of infection for 1.5 and 5 years retrospective follow-up periods respectively by using validated only samples. Although our estimator is evidently more efficient, it has potential threats and limitations. These include non-differential exiting from study population over time, i.e., the rate of participant exiting depends on HIV status. Common causes may be high mortality and out-migration rates among HIV positive individuals. We acknowledge that this scenario may seriously affect our estimator. However, due to Botswana being close to 90–90-90 targets of high ART coverage and viral suppression,^{18} we hope that mortality and migration are not major problems in the BCPP baseline data set. Moreover, we expect more countries to commit to the 90–90-90 UNAIDS targets in future, this will reduce such HIV related mortality rates, hence reducing the effect of this type of differential exiting from the study population.

Another potential limitation is that HIV status and testing may not be independent. That is, individuals may go for testing because of the presence of a sero-conversion related illness or they have engaged in some form of risk-inducing behaviour, this will result in sampled individuals differing systematically with those excluded. As a sensitivity analysis, we regressed an indicator variable (_{i}=1 for individuals who reported to have tested within _{i}. This analysis additionally accounts for bias related with dependence between HIV testing and different variables such as risk inducing behaviours. For the two retrospective follow-up times respectively, validated only analysis appears to be robust to these additional potential sources of selection bias, i.e., 1.13 (SE=0.33) and 1.00 (SE=0.22), while pooled analyses results were more sensitive, increasing point estimates by 37 % and 42 %, i.e., 1.74 (SE=0.12) and 1.66 (SE=0.11). Results are reported in

Our estimator is also sensitive to misspecification of the selection model used to construct inverse probability weights. That is, if this model is miss-specified, incidence rate under our proposed method will generally be biased.^{1422} In the future, we hope to develop a doubly robust inverse probability weighted estimator to additionally account for possible partial model misspecification. We did not formally account for possible clustering effect, however, this effect was negligible and will unlikely affect our results. A notable possibility is that self-reported time since last HIV test could be dependent on person’s underlying HIV status (differential misclassification), i.e. Individuals are saying that they tested negative six months ago, but actually know that they are positive. We intend to account for this problem by using instrumental variables in future.

Financial disclosure

This work was supported through the Sub-Saharan African Network for TB/HIV Research Excellence (SANTHE), a DELTAS Africa Initiative [grant # DEL-15-006]. The DELTAS Africa Initiative is an independent funding scheme of the African Academy of Sciences (AAS)’s Alliance for Accelerating Excellence in Science in Africa (AESA) and supported by the New Partnership for Africa’s Development Planning and Coordinating Agency (NEPAD Agency) with funding from the Wellcome Trust [grant #107752/Z/15/Z] and the UK government. The views expressed in this publication are those of the author(s) and not necessarily those of AAS, NEPAD Agency, Wellcome Trust or the UK government.

The BCPP study was supported by the US President’s Emergency Plan for AIDS Relief (PEPFAR) through the Centers for Disease Control and Prevention (CDC) under the terms of cooperative agreement U01 GH000447 and GH0001911.

Eric Tchetgen Tchegen and Kathleen E. Wirth received support from the National Institute of Health (NIH) R01AI27271.

Eric Tchetgen Tchegen also received a grant from National Cancer Institute (NCI) R01CA222147.

Conflict of interest

The authors declare no potential conflict of interests.

Data availability

The BCPP enrolment dataset is available upon request from BCPP team. Contact Person: Molly Pretorius Holme,

From

Among this group, an individual is HIV positive

So

Therefore

From Assumption 4 we have,

Therefore for this group, an individual’s data likelihood is formally expressed as;

For Δ_{i} = 1, the integral in

For

Corresponding score equation for

Under Assumptions 1 to 4 and weaker Assumption 5, i.e., selection into the validated sub-sample depends only on observed data

We propose to estimate

We show that the estimating equation has mean zero (i.e., unbiased) at the true value of

Now,

This result shows that the estimating equation is unbiased at the true value

Suppose that ^{15}

AMN means “asymptotically multivariate normal.” The asymptotic variance of

Human Immunodeficiency Virus

Probabilities assumed for simulating self-reported times since last HIV-negative test conditional on documented ones

_{i} | 0 | 1 | 2 | 3 |
---|---|---|---|---|

0.71 | 0.27 | 0.01 | 0.01 | |

0.27 | 0.79 | 0.16 | 0.12 | |

0.1 | 0.12 | 0.73 | 0.29 | |

0.01 | 0.01 | 0.10 | 0.58 |

Source: Monte Carlo Simulation.

Monte Carlo simulation results for validated sample-only versus pooled sample estimators for un-weighted and weighted log likelihood functions in presence of selection bias and misclassification error, n=7000

Un-weighted analysis | Weighted analysis | |||
---|---|---|---|---|

Estimate | Validated-only | Pooled | Validated-only | Pooled |

Monte Carlo Bias | 0.0525 | 0.0591 | 0.0001 | 0.0025 |

Monte Carlo Percent Bias | 26.2362 | 29.5718 | 0.0669 | 1.2454 |

Monte Carlo Mean square error | 0.0028 | 0.0035 | 0.00005 | 0.00003 |

Relative E ciency | 93.5667 | 117.7333 | 1.5333 | 1 |

Source: Monte Carlo Simulation Results.

Un-weighted and weighted estimates of HIV incident rates per 100 person-years at-risk, standard errors and 95 % confidence intervals for HIV-negative individuals in 1.5 and 5 years prior to BCPP enrolment data.

Un-weighted analysis | Weighted analysis | |||
---|---|---|---|---|

Estimate | Validated only | Pooled | Validated only | Pooled |

Incidence Rate | 1.3509 | 8.8753 | 1.1021 | 1.2665 |

Standard Error | 0.3747 | 0.5526 | 0.3310 | 0.0790 |

95 % lower limit | 0.6166 | 7.7922 | 0.5400 | 1.1220 |

95 % upper limit | 2.0853 | 9.9583 | 1.8150 | 1.4220 |

Incidence Rate | 1.1410 | 4.9690 | 1.0189 | 1.1671 |

Standard Error | 0.2379 | 0.3014 | 0.2370 | 0.0790 |

95 % lower limit | 0.6747 | 4.3782 | 0.5690 | 1.0260 |

95 % upper limit | 1.6073 | 5.5597 | 1.5020 | 1.3170 |

Source: BCPP enrolment. Standard errors reported for weighted analyses are from bootstrap samples with replacement.

Baseline characteristics of N=7221 participants with HIV negative test result prior to in-home HIV testing during the BCPP enrolment survey, overall and according to availability of accompanying test documentation and time since last documented test, Botswana, 2013–2015.

Availability of documented HIV negative result | ||||
---|---|---|---|---|

Characteristic | Overall (N=7221) | SR (N=4740) | Documented (N=2378) | Documented (N=1967) |

Age at the BHS (n=7221) | ||||

16 to 24 years | 1945 (27) | 1282 (27) | 659 (28) | 584 (30) |

25 to 34 years | 2474 (34) | 1631 (34) | 821 (35) | 698 (35) |

35 to 44 years | 1124 (16) | 718 (15) | 394 (17) | 323 (16) |

45 to 54 years | 881 (12) | 556 (12) | 303 (13) | 225 (11) |

5 to 64 years | 797 (11) | 553 (12) | 201 (8) | 139 (7) |

Female (n=7221) | 4664 (65) | 2967 (63) | 1626 (68) | 1361 (69) |

Pregnant at BHS | 215 (6) | 60 (3) | 154 (12) | 151 (13) |

Education (n=7183) | ||||

Primary or less | 1799 (25) | 1173 (25) | 565 (24) | 430 (22) |

Junior secondary | 2538 (35) | 1670 (35) | 845 (36) | 706 (36) |

Senior secondary | 1401 (19) | 924 (19) | 475 (20) | 416 (21) |

Higher than senior secondary | 1445 (20) | 947 (20) | 482 (20) | 408 (21) |

Income per month (n=7168) | ||||

None | 3709 (52) | 2396 (51) | 1252 (53) | 1030 (53) |

Less than $ 96 | 1183 (17) | 776 (17) | 386 (16) | 336 (17) |

$ 96 to $ 477 | 1645 (23) | 1074 (23) | 557 (24) | 449 (23) |

More than $ 477 | 631 (9) | 452 (10) | 172 (7) | 144 (7) |

Nights spent outside the community, past year (n=7204) | ||||

0 nights | 3016 (42) | 1904 (40) | 1059 (45) | 870 (44) |

1 to 6 weeks | 1565 (22) | 1053 (22) | 489 (21) | 385 (20) |

1 to 2 weeks | 699 (10) | 474 (10) | 214 (9) | 179 (9) |

3 weeks to less than 1 month | 795 (11) | 569 (12) | 223 (9) | 194 (10) |

1 to 3 months | 806 (11) | 529 (11) | 269 (11) | 231 (12) |

More than 4 months | 323 (4) | 201 (4) | 118 (5) | 105 (5) |

Self-reported timing of most recent negative HIV test (n=7072) | ||||

In the last month | 378 (5) | 100 (2) | 277 (12) | |

1 to 5 months ago | 1337 (19) | 649 (14) | 683 (29) | 680 (35) |

6 to 12 months ago | 1978 (28) | 1262 (27) | 712 (30) | 676 (34) |

More than 12 months ago | 3379 (48) | 2595 (56) | 695 (29) | 331 (17) |

Age at first sexual intercourse | ||||

10 to 14 years | 138 (2) | 89 (2) | 49 (2) | 44 (3) |

15 to 17 years | 1673 (27) | 1105 (28) | 548 (26) | 458 (26) |

18 to 21 years | 3559 (58) | 2278 (57) | 1226 (58) | 1025 (58) |

22 years or older | 781 (13) | 490 (12) | 283 (13) | 229 (13) |

Inconsistent condom use, past year | 3603 (64) | 2366 (64) | 1197 (63) | 1010 (63) |

Transactional sex, past year6 (n=5803) | 342 (6) | 181 (5) | 158 (8) | 135 (8) |

Source: BCPP enrolment data.

Self Reported without documentation.

Documented within 5 years.

Documented within 1.5 years.

Proportions calculated among female participants.

Proportions calculated among persons reporting any lifetime sexual activity.

Proportions calculated among persons reporting one or more sexual partners during the past years.

Evidence of misclassification error on BCPP dataset in months

Self-reported times ( | ||||
---|---|---|---|---|

Documented times (_{i}) | Less than 1 month | 1 to 5 months | 5 to 12 months | more than 12 months |

194 | 56 | 5 | 4 | |

75 | 535 | 106 | 39 | |

3 | 80 | 487 | 96 | |

3 | 9 | 78 | 192 |

Source: BCPP enrolment data.

Sensitivity analysis results

Weighted sensitivity analysis | ||
---|---|---|

Estimate | Validated only | Pooled |

Incidence Rate | 1.1322 | 1.7358 |

Standard Error | 0.3304 | 0.1234 |

95 % lower limit | 0.4980 | 0.1234 |

95 % upper limit | 1.8110 | 1.9890 |

Incidence Rate | 0.9974 | 1.6554 |

Standard Error | 0.2226 | 0.1107 |

95 % lower limit | 0.6029 | 1.4460 |

95 % upper limit | 1.4535 | 1.8790 |

Source: BCPP enrolment. Standard errors reported for weighted sensitivity analyses are from bootstrap samples with replacement.