A potential difficulty in the analysis of biomarker data occurs when data are subject to a detection limit. This detection limit is often defined as the point at which the true values cannot be measured reliably. Multiple, regression-type models designed to analyze such data exist. Studies have compared the bias among such models, but few have compared their statistical power. This simulation study provides a comparison of approaches for analyzing two-group, cross-sectional data with a Gaussian-distributed outcome by exploring statistical power and effect size confidence interval coverage of four models able to be implemented in standard software. We found using a Tobit model fit by maximum likelihood provides the best power and coverage. An example using HIV-1 RNA data is used to illustrate the inferential differences in these models.

Biomarkers play an important role in the identification, surveillance and treatment of various disorders.^{1} Such measurements can be a challenge to analyze because often biomarker outcomes have non-normal distributions and possess truncated distributions.^{2} The “censored” data contained in the truncated part of the distribution occur because of limitations of technology to measure the amount of biomarker present in a sample and can result in a sizable amount of data below the detection limit. For instance, Human Immunodeficiency Virus Type 1 (HIV-1) Ribonucleic Acid (RNA) data are subject to a lower assay detection limit. HIV-1 RNA levels are associated with transmission, where lower values greatly reduce the probability.^{3} Estimates of the percentage of Human Immunodeficiency Virus (HIV) patients in care who have a nondetectable viral load vary; a recent estimate is 29%,^{4} but some samples may have much larger censoring rates (e.g., 74%^{5}) possibly due to the use of older, less-sensitive assays or greater rates of effective treatment or adherence.

We refer to the minimum detectable concentration of a biomarker as the “detection limit”. This is the value at which samples have a detectable value that cannot be reliably quantified at the low end of the distribution. Others may split this limit into two components: a limit of detection and a limit of quantification.^{6} A biomarker may also have a maximum concentration at which samples cannot be reliably quantified, sometimes described as the “limit of linearity”.^{7} For this study, we considered a single detection limit referred to as ^{8} describe a situation where some observations were collected with the Amplicor Standard assay (^{9, 10} or pharmacokinetics^{11} data.

Multivariable statistical modeling techniques have been developed or adapted to analyze data specifically for outcomes with non-detectable values. One model treats the distribution of the biomarker outcome as a mixture of two distributions: a Bernoulli distribution (to model whether or not the observation lies above or below the detection limit) and a continuous distribution. These models have been adapted for use with biomarkers^{12–14} and environmental exposures.^{15} Extensions of this approach have been made to a Bayesian framework^{16} and to repeated measures data.^{17} Another model is to treat the censored observations as coming from the conditional density of the non-detectable values given the censored observations. This model was proposed by Tobin^{18} using least-squares and Amemiya^{19} provided the theoretical groundwork for finding maximum likelihood estimators (MLEs) using this approach. Recent theoretical research using this framework has focused on models for longitudinal data fit either by the EM algorithm^{20–22} or by directly maximizing a likelihood.^{23, 24} These models treat all observations as coming from the same distribution, but the likelihood function consists of the probability density function (pdf) for detectable values and the cumulative density function (cdf) evaluated at

Despite the availability of these models to model data with a detection limit and available resources to assist in their implementation,^{10, 25} investigators frequently choose other models. A popular approach has been to insert a single value for all observations at or below ^{10, 20, 23, 24, 26, 27} Deleting all observations below ^{11, 28} when data are not missing completely at random.^{29} Another common approach in HIV/AIDS research is to dichotomize the distribution of HIV-1 RNA concentration at ^{30–34} Dichotomizing has also been identified as a trend across all medical research.^{35} Although creating a binary outcome from a continuous variable results in less power^{36, 37} and may produce wider confidence intervals,^{38} some authors provide justification for dichotomizing continuous variables.^{39, 40} For example, a logistic regression model may be preferred as the question of interest may be to reduce (or raise) an outcome below (or above) some clinically relevant level.

Many studies have compared biases or coverage rates between models that deal with data subject to a detection limit,^{6, 9, 10, 13, 16, 20, 23, 26, 41–43} but few have explored the power of these models. Those that studied power focused on the ability to find a statistically significance association between mean exposure and a regulatory limit^{10, 15, 44} or between treatment groups.^{6, 10, 45} Of these, only Jin ^{10} systematically varied the rate at which data fall below the detection limit. However, Jin ^{10} focused on longitudinal data and did not consider the logistic regression or Bernoulli-Gaussian mixture model approaches.

This study addresses this lack of research on the influence of the censoring rate on the statistical power and confidence interval coverage of regression-type models designed to analyze multiple predictors on an outcome with a detection limit. The motivation behind this study is HIV-1 RNA, which appear to have an underlying Gaussian distribution in multiple different HIV-1 RNA reservoirs including plasma,^{46–50} seminal,^{48} cervicovaginal,^{51} nasal,^{51} rectal,^{52} breast milk,^{53} and cerebrospinal,^{46} though in some instances these are heavily truncated by the detection limit. Therefore, our outcome is generated from a Gaussian distribution. Our focus is on power to detect a mean difference in two treatment groups, which we assessed by performing simulations. Power is defined as the ability to detect a significant effect when a true effect exists and confidence interval coverage as the rate at which the true parameter is included in the simulation confidence interval. We compare linear regression after replacing censored values by the detection limit ^{18, 19} a Bernoulli-Gaussian mixture model,^{12} a Bernoulli-Gaussian mixture using the long-term survivor likelihood,^{13} and the Buckley-James estimator,^{54} a nonparametric regression model. Although the hypothesis of interest may dictate logistic regression as the appropriate model, we view this question as if the study purpose is to find a statistically significant association between an outcome with a limit of detection and a binary explanatory variable. To our knowledge, this study is the first to compare this group of models while varying the censoring rate across a range as wide as 0.1 to 0.9. Comparing the confidence interval coverage with the power should allow us to make more comprehensive recommendations on which models perform best with cross-sectional data for our simulation parameters. The

Consider data from _{i} is the outcome subject to _{i} records each participant’s group status; hence either _{i} = 0 or _{i} = 1. The parameter _{0} and the treatment difference

In the single-value imputation model, _{1}(^{2}) and be independent. Estimation is via maximum likelihood. We considered both cases in these simulations, i.e., where half the limit of detection is used (the half-LOD) or the limit itself (the at-LOD) as the imputed value. Additionally, it should be noted that, if transformations are performed, the imputed single-value should also be transformed.

The second model uses the outcome _{i} defined as
^{55} Hence, in an attempt to avoid reporting unstable and potentially biased results but still estimate power, we used an exact logistic regression procedure^{56} when there were 10 EPV or less. The rate of exact method use is noted in the

For the third model, the likelihood becomes a mixture distribution between the Gaussian pdf and cdf (henceforth, we will refer to this as the Tobin-Amemiya model). Using the same _{i} as in the previous model, the likelihood function can be defined as
_{1}) and Φ the Gaussian cdf
^{18, 19} This model was fit via PROC NLMIXED in SAS software.^{25}

For the Bernoulli-Gaussian mixture model, a Bernoulli likelihood is substituted in place of the Gaussian cdf. To estimate the success probability parameter in the Bernoulli likelihood, a logistic regression model is used. Additionally, the Gaussian pdf is multiplied by the probability the observation is detectable, defined by _{i} from the logistic regression model. _{i} is subject ^{25}

Also, we performed simulations with the following likelihood function:^{13}
^{57} In our situation, the “long-term” survivors term is considered to be a proportion of non-detectable observations which are in truth detectable. The probability of observation _{i}) is defined in the same way as in the Bernoulli-Gaussian likelihood in _{4}.

Finally, we included a non-parametric regression model for censored data, the Buckley-James estimator.^{54} Full details of the Buckley-James estimator can be found elsewhere.^{54, 58} In short, if _{0}

We fit the Buckley-James estimator by calling the bj function in the rms package^{59} from PROC IML. This uses Buckley and James’^{54} variance formula that Lai and Ying^{58} noted has potential shortcomings.

Our primary goal with these simulations was to assess the probability a test statistic is below a level-of-significance threshold given certain parameters, i.e., to find the statistical power. Hence, our main outcome in this study is whether or not the null hypothesis of no association can be rejected; in other words, is

We also wished to measure the 95% confidence interval coverage of these models. In order to measure all effects similarly, we based the coverage on effect sizes. For effects derived from a logistic likelihood function, we converted odds ratios to effect sizes by using Chinn’s^{60} conversion, where
^{61}). (The expected standard errors of the true effect size were calculated from the expression

Finally, we explored the bias in the estimated point estimates, standard errors of the difference of means, and effect sizes. Bias was computed by subtracting the median value from all simulations from the true effect. These results are included in the

We chose to vary the difference between the groups (_{1}0 scale, correspond to a difference in means a quarter as large as the standard deviation (or, in terms of HIV-1 RNA, a decrease from 1000 to 562.3 copies/mL),

For each parameter set, we chose to have a target power rate of 0.80, meaning that the true power rate is constant across all ^{62} formula. When ^{2} of each simulated data set prior to censoring to demonstrate the fit of the model with fully observed data. Additionally, we performed simulations with a fixed sample size of 50 for all

Rates of censoring (^{46, 46, 47, 47–53}).

We excluded models which did not converge or contained unstable estimates, namely those models that failed to reach convergence, possessed a non-positive-definite Hessian matrix, or had any parameter with a large standard error (defined as five or greater). For the Tobin-Amemiya and Bernoulli-Gaussian models, results from the linear regression with half the limit of detection and the logistic regression were used as start values, respectively.

Simulated data were generated from an equation for each treatment group. The treatment group is simulated using the equation ^{61}

Simulations were performed in SAS software, version 9.2^{63} using the MVN macro^{64} and in R version 2.15.2.^{65} Figures were created using the ggplot2 package.^{66}

For brevity, the true difference in means between the outcome of the two groups will be reported as the parameter ^{2}) is accounted for in the simple linear model (

The highest number of simulations which failed to converge was observed for treatment effects further from zero and

Proportions of null hypothesis rejection for

Confidence interval coverage was close to 95% for the Tobin-Amemiya model, but lower for the Gaussian component of the Bernoulli-Gaussian model, and the Gaussian component of the long-term survivor likelihood (

When the sample size is fixed at 50, the differences between models are largely similar to those with sample sizes with 80% power, but those differences are not as pronounced. Especially at higher rates of censoring, the differences are almost negligible. Lines for logistic regression possessed a concave shape, which is consistent with sample size tables when the outcome rate is varied (e.g., Hsieh^{67}).

Results from the estimates of bias largely mirror the results shown in the power and coverage proportion (

These data come from a project to assess the effect of raltegravir, a novel HIV medication, using data from the HIV Outpatient Study (HOPS) during the years 2007 to 2010.^{68} The HOPS is a prospective, observational cohort of HIV-positive participants enrolled from nine medical facilities in the United States. In Buchacz ^{68} the baseline dataset includes data on each patient’s medical history and current treatment at his or her initiation of raltegravir treatment or first drug regimen change post 1/1/2007. For this example, we utilize the baseline data and explore the association between participants’ HIV-1 RNA values at baseline and whether or not the patient was being prescribed raltegravir.

Our example is interesting primarily because it contradicts our simulation results. Summary statistics for patient’s log_{10} HIV-1 RNA levels (_{10} HIV-1 RNA distributions by group. The non-raltegravir group has a higher mean and median when considering only the detectable values. With all participants, however, there the differences are much smaller between these groups (using both imputation models).

As shown in our example the model chosen can have an impact on whether the null hypothesis of no treatment effect is rejected. The simulation results of these models indicate, for sample sizes with 80% power and across all censoring rates, the Tobin-Amemiya model produced an equivalent or better combination of power and effect size confidence interval coverage as compared to the other models simulated. When considering the bias in point and standard error estimates (^{10} and, despite analyzing longitudinal data, the crossover trial simulations of Karon ^{6} These results have some similarities to Beal,^{11} though are more supportive of using the Tobin-Amemiya model. Beal felt as though the Gaussian cdf-pdf was preferable, but argued models which omit all points below the LOD have value. His reasoning was that the Tobin-Amemiya model may be troublesome to implement and the shortcomings of simple models may not be large, especially at small censoring rates. We agree that, when the rate is small, a model which omits observations below the LOD will perform similarly, but the implementation difficulties are minimal with currently available software. Computation problems with the Tobin-Amemiya model only arose when a small percentage of points were detectable.

The Bernoulli-Gaussian mixture model performed well, but generally had inferior power to the Tobin-Amemiya model by roughly 5–10%. At high levels of censoring, the difference between the Bernoulli-Gaussian model and the Tobin-Amemiya model decreased, but the latter was still superior. We experienced the most convergence problems with the Bernoulli-Gaussian model at both high and low levels of censoring. Hence, the concave shape of the power function may be partially due to only certain simulated datasets converging which introduced bias into the results. Even in situations with good convergence, the Bernoulli-Gaussian model produced poorer confidence interval coverage for the Gaussian effect. Our simulations suggest that this model may be fully reliable only in situations when 0.3 ≤

For the long-term survivor model, rates of null hypothesis rejection hovered around 0.15 for all censoring rates, which are similar to the rates seen in Karon ^{6} for the same likelihood. Hence, this likelihood seems to be oversensitive and we don’t feel it can be recommended for analyzing a cross-sectional, Gaussian outcome with a detection limit.

The Buckley-James estimator exhibited good confidence interval coverage, but poor power, especially at higher rates of censoring. We are unsure why the power was so much less and further study is warranted.

Surprisingly, the power of the at-LOD and half-LOD models are not appreciably greater than other models included in these simulations. Multiple studies have shown inserting a single value below the detection limit produces biased estimates,^{10, 20, 23, 24, 26} especially standard error estimates that are too low. Hence, a priori, we felt the bias in the standard errors would drive power upward beyond 80%. Since the power was approximately 80% and the effect size confidence interval was good (except at ^{10} Although these counteracting biases gave the half-LOD model similar power to other models, inferences from such a model may be unwise to use since point and variance estimates are biased. Hence, it does not appear wise to use the single value imputation models except in situations where the censoring level is low enough to make minimal difference in the results.

The same overall conclusion is true for logistic regression. Dichotomizing at the detection limit resulted in less power than most of the other models included in these simulations when

At

When the sample size is fixed at 50, the differences between models are not as pronounced. The Tobin-Amemiya model again appears to be the best choice, but the differences are almost negligible, especially at higher rates of censoring. Lines for logistic regression possessed a concave shape, which is consistent with sample size tables when the outcome rate is varied (e.g., Hsieh^{67}).

For the fixed sample size simulations, the most striking aspect is the observed loss of power at higher rates of censoring. The loss of power can be substantial when dichotomizing a continuous outcome.^{36} This seems reasonable; as the amount of information decreases, there is less ability to find a difference. None of these models were able to compensate for this loss of information. Data with an outcome possessing a high level of censoring will have significantly reduced power to find a difference regardless of the model. This is important to recognize for study planning as failure to account for the detection limit may leave a study underpowered. Projecting the likely rate of non-detectable values for a study outcome may be challenging, but this may be crucial to achieve a desired level of power. Although Lachenbruch’s^{62} formula does not have the same null hypothesis we used in these simulations, it performed well at producing sample sizes with the appropriate levels of power.

Any simulation study is limited by a number of conditions. The linear predictors in these simulations included an intercept and one predictor. In most settings, multiple covariates will be used in models, meaning our models may not adequately represent settings where confounding is present. In regards to the simulation construction, one may expect the decreased performance of the Bernoulli-Gaussian mixture model compared to the Tobin-Amemiya model because the data generating mechanism more closely follows the Tobin-Amemiya model. Our focus was on models to analyze log-transformed measurements of HIV-1 RNA, which typically follow a Gaussian distribution. Hence, the Tobin-Amemiya model matches these assumptions better than any of the other models used in these simulations. Data arising from another distribution, such as a point mass-Gaussian mixture, may produce different conclusions; in particular, a Tobin-Amemiya model may be sub-optimal. We also did not consider the consequences of analyzing an outcome from a non-Gaussian distribution. Another potential limitation was the choice to include the treatment parameter in both the logistic and Gaussian components of the mixture. Determination of statistical significance was made with a likelihood-ratio test. Other approaches to defining the statistical significance of the treatment parameter estimate may give different results. Finally, in many applications, more than one, independent variable will be included in a model. Some models may perform better than others with informative covariates and the differences among models observed in these simulations might not be the same with additional covariates.

Extensions of this research could look at similar models using simulated data with a different underlying distribution. For instance, data from a semicontinuous or zero-inflated distribution would match the assumptions of the Bernoulli-Gaussian model better which may result in increased performance. Hurdle models^{69} may also be appropriate or models using a truncated Gaussian as the continuous distribution. As increasingly sensitive assays are developed and used, the likelihood of observing a Bernoulli-Gaussian distribution grows. Hence, a Bernoulli-Gaussian distribution may become the standard in HIV research soon. Any of these situations could be tested under a longitudinal setting as well. We were able to approximate the nominal level of power in the Tobin-Amemiya model with a published formula.^{62} Other models are available which may provide better estimates, but are more complicated to implement.^{70, 71} Covariates informative of the censoring could be added to the models to determine if a difference in performance is achieved. Finally, although we chose not to include any models which used multiple imputation (MI)^{72} to impute data below the detection limit, these models have been used in similar settings. Comparisons between several MI implementations for semicontinous data,^{73} comparisons to mixed models,^{27} and of a MI approach using bootstrapping^{41–43} have been performed.

Compared to the other models we examined, the Tobin-Amemiya model provided power and effect size coverage at or better than other models included in this simulation study and appropriate test size. Hence, the Tobin-Amemiya appeared to be the best model for analyzing an outcome. At low and high censoring rates, these models provided poor power and may fail to converge. Researchers and analysts must be cautious when analyzing data with a high proportion of data below a detection limit. The reduction in power is great with any of the model we evaluated and little utility may exist in analyzing such data. Exact logistic regression may be advisable with high or low censoring rates and large treatment differences.

The authors thank Kate Buchacz and the investigators of the HIV Outpatient Study (HOPS) for allowing use of data from the HOPS and Timothy A. Green and Lillian S. Lin for their helpful reviews. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Conflict of Interest Statement

The authors declare that there is no conflict of interest.

Null hypothesis rejection proportion (power), broken down by method, difference between treatment and control means, and sample size providing 80% power (solid lines; sample sizes for the row are listed along the bottom of plots in the fourth column) and a fixed sample size of 50 (dashed lines), 10,000 simulations, 5% level of significance. Dot−dash lines indicate nominal power of 0.80, except when there is no difference in means (row 1) where the nominal power is equal to the type I error rate of 0.05. Grey shading denotes 95% confidence interval.

Confidence interval coverage of true effect size, broken down by method (and component if applicable), difference between treatment and control means, and sample size providing 80% power (solid lines; sample sizes for the row are listed along the bottom of plots in the fifth column) and a fixed sample size of 50 (dashed lines), 10,000 simulations, 5% level of significance. Dot−dash lines indicate expected confidence interval coverage of 0.95. Grey shading denotes 95% confidence interval. In method titles, B=Bernoulli component and G=Gaussian component.

Summary of simulation parameters; all parameter sets use an error standard deviation of one with 10,000 simulations.

Mean difference between groups ( | Percentage censored | N per Arm^{a} | ^{2}: Median (IQR)^{b} |
---|---|---|---|

0.00 | 10 | 300 | 0.0008 (0.0002, 0.0022) |

0.00 | 30 | 300 | 0.00 (0.00, 0.00) |

0.00 | 50 | 300 | 0.00 (0.00, 0.00) |

0.00 | 70 | 300 | 0.00 (0.00, 0.00) |

0.00 | 90 | 300 | 0.00 (0.00, 0.00) |

−0.25 | 10 | 250 | 0.02 (0.01, 0.02) |

−0.25 | 30 | 244 | 0.02 (0.01, 0.02) |

−0.25 | 50 | 274 | 0.02 (0.01, 0.02) |

−0.25 | 70 | 357 | 0.02 (0.01, 0.02) |

−0.25 | 90 | 717 | 0.02 (0.01, 0.02) |

−0.50 | 10 | 64 | 0.06 (0.04, 0.09) |

−0.50 | 30 | 62 | 0.06 (0.04, 0.09) |

−0.50 | 50 | 70 | 0.06 (0.04, 0.09) |

−0.50 | 70 | 93 | 0.06 (0.04, 0.08) |

−0.50 | 90 | 195 | 0.06 (0.04, 0.08) |

−0.75 | 10 | 30 | 0.13 (0.08, 0.18) |

−0.75 | 30 | 29 | 0.13 (0.08, 0.19) |

−0.75 | 50 | 33 | 0.13 (0.08, 0.18) |

−0.75 | 70 | 44 | 0.13 (0.09, 0.17) |

−0.75 | 90 | 100 | 0.13 (0.10, 0.15) |

−1.00 | 10 | 17 | 0.21 (0.14, 0.30) |

−1.00 | 30 | 17 | 0.21 (0.14, 0.30) |

−1.00 | 50 | 20 | 0.21 (0.14, 0.29) |

−1.00 | 70 | 27 | 0.21 (0.15, 0.27) |

−1.00 | 90 | 68 | 0.20 (0.17, 0.24) |

Sample sizes for each value of ^{62} formula. When

^{2} calculated by using simulated data prior to censoring with standard ^{2} formula of simple linear regression.

Number of simulations out of 10,000 that failed to satisfy convergence criteria, possessed a non-positive-definite Hessian matrix, or had a parameter with a large standard error (five or greater) for simulations using Lachenbruch’s^{62} sample size formula.

Mean difference between groups ( | Percentage censored | Gaussian pdf-cdf | Bernoulli-Gaussian | Long-term survivor | Buckley-James |
---|---|---|---|---|---|

0.00 | 10 | ||||

0.00 | 30 | ||||

0.00 | 50 | 46 | |||

0.00 | 70 | 84 | |||

0.00 | 90 | 1 | 270 | ||

−0.25 | 10 | 2 | |||

−0.25 | 30 | ||||

−0.25 | 50 | 35 | |||

−0.25 | 70 | 1 | 128 | ||

−0.25 | 90 | 15 | |||

−0.50 | 10 | 138 | 152 | ||

−0.50 | 30 | ||||

−0.50 | 50 | 1 | 546 | ||

−0.50 | 70 | 2 | 592 | ||

−0.50 | 90 | 27 | 293 | ||

−0.75 | 10 | 2167 | 2253 | ||

−0.75 | 30 | 9 | 117 | ||

−0.75 | 50 | 97 | 918 | ||

−0.75 | 70 | 261 | 1555 | ||

−0.75 | 90 | 99 | 99 | 848 | 929 |

−1.00 | 10 | 5986 | 6906 | ||

−1.00 | 30 | 536 | 1317 | ||

−1.00 | 50 | 654 | 1330 | ||

−1.00 | 70 | 39 | 39 | 1389 | 2086 |

−1.00 | 90 | 1257 | 1257 | 3302 | 2164 |

Summary statistics for log base 10 HIV-1 RNA by raltegravir use in the HIV Outpatient Study (HOPS).

Group | N | Mean (SE) | Median (IQR) |
---|---|---|---|

Detectable values only | |||

Non-raltegravir | 422^{1}^{2} | 4.03 (0.05) | 4.34 (1.64) |

All data, half the limit of detection imputed for nondetects | |||

Non-raltegravir | 896 | 2.35 (0.06) | 0.85 (3.43) |

All data, zero imputed for nondetects | |||

Non-raltegravir | 896 | 2.80 (0.05) | 1.70 (2.58) |

Model estimates for log base 10 HIV-1 RNA, baseline data of raltegravir study, the HIV Outpatient Study (HOPS). Root mean squared error (RMSE) is an estimate of the

Parameter | Estimate or odds ratio (95% CI) | p value |
---|---|---|

At-LOD imputation | ||

Gaussian | −0.15 (−0.28, −0.01) | 0.03 |

Half-LOD imputation | ||

Gaussian | −0.14 (−0.31, 0.03) | 0.11 |

Logistic Regression | ||

Logistic | 1.04 (0.85, 1.26) | 0.73 |

Tobin-Amemiya | ||

Gaussian | −0.16 (−0.44, 0.11) | 0.25 |

Bernoulli-Gaussian | ||

Logistic | 1.04 (0.85, 1.26) | 0.73 |

Long-term survivor | ||

Logistic | 1.05 (0.86, 1.29) | 0.62 |

Buckley-James | ||

Gaussian | −0.28 (−0.45, −0.10) | 0.002 |