Adaptive designs are gaining popularity in early phase clinical trials because they enable investigators to change the course of a study in response to accumulating data. We propose a novel design to simultaneously monitor several endpoints. These include efficacy, futility, toxicity and other outcomes in early phase, singlearm studies. We construct a recursive relationship to compute the exact probabilities of stopping for any combination of endpoints without the need for simulation, given prespecified decision rules. The proposed design is flexible in the number and timing of interim analyses. A R Shiny app with userfriendly web interface has been created to facilitate the implementation of the proposed design.
Clinical studies to evaluate the toxicity and efficacy of a novel treatment are normally conducted in separate phases. Conventionally, Phase I trials are firstinhuman studies aimed at identifying the maximum tolerated dose (MTD) of an experimental agent. In a subsequent Phase II trial, the candidate drug will be evaluated at the MTD to determine whether it has sufficient therapeutic activity to warrant further testing. The majority of Phase II clinical studies are designed as singlearm trials without a control group and are commonly conducted following a twostage process.^{1}
The sample sizes of most Phase I trials are too small to allow accurate identification of MTD, so patients in Phase II trials might be exposed to subtherapeutic doses or overly toxic doses resulting in excessive serious adverse events (SAEs). To overcome these issues, Phase I doseexpansion cohorts (DECs) are now frequently used to assess preliminary efficacy and to further characterize toxicity. In the development of immune checkpoint blockade, for example, the use of DECs serves to generate a continuum of the drug development process, blurring the distinctions between Phase I dose finding, Phase II proof of concept and Phase III comparative efficacy trials.^{2,3} Despite its popularity, the majority of DECs are designed without sample size justification or treated as single arm, Phase II trials without statistically accounting for toxicities.^{4} Boonstra et al. state: “Regulatory agencies and others have expressed concern about the uncritical use of dose expansion cohorts (DECs) in Phase I oncology trials.”^{5}
Consider a singlearm study to evaluate the efficacy of a novel treatment. We are interested in testing if the response rate to this new treatment is higher than the response rate in historical controls. Hereafter, we refer to this study as our prototype. Based on the commonly used Simon twostage design, our prototype trial might be conducted in two stages, with the option to terminate accrual after the first stage if the number of responses is below the futility bound. Various extensions to the Simon twostage design have been developed.^{6–11} Notably, Mander and Thompson investigated designs optimal under the alternative hypothesis and constructed twostage designs with the flexibility to stop early for efficacy, which leads to substantial reductions in expected sample size.^{8} Bryant and Day proposed a twostage, twoendpoint design by integrating safety considerations into the early stopping rule of Phase II trials.^{7} Based on curtailed sampling procedures, Chen and Chi introduced twostage designs with two dependent binary endpoints and demonstrated the effectiveness of their method in reducing the expected sample size when the treatment lacks efficacy or is too toxic.^{9} Li et al. proposed new twostage designs including provisions for early termination due to sufficient effectiveness and safety, ineffectiveness and toxicity.^{10}
When the outcome of interest can be quickly observed, continuous monitoring designs are particularly useful. With the use of simulated annealing method, Chen and Lee constructed optimal stopping boundaries for the continuousmonitoring of futility.^{12} Law et al. proposed curtailed designs allowing a trial to be terminated early for efficacy or futility after evaluating the response of every patient.^{13} Ivanova et al. constructed a Pococktype boundary to continuously monitor toxicity in Phase II clinical trials.^{14} The use of the sequential probability ratio test (SPRT) is not appropriate in this context because the openended nature of SPRT.^{15} The majority of continuous monitoring designs are proposed in a Bayesian framework based on the posterior or the predictive probability of exceeding the historical response rate.^{16–18} A number of Bayesian designs have been proposed to consider both toxicity and response to treatment in Phase II setting based on the DirichletMultinomial Model.^{19–22} Sambucini^{23} developed predictive probability rules for monitoring bivariate binary outcomes; Teramukai et al.^{24} proposed a Bayesian adaptive design allowing futility stopping with continuous safety monitoring; Zhong et al.^{25} applied a copula model to describe the joint probability of efficacy and toxicity. These Bayesian designs rely on extensive simulations to search the optimal design parameters and determine trial operating characteristics. Implementing these methods is also challenging because statisticians need to be informed in real time when new data become available. Complicated computations are often required to provide updated estimates on efficacy or toxicity, and userfriendly software for these previous work are often lacking.
Schulz et al. proposed a recursive formula to calculate the exact probability of accepting or rejecting a null hypothesis for multiple stage designs with a binary endpoint^{26} and similar approaches have been adopted by some of the previously cited continuous monitoring designs.^{12,13} In this work, we extend the univariate recursive relationship to calculate the exact probabilities of making go/no go decisions in singlearm clinical trials with multiple binary valued endpoints. Based on the extended recursive relationship, we propose the unified exact design, which provides a unified statistical framework for making futility, efficacy and/or toxicity stopping decisions in early phase clinical trials with mutiplestage or continuous monitoring design. The proposed design provides transparent decision rules and is easy to implement because the cutoff values of stopping for toxicity, efficacy, or futility are spelled out a priori without the need to call upon statistical support midtrial. Barring any of these endpoints, we continue to enroll patients until a maximum sample size is reached. Unlike the previously cited Bayesian methods, which rely on simulation to determine the operating characteristics of their designs, we can calculate the exact frequentist probability of continuing or stopping a trial for any combination of these causes.
The remainder of this paper is organized as follows. We introduce the recursive relationship in
We denote by
After enrolling and evaluating the first
According to Schultz et al.,^{26} the recursive relationship relating
The marginal probability of continuing the trial after observing the
Consider a study design with prespecified stopping rules for efficacy (
After enrolling and evaluating
if
if
otherwise, continue to enroll the
After evaluating all the
if
if
Denote
Let
Denote
Based on the recursive relationship, we can calculate the interim conditional power of the study, which is the probability of ever stopping for efficacy after observing
We suggest choosing stopping bounds
Let us give an example of setting efficacy and futility monitoring bounds based on our prototype. Consider
The cumulative probabilities of stopping for efficacy and futility using these bounds are shown in
Early stopping for efficacy is not always preferred in early phase clinical trials, so the proposed design provides the option to ignore early efficacy stopping by specifying
In addition to the binary indicator for response, let us assume each patient has a binary valued indication of a SAE. Let
Toxicity and efficacy are not generally independent.^{27} In Immunooncology trials, for example, we can expect to see the patient developing a rash as a sign the treatment is taking effect. Define the efficacytoxicity odds ratio as
Define
Let
Intuitively the sequences
After observing the outcomes of each patient, we can take one of four actions: stop for toxicity, stop for efficacy, stop for futility, or continue to accrue based on observed data up to the maximum sample size
After
if
if
if
After all the
if
if
if
Given the prespecified decision rules, the conditional probability of stopping the trial for toxicity after
Likewise, the conditional probability of stopping for efficacy after
Bryant and Day considered different approaches to determine the design parameters for the twostage twoendpoint design and recommended the approach assuming independence (
Let
The efficacy and futility stopping bounds (
Given a specific responsetoxicity odds ratio
The statistical power of a study is given by
Note the false alarm rate
Yin et al. developed twostage and multiplestage designs by identifying the parameter values at which the maximum type I error rate and the minimum power are achieved when safety and response rate are jointly tested.^{11} In practice, implementation of this approach is difficult due to the intensive computations required to find the stopping boundaries. As detailed in this section, the approach proposed here sacrifices statistical sophistication for practical advantage such that the joint monitoring of response and safety can be implemented in studies with limited sample sizes while reducing the computational complexity in boundary identification.
The joint monitoring of multiple endpoints is of special interest in certain disease areas. Monitoring multiple endpoints allows clinical trialists to comprehensively capture complex efficacy or toxicity profiles when a single binary endpoint is not adequate to account for the complexities of outcomes. Consider a clinical trial of prostate cancer with multiple binary endpoints including objective response per RECIST (OR), SAE, and reduction in Prostatespecific antigen (PSA). PSA is a widely used biomarker of tumor burden for patients with prostate cancer. PSA response (PR) is defined as at least 50% reduction in the level of PSA and is known to be a good predictor of overall survival.^{29} We will use this hypothetical prostate cancer trial to illustrate how to jointly monitor multiple endpoints.
Denote
Denote
The conditional probability of stopping for safety after
The conditional probability of stopping for efficacy or futility after
Similarly, we can calculate the conditional probability of stopping for PSA responses when its upper (
The expansion from (
We would like to test the following hypothesis on PR
The stopping boundaries for PR are determined in a similar manner as the stopping boundaries for response rate. Specifically, we find the suitable upper (
In this section, we retrospectively apply the unified exact design to a Phase II study of Olaparib proposed at our institution. This study aims at evaluating the response rate to Olaparib in patients with acute myeloid leukemia carrying isocitrate dehydrogenase mutations. The MTD has been determined in a Phase I trial. Previous experience suggests the response rate for patients with these tumors is approximately 10%. We are interested in testing
The proposed design has the flexibility to include efficacy stopping. Consider a design with interim analysis planed after the first 5, 10, 15, and 20 patients have been accrued and evaluated. The corresponding efficacy bound is
Consider the continuous monitoring of efficacy and futility after enrolling five patients, and the continuous monitoring of toxicity outcomes for all the patients, with a maximum sample size of 19. To implement a unified exact design for this trial, we specify
The efficacy and futility cutoff values for this design are labeled in
This trial can be stopped as early as after two patients are accrued, if the first two patients both develop SAEs. We can also choose to stop this trial if more than two of the first five patients respond to the treatment. We continue to accrue if the number of SAEs is below the toxicity bound, and the number of responses is between the efficacy and the futility bound. If the trial reaches its planned maximum sample size of 19 patients, we conclude the treatment is promising if at least five patients have responded and at most four patients have SAEs.
The probabilities of early stopping using continuous monitoring under different scenarios are shown in
In this paper, we proposed the unified exact design to simultaneously monitor the efficacy, futility and toxicity outcomes of a singlearm clinical trial. We developed a recursive relationship to calculate the exact probabilities of stopping for any combinations of these outcomes. Compared to the Simon twostage design, which only allows one futility check and has no provisions for safety monitoring, the unified exact design provides the flexibility to stop the trial early for all the necessary causes. Although we choose stopping bounds based on the tail areas of a binomial distribution in this paper, Bayesian methods can also be used to set the stopping bounds. It is beyond the scope of this paper to discuss other types of stopping bounds; however, it should be noted that the proposed recursive method is valid as long as the stopping bounds is a nondecreasing function of sample size. Our future work will focus on computationally efficient methods to optimize the stopping bounds.
When results from previous studies are available, we can specify the joint probability of efficacy, futility and toxicity based on historical data. Given a correlation structure, it is possible to find stopping bounds producing smaller expected sample sizes compared to the stopping bounds found assuming the independence between response and toxicity. However, these data are not always available and the performance of the unified exact design can be negatively affected if the toxicityefficacy relationship is misspecified. To deal with this problem, an adaptive twostage design similar to Zang and Lee^{31} can be used to learn the efficacy–toxicity relationship in the first stage, assuming independence of these outcomes, and jointly model efficacy toxicity outcomes in the next stage.
The unified exact design allows the stopping boundaries to be completely specified prior to the start of a trial, saving the need for complex computations in the midst of a trial. All possible decision rules can be tabulated, which helps to convey the statistical design to trial practitioners. Continuous monitoring provides clinical trialists the advantage of altering the course of a trial in response to real time data; however, continuous monitoring is difficult to implement if the outcome of interest is not quickly available. The proposed method is flexible in the number and timing of interim analyses, with the flexibility to allow for multiple stage design as well as continuous monitoring design.
To provide a more accessible userinterface, we created a web application using Shiny, which can be accessed at
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institutes of Health, CTSA Grant Number UL1TR001863; National Institutes of Health Grant Number P30CA16359, P50CA196530,P50CA121974,R01CA177719,R01ES005775, R01CA223481, R41A120546, U48DP005023, U01CA235747, R35CA197574, R01CA168733, UM1 CA186689, P50 DE03070701, P50 CA121974, and a Pilot Grant from Yale Cancer Center.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Efficacy and futility stopping bounds for the prototype and a sample path. A study will be stopped for efficacy or futility if the number of responses is above the efficacy bound (e) or below the futility bound (f). The sample path (dotted line) crosses the futility bound when two out of the first 18 patients respond to the treatment, leading to a futility decision. Under the null hypothesis, the response rate is 10%. Under the alternative, the response rate is 35%. This design achieved 90.1% power at a significance level of 0.093.
The cumulative probabilities of stopping for efficacy (left panel) and futility (right panel). Under the null hypothesis, the response rate is 10%. Under the alternative, the response rate is 35%. This design achieved 90.1% power at a significance level of 0.093.
The toxicity bound and the cumulative probability of safety stopping for the prototype. The trialwide probability of safety stopping is 0.0855, 0.6231 and 0.9489, assuming the true probability of a patient developing SAEs is 0.10, 0.25, and 0.40, respectively. We stop for safety if the number of patients with SAEs exceeds the toxicity bound. The highest SAE rate considered safe is 0.10 and the false alarm rate we are willing to tolerate is 0.10.
The cumulative probabilities of stopping the prototype for efficacy, futility, toxicity, or any cause in different scenarios. (a) and (c) represent treatment with acceptable toxicity profiles, assuming the probability of SAE is 0.10, whereas (b) and (d) represent treatment with unacceptable toxicity levels, assuming the probability of SAE is 0.40. The probability of response is 0.10 and 0.35 under the null and efficacious scenarios, respectively.
The probability
Without SAEs  With SAEs  Marginal  

No Response 

 1 − 
Response 



Marginal  1 − 
 1 
The cumulative probabilities of efficacy (CPE) and futility (CPF) stopping for our prototype using the unified exact design.


 CPF  CPE  CPF  CPE  CPF  CPE  CPF  CPE 

5  0  2  0.0000  0.0086  0.0000  0.0579  0.0000  0.1631  0.0000  0.2352 
10  1  2  0.3487  0.0702  0.1074  0.3222  0.0282  0.6172  0.0135  0.7384 
15  3  3  0.8189  0.0893  0.4042  0.4171  0.1314  0.7471  0.0649  0.8558 
20  4  4  0.8731  0.0968  0.4628  0.4640  0.1518  0.8044  0.0741  0.9011 
Note: In the prototype, we aim to test if the response rate to a novel agent is greater than 10%, assuming a target response rate of 35%. Interim analyses are conducted after 5, 10, 15, and 20 patients have been accrued and evaluated. For each interim sample size
The cumulative probabilities of efficacy (CPE), futility (CPF), and toxicity (CPT) stopping for our prototype using the unified exact design.



 CPF  CPE  CPT  CPF  CPE  CPT  CPF  CPE  CPT  CPF  CPE  CPT 

5  0  2  2  0.0000  0.0084  0.0086  0.0000  0.0051  0.3174  0.0000  0.2325  0.0086  0.0000  0.1484  0.3174 
10  1  2  2  0.3266  0.0648  0.0696  0.0641  0.0133  0.8287  0.0128  0.6996  0.0540  0.0031  0.2299  0.7182 
15  3  3  3  0.7543  0.0819  0.0809  0.0984  0.0144  0.8814  0.0606  0.8076  0.0585  0.0082  0.2394  0.7452 
20  4  4  4  0.8027  0.0886  0.0819  0.1001  0.0146  0.8844  0.0690  0.8490  0.0593  0.0087  0.2413  0.7489 
Note: The prototype tests if the response rate to a novel agent is greater than 10%, assuming a target response rate of 35%. Any SAE rate greater than 10% is considered overly toxic. For each interim sample size