## Abstract

The standard phase II trial design has changed dramatically over the past decade. Randomized phase II studies have essentially become the standard phase II design in oncology for a variety of reasons. The use of these designs is motivated by concerns about the use of historical data to determine if a new agent or regimen shows promise of activity. However, randomized phase II designs come with the cost of increased study duration and patient resources. Progression-free survival (PFS) is an important endpoint used in many phase II designs. In many clinical settings, changes in PFS with the introduction of a new treatment may represent true benefit in terms of the gold standard outcome, overall survival (OS). The phase II/III design has been proposed as an approach to shorten the time of discovery of an active regimen. In this article, design considerations for a phase II/III trial are discussed and presented in terms of a model defining the relationship between OS and PFS. The design is also evaluated using 15 phase III trials completed in the Southwest Oncology Group (SWOG) between 1990 and 2005. The model provides a framework to evaluate the validity and properties of using a phase II/III design. In the evaluation of SWOG trials, three of four positive studies would have also proceeded to the final analysis and 10 of 11 negative studies would have stopped at the phase II analysis if a phase II/III design had been used. Through careful consideration and thorough evaluation of design properties, substantial gains could occur using this approach. *Clin Cancer Res; 19(10); 2646–56. ©2013 AACR*.

## Introduction

The standard phase II trial design has changed dramatically over the past decade. Randomized phase II studies have essentially become the standard phase II design in oncology for a variety of reasons. The use of these designs is motivated by concerns about the use of historical data to determine if a new agent or regimen shows promise of activity. These concerns stem in part from the preponderance of failed phase III trials in cancer research. When systematic differences exist between the study population and the historical control population or in the assessment of outcomes between the 2 populations, then evaluations of efficacy are subject to bias, resulting in false leads or missed opportunities. Although overall survival (OS) generally remains the primary endpoint in most phase III trials, progression-free survival (PFS) instead of response rate is increasingly used in the phase II setting. As noted by Zhang and colleagues (1), PFS assessment can be subject to bias based on the assessment schedule, consistency of evaluators, and a number of other factors.

Randomized phase II studies can demand significant time and patient resources (2, 3). They are typically 2 to 4 times larger than single-arm trials. Relative to phase III studies, they generally require similar levels of review and include the same degree of effort to initiate and conduct. Therefore, a well-designed randomized phase II trial may require a significant effort and time to complete, delaying the time to a definitive answer.

To address the aforementioned efficiency concerns, numerous authors have promoted a combination of randomized phase II and III studies (4–11). In essence, a randomized phase II/III design is a randomized phase III design with an early look to stop for futility, but not for early signs of efficacy (6). This early interim analysis may use an alternative endpoint other than the primary endpoint for the phase III trial; for example, PFS may be used for the first interim analysis, when the primary endpoint for the final analysis is OS. At this interim analysis, the go/no go decision conforms closely to the positive/negative conclusion from a typical randomized phase II trial, and as such there is a higher probability of stopping early for futility than with the typical early interim analysis.

In general, the most clinically meaningful outcome continues to be OS. However, as discussed by Korn and Crowley (12) and Villaruz and Socinski (13), PFS is becoming an endpoint thought to have clinical merit in its own right. In this article, we discuss design considerations for a phase II/III study using PFS as the primary endpoint for the phase II component and OS as the primary endpoint for the phase III. This design necessitates making certain assumptions about the relationship between PFS and OS. Design options and qualifications are evaluated using an example study setting. In addition, to explore the validity of the proposed model, PFS and OS data from published phase III studies leading to a U.S. Food and Drug Administration (FDA)–approved agent, or thought to be practice changing, are used. The article concludes with an evaluation of the properties when applied to phase III trials completed by Southwest Oncology Group (SWOG). To anchor this discussion of phase II/III designs, we propose the following as the basic foundation for designing a phase II/III trial.

## Design Considerations for Phase II/III Trials

The standard randomized phase II design using PFS as the primary endpoint will target larger improvements than the follow-up phase III using OS and use type I error rates between 10% and 20% (14). Although such a design may be based on improvement in PFS, interpretation of trial results often takes into consideration multiple pieces of information, such as OS, response, and toxicity. In contrast, a phase II/III study requires the specification of one rule. The phase II component of a phase II/III trial is a modification of the stand-alone phase II design, using similar effect sizes and error rates. The first interim analysis in a phase II/III trial is considered the phase II analysis, with the analysis time determined on the basis of the number of events defined by the phase II parameters. Futility is established at this interim analysis if the alternative hypothesis is rejected at 1-power of the phase II or equivalently, if the study fails to reject the null hypothesis at the type I error rate of the phase II. Therefore, in the design of a phase II/III trial, it is important to evaluate the proposed model and design choices as a unified study rather than simply combining phase II and III study designs.

Steps to design a phase II/III trial:

Determine if the setting is appropriate.

Define the assumed relationship between the phase II and III endpoints.

Define the phase II and III study parameters: target effect measure and size, type I and type II error rates.

Evaluate phase II designs in terms of feasibility/timeliness of analysis and impact on phase III design properties.

Evaluate the properties of the phase II/III design and adjust design parameters if necessary.

## Materials and Methods

Each specific step to design a phase II/III study is discussed later in this article. Step 1 is discussed in terms of key considerations. Step 2 is discussed by presenting a possible model to define the association between PFS and OS. The discussion of step 3 uses this model to determine the study parameters. The discussion of this step is further given by example where a phase II/III design might be appropriate. A literature review of phase III studies reporting on PFS and OS is used to evaluate the proposed model. Finally, to evaluate the performance of a phase II interim analysis on completed phase III studies, a “phase II analysis” is conducted on phase III studies led by SWOG between 1990 and 2005 that included OS as the primary outcome.

### Determine if the setting is appropriate

The first step, determination of appropriateness, is based less on statistical concepts and trial design issues than on clinical considerations. Specifically, within a disease setting and treatment type, there has to exist an intermediate outcome (such as PFS) for which it is expected that a treatment effect on this outcome represents an effect on the primary outcome (such as OS). It would seem that the most appropriate setting would be in more advanced disease with limited numbers of effective therapies available for postprogression treatment. Of note, one could design a phase II/III study using the same endpoint for the phase II component as for the overall trial; however, there will not be the same reduction in time and patient savings relative to using an earlier endpoint for the phase II component. In terms of evaluating the feasibility of this approach, the event rate is likely quite important because the basis of using such a design is time saving. If the rates are low, then in order for the sample size to be in the phase II range, the phase II analysis would require a temporary closure to accumulate enough events. In this case, it is important to consider how feasible it would be to conduct such a study because such an interim analysis may be more informative than the standard interim analysis and could affect the integrity of the trial (15, 16).

### Define the assumed relationship between the phase II and III endpoints

Given that continuation of a phase II/III design past the phase II interim analysis is based solely on a rule defined for the phase II endpoint, it should be the case that differences in the phase II outcome capture the effect of treatment on the phase III outcome. In the context of PFS and OS, this assumption would indicate that all or almost all of the treatment benefit occurs before progression, and most of the difference in OS is due to differences in progression. This assumption is consistent with the criteria put forward by Prentice (17) in specifying the necessary conditions for a surrogate endpoint. Specifically, a key component of the criteria is that the surrogate endpoint captures any relationship between the treatment and the primary endpoint.

To describe the relationship between PFS and OS, we use the model proposed by Goldman and colleagues (18). This model is a mixture of survival distributions specified in terms of the time to progression (TTP), time to death without progression (*S*_{pre}), and time to death following progression (*S*_{post}). Figure 1 depicts this model with representing the hazard rates at time *t* for the survival distributions for TTP, *S*_{Pre}, and *S*_{Post}, for the control treatment (*c* = 0) and the experimental treatment (*c* = 1). The observed values for PFS and OS are: PFS = min{TTP, *S*_{pre}) and OS = *S*_{Pre}, if a patient dies before progression and OS = TTP + *S*_{Post}, if the patient progresses before death. It follows, that if TTP and *S*_{Pre} are exponentially distributed, then so is PFS with hazard rate of . However, as shown by Goldman and colleagues (18), in this model OS is not exponentially distributed even when there are constant rates of transition between the states defined by progressive disease or death, so that hazard rates for survival are not constant over time. The HR for OS in this model is the HR averaged over the observation time period.

As depicted in Fig. 1, treatment has an indirect effect on survival time through its effect on TTP and a direct effect on death. If the main effect of treatment is through an indirect effect on survival due to an effect on progression, then progression represents a surrogate, and *λ*_{Pre} is 0, or close to 0. However, perhaps even with a small risk of death before progression, it could well be assumed that does not vary across treatment arms, and then the PFS HR is smaller than the TTP HR. In addition, if is a function of time, then the PFS HR is not constant over time.

### Define the phase II and III study parameters

Design of the phase II and III components requires specification of the effect measure and target effect size and the type I and II error rates. The phase II parameters are more accurately defined in terms of the phase III design parameters. Specifically, a phase II type I error occurs if the null hypothesis for PFS is rejected when there is in fact no difference in OS between the treatment arms. Likewise, a phase II type II error occurs if there truly is an effect of the study regimen on OS but the study is stopped at the phase II analysis.

Because the phase II interim analysis is much more aggressive than standard interim analysis boundaries, this analysis has a greater impact on the study design properties. For example, if both phase II and III designs specify 90% power, the adjusted power is 81% (6). Therefore, depending on the power of the phase II component, the adjusted power of the phase III component could be significantly decreased.

Determination of the effect measures and sizes for the 2 outcomes is more complicated. Quite often, although studies are designed to evaluate the HR between 2 treatment arms, clinically meaningful benefits are defined in terms of the difference in medians or percentage alive at a given time point. Under the model proposed, the multiplier that defines the relationship between the HR for PFS and OS is a time-varying function of the postprogression hazard rate and the hazard rates for PFS on both treatment arms. When the hazard rate for survival postprogression is larger, i.e., when time between progression and death is shorter, the multiplier is closer to 1 and the HRs for PFS and OS are more similar. It follows that in this situation, the ratio of medians is smaller than the average HR.

Assuming PFS is a perfect surrogate for OS, the HR for OS () at time *t* can be shown to be product of the HR for PFS () and a function of the survival functions for PFS and survival postprogression as follows: .

Here, the multiplier is generally less than 1, implying that the *Λ*_{PFS} is generally some factor larger than . However, given that the target in clinical terms is usually both an HR and a difference in medians, the approach we use here is to “attribute” the targeted absolute difference in median OS to PFS to determine clinically meaningful HRs. This does ignore that OS does not have proportional hazards in this model.

This model highlights that in design and analysis the true model is generally unknown, but for practicality and consistency across studies, standard analysis approaches such the log-rank test statistic or Cox regression are generally used. It is most likely that the deviations from proportional hazards and constant hazards will result in a reduction in power, but by averaging over the times, a meaningful measure is given. Again, this emphasizes the importance of evaluating the design using the proposed analysis approach.

### Evaluate feasibility and design properties

To show the evaluation of feasibility of design properties, the motivating example used here is treatment of extensive stage small-cell lung cancer (E-SCLC). The standard of care for E-SCLC has essentially been a platinum agent (cisplatin) in combination with etoposide for more than 20 years (19). Median survival is 9 to 10 months and median PFS is approximately 5 months. Moreover, to date, only topotecan has been approved by the FDA for the second-line treatment of SCLC, and its efficacy is quite modest. This highlights a setting where there is a desperate need to improve care and to do so quickly.

Using this example in E-SCLC, candidate designs for the phase II component of a randomized phase II/III trial were determined and compared with stand-alone phase II and III design. A design targeting a 3-month improvement in median survival would be considered clinically valuable. Therefore, these designs targeted a 33% improvement in median survival, from 9 to 12 months. Applying the 3-month difference to PFS, this would be equivalent to a 60% improvement in median PFS, from 5 to 8 months. The sample sizes were determined on the basis of a uniform accrual rate of 20 patients per month. Four follow-up scenarios were considered: no, 3, 6, and 9 months of follow-up. A stand-alone randomized phase II design would likely use 9 months of follow-up. Table 1 details the required events and sample size for varying levels of type I and II error rates.

Table 1 shows that a phase II/III design is quite feasible in this setting with a relatively short time to the phase II analysis and modestly small sample sizes even for the setting with no follow-up. In fact, because of the rapid event rate, for all of the designs the phase II interim analysis with no temporary closure for additional follow-up occurs at the same time or before the designs with follow-up, including the stand-alone randomized phase II design.

As discussed earlier, design of the phase III component of a phase II/III trial requires the choice of either a study with lower adjusted power or an increase in the phase III power to recover the power loss from the phase II analysis. A stand-alone phase III design with 81% power would require 506 patients with an expected analysis time of 38 months, whereas a phase II/III trial with adjusted 81% power (90%*90%) would require 638 patients with an analysis time of 44 months. A stand-alone phase III design with 86% power would require 570 patients with an expected analysis time of 41 months, whereas a phase II/III trial with adjusted 86% power (95%*90%) would require 764 patients with an analysis time of 51 months. These designs use the design parameters stated earlier, a one-sided 2.5% log-rank test for significance (assuming exponential survival) and 12 months of follow-up. Therefore, using a stand-alone phase II design with 10% error rates (see Table 1), a stand-alone phase II followed by a phase III with 81% power would require 666 patients (160 + 506) and a total study period of 55 months, whereas the phase II/III design would require 638 patients and a total study period of 44 months. Similarly, a phase III trial with 86% power would require 730 patients (160 + 570) and a total study period of 58 months for the stand-alone studies and 764 patients and a total study period of 51 months for the phase II/III design. This shows that in the given setting, a phase II/III design would reduce the time to a definitive answer relative to the traditional phase II followed by phase III designs.

### Evaluation of PFS and OS in literature review

In the literature review of phase III studies, 8 breast cancer (20–27), 13 colorectal (28–40), 11 kidney (41–51), 4 leukemia (52–55), 10 lung and mesothelioma (56–65), 1 lymphoma (66), 6 myeloma (67–72), and 2 prostate cancer (73, 74) publications were identified. To evaluate the relationship between PFS and OS, the absolute difference in median PFS and OS, the ratio of median PFS values and the PFS HR, and the ratio of median OS values and the OS HR were compared using linear regression. These data are presented in Fig. 2. Figure 2A presents the association between the difference in median PFS and OS. The blue line represents equality between change in OS and PFS, and the red line presents the regression of the difference in median OS on the difference in median PFS. The regression coefficient for PFS is 0.89 and is statistically significant at a level less than 0.0001. Figure 2B presents the association between the PFS HR and the OS HR. Again, the blue line represents equality and the red the regression line. Although the magnitude of association was significantly less than the association with the absolute difference in medians, the PFS HR was statistically associated with the OS HR at level 0.04, with a regression coefficient of 0.28. Figure 2C and D present the comparisons of the ratio of medians to the HRs for PFS and OS. Both values were highly statistically significant at a level less than 0.0001 with coefficients of 0.72 for the PFS comparison and 0.70 for the OS comparison, indicating the HRs for PFS and OS are reasonably well represented by the ratio of medians.

### Phase II/III evaluation of SWOG trials

SWOG provided an opportunity to evaluate this approach on real data. Fifteen phase III trials that included OS as the primary endpoint and had data on either PFS or relapse-free survival run through SWOG between 1990 and 2008 were identified (56, 73, 75–86). Table 2 includes a description of each of these trials, including the disease setting, treatments, and design properties. Of these trials, 4 were positive and the remaining 11 studies failed to reject the null hypothesis.

Of the 15 trials, 6 trials were identified as occurring in disease settings in which there would likely be little gain in time and patient accrual using this approach, given the assumed PFS times in the control arms. Specifically, S8797 with a 5-year PFS rate of 56%, S9313 with a 5-year disease-free survival (DFS) rate of 96%, S9415 with a median DFS of 8.5 years, S9321 with a median PFS of 1.3 years, S9438 with a 2-year DFS rate of 30%, and S9008 with a median RFS of 20 months were identified as poorer candidates for a phase II/III design using OS as the primary endpoint and PFS as the phase II endpoint, and perhaps these populations are better addressed by separate phase II and III studies. It is possible that in these disease settings there are better endpoints that would alleviate this issue and allow for the use of a phase II/III design.

Table 3 summarizes the actual number of events, sample size, percentage of actual accrual, and the phase II decision for each power/type I error combination. In general, the phase II decision was consistent across the range of error rates. Of the trials identified as poor candidates, 2 of them would have essentially gone to full accrual (S8797 and S9438), and the remaining 4 studies (S9313, S9415, S9321, and S9008) would have generally made it to at least 50% accrual before the phase II analysis, with S9313 accruing over around 2,000 patients and S9415 accruing around 1,000 patients by the analysis.

Of the positive trials, 3 of 4 would have continued at the phase II analysis across all scenarios. S9308, the positive study that stopped at the phase II analysis, was positive for both OS and PFS at the final analysis, with a median PFS of 2 versus 4 months and a median OS of 6 versus 8 months. The PFS HR used in the analysis was 2, which is likely too large. Evaluating this study with an HR of 1.6 (equivalent to a 1.2-month improvement), the study would have continued in all scenarios except in the phase II setting with 85% power and 20% type I error.

Of the negative trials, 10 of 11 would have stopped at the phase II analysis for the majority of scenarios. Interestingly, S0124 would have never stopped early, and at the final analysis, whereas OS was not significantly different between the treatment arms, PFS was significantly different between the treatment arms. Results were mixed in S9509 and S0001, with the studies failing to stop for futility based on the evaluation with all scenarios with 20% type I error rates for S9509 and those with 85% and 90% power for S0001. In addition, both studies would have failed to stop with the 85% power, 20% type I error combination.

Because the phase II power has a large impact on the adjusted phase II/III power, our recommendation is to use at least 90% power for the phase II design. Then, the choice of the type I error would be based on both acceptability of the proportion of negative studies that will proceed to the phase III portion of the study and the sample size and timing of the phase II analysis. The evaluation of the SWOG studies indicates that a false-positive rate of 20% may be too liberal.

## Discussion

Although this article focuses upon the statistical properties of a new clinical trial design, it is also important to note that this design has already been used in clinical oncologic research. For example, SWOG S1203, an on-going study, uses early monitoring based on complete response, whereas the primary outcome is event-free survival. In addition, a coauthor on this article, acting as a consultant for Threshold Pharmaceuticals, recommended this design for the conduct of a randomized clinical trial of doxorubicin versus the combination of doxorubicin and TH302, a prodrug of the DNA alkylator bromo-isophos-phoramide mustard. The company reported that this design enabled them to raise sufficient capital to fund this registration trial.

The evaluation of SWOG studies presented in this article was based on the proposed model for PFS and OS. However, we note that numerous authors have discussed modeling the relationship between PFS and OS (87–92). The proposed model used in this article is just one such possible model. We find the model to be an intuitive one for the setting described. For the design of a phase II/III study, the study team should assess the disease setting and potential outcomes to be used and then determine which model is most appropriate.

It is important to note that while a phase II/III design has a high likelihood of stopping at the phase II analysis under the null hypothesis, implementation of a phase II/III design does mean that the study sponsor and team are prepared (in terms of availability of financial and patient resources) for the study to continue to the phase III portion. In addition, another possible downside to this approach is that a separate publication of the phase II study will not occur, as it is inappropriate to publish the results of interim analyses of an ongoing study.

## Conclusions

The phase II/III design has been suggested as an approach to speeding up drug development. Although not applicable in all settings, use of a phase II/III design with PFS and OS, as described in this article, could reduce the time to discovery not only in study time but also by removing the development time between a phase II and a phase III trial. Through careful consideration and thorough evaluation of design properties, substantial gains could occur using this approach.

## Disclosure of Potential Conflicts of Interest

L. Baker is employed as the president of The Hope Foundation and is a consultant/advisory board member of Millenium and Cytrx. No potential conflicts of interest were disclosed by the other authors.

## Authors' Contributions

**Conception and design:** M.W. Redman, B.H. Goldman, M. LeBlanc, L.H. Baker

**Development of methodology:** M.W. Redman, B.H. Goldman

**Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.):** M.W. Redman, L.H. Baker

**Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis):** M.W. Redman, B.H. Goldman, L.H. Baker

**Writing, review, and/or revision of the manuscript:** M.W. Redman, M. LeBlanc, A. Schott, L.H. Baker

**Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases):** M.W. Redman

**Study supervision:** L.H. Baker

## Grant Support

This investigation was supported in part by the following PHS Cooperative Agreement grants awarded by the National Cancer Institute, Department of Health and Human Services (DHHS): CA32102, CA38926, CA46441, CA105409, CA42777 and NIH grant CA090998.

## Acknowledgments

The authors thank Patricia Arlauskas, Harry Erba, Bruce Redman, and Manuel Valdevieso for their extensive work in conducting the literature review and gathering the relevant data.

- Received January 18, 2013.
- Revision received March 18, 2013.
- Accepted March 18, 2013.

- ©2013 American Association for Cancer Research.

## References

- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵