Purpose: The best phase II design and endpoint for growth inhibitory agents is controversial. We simulated phase II trials by resampling patients from a positive (sorafenib vs. placebo; TARGET) and a negative (AE941 vs. placebo) phase III trial in metastatic renal cancer to compare the ability of various designs and endpoints to predict the known results.
Experimental Design: A total of 770 and 259 patients from TARGET and the AE 941 trial, respectively, were resampled (5,000 replicates) to simulate phase II trials with α = 0.10 (one-sided). Designs/endpoints: single arm, two-stage with response rate (RR) by Response Evaluation Criteria in Solid Tumors (RECIST; 37 patients); and randomized, two arm (20–35 patients per arm) with RR by RECIST, mean log ratio of tumor sizes (log ratio), progression-free survival (PFS) rate at 90 days (PFS-90), and overall PFS.
Results: Single-arm trials were positive with RR by RECIST in 55% and 1% of replications for sorafenib and AE 941, respectively. Randomized trials versus placebo with 20 patients per arm were positive with RR by RECIST in 55% and 7%, log ratio in 88% and 25%, PFS-90 in 64% and 15%, and overall PFS in 69% and 9% of replications for sorafenib and AE 941, respectively.
Conclusions: Compared with the single-arm design and the randomized design comparing PFS, the randomized phase II design with the log ratio endpoint has greater power to predict the positive phase III result of sorafenib in renal cancer, but a higher false positive rate for the negative phase III result of AE 941. Clin Cancer Res; 18(8); 2309–15. ©2012 AACR.
See commentary by LeBlanc and Tangen, p. 2130
The high failure rate of phase III oncology trials emphasizes that current phase II trials are not sufficiently informative about the efficacy, or lack thereof, of new experimental treatments. This is especially the case for growth inhibitory (cytostatic) drugs, for which conventional designs and endpoints suitable for cytotoxic agents (e.g., single-arm design with response rate by RECIST) may not accurately predict eventual phase III results. In this study, we apply a resampling methodology to 2 completed trials (1 positive and 1 negative) in metastatic renal cancer to simulate and compare various phase II designs and endpoints. We found that a randomized design with an early continuous change in tumor size endpoint was the best predictor of the positive result, although it had a higher false positive rate for the negative result. If validated in future studies, this design/endpoint has the potential to make phase II trials of growth inhibitory drugs more informative with relatively short follow-up and feasible sample sizes.
Phase II trials in oncology assess whether a novel agent or combination has enough antitumor activity to warrant further study. Although there are many definitions of antitumor activity, the most widely adopted is an “objective response,” defined as a reduction in the sum of unidimensional target lesions by 30% or more according to the Response Evaluation Criteria in Solid Tumors (RECIST; ref. 1). Single-arm phase II trials in oncology typically use the proportion of objective responders [i.e., response rate (RR)] as the primary measure of efficacy (2). Although there is increasing support for randomized phase II trials in oncology (3–6), oncology phase II trials are still significantly less likely to use control subjects than comparable trials in other specialties (7).
Unfortunately, the high rate of failure in phase III oncology trials emphasizes that current phase II trials are not sufficiently informative. Only 57% of oncology drugs succeed in phase III trials compared with 68% of nononcology drugs (8). Furthermore, less than 5% of positive phase II combination therapy trials have eventually resulted in improved standards of care (9). Although it is possible that the prevalence of truly active agents is lower in oncology than in nononcology, it is reasonable to question whether single-arm designs with RR endpoints are contributing to the low success rate. Ratain and Eckhardt pointed out that many drugs, particularly those developed based on activity against a molecular target, may be active without causing a high degree of tumor regression (10). El Maraghi and Eisenhauer found that 4 such drugs that were eventually approved on the basis of clinical benefit had RECIST RRs less than 10%, 2 of them less than 5% (11). These data call into question the wisdom of using RECIST RR as the endpoint for phase II trials.
Another major shortcoming of RR as an endpoint is the loss of statistical information that occurs when a continuous change in tumor size is dichotomized (12). To avoid this, Karrison and colleagues proposed using the mean log ratio of tumor size at 8 weeks versus baseline as the endpoint in a randomized phase II trial (13), and subsequent modeling studies have supported the potential utility of this endpoint (14, 15). In the setting of measurable disease, calculating the log ratio does not require any additional data compared with RECIST RR or PFS.
To explore these issues further, we evaluated a variety of phase II designs and endpoints using data from 2 completed phase III trials in renal cancer. The first is a phase III trial of sorafenib versus placebo (TARGET), a “positive” trial that showed a clinically meaningful improvement in progression-free survival (PFS) and led to approval by the U.S. Food and Drug Administration of sorafenib for renal cancer (16). The second is a phase III trial of AE 941 (shark cartilage extract) versus placebo in the same disease, a “negative” trial that showed no improvement in PFS (17). To compare the operating characteristics of single-arm and randomized phase II designs with various endpoints, we randomly selected (hereafter, resampled) individual patients from the actual phase III trials and simulated phase II trials using data from these patients. The ideal design and endpoint would accurately predict both the positive phase III result with sorafenib and the negative phase III result with AE 941.
Materials and Methods
TARGET randomized 903 patients with advanced renal cancer who had failed prior immunotherapy to sorafenib 400 mg (n = 451) or placebo (n = 452) by mouth twice daily, and investigator assessed data were obtained from the analysis conducted in May 2005. The AE 941 trial randomized 305 patients with metastatic renal cancer who had failed prior immunotherapy to AE 941 120 mL (n = 153) or placebo (n = 152) by mouth twice daily. Both trials were designed with overall survival as the primary endpoint. TARGET was powered to detect a HR of 1.33, but was stopped early after a prespecified interim analysis evaluating PFS. The AE 941 trial was powered to detect a HR of 1.5. The proportional hazards assumption was reasonably well met in TARGET, and the observed PFS curves were virtually the same in the AE 941 trial. Additional methodologic details are available in the original reports of these 2 trials (16, 17).
A total of 133 and 46 patients from TARGET and the AE 941 trial, respectively, were excluded from our analyses (Figs. 1A and B). Data included treatment assignment, unidimensional measurements of target lesions from computed tomography (CT) scans (every 6 weeks in TARGET; every 8 weeks in the AE 941 trial), PFS, and whether or not the event was censored. Tumor measurements were extracted as follows: (i) identified lesions measured consistently across all CT scans, (ii) kept the 3 largest lesions, and (iii) excluded lymph nodes if 3 other lesions were available. The revised target lesions were summed to calculate tumor sizes at baseline and at 6 or 8 weeks. Patients who missed the first CT scan had tumor sizes imputed by assuming a linear change between baseline and the second CT scan. Those who had a PFS event before the first CT scan had tumor sizes imputed by assuming a percentage increase equal to the largest in that arm; the alternative approach of inverse probability weighting could not be used because covariates that correlate with outcomes were not available in the provided data sets.
Random sampling with replacement was carried out at the level of the individual patient. Sampling with replacement was conducted because this effectively treats the empirical cumulative distribution function as reflective of the population distribution. Five thousand replicates were made for each simulated trial, yielding 95% CI widths of less than ±1.5% for the simulation margin of error. Simulated trials were classified as positive or negative according to statistical criteria outlined below, and the percentage of positive trials was compared across designs, endpoints, and sample sizes. Statistical analyses were conducted with standard software (STATA, version 11.2). A one-sided α (type I error rate) of 0.10 was used in all cases.
Study designs evaluated
The single-arm design was an optimal, 2-stage design (2), in which patients were resampled from the sorafenib or AE 941 arm only. To test the null hypothesis that RR by RECIST (1) at 6 weeks (for TARGET) or 8 weeks (for AE 941) was 5% or less versus the alternative that RR was 20% or more with 90% power, 37 patients were required. Trials stopped early if there were no responses in the first 12 patients and were positive if there were 4 or more responses in 37 patients. This design was selected because it reflects what investigators would likely choose for a single-arm trial based on RR by RECIST for the 2 drugs in question. Other endpoints, such as median PFS, were not considered for single-arm designs because results using such endpoints are difficult to interpret in the absence of concurrent controls (18). This is especially true for renal cancer, as the unpredictable natural history of the disease makes historical controls unreliable (19).
Randomized designs of sorafenib versus placebo or AE 941 versus placebo were carried out with 1:1 randomization and sample sizes of 20, 25, 30, and 35 patients per arm. The sample size of 20 patients per arm was selected to correspond approximately to the sample size for the single-arm design. The sample sizes of 25, 30, and 35 patients per arm were chosen to test the effect of increasing sample size, while keeping the total number of patients feasible for a phase II trial conducted at a single institution or through a small consortium. Endpoints included RR by RECIST at 6 or 8 weeks, mean log ratio of tumor size at 6 or 8 weeks relative to baseline (log ratio; ref. 13), PFS rate at 90 days (PFS-90), and overall PFS. Arms were compared with a χ2 test for RR by RECIST, a 2 sample t test for log ratio, a χ2 test for PFS-90, and a log-rank test for overall PFS. Trials were positive if patients on the treatment arm did significantly better than patients on the placebo arm (1-sided P < 0.10). Trials were stopped early (i.e., halfway) for all endpoints (except overall PFS) if the treatment arm was performing worse than the placebo arm, a rule associated with a very small loss in power (20).
For patients included in the analyses from TARGET, PFS was significantly longer in the sorafenib arm (P < 0.0001 by log-rank test, Supplementary Fig. S1A), consistent with the results of TARGET (15). For patients included in the analyses from the AE 941 trial, there was no significant difference in PFS between the 2 arms (P = 0.68 by log-rank test, Supplementary Fig. S1B). Tables 1 and 2 summarize the results for patients included in the analyses from TARGET and the AE 941 trial, respectively, based on the actual trial data.
Table 3 shows the results of resampling simulations for single-arm and randomized designs using data from TARGET. For a single-arm design with RR by RECIST, 55.2% (95% CI: 53.8–56.6) of simulated trials were positive. Similarly, for a randomized design with RR by RECIST and 20 patients per arm, 55.0% (95% CI: 53.6–56.4) of simulated trials showed a significant benefit for sorafenib compared with placebo. Using log ratio increased the percentage of positive trials to 87.7% (95% CI: 86.8–88.6). Randomized designs with PFS-90 and PFS resulted in a percentage of positive trials that was higher than RR by RECIST but lower than log ratio. Increasing the sample size in randomized designs to 25, 30, and 35 patients per arm gradually increased the percentage of positive trials for all endpoints, but in all cases log ratio outperformed the other endpoints.
Table 4 shows the results of resampling simulations for single-arm and randomized designs using data from the AE941 trial. For a single-arm design with RR by RECIST, 0.9% (95% CI: 0.6–1.2) of simulated trials were positive. For a randomized design with RR by RECIST and 20 patients per arm, 6.9% (95% CI: 6.1–7.6) of simulated trials showed a significant benefit for AE 941 compared with placebo. Using log ratio increased the percentage of positive trials to 24.7% (95% CI: 23.5–26.0). Again, randomized designs with PFS-90 and PFS resulted in a percentage of positive trials that was higher than RR by RECIST but lower than log ratio.
Our study shows the potential utility of a database of phase III trials for subsequent investigation, particularly for clinical trial simulations. We compared phase II designs and endpoints in renal cancer, including an early endpoint obtained after the first CT scan at 6 or 8 weeks and recorded on a continuous scale. Early endpoints were emphasized because of their value for making early decisions about drug efficacy, and the continuous scale was incorporated because it increases statistical efficiency. Because we resampled data from a phase III trial that showed the efficacy of sorafenib in renal cancer, the best designs and endpoints were those that most frequently showed that sorafenib is active in renal cancer. We also resampled data from a phase III trial that showed no efficacy of AE 941 in renal cancer to evaluate the false positive rates with these designs and endpoints. The large number of simulated trials allowed precise estimates of the frequency of a positive result. The stipulated 1-sided α level of 10% is consistent with the more exploratory nature of a phase II study.
The results of our study show that a randomized phase II design with log ratio is preferable to more conventional designs and endpoints for predicting the phase III results of sorafenib in renal cancer. We did not attempt to use endpoints such as log ratio or PFS-90 in a single-arm design, as it would be difficult to specify a null hypothesis from historical data and validity would be compromised by potential selection bias. According to expert consensus guidelines, single-arm phase II monotherapy trials should generally use RR by RECIST as the endpoint (18). For sorafenib, this conventional design with 37 patients has a false negative rate of 45% (55% of simulated trials were positive). Furthermore, a positive single-arm trial means that the null hypothesis (which could correspond to a low RR) can be rejected, but does not necessarily establish that the RR is high (21). Although slightly better with a comparable number of patients, the conventional randomized phase II design (with PFS) with 20 patients per arm still has a false negative rate of 31% for sorafenib in renal cancer. In contrast, the randomized design with log ratio and the same number of patients has a false negative rate of only 12%, and even lower false negative rates with larger sample sizes. To achieve a certain threshold of power (e.g., 80% or 90%) with PFS, one would need a larger sample size and longer trial duration than would be required for a similar trial with log ratio. Assuming that all patients have measurable disease, no additional data are required for calculating log ratio compared with RR by RECIST or PFS.
The randomized design with log ratio, as shown by the AE 941 results, had false positive rates ranging between 25% and 30% (increasing with sample size), which are higher than the rates observed for other designs/endpoints and higher than the 10% type I error rate that would be expected if the drug were truly equivalent to placebo. Patients in the AE 941 arm had slightly less tumor growth at 8 weeks compared with those in the placebo arm (Table 2), suggesting that AE 941 had a small growth inhibitory effect without a corresponding PFS benefit (i.e., this was not a perfectly negative trial). This suggests that AE 941 has some activity in metastatic renal cancer, further supported by the observation of 2 objective responses in the original phase II trial (22). Thus, AE 941 may not be a true negative control.
Randomized trials that were simulated by resampling twice from the placebo arm of either trial had the expected false positive rate of less than 10% (data not shown). Log ratio is not a perfect surrogate for PFS and, like any surrogate endpoint, may increase the risk of false positives (23). For drugs with large treatment effects (e.g., RR by RECIST >20%), it is likely that any of the designs evaluated will be positive. However, for active drugs with smaller treatment effects (e.g., sorafenib), a randomized design is most appropriate.
There are a number of limitations of our study. First, we excluded a minority (∼15%) of patients on each trial from our analyses, primarily due to incomplete tumor size data. We addressed this limitation by showing that PFS of included patients was consistent with published results of the trials. Second, we used a small number of lesions to assess tumor burden, a limitation that was necessary to restrict our analyses to consistently measured lesions. Third, our method of imputing tumor size data for patients who had a PFS event before the first CT scan (more of whom were in the placebo arms) might have exaggerated both drug effects, but excluding these patients would have done the opposite. Fourth, the potential use of log ratio is only applicable to disease settings with measurable disease. Fifth, our negative control (AE 941) may actually have some activity. Finally, the results about log ratio apply only to early assessment (at 6 to 8 weeks) during treatment with a growth inhibitory drug for renal cancer, a disease that is known to have a heterogeneous prognosis. These results might not be generalizable to drugs that have different mechanisms of action, or to other disease settings.
In contrast to our results, Fridlyand and colleagues concluded that PFS was a superior phase II endpoint to percentage change in tumor burden by resampling data from 6 phase III trials in colorectal cancer, breast cancer, and non–small cell lung cancer (24). These results emphasize that further analyses of the continuous log ratio endpoint are necessary before it can be broadly accepted. However, log ratio may be useful as a primary endpoint in settings in which an early decision is required or in which early crossover is preferred on ethical grounds.
Also in contrast to our results, An and colleagues concluded that continuous tumor measurement–based metrics were no better than categorical metrics (complete/partial response vs. stable disease vs. progressive disease) using data from 3 phase III trials of cytotoxic therapy in colorectal cancer and non–small cell lung cancer (25). However, the efficiency of continuous measurement–based metrics depends on the quality of the tumor measurement data. It is common in large multicenter trials that are not audited by industry sponsors not to collect full tumor measurement data as prescribed by RECIST. It would be expected that, in studies in which the average number of lesions assessed was small, the advantage of continuous assessments over categorical assessments would be diminished. To resolve the contradictions in the literature will require assessment of these thresholds in future analyses on a disease by disease basis. In addition, An and colleagues (25) did not address power but rather the ability of an early outcome measure to predict subsequent survival using a landmark analysis approach, with predictive ability assessed by the concordance index. The inclusion of patients who progressed or died before the first CT scan in our study is also an important methodologic difference.
In conclusion, this study supports the potential use of an alternative endpoint for randomized phase II trials of growth inhibitory drugs, the log ratio of tumor size at 6 or 8 weeks compared with baseline. This design/endpoint is better than more conventional designs/endpoints for detecting the efficacy of sorafenib, but may increase the risk of false positives with drugs that eventually prove to lack clinical benefit. Additional empirical evidence is needed before definitive conclusions can be reached about whether log ratio is an endpoint that should be used in phase II trials.
Disclosure of Potential Conflicts of Interest
R.R. Bies: employment, Centre for Addiction and Mental Health, University of Toronto; other commercial research support, Eli Lilly and Company; consultant/advisory board, Scientific Advisory Board Metrum Institute, LACDR. M.L. Maitland: other commercial research support, Bayer and EMD Serono; consultant/advisory board, Pfizer, Astellas Pharma and Abbott labs. W.M. Stadler: commercial research grant, Bayer, Novartis, Genentech, GlaxoSmithKline and Pfizer; honoraria from speakers bureau, Pfizer and Bayer; ownership interest, Abbott; consultant/advisory board, Novartis, Pfizer, Genentech, Caremark and Aveo. The other authors disclosed no potential conflicts of interest. The funders did not have any involvement in the design of the study; the collection, analysis, and interpretation of the data; the writing of the manuscript; or the decision to submit the manuscript for publication.
This work was supported by NIH grant T32GM007019 and a Conquer Cancer Foundation Translational Research Professorship award to M.J. Ratain; National Cancer Institute Mentored Career Development Award K23CA124802 to M.L. Maitland.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The authors thank Bayer Pharmaceuticals for providing the data from TARGET that was used in this research and Drs. Bernard Escudier and Daniel Croteau for providing the data from the AE 941 trial that was used in this research.
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
This article was presented, in part, at the 46th Annual Meeting of the American Society of Clinical Oncology, June 2–5, 2010, Chicago, Illinois.
- Received July 13, 2011.
- Revision received January 6, 2012.
- Accepted January 20, 2012.
- ©2012 American Association for Cancer Research.