Abstract
Purpose: In clinical cancer trials for evaluating neoadjuvant chemotherapy, tumor downstaging is frequently used as a surrogate end point for overall survival. We evaluated the surrogacy of tumor downstaging using data from a followup observational study in bladder cancer.
Experimental Design: A total of 586 patients (from 32 Japanese hospitals) who underwent radical cystectomy for invasive bladder cancer (clinical T2 to T4) between 1990 and 2000 were analyzed. We considered changes over time in clinical stage at diagnosis and pathologic stage at cystectomy as a surrogate end point, and survival time after cystectomy as a true end point. First, we developed a new criterion for tumor downstaging. Second, we statistically evaluated surrogacy for the criterion using Prentice's criteria.
Results: To develop the criterion of end points based on tumor downstaging, we selected the best classification among all possible classifications in an attempt to separate prognosis for patients. The hazard ratios after adjustment for prognostic factors in the intermediate effect patients and the poor effect patients were 1.9 (95% confidence interval, 1.03.7) and 5.0 (95% confidence interval, 2.69.8), respectively, compared with that in the good effect patients. The conditions for correlation and conditional independency of Prentice's criteria were satisfied approximately. Neoadjuvant chemotherapy has a statistically significant tumor downstaging effect, whereas there was no difference on survival between treatment groups.
Conclusions: The tumor downstaging effect could be an appropriate intermediate end point for screening novel neoadjuvant chemotherapy for invasive bladder cancer. The dataset from followup studies were useful for evaluating the surrogacy of end points.
 surrogate end point
 observational study
 tumor downstaging
 neoadjuvant chemotherapy
 invasive bladder cancer
Appropriate surrogate end points are critical for developing new therapies through evaluation of biological activity. The surrogate end point is a test, measurement, score, or some other similar variable that is used in place of a clinical event in the design of a trial, or in summarizing results from it. Used because the variable is believed to be correlated with the clinical event of interest and because of its perceived utility in yielding detectable treatment differences (1). In clinical cancer trials, overall survival is considered to be the most reliable and definitive true end point. However, surrogate end points such as tumor burden outcomes including objective tumor effect, diseasefree survival, and progressionfree survival, or biomarkers including prostatespecific antigen have been widely used because trials with the true clinical outcome are often longer and larger. In a recent analysis for oncologic drugs in the U.S., 68% (39 of 57) of the regular approvals and all of the 14 accelerated approvals were based on end points other than overall survival in the last 13 years (2). To use a valid and reliable surrogate end point in cancer clinical trials, we should evaluate the surrogacy of end points on a casebycase basis because the adequacy as a surrogate end point is highly dependent upon the type and/or stage of cancer, and other available therapies.
For statistical validation of surrogate end points, Prentice (3) proposed the validity criterion that a valid betweengroup analysis of the surrogate end point also constitutes a valid analysis of the true clinical end point. Freedman et al. (4) showed that these criteria were not straightforward to verify by hypothesis testing. Recently, Buyse et al. (5) have proposed two new measures, termed “relative effect” and “adjusted association.” However, to explore the validity of a surrogate end point by these measures, we have to combine information from several randomized clinical trials testing the effect of a treatment on both the surrogate and the true end points (6). In practice, we rarely have information about both end points from even single randomized clinical trials before designing a feature clinical trial for new agents. Such situations have motivated us to assess the surrogacy of end points using available information other than randomized studies. In clinical trials for evaluating neoadjuvant chemotherapy in bladder cancer, “tumor downstaging” is frequently used as a surrogate end point for overall survival. Clinical staging with transurethral resection (TUR) is very important in treatment planning and prognosis. However, the reliability of TUR staging is a problem. The disparity between clinical and pathologic staging may be caused by repeat TUR, i.e., TUR effect, and measurement error (7). We developed a new criterion of tumor downstaging effect and evaluated the surrogacy of tumor downstaging using data from a followup observational study in invasive bladder cancer.
Patients and Methods
A total of 1,131 patients who underwent radical cystectomy for invasive bladder cancer between 1990 and 2000 at 32 Japanese hospitals were retrospectively registered (8). The information that was collected from the medical records included age, gender, histology, clinical staging, and pathologic staging according to the tumornodemetastasis classification (9), and the presence of perioperative systemic chemotherapy. In the present study, 586 patients who have clinical stage T2 to T4, N0, M0, transitional cell carcinoma, and who were less than 80 years old were included.
Figure 1 shows a schema of treatment group comparison. The patients were divided into two treatment groups, i.e., neoadjuvant chemotherapy (NAC) group and no neoadjuvant chemotherapy (nonNAC) group. After the clinical staging was done based on diagnostic TUR, chemotherapy followed by radical cystectomy was done in the NAC group, and only cystectomy was done in the nonNAC group. More precise pathologic staging was done at the time of cystectomy.
Statistical analysis. Prentice's criterion for evaluating the surrogacy of end points is a set of four conditions as follows (3, 5, 10):
PC1: f (TZ) ≠ f (T) so the treatment affects the distribution of T,
PC2: f (SZ) ≠ f (S) so the treatment affects the distribution of S,
PC3: f (TS) ≠ f (T) so the surrogate affects the distribution of T,
PC4: f (TS, Z) = f (TS) so that conditionally on S, T is independent of Z.
where, for example, f (TZ) is the conditional distribution of the true end point T given the treatment assignment Z, and S is the surrogate end point. In the present study, the treatment Z is set to 0 for nonNAC group and 1 for NAC group. The candidate surrogate end point S is a tumor downstaging effect based on the difference between clinical stage and pathologic stage and the true end point T is overall survival after cystectomy. Therefore, in this setting, the PC1 means that neoadjuvant chemotherapy must affect overall survival, PC2 means that neoadjuvant chemotherapy must affect tumor downstaging, PC3 means that tumor downstaging must be correlated with overall survival, and PC4 means that tumor downstaging must fully capture the net effect of neoadjuvant chemotherapy on overall survival.
The survival curves were estimated with the KaplanMeier method. The Cox proportional hazards model was used to estimate hazard ratios (HR) after adjustment for covariates. All statistical analyses were done by using SAS version 8.02 (SAS Institute, Inc., Cary, NC).
Results
A total of 586 patients [481 men (82%) and 105 women (18%)], with a mean age of 65.2 years (range, 3380 years), were treated with radical cystectomy with bilateral lymph node dissection. Out of 586 patients, 183 patients (31%) were treated with neoadjuvant chemotherapy. As the neoadjuvant chemotherapy, methotrexate, vinblastine, doxorubicin, and cisplatin, was used in 43% of patients and used for 1.5 cycles on average. The other patients were treated with the modified cisplatinbased regimens including methotrexate, epirubicin and cisplatin; and cisplatin, cyclophosphamide, and doxorubicin; and cisplatin, adriamycin, and methotrexate, as well as other miscellaneous regimens (11–15). The distributions of prognostic factors in treatment groups were as follows: mean patient age was 65.8 years (SD, 8.8) and 63.7 years (SD, 8.6) in the nonNAC and NAC groups, respectively. The patient proportion of positive lymph node involvement was slightly higher in the nonNAC group (17.4%) than in the NAC group (14.2%), but that of clinical T3 or T4 was much higher in the NAC group (70.5%) than in nonNAC group (49.6%). Proportions of receiving postoperative chemotherapy were similar in both groups, i.e., 23.1% in the nonNAC group, 23.0% in the NAC group.
Development of tumor downstaging effect criterion. We estimated HRs on the overall survival after cystectomy by 10 combinations of clinical and pathologic stage after adjustment for age, lymph node involvement, and adjuvant chemotherapy (Table 1). The estimated HRs by treatment group were similar to that in all cases. First, the 10 combinations were ordered according to the size of HR [1, T2 to P0/1 (HR, 1); 2, T3/4 to P0/1 (HR, 1.5); 3, T2 to P2a (HR, 1.9); 4, T3/4 to P2a (HR, 2.2); 5, T2 to P2b (HR, 2.4); 6, T2 to P3 (HR, 4.3); 7, T3/4 to P2b (HR, 4.6); 8, T3/4 to P3 (HR, 5.3); 9, T3/4 to P4 (HR, 5.3); 10, T2 to P4 (HR, 11.1)] in all cases. Second, we selected the best classification among all possible classifications in an attempt to separate the prognosis of patients with respect to the Akaike's information criteria. The total number of examined classifications was 45—9 for two categories (good/poor) and 36 for three categories (good/intermediate/poor). For example, the examined classifications were 1 (good) versus 2 to 10 (poor), 1 to 2 versus 3 to 10,…, 1 to 9 versus 10 for two categories, and 1 (poor) versus 2 (intermediate) versus 3 to 10 (poor), 1 versus 2 to 3 versus 4 to 10, 1 versus 2 to 4 versus 5 to 10,…, 1 to 8 versus 9 versus 10 for three categories.
As a result, patients were classified into three categories, i.e., good effect (1, T2 to P0/1), intermediate effect (25, T2 to P2a/2b or T3/4 to P0/1/2a), and poor effect (610, T2 to P3/4 or T3/4 to P2b/3/4). Survival curves according to the tumor downstaging effect were shown in Fig. 2A. The HRs in the intermediate effect patients and the poor effect patients were 1.9 [95% confidence interval (CI), 1.03.7] and 5.0 (95% CI, 2.69.8), respectively, compared with that in the good effect patients after adjustment for age, lymph node involvement, and adjuvant chemotherapy. The risks by tumor downstaging effect were similar between treatment groups (Fig. 2B and C).
Statistical evaluation for surrogacy of the end point. It is obvious that to fulfill the PC3 condition, tumor downstaging must be correlated with overall survival because we selected the tumor downstaging in such a way that the patients can be classified based on their overall survival. To verify the PC4 condition that tumor downstaging must fully capture the net effect of neoadjuvant chemotherapy on overall survival, it is usually stated that the coefficient corresponding to treatment effect corrected for tumor downstaging is required to be equal to zero. The HRs between treatment groups by tumor downstaging effect, pooled HR and their 95% CIs were estimated after adjustment for age, lymph node involvement, and adjuvant chemotherapy (Table 2). The estimated pooled HR was 1.06 (95% CI, 0.771.47) when stratifying by tumor downstaging effect. Although the nonsignificance of the test in which HR = 1 does not prove the PC4 condition, it was suggested that PC4 might be plausible in this study because the pooled HR was close to 1.
As the data is not from randomized trials, strictly speaking, the inference for treatment comparison is not valid and thus the PC1 and PC2 conditions cannot be evaluated. However, we attempted to verify the PC1 and PC2 conditions after adjustment for the confounding factors. For evaluating PC2, we used the CochranMantelHaenszel statistic with rank score, i.e., the stratumadjusted Wilcoxon test, because of imbalance of clinical stage distribution among treatment groups. The effect of neoadjuvant chemotherapy on tumor downstaging effect was statistically significant (χ^{2} = 16.1, P = 0.001; Table 3).
To evaluate the PC1 condition, we compared the overall survival between treatment groups by clinical stage. In clinical stage T2, the treatment effect was not statistically significant (HR, 0.87; 95% CI, 0.441.70) after adjustment for age, lymph node involvement, and adjuvant chemotherapy. Similarly, in clinical stage T3 or T4, the treatment effect was not statistically significant (HR, 0.98; 95% CI, 0.671.43).
Discussion
In this study, we proposed a new tumor downstaging criterion based on prognosis in invasive bladder cancer patients for evaluating neoadjuvant chemotherapy. Objective tumor response has been a widely accepted measure of cancer chemotherapy activity. According to international standards, including WHO criteria (16) and Response Evaluation Criteria in Solid Tumors (17), patients were usually classified into either responders (complete response or partial response) or nonresponders (no change or progressive disease). The objective tumor response can be assessed even in singlearm studies, however, in the NAC group of the present study, overall survival had no difference between responders and nonresponders for neoadjuvant chemotherapy (adjusted HR, 1.09; 95% CI, 0.592.03; Fig. 3). Therefore, objective tumor response might not be a valid surrogate end point for evaluating neoadjuvant chemotherapy in invasive bladder cancer.
Some investigators defined the criterion for tumor downstaging (7, 18). However, few data were available with regard to clinical staging and pathologic staging for patients who were treated with or without neoadjuvant chemotherapy, and no definite criterion has been developed based on the prognosis of patients. In the present study, the HRs among clinical stages were different even on the same pathologic stage, especially on P_{0/1} and P_{2b} in the nonNAC group (Table 1). This suggests that unmeasurable components, including the clinician's subjective judgment on clinical stage, might reflect different prognoses. With regard to tumor downstaging in invasive bladder cancer, it is questionable to generalize the findings to other cancers because downstaging can occur without chemotherapy when the tumor is removed by the diagnostic TUR (7). In addition to the TUR effect, misclassification for staging system, called staging error, have to be considered. In the present study, a proportion of good downstaging effect was 29% even in the nonNAC group. This means that a control group is essential for evaluating therapies in invasive bladder cancer if the tumor downstaging effect is used as an end point of clinical trials.
We statistically evaluated the surrogacy of the end point using data from a followup observational study. Prentice's criterion was useful for that purpose, especially for the evaluation of PC3 (correlation) and PC4 (conditional independency). In the present study, the PC3 and PC4 conditions were satisfied approximately. Although the study is not a randomized trial, it is suggested that the neoadjuvant chemotherapy affects tumor downstaging, i.e., PC2 (tumor downstaging benefit) is acceptable, but the treatment does not affect overall survival, i.e., PC1 (survival benefit) is unacceptable. We gave an actual example of hypothetical situations from other articles (5, 10), which showed that the PC2 does not imply the PC1. As another actual case, a randomized trial for locally advanced bladder cancer concluded that the survival benefit of neoadjuvant chemotherapy was of borderline statistical significance (P = 0.06), whereas the tumor downstaging effect was statistically significant (P = 0.001; ref. 7). Do the inconsistent results between PC1 and PC2 depend on the differences of statistical power for evaluating these conditions? We calculated the power of two kinds of statistical tests, i.e., Wilcoxon ranksum test for tumor downstaging effect and logrank test for overall survival, based on our data. If the expected proportions of downstaging effect are 0.50 (good), 0.39 (intermediate), and 0.11 (poor) in the NAC group and 0.29 (good), 0.55 (intermediate), and 0.16 (poor) in the nonNAC group from the data in clinical stage T2, a sample size of 96 in each group will have 80% power to reject the null hypothesis using a Wilcoxon ranksum test with a 0.05 twosided significance level (19). On the other hand, if the expected 5year survival probability in the nonNAC group is 0.5, 0.6, and 0.7 and HR is 0.87, a corresponding sample size in each group will be 1,595, 2,004, and 2,683, respectively, using a 0.05 level twosided logrank test for equality of survival curves (20). The difference of statistical power is critical for evaluating the PC1 and PC2 conditions. In two recently published studies, the survival curves for patients treated with neoadjuvant methotrexate, vinblastine, doxorubicin, and cisplatin was superior for patients treated with cystectomy alone, with a HR of 0.74 (95% CI, 0.550.99) in a randomized trial (7, 21), and platinumbased combination chemotherapy showed a survival benefit with a HR of 0.87 (95% CI, 0.780.97) in a metaanalysis of individual patient data (22). The HR which we assumed to calculate the power might be plausible from these results. However, an important question for implementing neoadjuvant chemotherapy for patients with invasive bladder cancer remains, i.e., how do we select the appropriate patients for combination therapy (23).
Buyse et al. (5, 6) have emphasized that we have to combine information from several randomized clinical trials testing the effects of treatment on both surrogate and true end points to explore the validity of a surrogate end point. In practice, we must assess the surrogacy of a candidate end point without data from a randomized trial because the primary objective of a randomized trial will often be to evaluate survival benefit, hence, if the survival benefit were known to be true, then one would have to question the value of conducting such a study. Nonetheless, the purpose of the evaluation of surrogacy should be restricted to find out “appropriate intermediate end points” (10). Fleming et al. (24) also pointed out that surrogate end points can be useful in phase 2 screening trials for identifying whether a new intervention is biologically active and for guiding decisions about whether the intervention is promising enough to justify a large definitive trial with clinically meaningful outcomes. The basic premise is that we cannot predict a treatment effect on the true end point from the effect on the surrogate end point. In conclusion, the tumor downstaging effect could be an appropriate intermediate end point in phase 2 trials for screening novel neoadjuvant chemotherapy in invasive bladder cancer. The dataset from followup studies were useful for evaluating the surrogacy of end points.
Acknowledgments
We thank the hospitals associated with Kyoto University, Nara Medical University, and Nagoya University for providing clinical data.
Footnotes

Grant support: GrantinAid for Scientific Research on Priority Areas from the Ministry of Education, Culture, Sports, and Technology of Japan.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted October 10, 2005.
 Received July 24, 2005.
 Revision received September 17, 2005.