Abstract
In this report, we use new patient data to test three popular models developed to predict the outcome of definitive radiation therapy. The data come from 240 men with localized prostate cancer and who were treated with definitive radiation therapy at a community hospital. All three models tested were based on the three commonly available variables of pretreatment prostatespecific antigen (PSA), Gleason score, and tumor stage, and we used the Cox proportional hazards model and the logistic regression model to relate these variables to outcome. We discovered that in our data, the optimal way to use pretreatment PSA was as natural log(PSA), the optimal way to use T stage was in three categories: T_{1} and T_{2}, T_{3}, and T_{4}, and that the optimal use of Gleason score was as <7 versus ≥7. Nevertheless, models confined to the optimal use of these three variables leave much uncertainty about important outcomes, such as the probability of relapse within 5 years.
INTRODUCTION
Men with carcinoma of the prostate and their physicians must choose among several treatments or even between treatment versus “watch and wait.” Relative to some cancers, the situation is good because the treatments are effective, and for many the tumor does not shorten survival. For example, both surgery and definitive radiotherapy achieve long diseasefree intervals for men with localized tumor, and there are several effective hormonal treatments for patients not curable by local treatment or for those who suffer a relapse. To make the situation even better, carcinoma of the prostate enjoys the best serum tumor marker of any malignancy, i.e., PSA (1) .2 Use of PSA has not only assisted the diagnosis of prostate cancer but has also significantly improved our ability to evaluate tumor stage and to predict the likelihood for success of several treatments. For example, many have formed statistical models using pretreatment values of PSA to predict the outcome of either prostatectomy or definitive radiotherapy ,(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36) . Finally, after treatment, serial measurements of PSA have proven useful for evaluating the effectiveness of treatment, and rises in PSA have often preceded other measures of tumor recurrence (37) .
Developing models to predict the outcome of prostatectomy has the advantage of using pathology results for quick and objective outcomes, which often consist of binary events such as the absence or presence of tumor in lymph nodes, in seminal vesicles, or surgical margins (6 , 7) . For these outcomes, the logistic regression model is appropriate (38 , 39) . Outcomes such as tumor volume that are continuous can be modeled using the general linear model (40) . Both of these approaches allow their models to be easily validated with new data, a step that is necessary before they be used in the clinic, and when tested, some models have been found to explain a limited amount of variation and have not always validated well with new data (41, 42, 43) . Part of the difficulty may be due to a complex relationship between serum PSA and tumor volume.
When we consider prediction of outcomes for definitive radiation therapy, the modeling becomes more difficult. For example, there are no quick, objective outcomes analogous to the pathological observations on prostatectomies. Instead, the earliest outcome is a rising PSA, which has proven difficult to define. Instead of a binary event, the dependent variable in the statistical analysis is most often time to PSA failure. Because many of these patients are either cured or die of other causes or because there is not sufficient time to follow all patients to failure, the event of PSA failure is observed for just for a fraction of the patients (the uncensored ones), and the remaining patients must be considered “censored.” Thus, it takes longer time and more patients to effectively model the outcome of definitive radiation therapy, and details such as the number uncensored patients and the length of followup time are critical to the success of the modeling. Table 1<$REFLINK> gives some of these relevant details for several of the published models (20, 21, 22, 23, 24, 25, 26, 27, 28 , 30, 31, 32) . These data demonstrate how important length of followup is, because as the publication years increase, so do the numbers of uncensored patients. Because the usual way to analyze the time to treatment failure is with the Cox proportional hazard model (44) , the number of uncensored patients is critical. The uncensored patients determine the numerator of the likelihood function, which is critical to the Cox model (44 , 45) , so that studies with few uncensored patients are not likely to form useful models, especially when there are multiple prognostic variables. The median length of followup is also critical because shorter durations of followup generally produce fewer uncensored patients. Because PSA is a continuous variable, it can be used in its original units, broken into several discrete levels, or as its logarithm (log). Similarly, grade and stage can be broken into various levels.
Because the Cox model provides a relative hazard rather than an absolute hazard, validating these models with new data are not straightforward. One can validate the choice of variables as well as how they are used, i.e., through discrete levels or as continuous variables. Furthermore, if the model provides sufficient details, one can also calculate the model’s relative hazard function and compare this to observed time to failure or to discrete outcomes of failure over key time intervals. In this study, we used data previously unpublished from a community hospital to demonstrate the importance of the routine prognostic variables of PSA, clinical stage, and grade and how their several transformations relate to time to tumor relapse after definitive radiotherapy. We then form an optimal model for these data using PSA, clinical stage, and histological grade and compare this model to three previously published ones. Finally, we present new results regarding the limitations of models based solely on PSA, clinical stage, and grade.
PATIENTS AND METHODS
This study is based on 240 men with prostate cancer and who were treated with definitive radiation therapy at Moore Regional Hospital in Pinehurst, NC. Table 2<$REFLINK> summarizes their characteristics including their age, TumorNodeMetastasis stage (46) , Gleason tumor grade (47) , radiation dose, median and range of followup in months, number of patients followed to relapse (i.e., uncensored), and median and range of time to relapse. Patients were treated with external beam delivered to the prostate, seminal vesicles, and periprostatic area through anteroposterior and lateral fields to a dose of 4500 cGy at the rate of 180 cGy/fraction. Then the field was reduced slightly, and the remainder of the treatment was delivered at the rate of 200 cGy/fraction. The patients were not given either neoadjuvant nor adjuvant androgen ablation therapy.
The models that we focused on for this analysis are the three in Table 1<$REFLINK> that appear to include over 100 uncensored patients. As the table indicates, these three used PSA as four levels, as three levels, or with the logarithm transformation. The models categorized grade into either two, three, or four levels, and all three models used clinical stage as two levels.
Definition of Relapse.
Patients were considered treatment failures if they suffered biochemical relapse, required hormonal treatment (after radiation therapy), or were thought to have died of tumor. Biochemical relapse was defined as a sequence of two rises in PSA (i.e., a sequence of three increasing values of PSA) after the nadir, with the third value >1.4 ng/ml. In addition, we counted as relapsed a few patients with a postnadir PSA >9 ng/ml but for whom there were not three recorded values of PSA after the nadir. For these circumstances, the time of relapse was set as midway between the time of nadir and the first of the rising PSA levels.
Statistical Methods.
We used the Cox proportional hazard model (44) to relate time to relapse as a dependent variable to the independent variables of pretreatment PSA, stage, and grade. The software we used was the COXPH program in the SPLUS software package (MathSoft, Inc., Seattle, WA). To compare models, we used the overall model χ^{2} statistic based on the likelihood ratio as well as the independent variables’ Ps. To test prior models, we incorporated their reported optimal linear functions of PSA, clinical stage, and grade into the Cox model analysis of our data. If for simplicity we call the logarithm of the relative hazard function from the Cox model a “hazard score” (hs), then in the Cox model we can write the hazard (h) relative to a baseline hazard (h0) according to the following equation: In the setting of these models, hs is a linear function of levels of PSA, clinical stage, and Gleason score. Thus, when a reported model provided the coefficients for the linear relationship, we used these to calculate hs and then tested this as a variable in the Cox analysis. We normalized the hs to our data by subtracting its mean value for our patients. Because the hazard index used by Zagars et al. (28) is the ratio h/h0, we could derive the hs as the logarithm of h/h0. Specifically, we formed an algorithm to calculate h/h0 based on their reported relationships between h/h0 and PSA, stage, and grade and then verified that the h/h0 calculated by the algorithm agreed closely with the results of their report. Then we took the logarithm of h/h0 to be hs.
Although the Pisansky model did not produce an hs, it did give a rule that was a linear combination of PSA, stage, and grade as follows:
Here, T is 0 if the clinical stage is either T_{1} or T_{2} and 1 if it is either T_{3} or T_{4}. G7 is 0 if Gleason score is <7 and otherwise equals 1. After normalizing the Rule to our population, we used this as a hs in the Cox model. When there were discrete categories of PSA, stage, or grade, we used dummy variables for these. For example, if the discrete category of PSA between 4 and 10 ng/ml is used, then the dummy variable value would be 1 when PSA was between these points and otherwise 0. Finally, we used the logistic regression model to relate the optimal hs to the discrete outcome of PSA failure (38 , 39) .
RESULTS
Optimum Use of PSA, Clinical Stage, and Grade.
We compared three ways of using pretreatment PSA as a predictive variable for time to relapse: as two cutpoints, as three cutpoints or the natural logarithm of PSA. Three cutpoints in PSA produce four levels, e.g., PSA <4 ng/ml, PSA between 4 and 10 ng/ml, PSA between 10 and 20 ng/ml, and PSA >20 ng/ml (28) , and this also produces three dummy variables, each taking a value of either 0 or 1 (see “Patients and Methods”). Two cutpoints in PSA yield three levels or two dummy variables. The results appear at the top of Table 3<$REFLINK> (part 1). Here we see the model χ^{2} statistics for three models, the first using the PSA cutpoints of Zagars et al. (28) , the second using the cutpoints of Green et al. (32) , and the third using Log(PSA) suggested by Pisansky et al. (30) . There is one P for each dummy variable. The best model (highest χ^{2}) was the one using Log(PSA), and this provided the lowest individual P for PSA.
Part 2 of Table 3<$REFLINK> shows our analyses of clinical stage. In this part, the three listed models analyzed clinical stage as two categories, three categories, or four categories, and this was accomplished through the use of dummy variables T_{2}, T_{3}, and T_{4}. Each dummy variable takes the value of 1 only if the tumor is in that stage; otherwise, it is 0 (see footnote to Table 3<$REFLINK> ). The results show that splitting T_{3} and T_{4} stages into separate variables improved slightly on the model that used just a binary stage categorization of T_{1} and T_{2}versus T_{3} and T_{4}, because the model χ^{2} statistic rose from 58.2 to 60.8. Although we had just three patients with stage T_{4}, the P for this variable was 0.013, persuading us to separate T_{3} from T_{4} at least until further data come available. Nevertheless, splitting T_{1} and T_{2} stages did not improve the model any further, as indicated by the lack of further increase in the χ^{2} statistic as well as by the individual P of 0.55 for the dummy variable T_{2}.
Finally, part 3 of Table 3<$REFLINK> shows our analyses of histological grade. The three listed models used the Gleason score as either a two, three, or four category variable. The results show that any one of these yielded a slightly improved model, compared with those in part 2 but without producing individual Ps < 0.05. Neither the three nor the fourcategory grade scores appeared better than a twocategory score. The details for the final and best Cox model for these data are given in Table 4<$REFLINK> , and Gleason score is included as a binary variable, although its P is borderline, because it has been found so important in other studies. Radiation dosage was not a significant variable in this analysis (P = 0.54).
Comparison with Models of Zagars, Green, and Pisansky.
Next, we compared the model of Table 4<$REFLINK> to those of Zagars et al. (28) , Green et al. (32) , and Pisansky et al. (30) . Specifically, we calculated hs from the linear combinations of PSA, stage, and grade that had been found to be optimal by these three models published previously. Figs. 1<$REFLINK> 2<$REFLINK> 3<$REFLINK> show how these values of hs compared with the hs obtained from the model of Table 4<$REFLINK> . In Figs. 1<$REFLINK> and 2<$REFLINK> , we see that there is a noisy relationship between the two hazard scores, largely because the ones on the vertical axes used limited levels of PSA and the one of the horizontal axis used PSA as a continuous variable. By contrast, Fig. 3<$REFLINK> demonstrates a relatively close linear relationship between the Pisansky rule and the optimal hazard score. Table 5<$REFLINK> compares the Cox model χ^{2} statistics obtained with these three models for hs against the χ^{2} for the model of Table 4<$REFLINK> . We see that although the model χ^{2} statistics of the Zagars model and Green model were large and their hs significantly associated with time to relapse in our patients (model Ps < 1.0 × 10^{−11}), their χ^{2} statistics were lower than those of the either the Pisansky model or the optimal model of Table 4<$REFLINK> , and they also required more coefficients. In other words, both the Pisansky model and the one of Table 4<$REFLINK> explained more of the variance in the data than did the Zagars model or the Green model.
Model Fit and Residual Noise.
Although it is useful to identify variables that are prognostic for tumor recurrence, i.e., ones with low Ps in the Cox model, this result is not by itself sufficient. Published tables of prognostic variables, their Ps, and coefficients do not tell the whole story. To prove that the resulting multivariate model is useful, we needed to see how well it predicts observable events and how well it fits the data. Unless a model relates to observable events, it cannot help patients and their physicians make decisions about treatment. Furthermore, a good model should allow one to discriminate good from bad outcomes.
One way to demonstrate how well a model fits the data on outcomes is to examine the residuals for the Cox model, which are often given as the deviance residual (48) . Such residuals provide a measure of the difference between observed events and what the model predicts. Fig. 4<$REFLINK> shows a plot of these versus time from treatment for our best model of Table 4<$REFLINK> . If the model fits the data well, the residuals should follow a horizontally flat line located at 0 on the vertical axis. What we see instead on the graph is a curved line, which is higher at the beginning and lower later. The higher trend of residuals for earlier times implies that the model underpredicts relapses early in followup, and the trend of the line below 0 later indicates that it overpredicts relapses later.
Another way to test how the model relates to critical outcomes is to relate its hazard score to the probability of tumor relapse over some specified time. For example, Fig. 5<$REFLINK> shows a plot of the observed probability of tumor relapse in 5 years against the hazard score for the model of Table 4<$REFLINK> . Clearly, as the hazard score increases above zero, the probability of relapse rises, but the points at the top (relapses) and the bottom (no relapses) show that for hazard scores between −1 and 1, there are both relapsing patients and nonrelapsing patients. Fig. 6<$REFLINK> demonstrates this overlap further. Here is a boxplot of hazard scores plotted separately for those who did not relapse and for those who did. The shaded area gives the 25th to 75th percentiles of hazard score, the white line gives the median, and the more distant lines provide extreme values. The plot shows that there is significant overlap in hazard scores for relapsed and nonrelapsed patients. Thus, this best model leaves much uncertainty about an important outcome, and logistic analysis indicated that the hazard score accounted for just 18.7% of the deviance, or noise, in the data.
DISCUSSION
The results of this study suggest that for prognostic purposes, the optimal use of PSA is as its logarithm, that stage T_{3} should probably be separated from stage T_{4}, and that Gleason score is probably best used as a binary variable with a cutpoint between 6 and 7. The finding about logarithm of PSA confirms the recent conclusion of Movsas et al. (29) that the logarithm transformation of pretreatment PSA provides the best prognosticator for eventual PSA relapse after definitive radiotherapy. Although Movsas et al. (29) did not provide the coefficients of their final model using log(PSA), it is likely that their model is similar to the one in Table 4<$REFLINK> , because both use nearly the same variables. The results also confirm the nearly optimal nature of Pisansky’s rule for the same reason. Using cutpoints of PSA such as the ones that Zagars et al. (28) or Green et al. (32) used produce weaker predictive models. Because we had only three patients who were stage T_{4}, it is possible that the significant association between T_{4} and time to relapse may not validate with additional data. Nevertheless, the low P of 0.013 persuades us to separate T_{3} from T_{4}, at least until more data on this issue becomes available. In our data, histological grade was of borderline importance, although like others we found the break at the score of 7 to be the best way to use the Gleason score.
Despite the prognostic value of log(PSA), clinical stage, and histological grade, we have shown that models confined to only these three variables do not predict the outcome ideally and in fact explain only 18% of the variance in important outcomes, such as the probability of recurrence in 5 years. Thus, they leave much to uncertainty, and we should not be satisfied with such limited results. The question that remains is how to improve our ability to predict who will relapse and when.
Clearly, one way to improve the above models is to add more prognostic variables, and in this regard, needle biopsies can help. For example, the amount of tumor present in core biopsies is clearly important. Using 813 patients treated with radical prostatectomy Narayan et al. (49) demonstrated that a staging system based on the number of positive biopsy cores was superior to the ordinary staging based on digital rectal exam. Their best models for predicting ECE, involvement of seminal vesicles or lymph nodes, relied on the combination of biopsybased staging, PSA, and Gleason score. Egan and Bostwick (50) demonstrated that after controlling for log(PSA), the percentage of biopsy involved by tumor was significantly related to the probability of tumor outside the prostate, and recently we found that two measures of the amount of tumor in the biopsies were important. Both tumor length and the number of positive cores provided significant information for predicting tumor volume, and this occurred after controlling for the effects of both PSA and Gleason score (42) . Finally, to the degree that a set of six core biopsies sample a small portion of the prostate, it is possible that increasing the number of cores could add prognostic information simply by sampling the prostate and tumor more completely.
Refinements in the Gleason grading system could also add prognostic information. The Gleason score is formed by adding the predominant grade number to any less dominate grade that may be present. For example, if the predominate grade is 3 and the lessor one 4, then the Gleason score for this 3,4 tumor is 3+4 or 7. In this way, the Gleason score equates a 3,4 tumor with a 4,3 tumor, although their predominate grades differ. At least two groups have demonstrated that these two tumor types have a different prognosis, i.e., the predominance of a grade type is information that may be prognostically important (51 , 52) . Two groups have also found that the percentage of highgrade tumor (i.e., Gleason grades 4 or 5) present is important prognostic information (53 , 54) . For example, McNeal et al. (53) demonstrated that the volume of highgrade tumor related closely to the probability of tumor in lymph nodes, and Herman et al. (54) have shown recently that the amount of Gleason grades 4 or 5 is significantly related to survival after radical prostatectomy.
Immunohistochemical staining for MIB1, p53, and bcl2 have been found recently to add useful prognostic information (55, 56, 57, 58, 59, 60, 61, 62, 63) . In general, increased staining for these three markers has implied decreased diseasefree interval, although the results have been mixed, and at times marker staining was significant only in univariate analyses. Curiously, many studies of these immunohistochemical markers did not include PSA as a covariate in the modeling; therefore, without the strongest routine prognosticator, it is difficult to decide their importance. Furthermore, because the level of staining for both p53 and MIB1 has correlated positively with PSA (56 , 57) , it is possible that some of the prognostic importance of staining would diminish with PSA included in the models. Finally, because the number of uncensored patients in these eight cited studies was limited (average, 36; range, 17–66), further study of such markers is needed.
p27 is another cell cycleassociated marker found recently to be prognostic in prostate cancer (64, 65, 66, 67) . In general, decreased immunohistochemical staining for p27 has implied a worse prognosis, although not all have found that staining for p27 was important in multivariable Cox models. Because just two of the four cited studies here included PSA as a potential covariate and because the number of uncensored patients was small (range, 23–44), the results are preliminary.
Beyond biopsyrelated observations, it is possible that transrectal ultrasonography can provide prognostic information. For example, D’Amico et al. (33, 34, 35, 36) formulated a model using PSA, transrectal ultrasonography, and Gleason score to predict tumor volume, and they have suggested that this predicted volume could replace or add to the usual clinical stage. Their model deals with two issues: (a) it corrects the PSA for the amount of benign prostate tissue present; and (b) it corrects the PSA according to the grade because for any tumor volume higher grade has implied lower values of PSA (68, 69, 70) . Nevertheless, two studies have shown that the predicted volume agrees loosely with observed tumor volumes (42 , 43) .
Endorectal MRI has shown promise for detecting tumors outside the prostate, including in the seminal vesicles, and this MRIbased detection was significantly associated with PSA failure after prostatectomy (71, 72, 73, 74, 75, 76, 77) . The studies cited here (71, 72, 73, 74, 75, 76) have reported that the sensitivity of endorectal MRI for detecting ECE ranged from 13 to 95% (weighted mean, 49%), and the specificity for ECE ranged from 82 to 100% (weighted mean, 92%). The corresponding sensitivity for detecting tumor in the seminal vesicles ranged from 23 to 100% (weighted mean, 56%), and the specificity for this outcome ranged from 84 to 100% (weighted mean, 93%). Clearly, these results imply that endorectal MRI provides useful information about local stage of tumor. Furthermore, D’Amico et al. (77) have demonstrated that in a multivariate model, endorectal MRI improved the prediction of either ECE or positive seminal vesicles (77) . Specifically, they showed that predicting ECE was optimized by the combination of positive core biopsies, serum PSA, Gleason score, presence of clinical stage T_{2c}, and endorectal MRI. Predicting tumor in the seminal vesicles was optimized by the combination of PSA, percentage of biopsies, Gleason score, and endorectal MRI. Finally, these same authors demonstrated that four of these variables (except clinical stage) were significantly associated with the time to PSA failure. Thus, regardless of the outcome, they demonstrated that using endorectal MRI with PSA, grade, and a measure of the extent of tumor in the biopsies produced a multivariable model that was better than one using endorectal MRI by itself. Had they used the natural logarithm of PSA instead of four categories (three cutpoints), they may have found even better results.
In summary, we have verified that the best way to use PSA in predictive models for outcome after definitive radiation therapy is as its natural logarithm, and our results support separating clinical stage T_{3} from T_{4} as well as dividing Gleason grade into <7 versus ≥7. Nevertheless, predictive models based on just these three variables are not sufficient because they leave much to residual uncertainty. Improvement on such models will likely require additional variables, including measures of the amount of tumor in the biopsy cores and endorectal MRI, and it is possible that refinements of Gleason grading and immunohistochemical markers could help.
Acknowledgments
We greatly appreciate the help of Robert Clough in entering and providing much of the data.
Footnotes

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

↵1 To whom requests for reprints should be addressed, at Laboratory Medicine (113), VA Medical Center, Durham, NC 27705. Phone: (919) 2860411; Fax: (919) 2866818; Email: voll002{at}duke.edu

↵2 The abbreviations used are: PSA, prostatespecific antigen; ECE, extracapsular tumor; MRI, magnetic resonance imaging.
 Accepted June 14, 1999.
 Received March 22, 1999.
 Revision received May 28, 1999.