Abstract
Tumornodemetastasis (TNM) staging is the standard system for the estimation of prognosis of breast cancer patients. However, this system does not exploit information yielded by markers of the biological aggressiveness of breast cancer and is clearly unsatisfactory for optimaltreatment decisionmaking and for patient counseling. We have developed a prognostic model, based on a few routinely evaluated prognostic variables, that produces quantitative estimates for risk of relapse of individual breast cancer patients. We used data concerning 2441 of 2990 consecutive breast cancer patients to develop an artificial neural network (ANN) for the prediction of the probability of relapse over 5 years. The prognostic variables used were: patient age, tumor size, number of axillary metastases, estrogen and progesterone receptor levels, Sphase fraction, and tumor ploidy. Performances of the model were evaluated in terms of discrimination ability and quantitative precision. Predictions were validated on an independent series of 310 patients from an institution in another country. The ANN discriminated patients according to their risk of relapse better than the TNM classification (P = 0.0015). The quantitative precision of the model’s estimates was accurate and was confirmed on the series from the second institution. The 5year relapse risk yielded by the model varied greatly within the same TNM class, particularly for patients with four or more nodal metastases. The model discriminates prognosis better than the TNM classification and is able to identify patients with strikingly different risks of relapse within each TNM class.
INTRODUCTION
The best cure rates for early breast cancer are achieved by a combination of locoregional and systemic treatments (usually referred to as “adjuvant treatments”; Refs. 1, 2, 3) . However, cure rates are unrelated to the extent of locoregional intervention (4) . This has prompted a tendency toward less extensive locoregional treatment (breastconserving surgery) and a large body of research aimed at developing new adjuvant regimens.
Clinicians have now a wide choice of adjuvant regimens, ranging from lowtoxicity hormonal regimens to very aggressive highdose myeloablative chemotherapy with hematopoietic stem cell support, which is still under investigation. However, the wider the choice of adjuvant regimens, the harder it is to determine which treatment is best for a given patient. Decisionmaking about adjuvant treatment involves three issues. For each patient, one must first determine the likelihood of disease recurrence if no further therapy is administered and then the efficacy of each therapy available for that patient. Thirdly, the predicted therapeutic effect of each specific regimen must be weighted against potential side effects so as to determine the probable net benefit. Therefore, prognostic estimation remains one of the basic issues of treatment decisionmaking, not simply to identify patients who, because of their excellent prognosis, should be spared any adjuvant treatmentinduced toxicity but also to identify those highrisk patients who should be given the opportunity of very aggressive regimens.
At present, TNM3 staging is used to classify breast cancer patients according to prognoses ,(5 , 6) . This type of categorization, based on the extent of the disease at diagnosis, does not exploit the many markers of biological aggressiveness of breast cancer that are now widely available, and it is not satisfactory for optimaltreatment decisionmaking and patient counseling. On the other hand, it has been demonstrated that practicing oncologists are inaccurate when making prognostic estimates based on information obtained from multiple prognostic variables (7) . Hence, it is widely recognized that a new prognostic system is needed for breast cancer patients, and the development of such a system is a major goal of cancer research.
Using data from a large breast cancer database, we generated a prognostic model based on an ANN. This model yielded numerical estimates for risk of relapse of individual breast cancer patients that we validated on an independent data set of patients treated and followed at another institution.
PATIENTS AND METHODS
Data.
The model described here was constructed using data from 2990 patients stored in the “San Antonio database.” These data pertain to a cohort of breast cancer patients whose tumor specimens were consecutively received at the Division of Medical Oncology, University of Texas Health Science Center at San Antonio (San Antonio, TX) for prognostic factors evaluation. Patients had undergone surgical resection between 1974 and 1992 at a broad range of sites across the United States. Data were collected in a prospective fashion and partly verified by audit visits at the single treatment sites with direct check of the original source documents. After computerization, database information was verified with the referring physicians to minimize data entry errors. Recurrences were actively searched for by systematic patient followup. The followup schedule varied according to each center policy.
The following information was available for these cases: (a) age, (b) TS; (c) number of axillary lymph nodes affected by tumor (#Nodes), (d) ER levels; (e) PgR levels; (f) proportion of tumor cells in Sphase (Sphase fraction); and (g) tumor ploidy. The analysis was restricted to patients who fulfilled the following criteria: (a)invasive cancer; (b) followup of at least 2 months; and (c )no missing values for prognostic factors data. Data profile is reported in Fig. 1<$REFLINK> .
Age was recorded at the time of surgery. TS was defined as the larger diameter of the invasive tumor at pathological examination. The minimum number of axillary lymph nodes examined was equal to 6. ER and PgR were assayed with the charcoalcoated dextrane method as described elsewhere (8) and are expressed as femtomoles per milligram of cytoplasmic proteins (fmol/mg protein). Ploidy and Sphase fraction were estimated with DNA flow cytometry of fresh or frozen tumor specimens as previously reported (9) . Prognostic variables from (b) to (f) were logarithmically transformed to reduce the asymmetry of the distribution of the values and to obtain approximate normality. This stabilized the variance and reduced the impact of the small numbers of cases with very high values for given prognostic variables (10 , 11) . Treatment information was not available at the time of the analysis and was not used for modeling. See Table 1<$REFLINK> for a list of the prognostic variables used for the construction of the model and their respective transformations.
To reduce the risk of overfitting of the model and to test its generalizing ability, we randomly split the entire data sample into three subsets of approximately equal size: (a) a training Set; (b) a testing Set; and (c) an internal validation set (12 , 13) . The training set was used to develop the model, the testing set to select the optimal duration of the training process, and the internal validation set to assess performance.
External Validation.
After the model was constructed, a data set from a different institution (“external validation set”) became available. This database contains information regarding 1996 consecutive cases of breast cancer treated between 1977 and 1993 and followed at the University Federico II of Naples, Italy. Complete prognostic data for the variables in the model were available for only 310 patients; these were used to check the robustness of the estimates of the model across an independent series. See Fig. 1<$REFLINK> for the profile of the data.
ANN Modeling.
DFS was defined as the time elapsing between surgery and any of the following events, whichever occurred first: local recurrence, distant relapse, second breast cancer, or death. Patients without any event were censored at the time of the last followup.
The ANN model was constructed using a variation of the technique for censored data that we described previously (14, 15, 16, 17) . It is a multilayer perceptron with one hidden layer of five processing elements. The hyperbolic tangent was chosen as activation function of the processing elements. We used data in the training set for the learning process of the ANN (training). During this process, prognostic variables pertaining to each patient in this data set were iteratively presented to the ANN, the target value being each patient’s relapse status at 60 months. For patients whose status at 60 months was unknown (patients censored before 60 months) the conditional probability of relapse from the Kaplan Meier curve was used as surrogate target value. This conditional probability is expressed as the ratio between the average probability of relapse at 60 months and the average probability of relapse at the time the patient was censored. The backpropagationoferror algorithm was used to adjust the connection weights during the learning process (18) .
The testing set was then used to check when, in the training process, the ANN had reached optimal predictive power (curtailed training; 13 ) and to adjust the model’s calibration. The validation set was finally used to assess the predictive power of the model.
ANNs were implemented on a Sun Sparkstation IPX (SUN Microsystems, Inc.) using the Nworks Professional II plus software package (NeuralWare, Inc., Pittsburgh, PA).
Evaluation Criteria.
Performances of the ANN model were assessed in terms of refinement and calibration. The refinement is defined as the ability of a model to rank patients on the basis of their true probability of relapse, i.e., to discriminate between patients with different prognoses. The calibration is defined as numerical precision of the predicted probability of relapse, i.e., how close are the model’s predictions to the true probability of relapse (19 , 20) .
The ANN model was used to make predictions of risk of relapse at 5 years for patients in the internal validation set and in the Naples database. The refinement of the model was evaluated as areaundertheROCcurve (Az) for predicting relapses on both data sets. This area is an expression of the probability that, for two randomly chosen patients whose status at 5 years is known, one of whom is relapsed and the other is not relapsed, the relapsed one had the highest predicted risk (21) . This method tests the model’s sensitivity and specificity. The values of the areaundertheROCcurve were calculated by a maximum likelihood estimation, and the statistical significance of the differences between these values was evaluated by a z statistic (22 , 23) . These estimations were made with the software “Indroc” for independent data and “Corroc2” for correlated data.4
The calibration of the ANN model was tested by comparing the average predicted probability with the observed relapse rate at 5 years of followup within different prediction ranges. Observed relapse rates were estimated according to the productlimit method (24) .
The relationships between the ANN projections and the current TNM prognostic system were analyzed by examining the distribution of the ANN projections for each TNM class. Individual probabilities of relapse, as predicted by the ANN model, were clustered into 10 (0.1wide) intervals, and a smoothed plot of the percentage of patients falling into each interval was reported for all of the TNM classes.
All Ps were twosided.
RESULTS
The distribution of the prognostic characteristics of the patients in the four data sets used is shown in Table 2<$REFLINK> . Apart from the longer followup and the higher proportion of nodepositive patients in the Naples database, there was no other significant difference in the variable distribution among the data sets.
After completion of the training process, the ANN model was used to project a quantitative estimate of the 5year probability of recurrence for each patient in the internal validation set and in the Naples database (external validation set). The predictive accuracy of these estimates is shown in Table 3<$REFLINK> . The discrimination ability of the model, expressed in terms of Az values, is good and does not show any degradation across data sets (Az: 0.728 versus 0.732; z = −1.60; P = 0.10). These Az values represent the probability that random pairs of patients (one relapsed and one diseasefree) are correctly ranked by the model on the basis of their predicted risk. The calibration of the ANN model, i.e., how close the individual model predictions are to the actual probability of relapse, is also reported in Table 3<$REFLINK> . To generate this table, patients were ranked on the basis of their ANN predicted risk and then divided into eight groups of increasing risk (ANN1 to ANN8). For each group, the average model prediction, the observed relapse rate at 5 years and the number of patients falling in that prediction range were tabulated. Individual patients’ predictions of 5year probability of relapse ranged from 0.06 to 0.97 and were well distributed along the entire prognostic range. A comparison of predicted and observed values shows that the prediction of the model is consistent with the true relapse rate in each group of risk, except for a small discrepancy in the ANN7 group of the external validation set (0.725 predicted versus 0.865 observed).
To explore the clinical potential of the model, patients in the two validation sets were pooled (n = 1124). The predictive ability (refinement) of the model was then compared with that of the TNM staging. According to usual clinical practice, TNM was modified to account for the number of axillary lymph node metastases with resulting five TNM classes: T_{1}N_{0}, T_{2}N_{0}, T_{3}N_{0}, N_{1–3} (patients with 1–3 metastatic nodes), N_{4+} (patients with 4 or more metastatic nodes). As shown in Table 4<$REFLINK> , the predictive ability of the ANN model, expressed in terms of Az value, is significantly higher than that of the modified TNM classification (P = 0.0015). The clinical meaning of this difference is well depicted by the DFS curves according to the five TNM classes and the eight ANNbased prognostic groups (Fig. 2)<$REFLINK> .
To reveal the disagreements between the ANN model and the TNM staging, in Fig. 3<$REFLINK> is plotted the distribution of the ANN predictions for patients belonging to the five classes of the modified TNM system. The average ANNprojected 5year probability of relapse was consistent with the observed relapse rate for each TNM class, but individual predictions ranged widely within each single TNM category, with a wide overlap across different classes. In other words, according to the projections of the ANN model, the probability of relapse in each modified TNM class is markedly heterogeneous. This is particularly true for the N_{4+} patients, some of whom have an ANN projected risk lower than that of some patients in each of the lower TNM classes (ANNprediction range for N_{4+} patients, 0.19–0.97). To investigate whether or not this variability of estimates for N_{4+} could be related to an interaction with the adjuvant treatment, the analysis was repeated for patients who were both ER and PgRnegative (Fig. 3)<$REFLINK> . ANN estimates for this subset of patients were less widely distributed, as expected for the known interaction of steroid receptors with endocrine treatments (2 , 3) . Nevertheless, a substantial heterogeneity of the AAN estimates was still present (ANNprediction range for N4+/receptornegative patients, 0.35–0.97).
DISCUSSION
Because of attempts to obtain a more reliable prognosis of breast cancer patients compared with the classical TNM staging, practicing oncologists are now facing a growing number of prognostic parameters. Paradoxically, this wealth of prognostic information has not improved clinicians’ skill in predicting the outcome for breast cancer patients but has rather led to confusion. A survey of medical oncologists showed a surprisingly wide variation in estimates of risk of relapse— provided by different expert clinicians—for a given patient (7) . The conflicting estimates arise from the difficulty in taking into account complex, and sometimes conflicting, prognostic information. The subjective weighting of each prognostic variable does not yield a consistent quantitative evaluation of recurrence risk, and there is no standard procedure that takes into account various prognostic factors in an integrated fashion.
The aim of this study was to generate a prognostic model in which numerical information from several prognostic variables are integrated so as to yield quantitative predictions for risk of relapse of individual breast cancer patients. The advantages and disadvantages of the diverse multivariate analysis techniques that can be used to generate such a model are amply discussed elsewhere (17 , 25) . We used an approach based on ANN. ANNs are a form of artificial intelligence devised to solve complex pattern recognition and prediction problems in many fields including biomedicine (26, 27, 28, 29, 30) . Feedforward ANNs like the one that we used are trained with historical data to learn the correlation between input variable patterns and a specified output. The training consists of an iterative process of presentation of known data to the network. During training, the network changes its internal structure by updating connection weights thanks to a complex mathematical algorithm to minimize the global error in predicting the output. Once the training is terminated, an ANN can be presented with new patterns of input variables to make predictions about the output. In theory, given a large enough historical database, a properly constructed and trained ANN can learn any kind of functional or statistical relationship between the input variables and the output, no matter how complex and nonlinear it is. Because prognostic estimation is essentially a problem of pattern recognition and prediction, ANN analysis seems particularly suited to perform this task.
The application of ANN to survival analysis was initially viewed with skepticism by biostatisticians and with exaggerated enthusiasm by the “neural net community,” followed by a progressive convergence of the two views. Many groups are trying to merge classical statistical techniques and ANNs, and it has been shown that the backpropagationoferror algorithm can be used to obtain maximum likelihood estimates in standard regression models for survival data (31 , 32) . Although many clinicians view ANNs as magic boxes able to solve complex problems, feedforward networks, like the one that we used, share some features and limitations with classical statistical techniques (33 , 34) . Indeed, ANNs should be viewed as flexible nonlinear multipleregression models that, by searching in a multidimensional variable space, are particularly good in detecting complex interactions among variables in large data sets (16 , 17) .
Using simulation with synthetic data sets, we previously demonstrated that there are three conditions in which ANN should be better than standard statistical models for survival analysis: (a) when the prognosis of patients is a complex nonlinear unknown function of the prognostic variables; (b) when this function varies over time; and (c) when there are complex interactions among prognostic variables (16 , 17) . Complexity is a common feature in biological systems; thus, with the advent of new biological prognostic variables, these three situations are very likely to be present, and we venture to suggest that ANNbased models will be better than standard statistical modeling in this setting.
Major drawbacks of the ANN analysis are the complexity of the training process and the lack of standardization in selecting optimal starting parameters (such as the number of hidden layers and processing elements) for the network being trained. However, subtleties of training are irrelevant to a network whose predictions have been validated. Moreover, although the training process may appear cumbersome, once trained, the ANN model can be easily implemented as an easytouse prediction program on any desktop, notebook, or palmtop computer. Prototypes of this kind of software have already been developed at our institution and will be made available to the interested reader.
Here we demonstrate that several prognostic factors may be integrated as numerical variables into an ANN model able to generate survival predictions for individual patients. In our model, we used only wellestablished prognostic variables that are routinely used and whose measurement is well standardized. The ANNbased model produced very accurate predictions of probability of relapse that are robust across different series of patients. The ANN predictions were better than those obtained with the commonly used TNM variant in terms of ability to discriminate the 5year risk of disease recurrence among breast cancer patients. Furthermore, the calibration of the model, i.e., the precision of the quantitative estimation of this risk, was very accurate and fitted surprisingly well to a series of patients treated and followed in a different country.
Our model, however, has some limitations. First, detailed treatment information was not available when we constructed the model and is not taken into account by the model. Predictions of the model may, thus, underestimate the natural patientprognosis. This bias, however, is likely to be modest for nodenegative patients because the majority of them in the San Antonio database did not receive any adjuvant therapy (percentage of treated patients: T_{1}N_{0} = 39%; T_{2}N_{0} = 39%; and T_{3}N_{0} = 40%). On the other hand, about 89% of nodepositive patients whose data were used for model generation were treated. This may raise concerns about the usefulness of our ANN model. Nonetheless, because no patient whose data were entered into the model received highdose chemotherapy, the model predictions can be viewed as estimates of prognosis of the “averagetreated” nodepositive subject. Therefore, because all of the new nodepositive patients are being treated, the model predictions are still useful because they provide information that may be functional both for the decision to treat with very aggressive regimens and for counseling. A second limitation of the ANN model deserve a comment: in the database used for model construction, only 6% of the patients have tumors less than 1 cm in size. This is because we required prognostic information that was not easily obtainable on small tumors. Thus the model should be used with caution for projections of relapse rates for patients with such small tumors.
Another finding of our study is worthy of discussion. The ANN quantitative estimates of probability of relapse, when applied to patients grouped by the modified TNM classification, highlights the prognostic heterogeneity of each TNM group. This coincides with clinical experience, in which the outlook of patients belonging to the same TNM class can be quite different. In our study, the heterogeneity was particularly striking for N_{4+} patients whose projected relapse risks overlapped those of patients belonging to other TNM classes, including T_{1}N_{0} patients. As shown, this heterogeneity is partially due to the inclusion in the model of ER and PgR information and, thus, related to some extent to the predictive effect of these variables on the adjuvant tamoxifen efficacy, rather than to a pure prognostic effect. Nonetheless, even among receptornegative patients a substantial heterogeneity of the ANN estimates is observed. Because the effect of endocrine treatment is poor for these patients, and given that chemotherapy efficacy is not actually related to any other variable in the model (1) , this variability of the ANN estimates may be linked to only a genuine prognostic effect of the covariates. This observation may have important clinical implications. Indeed, N_{4+} patients are generally considered at high risk of relapse and are candidates for very aggressive experimental regimens, including new drug combinations or highdose myeloablative regimens with autologous stem cell reinfusion, which are usually very toxic and have a high incidence of side effects and a nontrivial rate of toxic deaths. According to our model, some N_{4+} patients are being overtreated, whereas patients in lower TNM classes but with a worse estimated prognosis are denied potentially more effective treatment. Should the estimates of our model be confirmed, selection of patients for aggressive regimens based only on the number of axillary metastases should be abandoned.
Various multivariate prognostic indices based on biological markers have been proposed (35, 36, 37) . These indices have been used to classify patients in a few groups with different prognoses. Some of these indices have good performances, and one of them—the Nottingham Prognostic Index—has been validated prospectively (37) . They all, however, suffer, in theory, from the same shortcoming as the TNM system, i.e., heterogeneous prognoses for patients assigned to the same prognostic class. By contrast, our model, rather than classifying patients in different prognostic groups, produces quantitative predictions of the probability of relapse on a continuous scale.
In conclusion, three findings of this study are clinically relevant: (a) using a few prognostic variables, we have generated an ANN model that produces quantitative estimates that represent an improvement on the classical TNM staging of breast cancer patients; (b) the model’s estimates confirm that the system used presently is inadequate for optimal decisionmaking regarding the treatment; and (c) the accuracy of the model’s projections is maintained on a series of patients from a different country. This accuracy, however, is amenable to further improvement: (a) using detailed treatment information; (b) using new biological variables; and (c) generating models on subsets of patients who pose specific therapeutic questions.
Acknowledgments
We thank Jean Gilder for editing the manuscript.
Footnotes

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

↵1 Supported by NIH Grants P01 CA30195 and P50 CA58183, Cancer Center Support Grant P30 CA54174, Associazione Italiana Ricerca sul Cancro (AIRC), and Ministero dell’Università e Ricerca Scientifica (MURST).

↵2 To whom requests for reprints should be addressed, at Cattedra di Oncologia Medica, Università ‘Federico II,’ via Sergio Pansini 5, 80131 Napoli, Italy. Phone: 39817464286; Fax: 39817462066; Email: delauren{at}unina.it

↵3 The abbreviations used are: TNM, tumornodemetastasis; ANN, artificial neural network; TS, tumor size; ER, estrogen receptor; PgR, progesterone receptor; DFS, diseasefree survival; ROC, receiver operating characteristic; fmol, femtomole(s).

↵4 Charles E. Metz, Helen B. Kronman, PuLan Wang and JongHer Shen, Department of Radiology, The University of Chicago, Chicago, IL.
 Accepted September 3, 1999.
 Received March 9, 1999.
 Revision received September 2, 1999.