
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Imaging, Diagnosis, Prognosis |
Authors' Affiliations: 1 Mary Babb Randolph Cancer Center, Departments of 2 Statistics, 3 Medicine and Division of Hematology/Oncology, 4 Microbiology, Immunology, and Cell Biology, and 5 Community Medicine, West Virginia University; and 6 The Pathology and Physiology Research Branch, Health Effects Laboratory Division, National Institute for Occupational Safety and Health, Morgantown, West Virginia
Requests for reprints: Lan Guo, Mary Babb Randolph Cancer Center/Department of Community Medicine, West Virginia University, 1814 HSS, 1 Medical Center Drive, Morgantown, WV 26506-9300. Phone: 304-293-6455; Fax: 304-293-4667; E-mail: lguo{at}hsc.wvu.edu.
| Abstract |
|---|
|
|
|---|
Experimental Design: Expression profiles of 7,650 genes were investigated on an unselected group of 99 node-negative and node-positive breast cancer patients to identify prognostic gene signature of recurrence and metastases. The identified gene signature was validated on independent 78 patients with primary invasive carcinoma (T1/T2 and N0) and on 58 patients with locally advanced breast cancer (T3/T4 and/or N2). The gene predictors were identified using a combination of random forests and linear discriminant analysis function.
Results: This study identified a new 28-gene signature that achieved highly accurate disease-free survival and overall survival (both at P < 0.001, time-dependent receiver operating characteristic analysis) in individual breast cancer patients. Patients categorized into high-risk, intermediate-risk, and low-risk groups had distinct disease-free survival (P < 0.005, Kaplan-Meier analysis, log-rank test) in three patient cohorts. A strong association (P < 0.05) was identified between risk groups and tumor size, tumor grade, estrogen receptor and progesterone receptor status, and HER2/neu overexpression in the studied cohorts. We also identified 14-gene predictors of nodal status and 9-gene predictors of tumor grade.
Conclusions: This study has established a population-based approach to predicting breast cancer outcomes at the individual level exclusively based on gene expression patterns. The 28-gene recurrence signature has been validated as quantifying the probability of recurrence and metastases in patients with heterogeneous histology and disease stage.
In this work, we present a population-based study to predict recurrence and metastases of breast cancer by using gene expression patterns in tumors. Sotiriou et al. (8) originally undertook the population-based study from a regional cancer center where 350 new patients a year were referred in from a population of 1.5 million. Within the years 1993 to 1995, 700 new breast cancer cases were seen, of which 99 cases were representative of the population with comparable overall survival adjusted for tumor size and nodal status. Almost all of these 99 patients received adjuvant therapy after surgery, according to accepted standards of practice at that time. Based on the gene profiling on these 99 tumors, we identified a new 28-gene signature to predict recurrence of breast cancer. Furthermore, the accuracy of the 28-gene signature in disease-free survival prediction was validated (P < 0.001) on two independent patient cohorts (1, 7). The cohort of van't Veer et al. (1) contained 78 patients with sporadic cancer all under the age of 55 years with no lymph node involvement, who were not treated with adjuvant chemotherapy. The cohort of Sorlie et al. (7) included 78 cases, 51 of which were with locally advanced T3/T4 and/or N2 breast cancer treated with primary chemotherapy. In this study, a recurrence score function was used to define a patient's risk for relapse and metastases for individualized clinical decision-making. Patients categorized into high risk, low risk, and intermediate risk had distinct disease-free survival (P < 0.005, Kaplan-Meier analysis, log-rank test) in all three cohorts. A strong association (P < 0.05) was identified between risk groups and tumor size, tumor grade (13), ER and progesterone receptor (PR) status, and HER2/neu overexpression in the studied patient cohorts (1, 7, 8). The recurrence score was also predictive of overall survival (P < 0.001) in the cohorts from Sotiriou et al. (8) and Sorlie et al. (7). To further reveal the molecular pathogenesis of breast cancer, we also identified 14-gene predictors of nodal status and 9-gene predictors of tumor grade using the data from Sotiriou et al. (8) in this population-based study.
| Materials and Methods |
|---|
|
|
|---|
The cohort from Sorlie et al. (7) contained 78 breast carcinomas (71 ductal, 5 lobular, and 2 ductal carcinomas in situ obtained from 77 different individuals; two independent tumors from one patient diagnosed at different times). Fifty-one of the patients were part of a prospective study on locally advanced breast cancer (T3/T4 and/or N2 tumors) treated with preoperative doxorubicin monotherapy. A total of 58 patients with complete follow-up information were included in this study. The detailed clinical information of all samples is available at the Proceedings of the National Academy of Sciences journal Web site and the author Web site.8
The clinical data of 78 patients from van't Veer et al. (1) were analyzed in this study, including 34 patients developed metastasis within 5 years and 44 patients continued to be disease-free after 5 years. Twenty patients with hereditary breast cancer (BRCA1 or BRCA2 carriers) were excluded from this study. All of these patients were diagnosed between 1983 to 1996 with primary invasive carcinoma (T1 or T2), no axillary metastases (N0), and all under the age of 55 years. The complete patient data are publicly available at the Nature journal Web site and the author Web site.9
Acquisition of gene expression profiles. There were 7,650 genes assayed by cDNA microarrays on 99 patient samples as described previously by Sotiriou et al. (8). RNA was isolated by using the Trizol method (Invitrogen, Carlsbad, CA). Total RNA from the Universal Human Reference (Stratagene, La Jolla, CA) was amplified and used as a reference for cDNA microarray analysis (8). The detailed protocols for RNA amplification and cDNA probe labeling and hybridization are available online.10
The data set from Sorlie et al. (7) includes 9,216 genes screened on 78 patient samples. Total RNA was isolated by phenol-chloroform extraction (Trizol, Life Technologies), and mRNA was purified by either magnetic separation using Dynabeads (Dynal) or the Invitrogen FastTrack 2.0 kit (7). All experiments and microarray assays were described previously (7) and the detailed protocols are available at Stanford University Web site.11,12
The data provided by van't Veer et al. (1) contain 24,500 genes screened on 98 patient samples. The detailed protocols for RNA isolation, cRNA labeling, and gene expression profiling using microarray were described in the original publication (1).
Missing value replacement and gene matching. The data from Sotiriou et al. (8) were used as training set. The genes that had missing values in more than five samples were excluded from the study. For the remaining genes, missing values were replaced by using the EMV package in software R.13 The missing values were estimated based on a k-nearest neighbor algorithm (k = 20). This algorithm first selects k nearest genes that do not contain any missing values to the one with at least one missing value, based on the Euclidean distance. Then, the missing values are replaced by the average of the neighbors. Genes screened on different microarray platforms were matched by their Unigene Cluster IDs with the interactive MatchMiner (14) Web site interface.14
Feature selection for biomarker identification. Feature selection is important to identify relevant and important genes and to remove irrelevant genes and noise from large scale microarray data sets. A combination of random forests (15) and linear discriminant analysis (LDA) was used to identify gene signatures for predicting breast cancer recurrence/metastases, tumor grade, and nodal status. Random forests of software R was first used to identify a small subset of genes from the original microarray data. LDA of software SAS15 was used to further refine the gene signature.
Random forests is a generalization of the standard tree algorithms (16). The basic step of random forests is to form diverse tree classifiers from a single training set. Each tree is built on a bootstrap sample from the training set. The variables used for splitting the tree nodes are a random subset of the whole variables set. The classification decision of a new case is obtained by majority voting (unless the cutoff value is user defined) over all trees. In random forests, about one third of the cases in the bootstrap sample are not used in growing the tree. These cases are called "out-of-bag" cases and are used to evaluate the algorithm performance. A very important function of random forests is variable importance evaluation. The importance of a variable is defined in terms of its contribution to classification accuracy. Based on the variable importance measure, backward elimination was used to identify the gene subset with the smallest out-of-bag error rate. Here, the out-of-bag error rate was not used to assess the prediction accuracy of the identified gene subsets. Instead, it served as a stopping rule for feature selection. The varSelRF package of software R (17) was used according to the following steps: (1) build a forest with N trees and obtain a ranking of variable importance; (2) remove 20% of the least important variables; (3) construct a new forest with K trees; (4) repeat steps 2 and 3 until two genes are left; and (5) select the gene subset with the smallest out-of-bag error rate.
In the experiments, we chose N = 3,000 and K = 1,000 because the large number of trees in the initial forests are likely to produce stable importance measures (17). We did not follow the "1-SE rule" as suggested by Diaz-Uriarte et al. (17). This rule chooses the smallest gene subset, whose error rate is within one SE of the minimum error rate of all forests. We used the "0-SE rule," which identifies the gene subset with the smallest out-of-bag error rate. The "0-SE rule" usually selects more genes than the "1-SE rule" does. Because further gene filtering would be done by using LDA, we chose the gene subsets with the lowest prediction error using random forests.
Discriminant analysis was used to determine which variables discriminate two or more naturally occurring groups in prognosis. Given several variables as the data representation, each class is modeled as multivariate normal distribution with a covariance matrix and a mean vector. Instances are classified to the label of the nearest mean vector based on Mahalanobis distance. The decision surfaces between classes become linear if the classes have a common covariance matrix. When the distribution within each group is assumed to be multivariate normal, a parametric method can be used to develop a discriminant function. Such function is determined by a measure of generalized square distance, which is based on the pooled covariance matrix as well as the prior probabilities of group membership. The generalized squared distance
from input x to class i is
![]() |
is the squared distance from x to group I; mi is the p-dimensional mean vector for group I; V is the pooled covariance matrix and g(i) depends on the prior probability of class i. In practice, the prior probability can be assumed as equal for all groups (refer to SAS Users' Manual). In this study, we assumed equal prior probability and thus g(i) = 0. x is classified into class I, if
is the smallest among all the distance measures. We selected the gene markers using backward selection of stepwise discriminant analysis with software SAS. Classifier evaluation. LDA of software SAS was used to refine the gene signature obtained from random forests and assess the classification accuracy of models in predicting 5-year relapse-free survival, tumor grade, and nodal status based on the identified gene signatures. Leave-one-out cross-validation was used in the evaluation (18).
To evaluate the accuracy of survival prediction, time-dependent receiver operating characteristic (ROC) analysis for censored data (19, 20) was done with software R. Time-dependent ROC analysis extends the concepts of sensitivity, specificity, and ROC curves for time-dependent binary disease variables in censored data. In our study, the binary disease variable Ri(t) = 1, if patient i has recurrent or metastatic breast cancer before time t; otherwise, Ri(t) = 0. For a diagnostic marker M, both sensitivity and specificity are defined as a function of time t:
![]() |
![]() |
Statistical methods. To estimate a patient's recurrent and metastatic potential, risk scores were generated by fitting the identified gene predictors in a Cox regression model as covariates. The distribution of the risk scores was used to classify the patients into three groups: high risk, low risk, and intermediate risk. Kaplan-Meier analysis was used to assess the disease-free survival probability of three risk groups in the studied patient cohorts (1, 7, 8). To evaluate the association between risk groups and clinicopathologic variables in three patient cohorts (1, 7, 8), the
2 test or Fisher's exact test was used. All statistical testing was done with software R.
| Results |
|---|
|
|
|---|
|
|
|
The identified 28-gene signature of recurrent and metastatic potential is unique. There is no overlap between our gene signature and previously reported survival signatures (1, 8, 12). A literature search found that 17 genes of this 28-gene signature are related to tumorigenesis, and 9 genes are directed linked to breast cancer pathogenesis (Table 1). Furthermore, among the nine breast cancerrelated genes, five genes are the established breast cancer biomarkers [i.e., MCM2 (22), RAD50 (23), PDGFRA (24, 25), S100P (26, 27), and FGF2 (28; Table 1).
Significant association of expression profiledefined risk groups with clinicopathologic characteristics. The clinical variables, such as nodal status, tumor size, tumor grade, ER status, and HER2/neu overexpression in breast cancer patients affect the disease outcomes. In three studied cohorts (1, 7, 8), the distribution of the recurrence risk scores in Cox modeling was used to stratify patients into three groups: high risk, low risk, and intermediate risk (the details are provided in Supplementary Materials). Kaplan-Meier analysis showed that disease-free survival was significantly different (P < 0.005, log-rank tests) among the risk groups in all patient cohorts (Fig. 3 ). The clinical characteristics of each risk group in the studied cohorts were reported in Supplementary Tables S5 to S7, including average disease-free survival days, ER and PR status, HER2/neu overexpression, nodal status, age, tumor size, grade, and treatment received. The results indicated that the expression profiles of the identified 28-gene signature were strongly associated with the clinicopathologic variables, including tumor size, tumor grade, ER and PR status, and HER2/neu overexpression (P < 0.05; Table 2 ). There was insufficient evidence to support the association between ER status and risk groups in Sorlie et al. (7). The reason might be that 80% of the patients in this study exhibited ER positive. No strong evidence was found that age (representing menopausal status) of the patients was associated with the expression profiles of the tumors (Table 2).
|
|
|
| Discussion |
|---|
|
|
|---|
50% of patients died in the absence of recurrent breast cancer (1). In addition, the recurrence gene signature predicts the relapse-free interval and metastases-free interval. Therefore, the expression profiledefined prognostic model robustly predicts all the outcomes we examined. Microarray analyses of breast cancer have identified gene expression profiles associated with patient survival. Sorlie et al. (7) found the expression profiles to distinguish ER-positive from ER-negative tumors with distinct outcomes in a locally advanced cohort treated with primary chemotherapy. van't Veer et al. (1) established a 70-gene signature to predict metastatic potential in an untreated, node-negative cohort. Sotiriou et al. (8) showed the concordance with these previous analyses in node-positive and node-negative patients with the majority receiving adjuvant treatment. Here, we present a powerful systems biology approach to identify unique gene predictors of breast cancer recurrence and metastases, which achieved highly accurate prognosis at the individual level in these three cohorts (1, 7, 8). Our results are remarkable in their high prognostic accuracy for the data from these previous studies despite the differences in patient populations, treatments used, and technology platforms used.
The variables, such as patient's age, tumor size, measurement of ER/PR (by ligand-binding assay), and HER2/neu status (by fluorescence in situ hybridization), are routinely used as predictors of recurrence in breast cancer and are incorporated into current treatment guidelines (34, 35). The results of this study show that the expression profiles of our recurrence gene predictors were strongly associated (P < 0.05) with tumor size, tumor grade, ER, PR, and HER2/neu in the studied patient cohorts. The gene expression profiles in tumors were not significantly associated with the patient's age in all three cases.
We evaluated the recurrence gene signature in the context of the interobserver variability in tumor grading that is typical in oncology practice. Tumor grade correlates with the likelihood of recurrence when analyzed in large populations of patients (12). However, previous studies have also reported that the grading of breast cancer entails subjectivity, leading to considerable variability among pathologists (12, 3640). Our results indicate that the expression of a recurrence gene signature is strongly associated (P < 0.05) with tumor grade in the patients from all three cohorts.
A recent report by a Breast Task Force serving the American Joint Committee on Cancer did not add tumor grade to its staging criteria because of the interobserver variability problem in the current grading system (39). To produce a consistent and standard system for tumor grading, we identified a nine-gene signature to predict tumor grade. The expression profiles of the nine genes (Table 3) accurately classified 85% of the tumor grades in the population-based cohort from Sotiriou et al. (8), with a sensitivity of 87% and a specificity of 83%. The prediction accuracy of our identified nine-gene grade signature was notably high.
The high dimensionality of microarray data has complicated major diagnostic and prognostic breakthroughs in cancer treatment and puts a premium on innovative data mining methods. This article presents a novel computational gene selection system for the identification of molecular signatures. An integration of random forests and LDA was used in this study. Random forests is characterized as an effective machine learning method for processing noisy large-scale data sets. Therefore, we used this algorithm to filter out noninformative genes sequentially from the original microarray data until a small subset of genes was obtained. Then, LDA was used to further filter out more genes. Our previous studies have shown that an integrative feature selection system based on random forests and other state-of-the-art techniques enables the identification of important biomarkers (41, 42). This study is another successful example of the combinatorial gene selection system, providing a general guideline for developing robotic prognosis toward personalized medicine.
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
Disclaimer: The findings and conclusions in this report are those of the author(s) and do not necessarily represent the views of the National Institute for Occupational Safety and Health.
7 http://www.pnas.org/cgi/content/full/100/18/10393 ![]()
8 http://smd.stanford.edu/cgi-bin/publication/viewPublication.pl?pub_no= 95 ![]()
9 http://www.rii.com/publications/2002/vantveer.html ![]()
10 http://nciarray.nci.nih.gov/reference/index.html ![]()
11 http://cmgm.Stanford.edu/pbrown/ ![]()
12 http://genome-www.stanford.edu/molecularportraits/ ![]()
14 http://discover.nci.nih.gov/matchminer/MatchMinerInteractiveLookup.jsp ![]()
Received 9/ 5/06; revised 12/ 5/06; accepted 1/ 8/07.
| References |
|---|
|
|
|---|
in breast cancer is associated with tumour progression. Breast Cancer Res 2005;7:R78895.[CrossRef][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |