
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Imaging, Diagnosis, Prognosis |
Authors' Affiliations: Departments of 1 Surgery, 2 Pathology, 3 Biostatistics, and 4 Human Genetics, and 5 Center for Biomedical Informatics, Hillman Cancer Center, University of Pittsburgh, Pittsburgh, Pennsylvania
Requests for reprints: Tony E. Godfrey, Mount Sinai School of Medicine, One Gustave L. Levy Place, Box 1079, New York, NY 10029. Phone: 212-659-9082; Fax: 212-849-2523; E-mail: tony.godfrey{at}mssm.edu.
| Abstract |
|---|
|
|
|---|
Experimental Design: Two previously published lung adenocarcinoma microarray data sets were reanalyzed. Patients were separated into two groups based on pathologic lymph node positive (pN+) or negative (pN0) status, and prediction analysis of microarray (PAM) was used for training and validation to classify nodal status. Overall survival analysis was performed based on PAM classifications.
Results: In the training phase, a 318-gene set gave classification accuracy of 88.4% when compared with pathology. Survival was significantly worse in PAM-positive compared with PAM-negative patients overall (P < 0.0001) and also when confined to pN0 patients only (P = 0.0037). In the validation set, classification accuracy was again 94.1% in the pN+ patients but only 21.2% in the pN0 patients. However, among the pN0 patients, recurrence rates and overall survival were significantly worse in the PAM-positive compared with PAM-negative patients (P = 0.0258 and 0.0507).
Conclusions: Analysis of gene expression profiles from primary tumor may predict lymph node status but frequently misclassifies pN0 patients as node positive. Recurrence rates and overall survival are worse in these "misclassified" patients, implying that they may in fact have occult disease spread.
Key Words: Lung cancer metastasis metastasis genes metastasis model Gene expression profiling
14% (1). Even in stage I (lymph node negative) patients, recurrence rates range from 25% to 50% and result in poor (50-70%) 5-year survival in this group (2). Current staging methods are unable to identify those patients who will ultimately have poor outcome, and we are therefore in need of a clinically useful approach to better stratify patients with respect to the risk of recurrence and survival. One much studied approach is to enhance the detection of tumor spread to lymph nodes using either immunohistochemistry for tumor cells or molecular detection of cancer-related RNA. A review of this literature indicates that occult lymph node metastases (micrometastases) are indeed clinically relevant in nonsmall cell lung cancer and confer worse prognosis (3). Recently, however, Beer et al. reported that microarray analysis of gene expression in lung adenocarcinomas can also predict outcome in stage I patients (4). In this study, survival of the low-risk stage I patients was almost 90%, whereas in the high-risk group, survival was similar to that expected for pathologic N1 (pN1) lung cancer patients (42%; ref. 5). Also interesting was the fact that the stage III patients did not show a significant survival difference between the high-risk and low-risk groups. Whereas this may be attributable to the small sample size, an alternative hypothesis is that the microarray data identifies patients at high risk for occult lymph node disease in the stage I group and that this is irrelevant in the stage III patients who already have overtly positive lymph nodes. Although this hypothesis is very tenuous based solely on the data from Beer et al., two recent articles lend some support (6, 7). In the first article by Ramaswamy et al., the authors analyzed gene expression in primary adenocarcinoma samples and compared this with gene expression in metastatic tumors derived from unmatched primary adenocarcinomas. Using this data, a 128-gene signature was identified that distinguished reasonably well between primary tumors and metastases. When applied to 62 lung adenocarcinomas, the authors found that survival was significantly worse in patients whose tumors bore the metastasis-associated gene expression signature (P = 0.009). The authors, however, did not report on correlation of the microarray data with pathologic nodal status in these 62 patients. From their data, Ramaswamy et al. concluded that some primary tumors are preconfigured to metastasize, and that this propensity is detectable at the time of initial diagnosis. In the second study, Huang et al. analyzed breast cancer samples and identified separate gene sets that seem to predict lymph node metastasis and overall patient outcome. Specifically relating to lymph node status, the authors identified a "metagene" signature that predicted pathologic nodal involvement with
90% accuracy. From these articles, it seems that while metastatic potential and patient outcome are clearly interrelated events, they may be the result of distinct biological processes, the signatures for which can be identified by analysis of the primary tumor. The goals of this study were to reanalyze publicly available gene expression data from lung adenocarcinoma samples and identify gene expression patterns that correlated with lymph node metastases. These gene expression patterns were then applied to pathologically node-negative patients to determine if the primary tumor microarray data could identify patients at risk for occult disease and worse survival.
| Materials and Methods |
|---|
|
|
|---|
Data compilation and preanalysis. There were 7,129 Affymetrix probe sets (genes) in the first data set and 12,625 probe sets in the second set. Data for NULL gene name (EST) were excluded and a total of 5,377 genes were subsequently identified as being common to both data sets (Supplementary data). Expression values for the same gene from multiple probe sets were average of values generated from all probe sets. To assess data quality and to identify any differences between data sets or between pN0 and pN+ patients, a variety of data visualization and data quality measurements were employed. These included data quality graphs (mean versus variance; multiplicative-additive plots; correlograms) and measures (confounding index; among-array coefficient of variation; mean group correlations). All methods were done using the University of Pittsburgh Cancer Institute web application for microarray analysis (caGEDA; cancer gene expression data analyzer), located at http://bioinformatics.upmc.edu/GE2/GEDA.html (9).
Data analysis and prediction of lymph node status. Prediction of lymph node status was done in a training set (Michigan data) and test set (Harvard data) approach using prediction analysis of microarray (PAM) software available free from the url http://www-stat.stanford.edu/~tibs/PAM. This software uses a variation of shrunken-centroid classification with an automated gene selection step integrated into the algorithm. During 10-fold cross-validation, a shrinkage factor
is varied in search of optimal classifier performance. The value that returns the lowest classification error with the fewest genes is favored as optimal. The variable
is a soft-thresholding variable that is reported to usually produce more reliable estimates of the true means (10). The
value obtained from training on the entire data set from the Michigan study was subsequently used for predicting the lymph node status of each sample in the Harvard data set. Survival comparisons between the predicted classes were done using Kaplan-Meier survival estimates and the log-rank test for determining survival differences.
| Results |
|---|
|
|
|---|
|
|
|
|
Correlation of prediction analysis of microarray nodal classification with recurrence and survival. In the training data set, Kaplan-Meier survival estimates showed that survival was significantly worse in PAM-positive cases than in PAM-negative cases (P < 0.0001) for all 86 patients. Furthermore, when the survival analysis was restricted to the pN0 patients only, we found that survival was worse in 9 PAM-positive cases compared with 60 PAM-negative cases (P = 0.0037; Fig. 3A). Thus, PAM successfully identified a subgroup (13%) of patients with worse outcome within the pN0 patients based on the microarray nodal status classification.
|
|
| Discussion |
|---|
|
|
|---|
More recently, analysis of gene expression in the primary tumor has been shown to correlate with patient outcome (4). Thus, it is possible that surgical removal and pathologic evaluation of lymph nodes may not be necessary to accurately predict a patients' survival probability. However, because survival is linked to effective treatment, and effective treatment is currently determined by nodal stage, it is unclear whether predicting patient outcome alone will be sufficient to avoid the need for accurate lymph node assessment. With this in mind, other recent studies have also shown that it may be possible to predict metastatic potential (6) and specifically lymph node metastases (7, 11) using analysis of gene expression in the primary tumor. For example, Huang et al. used microarrays to determine gene expression in 89 primary breast tumors and then used complex statistical approaches to identify patterns and interactions between genes that correlated with patient outcome. The authors identified patterns of gene expression that were associated with either lymph node involvement or survival, indicating that these two end points represent distinct biological processes, involving different gene sets. Interestingly, the genes that correlated with survival were the classic oncogenes associated with cell cycle, growth, cell signaling, etc., whereas the genes associated with lymph node involvement tended to be chemokines, chemokine receptors, and IFN-associated genes. This is very similar to the data reported by Ramaswamy et al. and raises the hypothesis that the ability of a tumor to metastasize is dependent on the host response to the tumor, and that this is what can actually be detected by analysis of the primary tumor. Thus, it is possible that analysis of the primary tumor may provide information on both the probability of survival and lymph node metastasis. Understanding the genomic differences between these closely related end points may prove useful for determining the appropriate treatment of nonsmall cell lung cancer patients.
In the current study, we specifically aimed to determine whether analysis of gene expression in primary nonsmall cell lung cancer could predict lymph node status and, if so, predict outcome in lymph nodenegative patients, many of whom are presumed to have occult lymph node metastases (3). For our study, we used pathologic lymph node status as the training classifier, and used microarray analysis software (PAM) to identify gene patterns that specifically correlate with pathologic nodal status. PAM identified 318 significant genes that were associated with nodal metastases. Of the top 50 genes (Table 2), 35 are overexpressed in tumors with lymph node metastases and the remaining 15 genes are underexpressed. These genes are involved in a wide spectrum of biological functions including extracellular matrix, protein synthesis/degradation, cell adhesion and structure, and regulation of transcription/translation. Interestingly, among the 35 overexpressed genes, four genes are from the collagen gene family and four are from the S100 calcium binding protein family. The up-regulation of collagen genes in tumors with nodal metastasis is consistent with observations that interactions between tumor cells and the surrounding stroma are critical for tumor cell invasion (1214). High levels of collagen gene expression have also been observed in advanced gastric and ovarian cancers using cDNA microarrays and serial analysis of gene expression, and in colorectal cancer using differential display reverse transcription-PCR (1517). Overexpression of S100A4 and S100A2 has also been associated with metastasis and survival in patients with early-stage breast cancer (18) and nonsmall cell lung cancer (19). Of the underexpressed genes, down-regulation of the HLA-B gene may allow tumor cells to escape the T-cell response and this is a potential mechanism for tumor cells to escape immune surveillance (2022). Down-regulation of the another gene, MGP, has also been seen in colon cancer (23) and has been associated with metastasis in prostate adenocarcinoma (24). Therefore, capture of these genes as the top genes associated with nodal metastases by PAM classification probably reflects the biological pathways contributing to the nodal metastases phenotype.
One clear weakness in our study is that many of the pN0 patients may actually have occult nodal involvement and, therefore, our gold standard for the training classifier is not necessarily accurate in pathologically node-negative cases. However, it should be 100% accurate in the case of node-positive tumors. Indeed, in both the training and validation data sets, we found extremely high (94%) concordance between microarray analysis and pathologic node status for tumors with pathologically positive lymph nodes. In node-negative tumors, however, microarray analysis of the training set data predicted that 9 of 69 (13%) pN0 tumors were actually node positive. This finding agrees quite well with estimates on the prevalence of occult lymph node disease in nonsmall cell lung cancer (3) and survival of these nine patients was significantly worse than for the remaining 60 pN0 patients (P = 0.0037). In the validation set, however, an unexpectedly high number (41 of 52, 79%) of pN0 tumors were classified as positive by PAM. Clearly, this is too high to be explained by the presence of occult disease alone and, therefore, we considered the following alternative possibilities: (1) Average expression values from the Harvard data set were much higher than that from Michigan data. This might cause prediction bias when the training algorithm generated with the Michigan data set was applied to the Harvard data. To make the differences smaller between the two data sets, we normalized the data using Z transformation but did not find any improvement in the performance of training and testing. (2) PAM was overtrained with the training data set. To test this hypothesis, a simulated data set that included 5,377 values (randomly assigned from 10 to 2,000) in 86 samples with two classes (randomly assigned with 1 or 2) was generated and applied to PAM for classification training. The overall classification error with cross-validation was 40% to 50% and classification probabilities for individual samples were from 40% to 60% (Supplementary Fig. 1). This shows that PAM classified the simulated data randomly (as it should) and indicates that the method does not overtrain for classification. (3) The population of patients between the two data sets was significantly different in terms of staging and/or outcome. Comparison of patient survival between the two data sets was analyzed and the results showed that patients from the Harvard data set had a significantly worse survival than the patients from the Michigan data set (P = 0.0245). Interestingly, this difference was completely due to differences in survival among the pN0 patients in the two data sets (P = 0.058), whereas there was no difference in survival of the pN+ patients between the two data sets (P = 0.9499; Supplementary Fig. 2).
When we analyzed survival in PAM-negative and PAM-positive patients within Harvard pN0 patients, 46.3% of PAM-positive patients suffered disease recurrence compared with only 9% in the PAM-negative patients. Thus, we believe that the poor outcome in the pN0 Harvard patients reflects understaging (possibly occult nodal disease) in a higher than normal percentage of patients, and to some degree this is responsible for the high number of patients called node positive in the microarray analysis. Furthermore, when we used the Harvard data set for training, 70% of the pathology-positive and 38% of the pathology-negative patients were incorrectly classified by PAM. Overall, cross-validated classification accuracy when training with the Harvard data set was only 54%. Again, we hypothesize that this inability to attain good classification accuracy (even within the training set itself) may be a result of a high frequency of occult lymph node disease in this patient set. Despite this, however, 22 patients who did not suffer disease recurrence were still called node positive in our analysis (when training with the Michigan data set), implying that many patients who were probably truly node negative were misclassified in the test set. This may be a result of differences between the two microarray data sets or may reflect the need to refine the analysis tools used for prediction. Another possibility is that some of the misclassified tumors do actually have the potential to metastasize but have not yet done so. Presumably, metastasis requires not only metastatic ability but also time, and it is therefore possible that the actual detection of lymph node disease (overt or occult) will prove to be a more powerful predictor of outcome than the primary tumor gene expression. Conversely, however, one could envision a scenario where tumor cells have spread to lymph nodes (or distant sites) but are unable to divide, grow, and/or evade the host immune system. This may be encoded in the genetics of the primary tumor in which case the gene expression data may be the more powerful predictor.
In summary, our analysis of gene expression in primary tumors correlated very highly in pathologically node-positive patients. Furthermore, recurrence rates and overall survival were worse in the pN0 patients who were "misclassified" as positive by microarray analysis, and it seems likely that the analysis may be identifying patients whose tumors have already metastasized or have the potential to do so. However, one problem with this interpretation of the data is that we do not have actual data on occult lymph node disease (detected by immunohistochemistry or reverse transcription-PCR) in pN0 patients. Instead, we are using recurrence and overall survival as surrogates for occult disease and this end point may reflect not only lymph node spread but also hematogenous spread and clinically occult metastases to other organs. Furthermore, because lymph node metastasis and survival are closely correlated, it is possible that despite training based on lymph node status, our analysis is actually identifying genes that correlate with overall outcome and not specifically with the ability to metastasize. Thus, we realize that our explanation for the unexpectedly high false-positive results from microarray analysis may in fact be somewhat circular. This issue can only be resolved in a situation where tumor gene expression, accurate clinical and surgical staging, and presence of occult nodal disease are determined in the same set of patients.
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/) or (http://www.mssm.edu/labs/godfrt01/publications/supp.htm).
L. Xi and T.E. Godfrey are currently at Department of Medicine, Mount Sinai School of Medicine, New York, NY 10029.
Received 12/13/04; revised 2/11/05; accepted 3/ 3/05.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
C. M. Tammemagi, M. T. Freedman, T. R. Church, M. M. Oken, W. G. Hocking, P. A. Kvale, P. Hu, T. L. Riley, L. R. Ragard, P. C. Prorok, et al. Factors Associated with Human Small Aggressive Non Small Cell Lung Cancer Cancer Epidemiol. Biomarkers Prev., October 1, 2007; 16(10): 2082 - 2089. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Seike, N. Yanaihara, E. D. Bowman, K. A. Zanetti, A. Budhu, K. Kumamoto, L. E. Mechanic, S. Matsumoto, J. Yokota, T. Shibata, et al. Use of a Cytokine Gene Expression Signature in Lung Adenocarcinoma and the Surrounding Tissue as a Prognostic Classifier J Natl Cancer Inst, August 15, 2007; 99(16): 1257 - 1269. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. N. Hayes, S. Monti, G. Parmigiani, C. B. Gilks, K. Naoki, A. Bhattacharjee, M. A. Socinski, C. Perou, and M. Meyerson Gene Expression Profiling Reveals Reproducible Human Lung Adenocarcinoma Subtypes in Multiple Independent Patient Cohorts J. Clin. Oncol., November 1, 2006; 24(31): 5079 - 5090. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Sun and P. Yang Gene Expression Profiling on Lung Cancer Outcome Prediction: Present Clinical Value and Future Premise. Cancer Epidemiol. Biomarkers Prev., November 1, 2006; 15(11): 2063 - 2068. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Cell Growth & Differentiation |