
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Imaging, Diagnosis, Prognosis |
Authors' Affiliations: 1 University of Pennsylvania Medical Center and 2 The Wistar Institute, Philadelphia, Pennsylvania; 3 Columbia University Medical Center and 4 Memorial Sloan-Kettering Cancer Center, New York, New York; 5 University of Minnesota Medical Center, Minneapolis, Minnesota; 6 University of North Carolina-Chapel Hill, Chapel Hill, North Carolina; and 7 University of Oxford, Oxford, United Kingdom
Requests for reprints: Louise C. Showe, The Wistar Institute, 3601 Spruce Street, Philadelphia, PA 19104. E-mail: lshowe{at}wistar.upenn.edu.
| Abstract |
|---|
|
|
|---|
Experimental Design: Gene expression patterns derived from 28 patients with HNSCC or LSCC from a single center were analyzed using penalized discriminant analysis. Validation was done on previously published data for 134 total subjects from four independent Affymetrix data sets.
Results: We identified a panel of 10 genes (CXCL13, COL6A2, SFTPB, KRT14, TSPYL5, TMP3, KLK10, MMP1, GAS1, and MYH2) that accurately distinguished these two tumor types. This 10-gene classifier was validated on 122 subjects derived from four independent data sets and an average accuracy of 96% was shown. Gene expression values were validated by quantitative reverse transcription-PCR derived on 12 independent samples (seven HNSCC and five LSCC). The 10-gene classifier was also used to determine the site of origin of 12 lung lesions from patients with prior HNSCC.
Conclusions: The results suggest that penalized discriminant analysis using these 10 genes will be highly accurate in determining the origin of squamous cell carcinomas in the lungs of patients with previous head and neck malignancies.
In some cases, the distinction between a lung metastasis and a second primary lung carcinoma can be easily distinguished on clinical grounds. The presence of multiple pulmonary nodules is usually considered evidence of metastatic disease. However, in subjects who present with a solitary lung nodule, the distinction between metastasis and primary carcinoma can be more problematic. Usually, patients with HNSCC who are found to have solitary pulmonary lesions undergo surgery or needle biopsy with pathologic evaluation. If the lung lesion is also of squamous cell histology, the distinction between metastasis and primary LSCC is extremely difficult. Currently, this distinction is made by comparison of histologic grade or by the presence of other premalignant changes in the respiratory epithelium; however, the accuracy of this approach is unclear.
Making the correct diagnosis has practical importance for choice of therapy. Although patients with either a primary LSCC or a solitary HNSCC metastases may be eligible for surgical resection, the choice of surgical procedure and the use of adjuvant therapy is usually different in these situations. Additionally, patients with early-stage LSCC have a significantly better prognosis than patients with metastatic HNSCC.
Recent gene expression studies have shown the potential to classify the origin of human carcinoma cell lines (3) and human tumors (4, 5). We have compared HNSCC and LSCC tumors using gene expression profiling with the goal of identifying a small number of differentially expressed genes that could ultimately prove useful and practical in distinguishing primary lung cancer from HNSCC metastases to the lung. Using a training set/validation set approach, we show that penalized discriminant analysis (PDA) can correctly classify patients with HNSCC and LSCC with high accuracy using a discriminant model with as few as 10 genes. The gene expression results were further validated by quantitative reverse transcription-PCR (QRT-PCR) data derived from 12 independent samples for 19 genes. Our classification algorithm also correctly classified a set of 12 squamous lung lesions of undetermined origin from patients with prior HNSCC.
| Materials and Methods |
|---|
|
|
|---|
Intraoperative tumor samples were routinely dissected from surrounding normal tissue, but no microdissection was done. H&E staining was done to verify the presence of >70% tumor cells. Samples were immediately frozen in liquid nitrogen before RNA extraction.
RNA preparation, target preparation, and hybridization. RNA was extracted from the tumor specimens as previously described (7). All hybridization protocols were conducted as described in the Affymetrix GeneChip Expression Analysis Technical Manual at the University of Pennsylvania Microarray Core. RNA was hybridized to Affymetrix U133A GeneChips (Affymetrix) using standard conditions in an Affymetrix fluidics station.
External data sources. Gene expression profiling data of HNSCC and LSCC tumor samples were provided by four external sources. The samples were analyzed on two different Affymetrix chipsU133A and U95Av2. U133A data included 41 HNSCC samples from the University of Minnesota (8). U95Av2 data sets included 11 LSCC samples from Columbia University (9, 10), 21 LSCC samples from the Dana-Farber Cancer Institute (11), and 49 samples (18 LSCC, 31 HNSCC) from Memorial Sloan-Kettering Cancer Center (12). U95Av2 data from 12 squamous cell lung lesions from patients with previous HNSCC were also provided by Memorial Sloan-Kettering Cancer Center (12). The Dana-Farber Cancer Institute data is available online.8 All other published data was kindly provided by investigators at the individual institutions. Patient characteristics and details of tissue acquisition, RNA isolation, and array hybridization have been previously described for these four data sets.
Identifying U95Av2 and U133A common genes. Common genes were linked between the two chip types using Affymetrix probe set identifiers. Probe sets that were common between the two different platforms (U95Av2 versus U133A) were aligned using the "best match" file.9 This spreadsheet identifies the probe sets from the two platforms that are most similar based on several factors, including target sequence match and percentage identity. A total of 9,530 probe sets were overlapping between U95Av2 and U133A.
Microarray normalization. The CEL files for each data set were reprocessed using a publicly available implementation of Robust Multichip Average expression summary (RMAExpress) Version 0.3 (13). Default settings were used for background adjustment, quantile normalization, and log 2 transformation. Samples from the different institutions were processed as independent groups.
Distance weighted discrimination. The distance weighted discrimination (DWD) method is a generalization of the support vector machine, a multivariate technique (14). DWD has been previously shown to be well suited for correction of the systematic biases associated with micro array data sets (15). DWD performance and robust quantification of systematic bias has been reported to be superior to that of classic methods (such as principal component analysis, linear discriminant analysis, and standard linear support vector machine). A detailed description of DWD is given in ref. (16). The DWD calculations were carried out using a Java-based version of DWD method.10 The following settings were used for the input variables: (a) DWD type, nonstandardized DWD, and (b) mean adjustment type, centered at the second mean.
Hierarchical clustering. Hierarchical clustering was done using the Pearson correlation distance metric and Ward's linkage. For visual enhancement of Figs. 1 and 2 (showing the results of biclustering of samples and the selected genes), the clustering was carried out after the values for each gene were converted to z scores by subtracting the corresponding gene mean that was computed over all samples being clustered, and dividing by the corresponding SD. Additionally, to keep figure space as compact as possible, the relative length of the main stem that partitions the clustered samples into two main subclusters has been reduced 5-fold in Fig. 3A and B , depicting clustering of samples before and after DWD transformation.
|
|
|
PDA with recursive feature elimination. The genes that contribute the most to the classification model were selected as follows:
30% of the genes least differentially expressed between HNSCC and LSCC data sets were first eliminated based on the P values from a univariate t test. A progressive scheme of gene reduction is then applied and the least informative genes (usually from 1% to 10%) are removed iteratively. This process is repeated until only one gene remains. A discriminant model is fitted at each reduction and each gene is assigned a computed "predictive power" (discriminant weight x SD), which estimates the contribution of that gene to the discriminant score. The discriminant scores (either positive or negative) define which of the two experimental classes a particular sample belongs and how well each sample is classified.
Resampling procedure. To evaluate the robustness of our classifier and to estimate the confidence intervals for the classification scores for each sample in the independent validation set, PDA with recursive feature elimination was carried out on 100 subsets of the University of Pennsylvania training set and applied to classify the validation samples. The 100 training subsets were generated by random resampling without replacement (jackknifing) from 28 samples in the University of Pennsylvania data set. Each subset contained 90% of the 28 original samples, with the same proportion of LSCC and HNSCC.
Quantitative real-time PCR. Gene-specific primers (IDT, Inc.) were designed with the Light Cycler Probe Design Software, Version 1.0 (Idaho Technology, Inc.), and ABI PRISM PrimerExpress software, Version 2.0 (Applied Biosystems, Inc.). Primers were selected from the 3' half of the message using sequence retrieved from the Genbank database and in almost all cases from different exons. The PCR reaction was done in 20 µL as previously described (20) using the Chromo4 PTC-200 Peltier Thermal Cycler (MJ Research). All primers were designed to have a melting temperature of
60°C. The PCR cycle variables were as follows: 95°C 3 min hot start, 40 cycles of 95°C 20 s, 60°C 10 s, 72°C 20 s, and 78°C 5 s (to ensure elimination of side product). SYBR green I fluorescence intensity was measured at the end of each 72°C extension as previously described (20). Results were normalized to GAPDH as the housekeeping gene and values were calculated relative to a standard curve generated using the Stratagene universal standard RNA, which had been supplemented with RNA from the Jar and HT3 epithelial cell lines. The same standard RNA mixture was used for all comparisons. Product specificity was assessed by melting curve analysis and selected samples were run on 2% agarose gels for size assessment. Quality of real-time PCR was determined in two ways: the amplification efficiencies had to be 100 ± 10%, and correlation coefficients (r2) >95%. The cDNA for PCR amplification were prepared from 0.5 µg of the amplified RNA using Superscript II as previously described. The amplified RNA was generated from 250 ng of total RNA subjected to one round of linear amplification using the RiboAmp RNA Amplification Kit (Arcturus, Inc.). Some samples were also assayed from cDNA prepared from total RNA with similar results.
| Results |
|---|
|
|
|---|
|
|
Validation of the discriminant model on the independent test set. The discriminant model using the genes identified by PDA with recursive feature elimination on University of Pennsylvania training set was then applied to classify 72 HNSCC and 50 LSCC samples in the DWD adjusted validation cohort. The observed accuracy of classification as a function of the total number of genes retained in the discriminant model is shown in Fig. 3. Values are shown for classifiers ranging from 1 to 100 genes. There is little change in accuracy between 100 and 5 genes. Because the classification accuracies were essentially the same with 5 or 10 genes, we used the 10 genes in further tests to accommodate greater heterogeneity that may exist in a larger sample set. Using this 10-gene classifier, the measurements of average accuracy, sensitivity, and specificity were each calculated to be 96%. Therefore, 10 genes are sufficient to robustly discriminate the HNSCC samples from LSCC samples in the validation set.
In applying the 10-gene classifier, each sample in the validation set is given a discriminant score that is a measure of how well it is classified. The discriminant scores for each individual subject in the validation cohort are shown in Fig. 4
. Of the 122 total samples, only five samples were misclassified, three samples were LSCC, and two samples were HNSCC. Two of the misclassified LSCC samples were considered to be borderline cases. These samples had a low predictive score, shown by the low column height in Fig. 4, and error bars that cross the zero line separating the two classes. The 10 genes used for this classification include chemokine ligand 13 (CXCL13); collagen, type VI,
2 (COL6A2); surfactant protein B (SFTPB); keratin 14 (KRT14); TSPY-like 5 (TSPYL5); tropomyosin 3 (TMP3); kallikrein 10 (KLK10); matrix metalloproteinase 1 (MMP1); growth arrest-specific 1(GAS1); and myosin, heavy polypeptide 2, skeletal muscle, adult (MYH2). These are highlighted in yellow on the tree view in Fig. 2.
|
|
|
|
| Discussion |
|---|
|
|
|---|
It is important to differentiate between these possibilities because the prognosis and treatment of a primary versus metastatic lesion are different. Studies have shown that the surgical approach for a primary LSCC should be a lobectomy compared with a lesser resection (23), whereas the goal of resection in pulmonary metastases is to remove all gross tumor while preserving as much normal parenchyma as possible. This can usually be achieved via a wedge resection. Additionally, the role of lymph node dissection, which is standard for primary lung cancer resection, is not well defined in pulmonary metastasectomy (24). The choice of adjuvant therapy is also affectedplatinum-based chemotherapy is now frequently used for primary lung cancer, whereas its role after metastasectomy has not been studied. Finally, the 5-year survival of early-stage lung cancer after lobectomy approaches 80%, but is much lower in patients with metastatic disease (25).
Recent studies suggest that the use of genetic abnormalities can help with the distinction between primary LSCC and metastasis. Leong et al. (26) compared tumors from 16 patients with HNSCC and a paired solitary lung nodule for loss of heterozygosity on chromosomal arms 3p and 9p. The use of loss of heterozygosity distinguished 13 of the 16 cases as primary lung cancer or metastasis based on discordant versus concordant allelic patterns between the index tumor and the lung lesion. Of the top 100 genes in our study, only two (WNT5A, TPM2) are located on one of these chromosomal arms. This is not surprising as both 3p and 9p are frequently lost in both HNSCC and LSCC and would therefore be less likely to lead to identification of differentially expressed genes in these two tumor types.
A separate study using loss of heterozygosity suggests that many squamous lung lesions in patients with HNSCC that are currently classified as metastases based on clinical criteria may in fact be primary lung cancers (27). Although loss of heterozygosity is potentially useful, this technique is time consuming, not widely available, not completely accurate, and, most importantly, requires appropriate tissue from both the primary and the lung lesion.
Comparison with other studies. Although a number of studies have been published examining gene profiles in HNSCC (8, 28) and LSCC (9) with their tissues of origin, to our knowledge, the patterns in these two types of tumors have only been compared in one previous study (12). Talbot et al. used gene expression profiling to compare 21 lung cancer and 31 tongue cancer samples and were able to distinguish between HNSCC and LSCC tumors using hierarchical clustering with 100 to 500 genes. The accuracy of their predictions decreased when the number of genes was reduced below 100. An important advantage of our discriminant model over the traditional hierarchical clustering/t test approach is the accuracy that was achieved using a small number of genes. Our 10-gene classifier also correctly classified 96% of the samples from the Talbot et al. study. Although as few as five genes could be used with equal accuracy, we used the 10-gene classifier in these studies.
A major concern in small array-based studies is the high degree of heterogeneity that exists within a single tumor type and whether the samples properly capture that case to case heterogeneity. In this particular study, factors that have not been considered include tobacco use and human papillomavirus status (for the HNSCC cases). Nevertheless, gene expression differences between the two tumor types were striking and our 10-gene classifier was evaluated with testing and validation data sets using several different groups of external samples. This allowed us to show that the data from University of Pennsylvania used for model building and gene selection was highly accurate when evaluated on these external data sets. Most biomarker studies are done in a single institution and usually on conservative sample numbers. If validation is done, it is usually with split sample or 10-fold cross-validation approaches, which, if not used carefully, can lead to bias in gene selection and "overfitting" of the data (29). The possibility of combining data sets of different origins to avoid this problem is shown by these studies.
Analysis of specific pathways and genes. Some of the most useful discriminating genes we detected were the lung surfactant genes, which were significantly higher in the LSCC. This is not surprising, given the lung epithelial origin of these tumors. However, because the tumor samples used for the gene expression studies were not microdissected and thus potentially contained up to 30% nontumor tissues, a potential explanation for this finding of high surfactant gene expression in the LSCC, but not in HNSCC, could be contamination from normal lung tissue in our original LSCC samples. Because of the availability of an antisurfactant protein C polyclonal antibody that worked well in paraffin-fixed tissues, we stained some of the LSCC specimens to determine the cellular localization of the SP-C. Our staining studies showed strong cytoplasmic staining in LSCC tumor cells (data not shown), demonstrating that the increased gene expression was not simply due to contaminating lung tissues. In addition, the 10-gene classifier, including the surfactant genes, easily distinguished LSCC from normal adjacent lung tissue using data available in the Memorial Sloan-Kettering Cancer Center data set further supporting the observation that the differential expression is tumor associated (data not shown).
Another major gene family with increased expression in lung cancers is the GAGE (G antigen) genes. GAGE proteins are a large group of cancer/testis antigens consisting of GAGE-1 through GAGE-8 (30). Although the function of most of the cancer/testis antigens is not known, GAGE proteins have been implicated in inhibition of apoptosis and chemotherapy resistance (31, 32). GAGE protein expression is present in
40% of lung cancers and is associated with poor prognosis (33). Detrimental effects of GAGE expression on survival has also been shown in esophageal and brain tumors (34, 35). Interestingly, GAGE gene expression was up-regulated in only a subset of LSCC (and no HNSCC). The significance of these proteins in the pathophysiology or prognosis of these tumors is as yet unknown.
One of the most striking changes we observed in our data set was difference in expression of specific cytokeratin genes in these two types of tumors. All eukaryotic cells contain a cytoskeleton composed of three distinct filamentous structures: microfilaments, intermediate filaments, and microtubules (36). The intermediate filament protein family includes several hundred different members that are divided into several groups. Cytokeratins constitute type I and type II intermediate filaments and are subdivided based on isoelectric point (CKs 1-9 are acidic; CKs 10-20 are basic). Stratified squamous epithelia express mostly CKs 1 to 6 and 9 to 17, whereas CKs 7, 8, and 18 to 20 are identified in simple epithelia (36). During malignant transformation of normal cells, the cytokeratin patterns are usually maintained.
The pattern of gene expression differences identified in our study showed a "stratified squamous epithelial" pattern in the HNSCC tumors with higher expression of CKs 1 and 14 (up 3.6- and 62-fold, respectively) and lower expression of CKs 18 and 19 (down 3.9- and 10-fold, respectively). Although both upper airway epithelium and bronchial epithelium are composed of stratified squamous cells, it is not surprising that HNSCC tumors are more likely to exhibit a stratified squamous pattern given their location in the upper aerodigestive tract.
Many genes in the collagen family were also up-regulated in head and neck tumors when compared with squamous cell lung cancer. Five collagen-related genes (COL6A2, COL1A2, COL10A1, COL3A1, and COL6A3) were found in our top 100 genes selected by PDA and had expression ratios ranging from +1.8 to +4.0. In the tumor microenvironment, collagens are a major component of the extracellular matrix, which is primarily secreted by stromal cells and inflammatory cells (37). Thus, the higher expression of collagen in the head and neck tumors may simply reflect a higher proportion of stromal elements compared with the lung cancer samples. There is recent data, however, that suggests that certain collagen genes are expressed in the tumor cells themselves. For example, ovarian cancer cells have been shown to highly express several extracellular matrix proteins, including collagen VI, and this was associated with resistance to cisplatin in vitro (38).
The high expression of collagens in the head and neck tumors was mirrored by higher levels of three matrix metalloproteinases, MMP1, MMP3, and MMP10, which were increased by 12.4-, 8.2-, and 2.6-fold, respectively, when compared with the lung cancers. MMP-1, or collagenase-1, is expressed in a wide variety of cancers and in most cases is associated with increased invasion and poorer survival (39). MMP-3, which is a secreted by fibroblasts, can activate tumor-derived MMP-1 and other collagenases leading to increased collagen degradation and tumor invasion (39). In head and neck tumors, high levels of MMP-1 and MMP-3 are associated with greater tumor invasiveness and incidence of lymph node metastases (40). The higher levels of MMP gene expression in our study may have been due to higher proportion of HNSCC tumors with lymph node metastases when compared with the LSCC tumors (61% versus 20%).
Significance and future directions. Although our data was derived from primary LSCC and HNSCC samples, we postulate that our predictive approach will be able to determine the origin of lung nodules in patients with previous HNSCC. We were able to conduct a first test of this hypothesis by validating our 10-gene classifier using data provided by Talbot et al. (12) from a set of 12 squamous cell lung lesions of "unknown etiology" derived from patients with previous HNSCC. As shown in Fig. 6, our predictions closely matched the results of the 500-gene classifier set of Talbot and were consistent with the final clinical classifications of these tumors (12). How the 10-gene classifier performs in a prospective series of squamous cell lung nodules from patients with HNSCC is the subject of ongoing investigation.
In addition to using microarray data, we are also studying ways to use our data in other types of assays. We have, in a limited fashion, shown the use of gene expression ratios using QRT-PCR to distinguish between these two tumor types. This method, originally developed by Gordon et al., has several potential advantages: It does not require a housekeeping gene to be used as a reference, is independent of the platform used for data acquisition, and requires very small amounts of RNA. We ultimately plan to develop PCR classifiers that can be used in paraffin-embedded tissues. Recent advances in PCR technology allow the measurement of gene expression from RNA harvested from paraffin-embedded tumors, which are most commonly used for standard clinical pathology (41). We are currently developing and testing PCR primers that will work well in paraffin-fixed tissue and are collecting a series of well-characterized pathologic specimens to validate this approach. We are also evaluating potential immunohistochemical markers, such as antisurfactant protein C antibodies using tissue arrays. The use of antibody staining remains the most commonly used technique in diagnostic pathology and would be the method most easily adopted into routine clinical practice. If protein ratios mirror the RNA ratios, this could be a useful diagnostic approach.
Summary. The ongoing refinements in surgical therapy and in adjuvant chemotherapy for head and neck cancer and lung cancer make the distinction between primary LSCC and lung metastasis increasingly important. We have identified a 10-gene classifier that we believe can distinguish primary squamous cell tumors of each type. This finding represents a potentially exciting new molecular diagnostic method, but will need to be further validated before it can be used clinically. We are now actively pursuing the use of both gene expression and immunohistochemical methods in HNSCC patients who present with a solitary lung nodule to further validate our result. Because there is not yet a true "gold standard," our assessment of accuracy validation will require careful and somewhat long-term clinical follow-up.
| Acknowledgments |
|---|
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
A. Vachani and M. Nebozhyn contributed equally to this work.
8 http://research.dfci.harvard.edu/meyersonlab/lungca/ ![]()
9 http://www.affymetrix.com/support/technical/comparison_spreadsheets.affx ![]()
10 https://genome.unc.edu/pubsup/dwd ![]()
Received 7/10/06; revised 12/ 5/06; accepted 2/21/07.
| References |
|---|
|
|
|---|
, Taxol and
-irradiation. Cancer Biol Ther 2002;1:3807.[Medline]This article has been cited by other articles:
![]() |
A. C. Borczuk, R. L. Toonkel, and C. A. Powell Genomics of Lung Cancer Proceedings of the ATS, April 15, 2009; 6(2): 152 - 158. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |