Abstract
Purpose: Oral cancer is a major health problem worldwide and in the U.S. The 5-year survival rate for oral cancer has not improved significantly over the past 20 years and remains at ∼50%. Patients diagnosed at an early stage of the disease typically have an 80% chance for cure and functional outcome, however, most patients are identified when the cancer is advanced. Thus, a convenient and an accurate way to detect oral cancer early will decrease patient morbidity and mortality. The ability to noninvasively monitor oral cancer onset, progression, and treatment outcomes requires two prerequisites: identification of specific biomarkers for oral cancers as well as noninvasive access to and monitoring of these biomarkers that could be conducted at the point of care (i.e., practitioner's or dentist's office) by minimally trained personnel.
Experimental Design: Here, we show that DNA microarray gene expression profiling of matched tumor and normal specimens can identify distinct anatomic site expression patterns and a highly significant gene signature distinguishing normal from oral squamous cell carcinoma (OSCC) tissue.
Results: Using a supervised learning algorithm, we generated a 25-gene signature for OSCC that can classify normal and OSCC specimens. This 25-gene molecular predictor was 96% accurate on cross-validation, averaging 87% accuracy using three independent validation test sets and failing to predict non–oral tumors.
Conclusion: Identification and validation of this tissue-specific 25-gene molecular predictor in this report is our first step towards developing a new, noninvasive, microfluidic-based diagnostic technology for mass screening, diagnosis, and treatment of pre-OSCC and OSCC.
- Oral squamous cell carcinoma
- Screening
- Gene signature
- Prediction
- Head and neck/oral cancers
- Cancer surveillance and screening
- Gene expression profiling
- Molecular diagnosis and prognosis
Head and neck cancers are the sixth most common cancer worldwide and are associated with low survival and high morbidity (1). Cancers of the oral cavity account for 40% of head and neck cancers and include squamous cell carcinomas of the tongue, floor of the mouth, buccal mucosa, lips, hard and soft palate, and gingiva (2, 3). Despite therapeutic and diagnostic advances, the 5-year survival rate for oral squamous cell carcinoma (OSCC) remains at ∼50% (2–4). In addition, aggressive treatment of OSCC cancer is controversial because it can lead to severe disfigurement and morbidity (5). As a result, many patients with OSCC cancers are either overtreated or undertreated, with significant personal and socioeconomic effects.
One of the fundamental factors accounting for the poor outcome of patients with OSCC is that a great proportion of oral cancers are diagnosed at advanced stages, and therefore, treated late. Early detection of oral cancer lesions will greatly improve morbidity associated with late disease treatment and overall patient survival. For example, early detection could lead to frequent patient monitoring, dietary changes, counseling on and cessation of smoking and drinking, preventative drug administration, and/or lesion removal. As such, early diagnosis and treatment of OSCC has been shown to lead to mean survival of >80% and a good life quality after treatment (6). However, no methodology exists that could early, accurately, and easily mass screen for oral cancer lesions.
Currently, clinical examination and histopathologic studies are the standard diagnostic methods used to ascertain whether biopsied material are precancerous or cancerous lesions (7). Biopsies are invasive procedures typically involving surgical techniques. Furthermore, biopsies are limited when it comes to lesion size. For example, small lesions may not provide enough material for accurate diagnosis, whereas biopsies taken from large lesions may not accurately reflect every histopathologic aspect of the lesion. Finally, the biopsy, as a diagnostic tool, has limited sensitivity. Thus, additional methodologies would be helpful in screening for premalignant and malignant oral cancer lesions. However, to noninvasively detect oral cancer cells will require easy access to the site in which these cancers typically arise, as well as a readily available source of cells. Oral cavity saliva meets both of these criteria. A fundamental requirement to identify cancer lesions early, accurately, and easily from the cellular population in saliva requires the identification of unique biomarkers.
Typically, genetic changes in cancer cells lead to altered gene expression patterns that can be identified long before the cancer phenotype has manifested. When compared with normal mucosa, those changes that occur in the cancer cell can be used as biomarkers. Attempts to find biomarkers that identify premalignant OSCC and cancerous lesions have resulted in several candidate genes associated with OSCC tumor progression including p53, cyclin D1, and epidermal growth factor receptor (8, 9). However, to date, no single gene has shown sufficient diagnostic utility in OSCC. Thus, as in many other cancers, clinical diagnosis will require considering the combined influence of many genes. Not surprisingly, the expression patterns of many genes have shown dramatic correlations with tumor behavior and patient outcome. Indeed, microarray analyses of several tumor types has shown that global expression profiling can distinguish tumor from normal cells, as well as the class and subtype of cancer, far superior to current histopathologic diagnostic systems (10–15). Recent independent studies (15–19) carried out by various research groups indicate that OSCC cells have a unique gene transcription profile, which differs from that of normal cells. Interestingly, none of these studies have tested the gene profiles for their ability to identify or predict OSCC.
To date, no accurate, cost-efficient, and reproducible method exists that enables the mass screening of patients for OSCC. Recent literature strongly supports the notion that microarray analysis of human cancers far exceeds conventional criteria with regards to diagnoses. Thus, the goal of this study was to identify a gene expression signature for OSCC. By analyzing microarray data from patient-matched normal and OSCC tissue and by validation at the RNA and protein level, and using several independent validation gene data sets, we identified a tissue-specific gene expression signature that can predict OSCC. This OSCC signature is our first step towards developing a microfluidic “lab-on-a-chip” system for rapid (real-time), point-of-care detection and diagnosis of oral cancer.
Materials and Methods
Patient samples and characteristics. Consent was obtained for all patients in this study in accordance with guidelines set forth by the Institutional Review Board at the University of Pennsylvania and the University of Pittsburgh. All matched patient normal and tumor samples and unmatched normal and tumor samples were obtained from surgical resection specimens from patients undergoing surgery for OSCC using standardized procedures. The Penn data set was isolated and microarrayed at the University of Pennsylvania, whereas the RO data set was kindly provided by Dr. Ruth Muschel (Children's Hospital of Philadelphia, Philadelphia, PA) and has been previously reported (12). The OSCC data set and other human data sets used in this study were downloaded from Geo DataSets (National Center for Biotechnology Information). In this study, cancers of the oral cavity included squamous cell carcinomas of the tongue, floor of the mouth, buccal mucosa, lips, hard and soft palate, and gingiva (2, 3). After resection, matched normal and tumor were fresh-frozen in liquid nitrogen. Samples were banked at −80°C for storage until later use. All normal samples were obtained at the greatest distance from the tumor, typically 2 to 3 cm, in which no gross appearance of tumor, leukoplakia, or erythroplakia could be detected. Touch preparation analysis was used to confirm that each normal-appearing mucosa did not contain tumor.
All sections were evaluated cytologically and diagnosis was confirmed. All tissue sections were fixed and stained with H&E and evaluated by two pathologists (J. Hunt and M.M. Feldman) and histologic analyses were done to ensure that each tumor specimen was pure for microarray analysis, containing >80% tumor tissue, and that each normal section did not contain dysplasia or carcinoma. Those samples that did not meet these criteria were rejected for this study. Archival material of normal sections were derived from patients who either had surgical tooth extractions or had non–epithelial-related pathology.
Extraction of total RNA and microarray analysis. For RNA isolation, each tissue specimen was placed in a liquid nitrogen–chilled mortar and the tissue ground to a fine powder. The liquid nitrogen was evaporated, and the tissue was homogenized in Trizol (Invitrogen, Carlsbad, CA). Total RNA was isolated using the Trizol method and dissolved in RNase-free water. To remove contaminates, the RNA was purified using RNeasy spin columns (Qiagen, Inc., Valencia, CA). Each specimen typically yielded 50 μg of total RNA.
Total RNA samples were submitted to the University of Pennsylvania Microarray Facility for microarray analysis using Affymetrix U133A chips. Samples were run on an Agilent Bioanalyzer to confirm integrity and concentration. For target preparation and hybridization, all protocols were conducted as described in the Affymetrix GeneChip Expression Analysis technical manual. Briefly, 5 to 8 μg of total RNA were converted to first-strand cDNA using Superscript II reverse transcriptase (Invitrogen) primed by a poly(T) oligomer that incorporates the T7 promoter. Second-strand cDNA synthesis was followed by in vitro transcription for linear amplification of each transcript and incorporation of biotinylated CTP and UTP (Enzo RNA Labeling Kit, Affymetrix). The cRNA products were fragmented to 200 nucleotides or less, heated at 99°C for 5 minutes and hybridized for 16 hours at 45°C to U133A GeneChips microarrays (Affymetrix). The microarrays were washed at low (6× saline-sodium phosphate-EDTA) and high (100 mmol/L MES, 0.1 mol/L NaCl) stringency and stained with streptavidin-phycoerythrin. Fluorescence was amplified by adding biotinylated antistreptavidin and an additional aliquot of streptavidin-phycoerythrin stain. A confocal scanner was used to collect fluorescence signal at 3-μm resolution after excitation at 570 nm. The average signal from two sequential scans was calculated for each microarray feature.
Analysis of microarray data. Initial data analysis was done using Affymetrix Microarray Suite 5.0 to quantitate expression levels for targeted genes; default values provided by Affymetrix were applied to all analysis variables. Border pixels were removed, and the average intensity of pixels within the 75th percentile was computed for each probe. The average of the lowest 2% of probe intensities occurring in each of 16 microarray sectors was set as background and subtracted from all features in that sector. Probe pairs were scored positive or negative for the detection of the targeted sequence by comparing signals from the perfect match and mismatch probe features. The number of probe pairs meeting the default discrimination threshold (τ = 0.015) is used to assign a call of absent, present, or marginal for each assayed gene, and a P value is calculated to reflect confidence in the detection call. A weighted mean of probe fluorescence (corrected for nonspecific signal by subtracting the mismatch probe value) was calculated using the one-step Tukey's biweight estimate. This signal value, a relative measure of the expression level, was computed for each assayed gene. Global scaling was applied to allow comparison of gene signals across multiple microarrays: after exclusion of the highest and lowest 2%, the average total chip signal was calculated and used to determine what scaling factor was required to adjust the chip average to an arbitrary target of 150. All signal values from one microarray were then multiplied by the appropriate scaling factor.
For statistical analysis, all data was normalized for comparison across arrays using GeneSpring default normalization settings: data transformation was set at measurements <0.01 to 0.01 (per chip, normalized to the 50th percentile; and per gene, normalized to the median). Genes differently expressed between matched patient normal and tumor samples were obtained using GeneSpring and statistical analysis of microarrays (SAM; ref. 16). Briefly, after normalization, all gene expression data were filtered for those genes that were present in the matched patient normal and tumor samples. The 15,311 genes satisfying this filter were further analyzed for differences between normal and tumor samples using ANOVA with Benjamini-Hochberg multiple testing correction factor at P ≤ 0.001. SAM was then used to determine fold expression levels. Unsupervised clustering was done with Cluster (17) at the settings described using Pearson's correlation distance metric and complete linkage clustering followed by visualization in Treeview (17). Principal components analysis was used to compare site-specific expression in the oral cavity and was done using GeneSpring (principal components analysis on conditions).
Immunohistochemistry. Immunohistochemistry was done to validate the differential expression of selected genes in tissue sections and to localize the tissue expression of the genes. Sections were incubated at 70°C for 20 minutes, deparaffinized in xylene (20 minutes at room temperature), and then rehydrated through a series of graded ethanol solutions (20 minutes at room temperature) followed by water (10 minutes at room temperature). Endogenous peroxide was quenched through treatment with hydrogen peroxide solution (15 minutes at room temperature). To enhance antigen exposure, specimens for matrix metalloproteinase-1 (MMP-1) and laminin-5 γ2 chain (Ln-5γ2) were incubated in 1% sodium citrate solution at 100°C for 10 minutes and then cooled to room temperature. The detection of MMP-3 required no antigen retrieval methods. Nonspecific binding sites were blocked with horse serum (Vector Laboratories, Burlingame, CA) for 30 minutes. The slides were then incubated overnight at 4°C with monoclonal mouse antibodies directed against the Ln-5γ2 chain (D4B5; Chemicon, Inc., Billerica, MA), MMP-1 (36665; R&D, Minneapolis, MN), or MMP-3 (SL1 IID4; Chemicon). Biotinylated secondary antibody (anti-mouse, 30 minutes at room temperature; Vector Laboratories), a biotin-avidin complex (30 minutes at room temperature; Vector Laboratories), and a chromogenic substrate (3,3′-diaminobenzidine, 10 minutes at room temperature; Vector Laboratories) were then applied. Sections were counterstained with hematoxylin for 5 minutes.
Quantitative real-time PCR analysis. Changes in mRNA levels were compared by quantitative real-time PCR analysis, using the Bio-Rad (Hercules, CA) MyiQ single-color real-time PCR detection system. All gene-specific primers used for real-time PCR were purchased from Qiagen. Five micrograms of total RNA from normal and tumor specimens were converted to cDNA using Superscript II (Invitrogen) according to the manufacturer's specifications. PCR reaction mixtures consisted of 2 μL of Faststart DNA Master SYBR Green I mixture [containing TaqDNA polymerase, reaction buffer, deoxynucleotide triphosphate mix (with dUTP instead of dTTP), SYBR Green I dye, and 10 mmol/L of MgCl2], 0.5 μmol/L of each target primer stock, 2 or 4 mmol/L MgCl2 (Ln-5γ2 chain, MMP-1, and MMP-3) in a final reaction volume of 20 μL. β-Gus was used as the internal control for normalization and Universal Human RNA (Stratagene, La Jolla, CA) was used as the standard reference (18). Cumulative fluorescence was measured at the end of the extension phase of each cycle. Product-specific amplification was confirmed by a melting curve analysis and agarose gel electrophoresis analysis. Quantification was done at the log-linear phase of the reaction and cycle numbers obtained at this point were plotted against a standard curve prepared with serially diluted samples.
Class prediction. The support vector machine (SVM; GeneSpring) algorithm was used to identify a gene predictor set from the patient-matched normal and tumor microarray data. SVM uses recognition and regression estimates to identify class prediction gene sets using a training set of microarray data. The SVM algorithm attempts to find a hyperplane that provides separation between the different input data classes such that there is a maximal distance between the hyperplane and the nearest point on any one of the input classes. Furthermore, using the training data set, SVM performs a 10-fold cross-validation to select a set of predictor genes which leads to the smallest error rate. The patient-matched tumor and normal samples were used as the training set and cross-validated. The Penn, RO, OSCC, and various tumor data sets were used as the test sets. SVM was set to use the Golub method. In the Golub method, each gene is tested for its ability to discriminate between the classes using a signal-to-noise score, which is given by:
Where μi and σi, i = 1, 2, are the mean and SD of the expression values over the samples in class i. Genes with the highest scores are kept for subsequent calculations. The Golub method of gene selection calculates the difference in means between the training and test sets divided by the sum of the SDs to identify the best set of predictors. The number of genes used for predictors was set at 25 and the 2,207 genes identified by filtering for only genes present, and analyzed by ANOVA with Benjamini-Hochberg multiple testing correction factor at P ≤ 0.05 were used as the gene pool. Once this set of 25 predictor genes was identified, it was applied to each test set to test the classification accuracy. All SVM calculations were done using kernel function set to polynomial dot product (order 1) and diagonal scaling factor set to 0. Prediction analysis of microarrays (19) and k-nearest neighbors confirmed SVM results (ref. 9; data not shown).
Results
Patient-paired normal and tumor specimens. To identify differently expressed genes in oral mucosa and OSCC, we obtained matched patient OSCC and normal specimens from 13 patients undergoing surgical resection for OSCC at the University of Pennsylvania and at the University of Pittsburgh. The clinical characteristics of the 13 paired tumor/normal patients are shown in Supplemental Table S1. This group is representative of the general population of patients with OSCC having a median age of 59 in which a greater percentage (54%) were male. All OSCC specimens were located in the oral cavity and included squamous cell carcinomas of the tongue, floor of the mouth, and buccal mucosa (Supplemental Table S1). Of the patients reporting, >90% smoked tobacco and/or drank alcohol (data not shown). We chose to compare patient-paired normal and tumor specimens in order to provide the most statistically representative database for distinguishing gene expression difference between tumor and normal samples. All RNA were isolated at the University of Pennsylvania.
Distinct expression profiles define tumor and normal samples. Initially, to identify differently expressed genes between oral mucosa and OSCC, the microarray data was normalized, filtered for only the genes present, and analyzed using ANOVA with Benjamini-Hochberg multiple testing correction factor at P ≤ 0.05 (20). This yielded a highly discriminating set of 2,207 genes. To visualize the gene expression data, hierarchical clustering was done independently of normal and tumor samples (Fig. 1A ). The samples evenly partitioned into two major groups corresponding to normal and OSCC samples. More genes were differentially regulated in the tumor samples than the normal specimens in this 2,207–gene set. Within each tumor and normal cluster were two major classes. This separation was most apparent in the clustering of the normal samples (Fig. 1A and B). Six of the seven (>85%) normal samples that were derived from tongue tissue were in one class. In contrast, >80% (five of six) of those samples obtained from sites other than the tongue, including the mandible, floor of the mouth, gum, and buccal mucosa, made up the other class (Fig. 1B). A similar clustering of the OSCC tongue specimens and non-tongue OSCC specimens was also present in the tumor samples (Fig. 1B). Principal components analysis highlighted the separation of the tongue and non-tongue specimens in the normal and tumor groups (Fig. 2A ). The separation of site-specific gene expression was readily apparent in the normal specimens but was slightly less distinct in the OSCC samples (Fig. 2A). Additionally, within the OSCC tumor samples was a subcluster of those primary OSCC tumors that were associated with nodal disease (Fig. 1).7 Although expression profiling allowed the distinct separation of tongue versus non-tongue samples, several samples clustered into groups that were inconsistent with their histology. However, these results do indicate that there are distinct expression patterns within the oral cavity that can identify normal and tumor tissue sites of origin.
Gene expression profiles of 13 normal oral mucosal and 13 patient-paired OSCC specimens shows distinct separation. A total 2,207 genes were identified after the microarray data was normalized, filtered for only the genes present, and analyzed using ANOVA with Benjamini-Hochberg multiple testing correction factor at P ≤ 0.05. A, to visualize the data, samples and genes were analyzed by unsupervised hierarchical clustering using Cluster and graphically represented in Treeview (17). Gene signatures of interest are highlighted by dark lines. Tree headings for tumor and normal specimens are labeled. B, two major classes of samples were identified using the 2,207 genes identified in (A) representing an exact separation of the tumor and normal specimens. Tree headings for these two groups are labeled. Anatomic sites in the oral cavity in which samples were surgically removed are labeled as well as the stage and tissue number. Samples are subclustered into tongue and non-tongue classes under each of the tumor and normal headings.
Unsupervised classification analysis of patient-matched normal oral mucosal and OSCC samples. A, principal components analysis was done on tongue and non-tongue oral cavity OSCC and normal samples. Perspective image with the tongue and non-tongue samples as principal components and axes. Tongue (▪), and non-tongue samples (), clustering of similar samples (encircled points). B, expression patterns up-regulated and associated with tongue samples are labeled. Regions identified in Fig. 1 as site-specific are enlarged. C, expression patterns up-regulated and associated with non-tongue samples are labeled. Regions identified in Fig. 1 as site-specific are enlarged. Headings for tumor and normal specimens are labeled.
Closer inspection of the site-specific gene expression patterns identified in Fig. 1 showed distinct gene expression profiles for normal tongue tissue compared with normal non-tongue sites (Fig. 2B and C). In particular, several enzymes associated with biochemical processes were up-regulated in the tongue including cytochrome family members, aldehyde dehydrogenase, monoglyceride lipase, transglutaminase, sulfotransferase, arachidonate 12-lipoxgenase, glutathione S-transferase, and others. In addition, several signaling transduction molecules, including RAB proteins and Rho-GTPase activating proteins, and the e-erb-b2 receptor were elevated in the tongue specimens (Fig. 2B).
Although the non-tongue samples did express some biochemical enzymes, the number was far less than that for the tongue samples (Fig. 2C). Unlike the tongue specimens, the non-tongue samples had several types of receptors up-regulated including growth factor receptors, prostaglandin receptors, and G protein–coupled receptors. Likewise, several growth factors were also elevated, for example, the epidermal growth factor, fibroblast growth factor-2, and WNT inhibitory factor. Finally, non-tongue tissues showed expression for several transcription factors, including ets, zinc finger, AP-2 and p300/CBP, and signaling molecules like H-Ras suppressor and Grb-2 like proteins. Together, these results indicate that there are unique gene expression patterns between tongue and non-tongue sites in the oral cavity.
Gene expression signature for OSCC. To more thoroughly characterize and identify the most significantly differentially expressed genes in OSCC and normal mucosa, we used a combination of ANOVA with the Benjamini-Hochberg multiple testing correction factor with P ≤ 0.001 and SAM to analyze the patient-matched tumor/normal samples (16, 20). This resulted in a list of 92 genes that are highly significantly differentially expressed in OSCC and normal tissue (Table 1 ). As shown in Table 1, this list is comprised of a majority of genes (95%) that are up-regulated in OSCC as compared with normal (similar to that presented in Fig. 1). This list contains genes expressed from 2-fold to >70-fold in the OSCC with P values that ranged from 1 × 10−7 to 0.001 (Table 1). Likewise, the genes which were down-regulated in OSCC ranged from 2- to 33-fold with P values of 4 × 10−8 to 7.5 × 10−4.
Differentially expressed genes in normal mucosa and OSCC
Although several genes in Table 1 are relatively unknown, many have been implicated in OSCC development and progression. These include molecules associated with the extracellular matrix, matrix proteolysis, cell to cell adhesion, migration, and other processes. For example, several collagen chains, two laminin-5 chains, six different MMPs (MMP-1, -3, -9, -11, -12, and -13), and plasminogen activator of urokinase were all up-regulated in OSCC. In addition, genes regulating cell to cell adhesion and motility including snail homologue 2, myosin, meltrin α, and lysyl oxidase–like 2 were identified as being up-regulated in OSCC. The functions of those genes down-regulated in OSCC are presently unknown. However, several of the up-regulated genes have previously been reported as markers of or being involved in OSCC tumor development and progression.
Validation of the OSCC gene signature. To confirm the findings from the microarray analysis, real-time PCR was done using primers specific for a sampling of genes that were at the top, middle, and bottom of the genes listed in Table 1. This included MMP-1, Ln-5γ2, and MMP-3. The amplification efficiencies and expression values of the primer/probe sets at various dilutions were compared with the amplification efficiencies of the internal control gene β-Gus. We selected β-Gus as the internal control because it was uniformly expressed across all samples by microarray. Quantitative real-time PCR results were done using the 2−CΔΔT method (18). As shown in Fig. 3 , the expression levels of MMP-1, Ln-5γ2 chain, and MMP-3 were all elevated in the tumor specimens as compared with the adjacent normal tissue. The fold expression of MMP-1, Ln-5γ2 chain, and MMP-3 in tumor versus normal tissue was in agreement with that determined by microarray analysis (Supplemental Table S2). Overall, these results confirm, at the RNA level, the differential gene expression signature that distinguishes OSCC tumors and normal mucosa.
Validation analyses of gene expression profiling. A, quantitative real-time PCR of matched normal and tumor specimens (n = 5 pairs) for MMP-1 (top), Ln-5γ2 (middle), and MMP-3 (bottom). All expression values are normalized to expression of internal control gene β-Gus. All real-time reactions were done in triplicate; bars, SD. Only a total of seven of the paired samples were analyzed by real-time PCR due lack of a sufficient sample. B, immunohistochemical staining analysis of MMP-1, Ln-5γ2 chain, and MMP-3 in OSCC and normal oral mucosa. In OSCC (b, d, and f) staining of MMP-1 (b), Ln-5γ2 chain (D), and MMP-3 (f) was detected around and within the OSCC tumor islands. Staining of normal oral mucosa (a, c, and e) could not detect the expression of MMP-1 (A) or MMP-3 (e). The Ln-5γ2 chain (c) was correctly detected in the basement membrane of the oral mucosa (↑). Magnification, ×400. A total of seven different OSCC tumor and normal sections were stained by immunohistochemistry.
This sampling of genes was next examined at the protein level in paraffin-embedded archival samples, both OSCC and normal, by immunohistochemistry. All OSCC sections displayed elevated expression of MMP-1, MMP-3, and Ln-5γ2 within the tumor islands (Fig. 3). In contrast, with the exception of the Ln-5γ2, these proteins could not be detected in the normal sections. Ln-5γ2 was detected, as expected, in the basement membrane of normal oral epithelium (ref. 21; Fig. 3B). Thus, these results show that we have validated the OSCC gene signature at the RNA and protein levels.
OSCC gene signature predicts tumor and normal samples. To identify an OSCC tumor predictor, the patient-matched OSCC tumor and normal microarray data was next analyzed by the class prediction algorithm SVMs (a supervised machine-learning technique; ref. 22). To identify the best gene predictors, the patient-matched OSCC tumor/normal data was trained and cross-validated using the Golub gene selection method (9), with the number of possible gene predictors set at 25 genes. The cross-validation of the matched patient tumor and normal sample data show that this 25-gene set predictor could classify tumor and normal samples with a 100% accuracy (Supplemental Table S3). Interestingly, a majority of the margins in Supplemental Table S3, 25 of 26 (96%) were ≥0.5 and are considered as confident classifications (23, 24). However, the margin for sample 04-0123 was <0.05 and is considered an unreliable prediction. Thus, this reduced the accuracy to 96%. We were able to retain this 96% accuracy prediction rate using as few as 10 genes; however, the 25 predictors yielded the best performance when applied to other data sets below (data not shown).
To test the predictive strength of the 25-gene predictor identified by SVM and cross-validation, we tested it on three independent oral cancer data sets. The Penn data set was comprised of two normal specimens and 13 OSCC specimens, the RO data set (which has been previously reported) consisted of 18 OSCC tumors and the OSCC data set consisted of 5 tumor and 4 normal samples (from GSE1722; Geo DataSet, National Center for Biotechnology Information; ref. 12). Using SVM and the Golub method of gene selection, all (100%) specimens of the Penn data set were correctly classified (Table 2 ). However, the margins of two predictions were considered unreliable, resulting in an accuracy of 87%. The 25-gene set predictor was able to accurately classify 86% of the RO OSCC data (one incorrect prediction and two unreliable; Supplemental Table S4). The OSCC predictor correctly classified all nine samples as tumor or normal in the OSCC data set. However, one prediction was considered unreliable, giving an accuracy of 89% (Supplemental Table S5). Similar results for the above analyses were obtained using k-nearest neighbor and prediction analysis of microarrays (data not shown). Thus, testing a total of 44 OSCC and normal specimens, the 25-gene predictor was able to correctly classify 42 samples producing an average accuracy rate of >87%. As a follow-up to this initial work, we are beginning prospective studies to test the OSCC gene signature(s) ability to predict using a larger sample population and improve the positive identification accuracy rate. Finally, using existing Affymetrix U133A chip–derived data sets (National Center for Biotechnology Information Geo DataSets) from 20 other human tumors including breast, renal clear cell tumor, acute myeloid leukemia, lymphoblastic leukemia, and Barrett's-associated adenocarcinomas resulted in accuracies of only 25% (Supplemental Table S7); thus, illustrating the tissue-specificity of the 25-gene predictor.
Prediction of Penn data set of OSCC and normal specimens (87% accurate; 15 correct predictions; 2 unreliable)
Many of the genes that make up the 25-gene predictor have previously been implicated in and described for OSCC (Table 3 ). However, several predictor genes have not been directly associated with OSCC tumorigenesis and will therefore provide starting points for further investigations. Several of the genes identified in the predictor were also present in the set of highly significant genes expressed between normal and tumor (Table 1). The predictor set of genes were comprised of several epithelial marker genes with categories of potential interest including genes encoding extracellular matrix components, genes involved in cell adhesion, including fasciclin; genes involved in cell to cell integrity, for example lysyl oxidase–like 2 and snail-homologue 2; genes encoding hydrolyzing activities, including proteins involved in the degradation of the extracellular matrix such as MMP-1, MMP-9, MMP-11; and urokinase and cytokines such as inhibin βA and parathyroid hormone–like hormone. As previously proposed, the development of OSCC involves stromal and immune-regulatory components. Thus, many of the predictor genes belong to these categories.
OSCC tumor and normal gene predictors
Discussion
In the present study, we did expression profiling on patient-matched normal mucosa and OSCC tumors to identify a gene expression signature that was capable of predicting OSCC tissue from normal. The use of patient-paired normal and tumor specimens provided the most representative database statistically for distinguishing gene expression difference between tumor and normal samples. By microarray analysis, we identified a highly significant set of differentially expressed genes between normal and OSCC tissue. Furthermore, when used in the supervised machine algorithm SVM, we were able to classify three independent test data sets with accuracies ranging from 87% to 100%. This is the first report to date that has used patient tumor/normal samples to identify a gene signature for the prediction of OSCC. Finally, this report satisfies our first requirement in developing a microfluidic lab-on-a-chip system for rapid (real-time), point-of-care screening, detection and diagnosis of oral cancer; identification of a gene signature that predicts OSCC.
We have used several approaches for the analysis of gene expression data with regards to clinicopathologic variables. Our initial approach, gene selection using an ANOVA test with Benjamini-Hochberg multiple testing correction, was used to examine similarities and differences among the paired tumor/normal samples in their patterns of gene expression. In agreement with our results, several previous studies using unmatched and matched OSCC tumor specimens and normal samples showed a similar distinct clustering of normal and tumor samples (13–15). As with these previous studies, we did not find any hierarchical clustering of the samples as related to stage of disease or tumor grade. The most differently expressed genes were up-regulated in OSCC samples, whereas those in the normal tissue were down-regulated. More interestingly, and in contrast with previous studies, both tumor samples and, to a greater extent, the normal samples were clustered according to their anatomic sites in the oral cavity. Tissue from the tongue comprised one major cluster whereas those from other sites including buccal mucosa, mandible epithelial, gum tissue, and floor of the mouth populated the other cluster.
Using a highly significant statistical and data-filtering approach, we identified 92 genes differentially expressed between OSCC and normal mucosa (P < 0.001). Furthermore, the strong correlation of real-time PCR with the array data for gene expression and the validation using immunohistochemistry strongly indicate that the 92 genes in this list are representatively expressed genes in OSCC. Thus, these studies indicate that many of the genes identified by microarray analysis are highly relevant to OSCC development and/or progression. For example, work by us and others have shown that urokinase, MMP-1, MMP3, MMP-11, MMP-13, and laminin-5 (mainly the γ2 and α3 chains) are up-regulated and play a significant role in OSCC tumor development and progression (25–28). The MMPs are instrumental in the degradation of the extracellular environment allowing OSCC tumor growth and invasion, whereas laminin-5 plays a fundamental role in tumor growth and migration/invasion (25). In addition, STAT-1 was identified as being up-regulated and is believed to play an important role in the tobacco-induced pathogenesis of oral cancer (29). As in several recent studies, we found both snail homologue and lysyl oxidase to be expressed in OSCC. These two genes alter cell to cell adhesion and are suspected of being important in OSCC tumor invasion (30). It is interesting that many of these genes, normally expressed during wound healing, especially those associated with proteolysis and motility, are not properly regulated and seem to become constitutively overexpressed in OSCC.
To date, >20 studies incorporating microarray analysis have reported on the genetic changes associated with OSCC. Unfortunately, many of these studies have used a variety of gene expression arrays and platforms, and thus, it is difficult to make a direct comparison with our results (14, 15). In addition, none of these previous studies have tested their OSCC gene signature for its ability to predict OSCC using an independent validation data set as shown here. As expected, there is a unified consensus that distinct gene expression patterns exist when normal and primary OSCC tumors are compared. For example, Ginos et al., who compared OSCC samples to unmatched normal subjects using Affymetrix U133A chips identified several genes that overlapped with those found in this study (13). In addition, Mendez et al. used microarrays to analyze both OSCC and normal specimens (31). As with our work here, they concluded that oral carcinomas are distinguishable from normal oral tissue using oligonucleotide arrays that contained probes representing only 7,000 full-length human genes. Interestingly, these authors indicated that there was expression profile heterogeneity among tumors of a particular histopathologic grade and stage, and that no statistically significant differences in gene expression were found between early-stage disease and late-stage disease (31). Our data agrees with their assessment that there was no correlation with grade and stage but does not agree with their findings on invasive disease.8 These differences are more likely due to the dissimilar microarray chips used in the Mendez et al. study as the one presented here. However, these results do illustrate that distinct genes are expressed in OSCC compared with normal samples. The OSCC prediction signature from this study will provide the emphasis to continue work with our collaborators from the University of Pennsylvania School of Engineering and Applied Science in the design, development and testing of a disposable handheld microfluidic device for the clinical assessment of OSCC (32).
To test for clinical applicability, we assessed whether different data sets of genes and tissues could be predicted by the OSCC gene expression signature. We selected SVM, a supervised machine-learning technique for our prediction studies (22). SVM was chosen because it has been used in several microarray studies with success and seems to be superior to similar algorithms such as k-nearest neighbor and prediction analysis of microarray (9, 23, 33, 34). Cross-validation of the training set, which consisted of the paired tumor/normal samples from two institutions, resulted in an accuracy rate of 96% using at least 25 genes per class. The most appropriate test of predictive accuracy is to validate the predictor on an independent set of samples. Therefore, to provide a superior means of testing the predictor, we did not split the data set and use leave-one-out validation. Instead, we validated and obtained accuracy rates on the 25-gene predictor using three independent validation sets: an independent OSCC and normal sample set from the University of Pennsylvania, a previously published OSCC sample set, and one obtained from the National Center for Biotechnology Information Gene Dataset (12). The 25-gene predictor had an overall accuracy ranging from 86% to 89% for these two validation sets. How these numbers compare with the early clinical diagnosis accuracy of OSCC is currently under investigation. It is difficult to determine why some samples were incorrectly classified. This may be the result of other tissue components within the sample (i.e., bone) or because some samples were mistakenly labeled.
Finally, we tested the OSCC 25-gene predictor's ability to classify non-oral cavity tumor and normal samples. The OSCC predictor displayed poor classifications, with accuracies of only 25% (75% of the samples were predicted incorrectly), when microarray data sets (obtained from National Center for Biotechnology Information GEO DataSets) were derived from non–OSCC human cancers. This indicated that the oral cancer gene predictor set was tissue-specific. In addition, several of these genes present in Table 2 are those differentially expressed in OSCC tumor and normal mucosa components. Interestingly, several of the predictor genes have not been directly associated with OSCC tumor development or progression, and thus, provide areas for further investigations. Categories of potential interest include genes (putatively) encoding extracellular matrix components, in particular, several collagen chains; genes involved in matrix degradation, including urokinase and three members of the matrix metalloproteinase family; genes regulating cell to cell and cell to extracellular matrix adhesion including snail homologue, lysyl oxidase, and fasciclin-like; and genes involved in cell growth and migration, for example, chemokine ligand 13, inhibin, and myosins X and IB. Thus, we have identified a highly significant set of genes that are expressed in OSCC which will provide opportunities for further investigations into OSCC development and progression.
These results represent the initial stage of bringing a gene signature into the clinic for patient screening. To date, no accurate, cost-efficient, and reproducible method exists that enables mass screening of patients for OSCC. Our results show that such a method is possible using the OSCC gene signature identified here. This gene signature, after further testing, could be useful in identifying the site of tumor origin and/or identification of neoplastic lesions before any gross appearance of tumor. As a follow-up to this initial work, we have begun prospective studies using this OSCC gene signature. In addition, we are beginning the development of a new, noninvasive, microfluidic-based diagnostic device using this OSCC gene signature to distinguish oral cancer cells from normal mucosa. A multidisciplinary group of scientists from the University of Pennsylvania School of Medicine and the School of Engineering and Applied Science is planning to develop a device that will be used clinically for the early mass screening of individuals for the presence of OSCC. It is anticipated that this microfluidic lab-on-a-chip will be sensitive enough to identify a suspect lesion before it is detectable during a routine clinical exam. Finally, determining this gene predictor's ability to clinically diagnose oral dysplasias and other potential neoplastic oral lesions is currently under way as well.
Footnotes
↵7 Ziober et al., manuscript in preparation.
↵8 Manuscript in preparation.
-
Grant support: NIH grants DE15856-01 and DE015626-01(B.L. Ziober).
-
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
-
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
- Accepted May 25, 2006.
- Received March 6, 2006.
- Revision received May 21, 2006.