Purpose: This study was performed to discover prognostic genomic markers associated with postoperative outcome of stage I to III non–small cell lung cancer (NSCLC) that are reproducible between geographically distant and demographically distinct patient populations.
Experimental Design: American patients (n = 27) were stratified on the basis of recurrence and microarray profiling of their tumors was performed to derive a training set of 44 genes. A larger Korean patient validation cohort (n = 138) was also stratified by recurrence and screened for these genes. Four reproducible genes were identified and used to construct genomic and clinicogenomic Cox models for both cohorts.
Results: Four genomic markers, DBN1 (drebrin 1), CACNB3 (calcium channel beta 3), FLAD1 (PP591; flavin adenine dinucleotide synthetase), and CCND2 (cyclin D2), exhibited highly significant differential expression in recurrent tumors in the training set (P < 0.001). In the validation set, DBN1, FLAD1 (PP591), and CACNB3 were significant by Cox univariate analysis (P ≤ 0.035), whereas only DBN1 was significant by multivariate analysis. Genomic and clinicogenomic models for recurrence-free survival (RFS) were equally effective for risk stratification of stage I to II or I to III patients (all models P < 0.0001). For stage I to II or I to III patients, 5-year RFS of the low- and high-risk patients was approximately 70% versus 30% for both models. The genomic model for overall survival of stage I to III patients was improved by addition of pT and pN stage (P < 0.0013 vs. 0.010).
Conclusion: A 4-gene prognostic model incorporating the multivariate marker DBN1 exhibits potential clinical utility for risk stratification of stage I to III NSCLC patients. Clin Cancer Res; 17(9); 2934–46. ©2011 AACR.
This article is featured in Highlights of This Issue, p. 2603
Lung cancer is a global problem, but the validation of prognostic genomic markers for recurrence-free survival (RFS) of lung cancer patients may be influenced by local differences in population genetics and environmental factors, including tobacco exposure, diet, and exercise. We hypothesized that comparison of 2 geographically distant and demographically distinct patient cohorts could facilitate the identification and validation of reproducible genomic markers associated with RFS, by eliminating genomic markers that vary between populations. Four genes associated with recurrence in an American training cohort were also identified in a larger Korean validation cohort. One gene, DBN1, was multivariate. Genomic and clinicogenomic modeling using these genes risk-stratified both cohorts and stage I to II patients of the validation cohort (P < 0.0001). These 4 internationally validated recurrence-associated genes are strong candidates for broader analysis in larger patient cohorts, and likely will contribute to personalized lung cancer management strategies in the multidisciplinary lung cancer clinic.
The discovery of genomic markers that are prognostic of non–small cell lung cancer (NSCLC) recurrence could change clinical practice by identifying patients who would do well regardless of adjuvant chemotherapy, thus sparing those patients considerable treatment-related toxicities. Recently, patients identified as low-risk for death from lung cancer on the basis of stratification by a 15-gene signature exhibited worse outcomes when treated with adjuvant cisplatin/vinorelbine (1). Many microarray studies of NSCLC gene expression have been performed with the purpose of finding gene markers predictive of overall survival (OS; ref. 2, 3, 4, 5, 6), but few individual genes are reproducible.
Although earlier microarray studies successfully identified gene markers associated with OS, in some cases independent of stage, some studies did not distinguish between death from NSCLC and other causes. Lung cancer markers linked to survival alone exhibit limited clinical utility, because there are significant competing causes of mortality including postoperative mortality (7, 8), the occurrence of second primary cancers including second primary lung cancer, cardiovascular disease, and chronic obstructive pulmonary disease (COPD). Among stage I NSCLC patients, 5-year disease-specific survival is 77% for pT1N0M0 patients and 62% for pT2N0M0 patients (9), while 5-year OS is 67% and 57% for these groups (10). This comparison indicates that even for stage I patients there is a significant death rate from competing causes of death. Recently, there has been an emphasis on disease-specific survival, which has been more helpful (1). Another approach is to study recurrence-free survival (RFS), which has led to significant progress in identification of genomic markers that are associated with lung cancer-specific outcomes (11, 12). Although these studies have led to identification of multivariate gene groups, to the best of our knowledge, they have not led to the discovery of single gene multivariate markers for RFS, whereas pT (9) and pN1 stage (13) are multivariate.
An as yet unrealized goal is to develop stage-independent genomic models for NSCLC that are prognostic for RFS and reproducible between differing patient populations. This goal, if accomplished, would make it possible to predict the likelihood of recurrence within 5-year of surgery based solely on genomic markers, thus giving the patient and oncologist information to make adjuvant therapy decisions that are related to patient's individual tumor biology rather than stage (I–III). This type of model would be very useful clinically, because patients seen in the NSCLC clinic exhibit a wide range of stages, but may exhibit personalized risk that may differ from that predicted by stage alone. To achieve this goal, reproducible multivariate genomic markers for RFS are needed. Most studies are internally validated by dividing a relatively homogeneous patient cohort, such as one consisting of North American patients, the majority of whom are current or former smokers, into smaller test and larger validation sets. These studies suffer from demographic and geographic similarity of the two populations. Recent studies have successfully identified genomic markers associated with RFS; nonetheless, the lead genomic markers derived within an exclusively North American or Asian patient cohorts are not generally reproducible between demographically distinct and geographically distant populations (11, 12). Differences between genomic markers for RFS may be related to chance or reflect differing mechanisms of cancer recurrence. Differences in recurrence patterns between distinct populations could relate to genetics or interaction of genetics with diet, exercise, and second hand smoke exposure. Although much can be learned about cancer outcome disparities by studying genomic markers within specific racial or ethnic groups, much can also be learned about what is similar between differing groups. Genomic markers that remain the same between differing patient groups are therefore more likely to be broadly useful and independent of confounding factors. Our hypothesis is that improved reproducibility of genomic markers can be achieved by identifying markers that are reproducible between demographically distinct and geographically distant populations.
A novel approach is to identify genomic markers that are of univariate significance in a small training set and then retest them for univariate or multivariate significance in a larger validation group that is demographically distinct. We therefore took a group of 44 lead genomic markers for RFS from a small American training group (stages I–III) and found 4 genes that also exhibited either univariate or multivariate significance in a larger Korean validation group (11). These 4 genomic markers were strongly associated with recurrence in both groups and resulted in a genomic model that was prognostic for recurrence independent of clinical variables, including stage and histology.
Materials and Methods
Training set patients
This study was approved by the Indiana University Purdue University at Indianapolis (IUPUI)/Clarian Institutional Review Board (IRB; no. 0201-58), and the banking of tissue was performed under a separate protocol approved by the same IRB (no. 9401-17; IU-Lilly Tissue Bank). A longitudinal database of consented patients undergoing NSCLC resection with curative intent from July 12, 1999, to January 2, 2002, was searched for banked NSCLC tumor tissue with sufficient tumor content. The tumor resections in the single institution training set were performed at Indiana University by 2 thoracic surgeons who used consistent procedures for lobectomy or pneumonectomy during the period of the study (8, 14). The lobectomy/pneumonectomy procedure involved ligation of the vein leading to the tumor-containing lobe or lung before arterial ligation. All patients had complete peribronchial and mediastinal lymph node dissections. Patients treated on this study were subsequently followed by the multidisciplinary Thoracic Oncology Program at Indiana University.
All patients (n = 27) with stage Ia to IIIb NSCLC, who were evaluable for recurrence at 2 years of follow-up after surgery were included in the study. The median follow-up time for the recurrence-free patients was 57 months. NSCLC recurrence, if suspected, was histologically confirmed. All patient-related data were deidentified by the tissue bank staff so that the investigators could access only coded frozen tissue specimens, coded paraffin slides, and relevant variables for multivariate analysis that were associated with the code number for the patient.
While none of the patients in the training set received adjuvant chemotherapy, one recurrent patient and 2 nonrecurrent patients received carboplatin-based chemotherapy as neoadjuvant treatment. One recurrent patient (no. 11) received chemotherapy 8 months before resection, having participated in a phase II study of paclitaxel/carboplatin and cetuximab. The same patient was continued on cetuximab at the time of recurrence, which was 8 months following resection. One nonrecurrent patient (no. 13) received paclitaxel/carboplatin 2 months before resection. Another nonrecurrent patient (no. 21) received paclitaxel/carboplatin for 2 cycles before resection, which was performed 7 weeks after initiation of chemotherapy.
NSCLC tumor histology was confirmed by hematoxylin/eosin (H&E) staining of associated tissue blocks. The frozen samples included in the study ranged from 90% to 40% tumor tissue with a mean of 59% ± 11% SD. Of the 30 specimens that were available for study, 3 were rejected because they either contained mainly stroma or necrosis, or exhibited less than 40% tumor epithelium. The percentage of malignant cells in the frozen tumor specimen could not be determined definitively from the single 5 micron frozen section, in part, because of the difficulty of completely counting the admixture of interspersed normal stromal, endothelial cells, and immunocytes, which were much smaller than the tumor cells.
RNA isolation and purification
Tumors were collected by a tissue procurement service in the operating room and stored in liquid nitrogen immediately following resection and remained frozen until RNA isolation was performed. Specimens were removed from liquid nitrogen, transported on dry ice to the laboratory and homogenized using a Polytron homogenizer (Brinkmann Instruments) in ice-cold Trizol reagent (Gibco BRL). RNA was extracted with chloroform and precipitated using isopropanol with glycogen as carrier. The RNA pellet was washed in 75% ethanol and resuspended in DEPC-treated MilliQ water. The RNA was diluted in a guanidinium thiocyanate-containing RLT buffer (Qiagen Sciences, Inc.) and further purified using a silica-gel-based membrane in RNeasy MinElute spin columns. The RNA quality was confirmed by electrophoresis on a 1% nondenaturing agarose gel or by Agilent Bioanalyser, which revealed intact 18 and 28S rRNA in all samples. UV spectroscopy (A210−A350) was performed to confirm RNA purity; the A260/A280 ratio was 1.99 to 2.00 for all samples.
cDNA was prepared from total RNA (10 μg) using the Superscript Choice system (Gibco-BRL). A T7-(dT) oligonucleotide primer was used for first strand synthesis. Double-stranded cDNA was purified by phenol/chloroform extraction and the Phase Lock Gel method (Eppendorf-5 Prime). Biotin-labeled cRNA was synthesized using a BioArray RNA Amplification and Labeling kit (ENZO), cleaned and fragmented (15). The cRNA was biotin labeled and hybridized on U133A GeneChip arrays for 17 hours (45°C) in an Affymetrix GeneChip 640 Hybridization Oven (Affymetrix). Washing and staining of the chips were performed using the Affymetrix GeneChip Fluidics Station 400. Arrays were scanned in an HP GeneArray Scanner (Hewlett Packard), and data were analyzed with Affymetrix Microarray Suite v5.0 software (MAS5).
Microarray analysis of discriminatory genes
Probes that were not detected as “present” in at least half of the samples using the standard MAS5 parameters were removed from analysis (16, 17). The remaining probes were analyzed using Welch's t-test, assuming unequal variance, on the log2 transformed MAS5 signals. The Affymetrix chip for one patient, sample no. 26, was damaged; consequently, this patient was deleted from the gene analyses.
Quantitative reverse transcriptase PCR
First strand cDNA synthesis was performed using SuperScript III reverse transcriptase (RT; Invitrogen Corp.) with total RNA (5 μg) and oligodT12–18 primer. The reaction mix was diluted (1:12.5) and used for quantitative PCR (qPCR) analysis. The primers and probes used for the PCR were purchased from Applied Biosystems (Foster city). The probes used for DBN1 (Hs00365623_m1, exon boundary 9–10), PP591 (Hs00611011_m1, exon boundary 5–6), CCND2 (Hs00277041_m1, exon boundary 1–2), and CACNB3 (Hs00167873_m1, exon boundary 9–10) were all Fluorescein Amidite (FAM) labeled. FAM labeled glyceraldehyde-3-phosphate dehydrogenase (GAPDH) probes (433376F) was used in separate experiments to normalize the Ct value between the samples. The Applied Biosystems master mix was used for the qPCR reaction and the manufacturer's instructions were followed. The qPCR reactions were performed in a I-Cycler (BioRad laboratories) at 50°C (2 minutes), 95°C (10 minutes) followed by 40 cycles of consisting of 95°C (15 seconds) and 60°C (1 minute) steps. The reactions were performed in triplicate and the GAPDH Ct values were subtracted from the raw sample Ct values to get the corrected Ct, which was converted into the relative RNA amount using the formula (2−(corrected Ct)). All the 27 patient samples were used for qPCR analysis.
Hierarchical clustering analysis of genes associated with recurrence
For hierarchical clustering, 51 probe sets that differed between the groups (P < 0.001) were selected. The log2 transformed MAS5 signals were normalized [(signal-mean)/SD] and then arrays were clustered using the hierarchical clustering function of Partek Genomics Suite, version 6.3 2007 (Partek, Inc.), with Euclidean distance and average linkage. Pearson's dissimilarity and average linkage were used to cluster the probe sets. Normalization was performed on each probe set to ensure that no individual probe set would have undue influence on the clustering.
Validation set patients
The validation set consisted of 138 patients (n = 138) from Samsung Medical Center. Of 138 patients, 69 exhibited no recurrence following surgery (group NR) and 69 patients exhibited recurrence after surgery (group R). The details of the patients of this validation set are described in Lee and colleagues, 2008 (11). The patients were a mix of different NSCLC stages, established by pathologic staging after surgery: 64%, 17%, and 19% were stages I, II, and III, respectively. The tumor node metastasis (TNM) staging of the patients was widely distributed: 17% patients were pT1 stage, 68% pT2, 7% pT3, and 7% pT4. Most of the patients had no nodal involvement (pN0, 71%), while 20% were pN1 and 9% were pN2. Of the total, 63 patients had adenocarcinoma and 75 patients had squamous cell carcinoma. The microarray data obtained was processed using gene chip robust multiarray average (GCRMA) normalization (18) with perfect match (PM) and perfect match/mismatch (PM/MM) modeling.
Among the 69 nonrecurrent patients, 3 received adjuvant combination chemotherapy. Specifically, 1 patient received a combination of fluorouracil, leucovorin, ifosfamide, and dexamethasone, another patient received combination of etoposide and dexamethasone, and a third patient received a combination of cisplatin and paclitaxel. Nine of the recurrent patients received different adjuvant chemotherapy or biologically targeted regimens including: gefitinib, etoposide/dexamethasone, cisplatin/paclitaxel, cisplatin/etoposide/dexamethasone, and docetaxel/cisplatin/dexamethasone/gemcitabine.
Clinical, genomic, and clinicogenomic models of RFS in the training set
The objective was to derive 3 models for RFS: models based solely on patient characteristics (i.e., clinical model), solely on gene expression levels (i.e., genomic model), and on both (i.e., clinicogenomic model). To model RFS outcomes in the training set, multivariate analyses were performed using a stepwise Cox proportional hazards model on the 27 patients (stage I–III). For the clinical model, patient characteristics including age, sex, race, and smoking history were considered. Smoking history was categorized as: current smoker (C, defined as smoking <1 year before surgery), former smoker (F, defined as quit >1 year before surgery), and never smoker (N). For each of the models, a subsequent Cox score was calculated and dichotomized at its median value into low- and high-risk groups. The RFS curves were estimated using the Kaplan–Meier method and compared by log-rank test. The variables used for the clinical model of the training set were: histology, pathologic stage (pStage), sex, race, and smoking. For the genomic and clinicogenomic models the natural logarithm of gene expression levels determined by qPCR was included, using the genes DBN1, FLAD1 (PP591; flavin adenine dinucleotide synthetase), CACNB3, and CCND2, which were identified as being differentially expressed in both the training and validation sets. Median value of the Cox score was determined as previously described (11) and was used to dichotomize the patients into low- and high-risk groups for Kaplan–Meier plots. The equations used for the modeling are given in the Figure 2 legend. The SAS program codes used to derive the models are provided in the Supplementary Material (SAS codes).
Genomic and clinicogenomic RFS models of the validation set
Genomic and clinicogenomic models for the stage I to III patients of the validation set were developed similar to a prior study of the same cohort, for which the clinical model has already been published (11). The genomic markers used for the validation set model were: DBN1, FLAD1 (PP591; flavin adenine dinucleotide synthetase), CACNB3, and CCND2. For the validation set, there were several Affymetrix probes for each gene: 2 for DBN1, 2 for CACNB3, 5 for CCND2, and 2 for FLAD1 (Supplementary Table S2). To optimize the model, all the probe combinations were considered (2 × 2 × 5 × 2 = 40). Single Affymetrix probes were selected for each gene using a stepwise parameter method (11) that allowed selection of an optimized model of 4 unique probes (Supplementary Table S2). To develop the clinicogenomic model for the validation set, the same stepwise parameter selection was also used for filtering of clinical variables. Pathologic T stage (P = 0.0003) and pathologic N stage (P = 0.0038) were identified as being of utility among the considered variables, whereas age, gender, cell type, tumor size, smoking status, and tumor differentiation were not.
Median value of the Cox score was determined as previously described (11) and was used to dichotomize the patients into low- and high-risk groups for Kaplan–Meier plots. The equations used for the modeling are given in the Figure 2 legend. The SAS program codes used to derive the models are provided in the Supplementary Material (SAS codes).
Genomic and clinicogenomic RFS modeling of stage I and II patients in the validation set
Stage I to II patients of the validation set were selected for Cox analysis. The genomic markers used for the genomic modeling were those described above (DBN1, FLAD1, CACNB3, and CCND2). The clinicogenomic model incorporated the genomic information with additional pT and pN stage data. The Cox equations used for the modeling are given in Figure 2 legend. The SAS program codes used to derive the models are provided in the Supplementary Material (SAS codes). Due to the relatively small number of patients, this type of analysis was not performed on the training set.
Genomic and clinicogenomic OS modeling of validation set
Stage I to III patients of the validation set were included in this Cox analysis. All the deaths included in this analysis are death due to lung cancer, including complications associated with lung cancer (disease-specific OS). To optimize the OS model, a Cox proportional hazards regression model with stepwise parameter selection method was developed. The Cox equations used for the modeling are given in Figure 3 legend. The SAS program codes used to derive the models are provided in the Supplementary Material (SAS codes). Due to the relatively small number of patients, this analysis was not performed on the training set.
American patient training set
The training set of patients accrued at the Indiana University Thoracic Oncology Program consisted of 27 patients with stage Ia to IIIb NSCLC who were evaluable for recurrence at 2 years of follow-up. All histologies of NSCLC were included. This patient cohort was reflective of a broad patient population seen in an American academic medical center without selection for histology or stage. This approach could be useful because genomic markers derived from this type of study could potentially be applied to clinical decision-making in a general thoracic oncology practice. There were 11 patients who experienced recurrence within 2 years of resection (group R) and 16 patients who did not (group NR). Patient characteristics, including stage of disease, recurrence status, histology, age at surgery, smoking history, survival after surgery, time to recurrence (TTR), gender, and adjuvant therapy are described in Table 1. The majority of the NSCLC patients exhibited adenocarcinoma histology (n = 19). Mean age of the patients at operation was 61.8 (SD = 12.8) years (range 34–81 years).
The median TTR of group R was 12 months and the median OS was 20 months. This study had a 57-month median follow-up and the 2-year cut-off captured 81% of the recurring patients. Among the NR group, 19% of the patients recurred after 2 years, the recurrences being at 32, 50, and 53 months and these patients were still alive at the time of data collection closure. Age at operation, race, gender, and smoking status were analyzed in conjunction with gene expression data in subsequent Cox multivariate analysis (see below).
Genomic markers associated with NSCLC recurrence in the training set
To identify genes that were differentially expressed in recurring patients, genomic microarray analysis of tumor gene expression was performed by Affymetrix U133A chip hybridization. Candidate genomic recurrence markers (Table 2) were identified from 12,956 evaluable probes in the microarray by statistical analysis of log2-transformed signals (Welch's t-test; P ≤ 0.001). This analysis resulted in 51 probes corresponding to 44 genes that were differentially expressed between the R and NR groups (Table 2; raw data available at Geo database; GSE9971; Supplementary Table S1).
Hierarchical clustering analysis of training set genes associated with NSCLC recurrence
Hierarchical clustering using the 51 probes associated with recurrence separated the training set patients exhibiting recurrence. Based on recurrence or not at 2 years of follow-up, this clustering identified 3 subgroups of genes exhibiting upregulation and one exhibiting downregulation in recurrence (Fig. 1). The most statistically significant markers upregulated in recurrence were, in order of significance by Welch's t-test, FLJ20343, DKFZp566O084, CACNB3, CYP3A5, and DBN1 (Table 2). The first 3 markers were associated with 1 cluster of genes upregulated in patients exhibiting recurrence (group 1), while CYP3A5 was associated with a second cluster (group 2), and DBN1 with a third (group 3). Genes in groups 1 to 3 were associated with a broad range of t-test values (ranging from 0.001 to 0.00001). The most statistically significant markers that were downregulated in recurring patients were, in order of significance, C14orf118, STAT2, ATF7IP, HIPK3, and HLA-DOA (Table 2) and this group clustered together (group 4). Group 4 exhibited a smaller distribution of t-test values (ranging from 0.0009 to 0.00008), consistent with the larger size of this group.
Screening of a Korean patient validation set for concordance of candidate genomic markers
We hypothesized that screening of a geographically distant and demographically distinct patient population for recurrence-associated genomic markers would lead to the identification of more reproducible genes for the study of NSCLC prognosis. To find a distinct and larger validation set, the GEO database was screened for NSCLC studies with similar stage grouping and no selection based on histology, performed on a similar genomics platform. A Korean study of NSCLC recurrence-associated genomic markers, GSE8894, met the criteria for comparison with the American training set and was probed for genomic markers associated with recurrence common to both sets (11). The Kaplan–Meier curve for the Korean patients exhibited a median disease-specific survival time of 69.4 months and a 5-year survival percentage of 56.2%, comparable to but perhaps slightly shorter than North American patients with resected NSCLC (ref. 9; Supplementary Fig. S1).
Each of the 44 genes identified in the American training set was tested in the Korean data set for significance by univariate and multivariate Cox analysis. The 4 genes that were most significant by univariate Cox analysis were, in order of significance, DBN1, FLAD1 (PP591), CACNB3, and CCND2 (Table 3). The first 3 genes exhibited P values less than 0.05 and the fourth, exhibiting a P value of 0.08, was retained for model building. By multivariate analysis, only DBN1 exhibited significance (P = 0.0095; Table 3). Nonetheless, 2 of the genes approached multivariate significance, FLAD1 (P = 0.0720) and CCND2 (P = 0.0713), while CACNB3 did not (P = 0.1813; Table 3). These results indicate that DBN1 is potentially of value as a multivariate marker and the combination of the 4 genes was effective in model building (see below).
Confirmation of DBN1, CACNB3, FLAD1 (PP591), and CCND2 expression in the training set by qPCR
The utility of prognosis-associated genes identified by microarray analysis is increased if they are also assayable by qPCR (11). Increased expression of the DBN1, CACNB3, and FLAD1 (PP591) genes in the recurrent NSCLC tumors was confirmed in the training set by qPCR (Table 4). Decreased expression of CCND2 in the recurring patients of the training set approached, but did not reach, statistical significance (Table 4). These expression values were used to perform subsequent Cox regression analysis of the training set.
Clinical, genomic, and clinicogenomic modeling of the training and validation sets
A clinical model of the training set was performed, based on histology, pStage, sex, race, and smoking history (Fig. 2A). The clinical model was effective at separating the low- and high-risk patients in the training set (P = 0.0032). The median RFS for the training set was 17.2 months, while the 5-year % RFS was 83.4% and 28.6% for the low- and high-risk groups, respectively (Fig. 2A; Table 5). Nonetheless, application of a similar clinical model to the larger validation cohort was less successful (P = 0.0518; ref. 11). A genomic model developed from the training set was more effective than the clinical model at risk stratification of the training (P < 0.0001; Fig. 2A) and validation sets (P < 0.0001; Fig. 2B). Using the genomic model, the 5-year RFS for the low- and high-risk groups was 92.3% versus 15.4% and 67.5% versus 32.8% in the training and validation sets, respectively (Table 5). Using the clinicogenomic model, patients were effectively risk-stratified in the training (P < 0.0001) and validation sets (P < 0.0001; Fig. 2A and B). Using the clinicogenomic model, the 5-year RFS for the low- and high-risk groups was 92.3% versus 15.4% and 67.0% versus 33.3% for the training and validation sets, respectively (Table 5). In summary, the genomic and clinicogenomic models exhibit clinical utility because the difference in 5-year RFS is more than 2-fold indicating a substantial clinical effect.
Clinical, genomic, and clinicogenomic modeling of stage I and II patients in the validation set
Because the decision to offer chemotherapy or not is crucial for early stage patients, we reanalyzed the stage I to II patients in the validation set using the 4 genomic markers to develop genomic and clinicogenomic models for this risk group. The genomic model risk-stratified the stage I to II patients (P value <0.0001; Fig. 2C), exhibiting 5-year RFS of 73.2% versus 33.8% for the low- and high-risk groups, respectively (Fig. 2C and Table 5). The clinicogenomic model also risk-stratified stage I to II patients (P value <0.0001; Fig. 2C), exhibiting 5-year RFS of 69.6% versus 30.3% for the low- and high-risk groups, respectively (ref. Fig. 2C and Table 5). These results support the utility of the 4 genes for risk model development for stage I to II patients.
Genomic and clinicogenomic modeling of the validation set based on disease-specific OS
Risk stratification on the basis of disease-specific OS is another important test of the utility of the 4 genomic markers. Therefore, genomic and clinicogenomic models were developed. Both models were equally effective risk-stratifying the patients into low- and high-risk groups (P < 0.0001; ref. Fig. 3). Using the genomic and clinicogenomic models, the 5-year disease-specific survival for the low- and high-risk groups was 63.3% versus 37.0% and 67.3% versus 44.2%, respectively (Table 5). These differences in disease-specific OS were at least 1.5-fold, indicating clinical utility.
Analysis of multivariate marker DBN1
The genomic marker DBN1 was identified as a significant in the genomic and clinicogenomic models of RFS stage I to III, RFS stages I to II, and disease-specific OS (stage I–III RFS genomic: HR = 1.463, CI = 1.088–1.967; stage I–III RFS clinicogenomic: HR = 1.758, CI = 1.248–2.476; stage I–II RFS genomic: HR = 1.455, CI = 1.038–2.04; stage I–II RFS clinicogenomic: HR = 1.72, CI = 1.165–2.541; OS genomic: HR = 1.484, CI = 1.057–2.082; OS clinicogenomic: HR = 1.627, CI = 1.115–2.367; Supplementary Table S3). The addition of pT and pN stage information improved the significance of DBN1 for all models (RFS stage I–III, P = 0.0119–0.0013; RFS stage I–II, P = 0.0297–0.0064; OS, P = 0.0226–0.0117). This finding indicates that DBN1 serves as a component of the genomic model that can be improved by the addition of clinical stage.
We hypothesized that comparison of 2 geographically distant and demographically distinct patient cohorts, namely American and Korean, could facilitate the identification and validation of NSCLC genomic markers associated with RFS. Four genes were associated with recurrence in the American training set and validated in the larger Korean patient cohort. Genomic and clinicogenomic modeling for RFS risk-stratified both cohorts with high statistical significance, and genomic modeling alone was sufficient to risk-stratify the groups. Genomic and clinicogenomic models also risk-stratified stage I to II patients of the validation set. For stage I to II or I to III patients, the 5-year RFS of the low- and high-risk patients differed by 2-fold or more. The genomic and clinicogenomic models also predicted 5-year disease-specific OS, exhibiting a 1.5 to 1.7-fold difference between low- and high-risk groups.
The comparison of American and Korean cohorts resulted in discovery of DBN1 as a multivariate biomarker, which by itself exhibits significant prognostic utility for RFS that is improved by the addition of the clinical markers of pT and pN stage. Because addition of clinical stage information increased the HR of the DBN1 marker in all patient groups and for all outcomes, DBN1 may be a platform on which to build future models consisting of other multivariate markers. Additional multivariate markers may be discovered by similar comparisons of demographically distinct NSCLC patient populations. For example, the validation set from this study could be used as the training set for an even larger demographically distinct patient cohort, such as a larger American patient group. This iterative process would confirm that some or all of the genomic markers derived here can contribute to further model building, thus allowing model refinement.
It is notable that the genomic markers that correlate between the training and validation groups were not necessarily the ones exhibiting the lowest univariate P values for differential expression or the greatest fold-differences in expression. This indicates that if modeling were to be performed solely on the basis of P value or fold-differences, genes that could contribute to more robust modeling would be missed. Although clustering of gene expression patterns identified 4 clusters, only genes belonging to 3 of the clusters contributed to the model. Of note, one of these clusters contained 2 of the 4 genes, including DBN1, which is the only gene significant by multivariate analysis. The approach taken here suggests that successful genomic modeling can be performed even with a relatively small number of significant genomic markers identified by comparison of very different training and validation sets. Furthermore, the approach taken may be effective at identifying more reproducible and broadly applicable genomic markers, which are essential for efficient clinical trial development of a risk model. Thus, adopting this approach, a single, committed gene set can be derived that is statistically significant across a number of patient populations, independent of demographic factors, with informative Cox models as the product.
Of importance, the 4 genomic markers identified are all amenable to qPCR assay, which is an important factor in determining clinical development of the 4-gene set. The utilization of qPCR data for Cox model development confirms their robust amplification and indicates that the differences in their expression between recurrent and nonrecurrent patients is sufficient to overcome background noise (11). These findings also provide a robust assay to allow rapid confirmation of the results in other NSCLC patient cohorts.
In summary, the genomic model developed in the present study identifies recurrence-associated genes that are internationally validated and are strong candidates for broader analysis in larger patient cohorts. The validated genomic model exhibits utility, because of the magnitude of the absolute differences between 5-year RFS and OS of the low- and high-risk patients. The utility of the 4-gene set is independent of stage, and is applicable to RFS of stage I to II, I to III patients as well as OS of stage I to III patients. Together these results support the further development of the 4-gene set or its multivariate DBN1 component for future NSCLC risk model development. Of particular importance, risk-stratification has recently showed that low-risk NSCLC patients may exhibit decreased disease-specific survival when treated with adjuvant cisplatin/vinorelbine chemotherapy (1). With confirmation, our 4 genomic markers could be prospectively tested in a future clinical trial to determine the effectiveness of risk stratification of patients independent of stage and could also potentially be used to test the effectiveness of novel adjuvant chemotherapy regimens in low- and high-risk patients.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
D.A. Potter acknowledges grants NIH P20GM066402, R01 CA113570, the Flight Attendant Medical Research Institute, Walther Cancer Research Prize, the Thoracic Oncology Program at Indiana University Simon Cancer Center, the Walther Oncology Center at Indiana University, the Cancer Experimental Therapeutics Initiative of the Masonic Cancer Center, and the Dr. Barbara Bowers Oncology Fund of the Fairview Foundation. The training set microarray experiments were carried out using the facilities of the Center for Medical Genomics at Indiana University School of Medicine, which is supported in part by a grant to H.J. Edenberg from the Indiana 21st Century Research and Technology Fund, and from the Indiana Genomics Initiative (INGEN). This work was supported in part by NIH P30CA077598 utilizing the support of the Biostatistics and Bioinformatics Core of the Masonic Cancer Center, University of Minnesota, shared resource. We acknowledge a grant from the Eli Lilly and Co. to support the tissue bank and Carol Boyd and Christina Beard for excellent technical help with tissue banking.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
We thank Drs. Lawrence Einhorn, Nasser Hanna, David Flockhart, David Donner, Jorge Capdevila, Ming Sound Tsao, Ignacio Wistuba, David Beer, Mitch Raponi, Joan Schiller, Peter Ravdin, Faris Farassati, Janice Blum, Penni Black, Arkadiusz Dudek, and Miriam Garland for helpful discussions. We are grateful to Michael Franklin for outstanding editing help. This article is dedicated to the memory of Dr. Stephen Williams, past Director of the IU Simon Cancer Center whose vision led to the IU/Lilly Tumor Bank.
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
- Received July 9, 2010.
- Revision received December 2, 2010.
- Accepted January 3, 2011.
- ©2011 American Association for Cancer Research.