Purpose: Recent studies sought to refine lung cancer classification using gene expression microarrays. We evaluate the extent to which these studies agree and whether results can be integrated.
Experimental Design: We developed a practical analysis plan for cross-study comparison, validation, and integration of cancer molecular classification studies using public data. We evaluated genes for cross-platform consistency of expression patterns, using integrative correlations, which quantify cross-study reproducibility without relying on direct assimilation of expression measurements across platforms. We then compared associations of gene expression levels to differential diagnosis of squamous cell carcinoma versus adenocarcinoma via reproducibility of the gene-specific t statistics and to survival via reproducibility of Cox coefficients.
Results: Integrative correlation analysis revealed a large proportion of genes in which the patterns agreed across studies more than would be expected by chance. Correlation of t statistics for diagnosis of squamous cell carcinoma versus adenocarcinoma is high (0.85) and increases (0.925) when using only the most consistent genes identified by integrative correlation. Correlations of Cox coefficients ranged from 0.13 to 0.31 (0.33–0.49 with genes selected for consistency). Although we find genes that are significant in multiple studies but show discordant effects, their number is approximately that expected by chance. We report genes that are reproducible by integrative analysis, significant in all studies, and concordant in effect.
Conclusions: Cross-study comparison revealed significant, albeit incomplete, agreement of gene expression patterns related to lung cancer biology and identified genes that reproducibly predict outcomes. This analysis approach is broadly applicable to cross-study comparisons of gene expression profiling projects.
Lung cancer is currently classified according to morphological patterns (e.g., squamous cell carcinoma, adenocarcinoma, small cell carcinoma, and large cell carcinoma). However, this classification is frequently ineffective in predicting the biological behavior of these cancers, and therefore, molecular characteristics such as gene expression profiles are being considered as alternative classification approaches. Using gene array measurements of expression profiles, several groups have recently reported findings to suggest that distinctive molecular profiles could lead to refinement of classification and prognostication of lung cancer (1, 2, 3, 4, 5) .
Attempts to integrate and cross-study validate the results of various gene expression profiling projects are complicated by the use of diverse microarray platforms, sample sets, protocols, and analytical approaches. Furthermore, review of data summaries from publication yields, in some cases, apparent conflicts of findings. For example, ornithine decarboxylase was listed as a gene highly expressed in the good outcome class in one study (1) and the poor outcome class for another (2) .
To approach systematically the issue of comparing results of existing lung cancer profiling projects, we carried out a combined analysis of three projects that measured gene expression in large sets of tumors representing a spectrum of lung cancer histological patterns (1, 2, 3) . One study, a collaboration between investigators at Stanford University and the Humboldt University Institute of Pathology (here, the Stanford study; Ref. 1 ), used 24,000 element cDNA arrays to profile 67 lung tumors of various histological patterns. A second study, performed by scientists at the Dana-Farber Cancer Institute and the Massachusetts Institute of Technology (here, the Harvard study), used Affymetrix oligonucleotide arrays Hu95a representing 12,600 transcripts to profile 203 samples, including 186 lung tumor samples (2) , again of various histological patterns. The third study was conducted at the University of Michigan (here, the Michigan study) and used Affymetrix arrays HG6800 to profile 108 cases of adenocarcinoma (3) .
The Harvard and Stanford gene expression profiling projects reported that hierarchical clustering analysis of the gene expression patterns could distinguish the major morphological classes of lung cancer (i.e., small cell carcinoma, squamous cell carcinoma, and adenocarcinoma) and also define subgroups of adenocarcinoma tumors. Subclassification of adenocarcinoma appeared to be different in the two projects, however. For example, the hierarchical clustering in the Stanford study defined three groups of adenocarcinoma, whereas the clustering in the Harvard study defined four groups of adenocarcinoma. More significantly, the Stanford group reported impressive differences in survival among the three groups, with patients from one class (group 2) experiencing no mortality at up to 54 months and patients in another class (group 3) experiencing 100% mortality by 16 months (1) . In contrast, only a relatively modest difference in survival was seen for one of the Harvard classes compared with the others (2) . The Michigan study, which differed from the Harvard and Stanford studies by measuring gene expression patterns only in cases of adenocarcinoma, found modest differences in survival among different subgroups defined by hierarchical clustering. However, gene expression signatures significantly predictive of outcome could be defined when supervised statistical methods were applied by the Michigan investigators.
Although all three studies showed promise in terms of defining molecular profiles for lung cancer, the extent of concordance among these profiles could not be readily appreciated from the data as summarized in the manuscripts. To determine the level of consistency between the various studies with regard to class characteristics and to identify specific predictive markers, we initiated a comparative reanalysis of these studies, using the data available as supplements to the published studies.
Materials and Methods
All data used for our analysis were obtained from web-based information supporting the published manuscripts (1, 2, 3) . We focused our analysis on genes that were (a) measured in all three projects and (b) met criteria, established by investigators in the original studies, for inclusion in their cluster analysis. Criteria included quality checks and minimum requirements on variation in expression levels across different tumors and are described in detail in the original studies. There were 3171 common genes in the three data sets, as recognized by cross-referencing Unigene cluster numbers using the Bioconductor annotate package (6) . These included 307 genes that met criteria for consideration in the cluster analysis.
To facilitate comparisons, expression indices for both Affymetrix data sets were recomputed using the Robust Multichip Analysis (RMA) method (7) from the original CEL files. We averaged expression indices over multiple Affymetrix probe sets that corresponded to the same Unigene cluster number. We applied a logarithmic transformation to the cDNA intensity ratios of the Stanford study, to improve comparability with the expression indices produced using the RMA method. Our subsequent analysis was conducted in three phases. First, using data from all samples, we used integrative correlation analysis, a novel feature of this article described in detail below, to assess overall reproducibility of gene coexpression patterns across the three studies and to identify genes with relatively consistent coregulation patterns. Second, we compared the Harvard and Stanford data for genes that distinguish two major histological classes of lung cancer, squamous cell carcinoma, and adenocarcinoma. Finally, using cases of adenocarcinoma only, we compared the three studies for genes that predict survival. The latter two comparisons were both performed using the 307 genes set and using a subset that show consistent coregulation as quantified using the integrative correlation analysis (the consistent set).
To assess reproducibility of gene coexpression patterns across studies, we performed an integrative correlation analysis. We started by calculating all possible pairwise correlations of gene expression across samples within individual projects. These are computed within a given platform. Define ρ sp to be the Pearson correlation coefficient for a pair p of genes in study s. By examining whether these correlations agree across studies, we can quantify the reproducibility of results without relying on direct comparison of expression across platforms. Before proceeding to a combined analysis, we thus evaluate reproducibility both overall and by gene. Overall reproducibility is assessed by scatterplots of ρ sp versus ρ s′p for studies s and s′ and also summarized by correlations of correlations coefficients (termed here integrative correlations), given by I(s,s′) = Σp (ρ sp − ρ s)(ρ s′p − ρ s′), where ρ s and ρ s′ are the averages of correlation coefficients over all pairs in studies s and s′, respectively, and the summation ranges over all possible pairs. Bootstrap confidence intervals on I are obtained by resampling tumors. Gene-specific reproducibility across studies s and s′ for gene g is computed similarly, but only considering pairs in which one of the two genes is gene g. With three studies, this generates three integrative correlations for each gene. The average of these is used a reproducibility score to select genes for combined analysis. In the analysis, genes were included in the reproducible set if they had a reproducibility score above the median of the 307 scores. Pearson correlations reflect linear coexpression patterns. To assess the sensitivity of our results to the assumption of linearity, we computed Spearman correlations as well. The results of these two types of correlations were very similar, and only analyses using Pearson correlations are shown.
To perform a cross-study validation of genes that distinguish major histological classes of lung cancer, we considered squamous cell carcinoma and adenocarcinoma, using the Harvard and Stanford data sets only, because the squamous subgroup had not been included in the Michigan study. We computed, separately for each gene (and in each study), a t statistic for the differentiation of adenocarcinomas versus squamous lung cancer. We then correlated these t statistics across the two studies.
To perform a cross-study validation analysis of genes that help prognosticating survival we fit, separately for each study and gene, a Cox proportional hazard model using gene expression as a predictor and time to earliest adverse event as the response. Observations were censored for loss to follow-up. Predictors were divided by their SD before analysis to improve comparability of the resulting coefficients. The package survival (8) from the software package R4 was used for these analyses.
We approximated the false discovery rate for the survival analysis by the ratio between the number of significant Ps expected by chance alone and the number of observed significant Ps (9) . In this calculation, we assumed all discordant discoveries were false discoveries. We determined the reference distribution for the gene-specific integrative correlation analysis by randomly permuting gene labels within each study and then reanalyzing the data.
The dissimilarity of the gene expression array technology, the inclusion of incompletely overlapping sets of genes, and the use of independent sets of tumor samples do not allow direct comparisons of the results of the three lung cancer gene expression profiling studies. Therefore, we initiated an analysis to compare the coordinate patterns of gene expression across different lung cancer specimens in the three projects. Previous investigations have shown that genes with related functions show coordinate patterns of expression across different samples, providing the basis for clustering genes in the analysis of gene expression data (10, 11, 12, 13) . Therefore, we reasoned that the consistency of gene coexpression patterns would reflect the overall consistency of the data sets. To perform this analysis, each of the possible gene-gene correlations was calculated within a data set, and these correlations were then compared across data sets.
The three possible two-way comparisons across studies are demonstrated in the scatterplots shown in Fig. 1,A–C⇓ , where each point represents a correlation between two genes, with the pairwise correlation from one study on the x axis and from another on the y axis. Perfectly agreeing points fall on the x = y line. Overall integrative correlations are 0.54, 0.43, and 0.33, whereas the corresponding, highly asymmetric, and bootstrap confidence intervals are (0.37, 0.55), (0.29, 0.47), and (0.18,0.36). These results imply strong evidence of a positive association between gene coexpressions across studies (e.g., these intervals do not contain 0), but they also suggest that comparison across studies is challenging.
The highest correlation for a two-way comparison was seen for the comparison of Harvard and Stanford data. This could reflect the similar use of lung cancer cases with a variety of histological patterns in both studies. Correlations increase with the heterogeneity of the samples; for example, when only data for adenocarcinomas were considered, the three correlations decreased to 0.36, 0.40, and 0.22, respectively, with the highest correlation corresponding to the comparison of Harvard and Michigan data. Also, correlations would be smaller if we did not filter genes for quality of spots and evidence of variation across samples. For example, when the full set of 3171 genes is used, correlations decrease to 0.23, 0.28, and 0.14. Again, the highest correlation corresponding to the comparison of Harvard and Michigan data.
Consistency of individual gene coexpression patterns is investigated by evaluating integrative correlations only on pairwise correlations involving that gene. Fig. 2, A and B⇓ , illustrates a noisy gene (gene-specific integrative correlation of −0.05) and a highly reproducible gene (gene-specific integrative correlation of 0.80). Overall, gene-specific integrative correlation varied considerably for the 307 genes, ranging from −0.30 to 0.86. The distribution of the gene-specific reproducibility score, i.e., the average of these integrative correlations over the three possible pairwise comparisons across studies, is shown in Fig. 2C⇓ . Although this analysis provides a relatively indirect comparison of the data sets, it shows that expression levels of a large number of genes are measured in a relatively consistent manner across different projects. Furthermore, this analysis allows us to identify specific genes for which consistency of measurement appears inferior.
Next, we compared the Harvard and Stanford studies for variations of expression levels of specific genes as related to the classification of tumors according to histology. The histological diagnoses of squamous cell carcinoma and adenocarcinoma are generally highly sensitive and specific, and therefore, we used these classifications as a reference to compare the predictive value of genes for diagnosing these morphological categories. Analyzing each data set individually, we calculated a t statistic to measure the relationship of each gene to this diagnostic distinction. Comparing these t ratios for individual genes across the two studies shows a correlation of 0.85 when all 307 genes are considered, demonstrating a high reproducibility of these two studies in the gene expression classification of well-defined lung cancer phenotypes (Fig. 3)⇓ . When only the relatively more consistent 25, 50, and 75% of the genes (based on integrative correlations) are considered, this correlation increased to 0.90, 0.921, and 0.924. Thus, restricting the cross-study comparison to genes that have relatively consistent coexpression patterns can considerably improve the consistency of these two independent data sets for lung cancer classification.
The third phase of our analysis was to validate across studies the associations of gene expression data with survival. This analysis included data from all three studies and used Cox regression coefficients, calculated for each gene using data from each data set separately. These coefficients for all two-way comparisons of the three projects are presented in Fig. 4,A–C⇓ . Correlations across studies are 0.13, 0.31, and 0.28. Reproducibility across study is less pronounced than in the case of prediction of histology, consistently with the fact that predicting survival is generally a harder task than comparing histology. In all three comparisons, the number of genes that show a significant and concordant effect on survival far exceeds the number of genes that show a significant and discordant effect. The latter group is approximately what should be expected by chance.
We then performed the same analysis using the genes in the consistent set based on integrative correlations. Although integrative correlations do not make use of outcome data, this selection resulted in substantially improved cross-study concordance of the results (Fig. 4D)⇓ . Most of the genes that showed significant association of survival in two studies, but discordant effect signs were eliminated. The range of cross-study correlations of Cox regression coefficients went from 0.13–0.31 to 0.33–0.49 (0.26–0.38 at 25% reproducibility cutoff, 0.40–0.43 at 75%). Table 1⇓ lists the 14 genes that were identified as significant predictors of survival in all three studies at the 0.3 level and were concordant in sign. These four conditions lead to a combined 0.00675 significance level and an estimated two false discoveries or a 14.3% false discovery rate.
The accruing number of genomics studies for cancer classification, the variety of genomic technologies and protocols used, and the cost of these studies pose two important questions. First, how can we systematically use information available from multiple studies to assess reproducibility of genomic findings? Second, how can we efficiently integrate genomics information across studies?
In microarrays, because of the technical variations with probe quality and printing, it is not obvious whether gene-to-gene comparison across platform is justified because measurements could reflect more strongly the quality of the probes rather than actual gene expression differences. Recently, Rhodes et al. (14) considered supervised genomic analyses and examined the similarity of significance values for each gene across various prostate cancer gene expression data sets using meta-analysis methods combined with multiple testing. This cross-study comparative analysis demonstrated a reasonably consistent pattern of gene dysregulation in prostate cancer compared with normal prostate, measured by the various studies, and indicated the potential for combined analyses in microarray data.
Here, we developed a systematic statistical approach, termed integrative correlation, for assessing reproducibility of gene expression patterns across studies in both supervised and unsupervised settings. Integrative correlation analysis and cross-study comparison of t statistics and standardized proportional hazard coefficients, all of which are unitless quantities, bypass the need for normalizing expression measurements across platforms, a task for which no well-established methodology exists. We find that focusing on genes meeting quality requirements and calculating integrative correlations across data sets provide a basis for selecting a subset of genes that ultimately show more consistent associations with histological classification and outcome. This approach demonstrates an improved correlation across the various studies and also helps to identify a subset of genes that may be highly promising for reproducible profiling of lung cancer.
Gene-specific reproducibility scores show substantial variability within the set of genes studied. Although some show excellent reproducibility and many show reproducibility far above what can be expected by change alone, a fraction of genes show low reproducibility. The latter could be because of deficiencies in the probes in one or more of the platforms or a high ratio of technical to biological variation in one or more of the studies. Although we reported only on the genes meeting investigator’s criteria, we analyzed the complete as well. Conclusions are similar although reproducibility decreases.
With regard to the question of integrating knowledge across studies, our comparative analysis of three data sets provides encouragement that there is a significant level of consistency with regard to genes that distinguish well-defined classes of lung cancer (i.e., squamous cell versus adenocarcinoma) and genes that are associated with lung cancer patient survival. In particular, significant but discordant findings, which have been the source of controversy, occur approximately as should be expected by chance alone. However, there is also significant scatter in the comparative correlations of gene expression levels with relation to classification or outcome.
Our integrative correlation analysis indicates that there remains a substantial component of unexplained variability across studies. This variability arises as the result of (a) biological differences among the samples and populations considered in the three studies, (b) technological differences between the platforms, and (c) random technical variation. To quantify these three sources of variation separately would be helpful, but it requires new studies, including cross-platform gene expression measurements of a common set of tumors and replicate measurements in each study. Additional development of methods for integrating gene expression data sets, including extension of the current within-study normalization methods to cross-study comparisons, will significantly enhance our ability to extract information from existing data.
We thank Dr. Mark Segal and two anonymous reviewers provided useful comments.
Grant support: National Cancer Institute Grants 5P30 CA06973-39 and CA058184 (Lung Cancer Specialized Programs of Research Excellence).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Requests for reprints: Giovanni Parmigiani, The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, 550 North Broadway, Suite 1103, Baltimore, MD 21230. Phone: (410) 614-3426; Fax: (410) 955-0859; E-mail:
↵4 Internet address: http://www.r-project.org.
- Received November 12, 2003.
- Revision received January 6, 2004.
- Accepted January 16, 2004.