Abstract
Purpose: The goals of this analysis were to (a) determine concordance of gene expression results from replicate experiments, (b) examine prediction agreement of multigene predictors on replicate data, and (c) assess the robustness of prediction results in the face of noise.
Patients and Methods: Affymetrix U133A gene chips were used for gene expression profiling of 97 fine-needle aspiration biopsies from breast cancer. Thirty-five cases were profiled in replicates: 17 within the same laboratory, 11 in two different laboratories, and 15 to assess manual and robotic labeling. We used data from 62 cases to develop 111 distinct pharmacogenomic predictors of response to therapy. These were tested on cases profiled in duplicates to determine prediction agreement and accuracy. To evaluate the robustness of the pharmacogenomic predictors, we also introduced random noise into the informative genes in one half of the replicates.
Results: The average concordance correlation coefficient was 0.978 (range, 0.96-0.99) for intralaboratory replicates, 0.962 (range, 0.94-0.98) for between-laboratory replicates, and 0.971 (range, 0.93-0.99) for manual versus robotic labeling. The mean % prediction agreement on replicate data was 97% (95% CI, 0.96-0.98; SD, 0.006), 92% (95% CI, 0.90-0.93; SD, 0.009), and 94% (95% CI, 0.92-0.95; SD, 0.008) for support vector machines, diagonal linear discriminant analysis, and k-nearest neighbor prediction methods, respectively. Mean accuracy in the test set was 77% (95% CI, 0.74-0.79; SD, 0.014), 66% (95% CI, 0.63-0.73; SD, 0.015), and 64% (95% CI, 0.60-0.67; SD, 0.016), respectively.
Conclusion: Gene expression results obtained with Affymetrix U133A chips are highly reproducible within and across two high-volume laboratories. Pharmacogenomic predictions yielded >90% agreement in replicate data.
- pharmacogenomic predictors
- Gene expression profiling
- Breast cancer
Gene expression profiling represents a potential new diagnostic tool (1). The complex data yielded by these experiments are analyzed to discover new molecular classes of cancer, identify individual genes with prognostic or predictive value, and develop multigene predictors of clinical outcome (2–6). The promise of this later approach is that simultaneous assessment of multiple genes will result in more accurate predictions than measuring any single gene alone. Before wide-scale adoption and clinical testing of a proposed new diagnostic method, it is important to evaluate the technical reproducibility of the results within and between laboratories and in replicate experiments.
Noise that could influence prediction results is introduced into gene expression data by experimental variables, such as disease heterogeneity, different tissue acquisition methods, and the technical noise, that is inherent to any analytic method. The effect of tissue sampling method on gene expression profiles obtained from breast cancer has been examined previously (7). For example, fine-needle aspiration (FNA) and core-needle biopsy of the same tumor yield tissues with different cellular composition that in turn results in slightly different transcriptional profiles. It is also well understood that different gene expression profiling techniques measure the expression of the same gene with different sensitivity and on a different scale; therefore, different profiling platforms often yield substantially different results even from the same samples (8–10). These observations suggest that for optimal performance multigene predictors will need to be developed from and tested on gene expression data obtained with a standardized profiling method from samples that were collected in a uniform manner. However, even if uniform tissue sampling and standard operating procedures are meticulously followed, a certain amount of technical noise in the data is unavoidable. It is not known to what extent the reproducibility of multigene predictors is influenced by this technical noise; however, this can be tested in replicate experiments from the same sample.
In this article, we assess the effect of the technical noise on the reproducibility of the prediction results of learning algorithm-based multigene predictors of response to preoperative chemotherapy. We used gene expression data obtained from FNA of from 97 breast cancers. We previously reported that FNA material contains predominantly neoplastic cells and some infiltrating leukocytes but is by large devoid of stromal elements (7). Gene expression data generated from FNAs captures the molecular characteristics of the invasive cancer, including molecular class (3). We develop pharmacogenomic predictors from 62 of the cases and subsequently tested these predictors on an independent set of 35 cases profiled in duplicates (test set). Each total RNA specimen in this test set was profiled at least twice as part of an effort to estimate within-laboratory (n = 17) and between-laboratory (n = 11) reproducibility of gene expression results. Four samples were included in both the within-laboratory and between-laboratory replicates experiments. An additional 15 of the 35 RNA samples were processed and labeled in duplicate to assess performance of manual labeling and robotic labeling. The first goal of these experiments was to determine concordance of gene expression results in replicate experiments using the Affymetrix U133A chips and standard operating and data normalization procedures. The second goal was to examine prediction agreement on replicate data. The third goal was to assess the robustness of prediction results in the face of increasing simulated noise. To accomplish this last goal, we introduced Gaussian noise into one set of the duplicate gene expression data and plotted how prediction agreement declines between the unperturbed and the noise-added replicate sets as the added noise increases.
Patients and Methods
Patient samples. FNAs of breast cancer were collected in a prospectively designed pharmacogenomic marker discovery study at the Nellie B. Connally Breast Center of the University of Texas M.D. Anderson Cancer Center (MDACC). The goal of the ongoing clinical study is to develop multigene predictors of pathologic complete response (CR) to preoperative therapy (11). The current analysis was undertaken to examine technical reproducibility and robustness of prediction results. Gene expression results from 97 patients with stage I to III breast cancer are included in this analysis, including 35 cases that were profiled in replicates. FNA was done using a 23- or 25-gauge needle before starting preoperative chemotherapy. Cells from two to three passes were collected into vials containing 1 mL of RNAlater solution (Ambion, Austin, TX) and stored at −80C. Approximately 80% of all aspirations yielded ≥0.6 μg total RNA that was required for gene expression profiling. The main reason for failure to obtain sufficient RNA was acellular aspirations (low cell yield). Median RNA yield of the 97 specimens used in this study was 1.8 μg (0.5-25 μg). Thirty-three of the 97 RNA specimens were included in a previous pharmacogenomic analysis using cDNA arrays (6). These 33 cases were profiled on both platforms (Affymetrix U133A and proprietary cDNA), and the results of the cross-platform comparison of gene expression data were published separately (8).
Preoperative chemotherapy consisted of paclitaxel followed by 5-fluorouracil, doxorubicin, and cyclophosphamide. All patients underwent surgery after completion of 24 weeks of preoperative chemotherapy, and patients without any residual invasive cancer in the breast and axillary lymph nodes were considered to have pathologic CR. The study was approved by the Institutional Review Board of MDACC, and all patients signed an informed consent.
RNA extraction and gene expression profiling. Total RNA was extracted from FNA samples using the RNAeasy kit (Qiagen, Inc., Valencia, CA). The amount and quality of RNA was assessed with DU-640 U.V. Spectrophotometer (Beckman Coulter, Fullerton, CA), and it was considered adequate for further analysis if the A260/280 ratio was ≥1.8, and the total RNA yield was ≥0.6 μg. cRNA was generated using standard T7 amplification protocol. Second-strand cDNA synthesis was done in the presence of DNA Polymerase I, DNA ligase, and Rnase H, and the resulting double-stranded cDNA was blunt-ended using T4 DNA polymerase and purified by phenol/chloroform extraction. No second-round amplification was done. cRNA was generated in the presence of biotin-ribonucleotides using the BioArray High Yield RNA transcript labeling kit (Enzo Laboratories, Farmingdale, NY). The biotin-labeled cRNA was purified using Qiagen RNeasy columns (Qiagen), quantified, and fragmented at 94°C for 35 minutes in the presence of 1× fragmentation buffer. The robotic cDNA and cRNA purification used Ampure PCR purification System (Agencourt, Beverly, MA). Fragmented cRNA was hybridized to Affymetrix U133A gene chip, overnight at 42°C. Hybridization cocktail was prepared as described in the Affymetrix technical manual. Total RNA was the starting point for all replicate experiments.
Data analysis. dChip V1.3 (http://dchip.org) software was used to generate probe level signal intensities and for normalization of data across arrays, including the replicate experiments. This program normalizes the gene expression data to one standard array that represents a chip with median overall intensity from a predetermined set of experiments. Quality metrics, including median intensity, % probe set outliers, and % single probe outliers for each chip, were used in conjunction with alignment checks to identify poor-quality chips (12). The first 50 chips from MDACC, passing our quality assurance, were normalized to a baseline chip chosen by dChip with median signal. After normalization, model-based expression index signal intensity values for these 50 chips were fit by the perfect match only model in dChip, and the probe sensitivity indices were stored in a .psi file (13). All subsequent U133A CEL files in this study were processed using the same baseline chip and fitted probe sensitivity indices in the stored .psi file. Probe set level expression values used in this study were log 10 of 1 plus the model-based expression index value.
Concordance correlation coefficients and Pearson correlation coefficients were calculated to assess the reproducibility of the 35 replicate pairs within and between the two laboratories and by the two different labeling methods (14). Prediction performance was based on 31 of the 35 replicate cases that had clinical information regarding response to treatment. The intralaboratory replicate was used in all prediction analysis for the four cases that had three replicates. Prediction algorithms included diagonal linear discriminant analysis (DLDA), k-nearest neighbor (KNN), and support vector machines (SVM). The KNN method used three nearest neighbors (k = 3), and the radial kernel method was used for SVM (15). Gene expression values from 62 cases (pathologic CR = 19, residual disease = 43) that received uniform preoperative chemotherapy were used to build predictors using the three different algorithms with increasing numbers of probe sets included in the predictor. To determine the differentially expressed probes sets, probes sets were ordered by smallest to largest Ps from unequal variance t tests comparing expression values from pathologic CR cases versus residual disease cases. DLDA and KNN predictors were built using the following 38 subset sizes: n = 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 1,100, 1,200, 1,300, 3,000, 5,000, 10,000, 15,000, and 22,283. The SVM algorithm used the same subset sizes, except 10,000, 15,000, and 22,283. Each of the 31 test case samples was then predicted as pathologic CR or residual disease for each of the paired replicates. These 31 pairs of predictions were compared for agreement, and the percent agreement was calculated. The percent accuracy and area under the receiver operating characteristic curve (AUC) was also calculated when considering the true clinical outcome for the patient. Mean accuracy and mean AUC was assessed for the first 30 distinct predictors (probe set sizes, 2-1,000 differentially expressed probe sets) for each of the prediction algorithms.
The three prediction algorithms using the top 30 probe sets were used to assess the sensitivity to added noise. The levels of additional noise were determined by calculating the SD of log 10 expression values within each probe set for the top 1,000 informative genes and for the least informative 1,000 genes within the training set of 62 cases. The SD levels chosen were 0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, and 0.5. Random Gaussian noise with mean = 0 and using each SD level was added to perturb one half of each replicate pair. The prediction from the unperturbed replicate was compared with the predicted value from the perturbed replicate for all 31 cases, and the prediction agreement was calculated. Predictions were made using 100 different perturbations for each level of SD of added noise. All statistical analysis was done using S-PLUS V 6.1 (Insightful Corp., Seattle, WA).
Results
Intralaboratory and interlaboratory reproducibility of gene expression profiles. Seventeen total RNA specimens were split and labeled and hybridized in duplicates within the same laboratory several months apart, and 11 RNA samples were labeled and hybridized in two different laboratories (MDACC and Millennium Core Genomic facilities) several months apart following the same operating procedure. Fifteen other RNA samples were split and labeled in duplicate simultaneously using manual labeling done by technicians, and robotic labeling was done by a custom-built robot at the transcriptional profiling core facility of Millennium Pharmaceuticals, Inc. (Cambridge, MA). Both the original and the replicate data sets were normalized uniformly and centrally at the biostatistics department of MDACC. Concordance and Pearson correlation coefficients of the paired replicates are presented in Table 1 . The reproducibility of results within and across the two laboratories and by the two different labeling methods was high. The average concordance correlation coefficient was 0.978 (range, 0.96-0.99) for replicates within the same laboratory; it was 0.962 (range, 0.94-0.98) for replicates between laboratories and 0.971 (range, 0.93-0.99) for manual versus robotic labeling. It is important to realize that even if the overall concordance for replicate experiments is high, a substantial percentage of individual gene expression measurements have a ≥2-fold variation because of the tens of thousands of variables measured. For example, with an overall concordance of 97.98%, 1.31% of all gene expression measurements show ≥2-fold variation. This raises the following question: what degree of concordance is necessary to reproduce the same prediction results when multigene predictors are applied to the replicate data?
Correlation coefficients of replicate experiments
Prediction agreements in technical replicates. Perfect replication of gene expression data from experiment to experiment is impossible. From a practical point of view, experimental noise could be considered acceptable if it does not alter the conclusion from an experiment. In the case of multigene predictors, if the same outcome prediction can be made on data from the original and replicate samples of the same RNA, then the experimental noise is practically insignificant. To examine this question, we developed and trained KNN, DLDA, and SVM class predictors from 62 samples to predict complete response to preoperative chemotherapy and tested these predictors in combination with a broad range of genes on replicate data from 31 independent cases. The goal was to examine the prediction agreement on replicates sets of data generated from the same samples. Figure 1 illustrates the prediction agreement for the three different prediction methods with increasing number of genes included in the predictor (corresponding to 111 different predictors). The % agreement of prediction results was very high. The mean % agreement for the DLDA-based predictors was 92% [95% confidence interval (95% CI), 0.90-0.93; SD, 0.009]; for the KNN predictors, it was 94% (95% CI, 0.92-0.95; SD, 0.008), and for the SVM predictors, it was 97% (95% CI, 0.96-0.98; SD, 0.006). Each prediction method done well with a broad number of informative genes included. These results indicate that the technical reproducibility of gene expression data is sufficiently high to yield essentially identical prediction results from duplicate samples.
Prediction agreements between replicate experiments as function of number of probe sets included in the classifier. KNN, DLDA, and SVM class predictors were trained on 62 cases and tested on replicate data from 31 independent cases. The X axis shows the number of genes included in the prediction models. The Y-axis shows how often the same outcome was predicted on matched pairs of replicate results.
We also assessed reproducibility of dendograms in hierarchical clustering using 689 “intrinsic genes” as described previously (3). Reproducibility of dendogram branches was moderate, suggesting that this method of classification may be more susceptible to technical noise than supervised class prediction algorithms (Supplementary Fig. S1).
Prediction accuracy. To develop the optimal response, predictor was not the primary goal of this analysis. Sample size calculations by fitting learning curves to prediction results and extrapolating further gains in accuracy as sample size increases suggested that larger training samples would be needed to develop predictors that operate close to their plateau (16, 17). A separate report will be prepared when this sample size is achieved and when a sufficiently large independent validation set is accrued (18). However, we did examine the prediction accuracy of the 111 predictors developed from 62 cases and tested on the 31 independent cases that corresponded to 62 gene expression measurements. The training set was composed of 19 cases with pathologic CR and 43 cases with residual disease. In the test set, six cases had pathologic CR and 25 had residual disease. Accuracy was assessed for the first 30 distinct predictors for SVM, DLDA, and KNN. Mean accuracy of the SVM predictors was 77% (95% CI, 0.74-0.79; SD, 0.014); it was 66% (95% CI, 0.63-0.73; SD, 0.015) for the DLDA and 64% (95% CI, 0.60-0.67; SD, 0.016) for the KNN predictors. These results indicate that pharmacogenomic response predictors developed from 62 cases have overall prediction accuracy of around 70%. Receiver operating characteristic curves were also calculated for the first 30 predictors for each of the three different prediction methods. The mean areas under the receiver operating characteristic curves were 0.75, 0.71, and 0.69 for the SVM, DLDA, and KNN predictors, respectively. These values indicate tests with reasonably strong discriminating value. We expect the accuracy of pharmacogenomic predictors to improve as the training set size increases.
Sensitivity analysis of class predictors to simulated noise. We examined further how variation in gene expression data due to experimental noise could affect the prediction accuracy of three distinct predictors, including the 30-gene KNN, DLDA, and SVM models using mathematical simulation. Gaussian noise was added to the gene expression values of the 30 informative genes in one half of the replicate data pairs, and prediction agreement was compared between the unperturbed and perturbed pairs. To provide context, the real, observed noise in the training data (n = 62) was plotted as SD of log 10 expression values for the top 1,000 informative genes and for the least informative 1,000 genes on Fig. 2 . Figure 3 shows how the average prediction agreement decreases as the noise increases in one half of the replicate data pairs. Of the three distinct predictors, DLDA and KNN seemed to be more susceptible to noise than SVM. However, each of the predictors tolerated added noise (above the real experimental noise already present due to technical duplication) quite well up to noise levels of SD = 0.3, and SVM up to SD = 0.5. These results indicate that pharmacogenomic predictors are robust well above the noise levels that are encountered in real experiments in high volume core laboratories.
Observed variation in gene expression values in the training data. Frequency histograms of the SD of the top 1,000 differentially expressed genes and also the least differentially expressed 1,000 genes are presented. The majority of probes in both groups had SD ≤ 0.2.
Replicate prediction agreement as a function of added Gaussian noise. The X-axis indicates the SD of random noise that was added to the 30 informative genes in one half of the replicate data pairs. The Y-axis shows the percent prediction agreement between the perturbed and unperturbed replicates at increasing noise levels calculated from 100 iterations. Box plots show the performance of three distinct 30-gene predictors using DLDA, KNN, and SVM algorithms with 95% CI for the median, interquartile range, whiskers are drawn at 1.5 × interquartile range, and any remaining outliers.
Discussion
Intralaboratory and interlaboratory reproducibility of gene expression profiles obtained from the same samples with Affymetrix U133A chips and using a standard operating procedure was high in our study. The overall concordance of the data was 96% to 99% in within-laboratory replicates and 94% to 98% in between-laboratory replicates. These results probably represent the best reproducibility under optimal conditions. The experiments were done in high volume laboratories following standard and formalized operating procedures, and all data were centrally and uniformly normalized. Our observations are consistent with other recent reports that also showed that shared platform, standard operating procedures, and central data processing dramatically improve the reproducibility of gene expression results across laboratories (19, 20). However, even with this high level of global concordance, a substantial percentage (0.15-3%) of individual gene expression measurements show ≥2-fold variation in the replicate pairs of data. Therefore, we next examined the effect of this technical noise on the reproducibility of prediction results yielded by multigene predictors.
We developed and trained KNN, DLDA, and SVM class predictors from 62 samples to predict pathologic complete response to preoperative chemotherapy and tested these predictors on replicate data from 31 independent cases. One hundred eleven different predictors were examined to systematically evaluate the statistical performance of a broad range of pharmacogenomic predictors. The mean percent agreement of prediction results was high, 92%, 94%, and 97%, respectively, for DLDA, KNN, and SVM methods. All three prediction algorithms done well over a broad number of informative genes.
How does this data compare with the reproducibility of standard clinical predictive tests that are used in the management of breast cancer today? Estrogen receptor (ER) and HER-2 receptor expression measurements are routinely done on breast cancer specimens, and these results are used to select patients for antiestrogen or trastuzumab therapies, respectively. Concordance between central and local laboratory HER-2 testing was assessed in a large randomized adjuvant therapy study using trastuzumab. The same cancer specimens were tested locally and centrally with fluorescent in situ hybridization or immunohistochemistry to determine HER-2 gene amplification or protein overexpression. The level of concordance for immunohistochemistry was 80%, and it was 85% for fluorescent in situ hybridization in two separate reports (21, 22). Similarly, only moderately good between-laboratory reproducibility was shown for ER assessment by immunohistochemistry. Interlaboratory concordance for high ER-expressing breast cancer was 99%, but for medium expressing breast cancer, it was 85%, and for low ER-expressing tumors, it was only 37% (23). In the context of these reports, the reproducibility of pharmacogenomic prediction results can be considered highly reproducible. It should be noted, however, that the more modest reproducibility of ER and HER-2 assays may represent the current practice, but it is by no means acceptable. It indicates that better standardization of these routine diagnostic tests is necessary particularly in low-volume laboratories.
To further evaluate the robustness of multigene predictors, we corrupted one half of the replicate data set by introducing random noise into the informative genes included in the prediction models. As expected, prediction agreement decreased as the added noise increased. However, each of the three distinct predictors that we tested tolerated substantial amount of added noise. Both the DLDA and KNN 30 gene predictors yielded over 80% prediction agreement up to noise level of SD = 0.3, and the SVM predictor up to SD = 0.5. These results indicate that pharmacogenomic predictors are robust well above the technical noise levels that are encountered in real experiments in high-volume laboratories. Our experiments did not address the effect of several other potential sources of noise in microarray data obtained from human cancer tissues. All specimens used in this study were collected at one institution (MDACC) and mostly by one pathologist who is experienced with the FNA technique (W.F.S.). Samples were processed without delay within a few seconds after taking them. Therefore, the effect of variable tissue sampling and RNA degradation are not represented in these results. However, the mathematical simulations with added noise suggest that even if these sources of noise are factored in pharmacogenomic predictors might still be at least as reproducible as the currently used molecular assays for ER and HER-2 detection.
Technical reproducibility is an essential but far from sufficient prerequisite for clinical use of any novel diagnostic test. A more important question is how accurate the prediction results are and what clinical value they provide. This question currently remains unanswered for pharmacogenomic predictors of response to chemotherapy. Our ongoing study has not yet reached the sample size to develop and validate a predictor that we believe performs close to its maximum accuracy. However, we did examine the overall predictive accuracy of 111 different predictors developed from the 62 training cases on the 31 independent cases that were included in the replicate studies. Mean accuracy was 66%, 64%, and 77% for the various DLDA, KNN, and SVM predictors, respectively. The corresponding AUCs were 0.71, 0.69, and 0.75. To put these numbers into a broader context, a recent report suggested that a ratio of IL17BR and HOXB13 genes could be used to predict outcome after tamoxifen therapy (24). In that article, the AUC values for IL17R and HOXB13 assessed in tissue sections were 0.79 and 0.67, respectively, and the AUC value for the ratio of the two genes was 0.81. The current diagnostic standard, ER gene expression, had an AUC value of 0.55 in that study. Our results fall within the range of AUC values that current and emerging diagnostic tests provide. The high reproducibility of results indicates that the technology is mature enough to yield reliable diagnostic tests. However, the true accuracy and clinical use of such tests is yet to be determined in larger clinical trials.
Footnotes
-
Grant support: Breast Cancer Research Foundation (L. Pusztai), Gilder Foundation (L. Pusztai), National Cancer Institute grant R01 CA106290-1 (L. Pusztai), Susan G. Komen Breast Cancer Foundation grant LF2002-044HM (W.F. Symmans), and Nellie B. Connally Breast Cancer Research Fund (G.N. Hortobagyi).
-
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
-
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
- Accepted January 4, 2006.
- Received July 15, 2005.
- Revision received December 21, 2005.