
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Imaging, Diagnosis, Prognosis |
Authors' Affiliations: Departments of 1 Breast Medical Oncology, 2 Biostatistics and Applied Mathematics, 3 Cancer Genetics, 4 Breast Cancer Translational Research Laboratory, and 5 Pathology, University of Texas M.D. Anderson Cancer Center, Houston, Texas; and 6 Transcriptional Profiling Core Facility of Millennium Pharmaceuticals, Inc., Cambridge, Massachusetts
Requests for reprints: Lajos Pusztai, Department of Breast Medical Oncology, University of Texas M.D. Anderson Cancer Center, Unit 1354, P.O. Box 301439, Houston, TX 77230-1439. Phone: 713-792-2817; Fax: 713-794-4385; E-mail: lpusztai{at}mdanderson.org.
| Abstract |
|---|
|
|
|---|
Patients and Methods: Affymetrix U133A gene chips were used for gene expression profiling of 97 fine-needle aspiration biopsies from breast cancer. Thirty-five cases were profiled in replicates: 17 within the same laboratory, 11 in two different laboratories, and 15 to assess manual and robotic labeling. We used data from 62 cases to develop 111 distinct pharmacogenomic predictors of response to therapy. These were tested on cases profiled in duplicates to determine prediction agreement and accuracy. To evaluate the robustness of the pharmacogenomic predictors, we also introduced random noise into the informative genes in one half of the replicates.
Results: The average concordance correlation coefficient was 0.978 (range, 0.96-0.99) for intralaboratory replicates, 0.962 (range, 0.94-0.98) for between-laboratory replicates, and 0.971 (range, 0.93-0.99) for manual versus robotic labeling. The mean % prediction agreement on replicate data was 97% (95% CI, 0.96-0.98; SD, 0.006), 92% (95% CI, 0.90-0.93; SD, 0.009), and 94% (95% CI, 0.92-0.95; SD, 0.008) for support vector machines, diagonal linear discriminant analysis, and k-nearest neighbor prediction methods, respectively. Mean accuracy in the test set was 77% (95% CI, 0.74-0.79; SD, 0.014), 66% (95% CI, 0.63-0.73; SD, 0.015), and 64% (95% CI, 0.60-0.67; SD, 0.016), respectively.
Conclusion: Gene expression results obtained with Affymetrix U133A chips are highly reproducible within and across two high-volume laboratories. Pharmacogenomic predictions yielded >90% agreement in replicate data.
Noise that could influence prediction results is introduced into gene expression data by experimental variables, such as disease heterogeneity, different tissue acquisition methods, and the technical noise, that is inherent to any analytic method. The effect of tissue sampling method on gene expression profiles obtained from breast cancer has been examined previously (7). For example, fine-needle aspiration (FNA) and core-needle biopsy of the same tumor yield tissues with different cellular composition that in turn results in slightly different transcriptional profiles. It is also well understood that different gene expression profiling techniques measure the expression of the same gene with different sensitivity and on a different scale; therefore, different profiling platforms often yield substantially different results even from the same samples (810). These observations suggest that for optimal performance multigene predictors will need to be developed from and tested on gene expression data obtained with a standardized profiling method from samples that were collected in a uniform manner. However, even if uniform tissue sampling and standard operating procedures are meticulously followed, a certain amount of technical noise in the data is unavoidable. It is not known to what extent the reproducibility of multigene predictors is influenced by this technical noise; however, this can be tested in replicate experiments from the same sample.
In this article, we assess the effect of the technical noise on the reproducibility of the prediction results of learning algorithm-based multigene predictors of response to preoperative chemotherapy. We used gene expression data obtained from FNA of from 97 breast cancers. We previously reported that FNA material contains predominantly neoplastic cells and some infiltrating leukocytes but is by large devoid of stromal elements (7). Gene expression data generated from FNAs captures the molecular characteristics of the invasive cancer, including molecular class (3). We develop pharmacogenomic predictors from 62 of the cases and subsequently tested these predictors on an independent set of 35 cases profiled in duplicates (test set). Each total RNA specimen in this test set was profiled at least twice as part of an effort to estimate within-laboratory (n = 17) and between-laboratory (n = 11) reproducibility of gene expression results. Four samples were included in both the within-laboratory and between-laboratory replicates experiments. An additional 15 of the 35 RNA samples were processed and labeled in duplicate to assess performance of manual labeling and robotic labeling. The first goal of these experiments was to determine concordance of gene expression results in replicate experiments using the Affymetrix U133A chips and standard operating and data normalization procedures. The second goal was to examine prediction agreement on replicate data. The third goal was to assess the robustness of prediction results in the face of increasing simulated noise. To accomplish this last goal, we introduced Gaussian noise into one set of the duplicate gene expression data and plotted how prediction agreement declines between the unperturbed and the noise-added replicate sets as the added noise increases.
| Patients and Methods |
|---|
|
|
|---|
0.6 µg total RNA that was required for gene expression profiling. The main reason for failure to obtain sufficient RNA was acellular aspirations (low cell yield). Median RNA yield of the 97 specimens used in this study was 1.8 µg (0.5-25 µg). Thirty-three of the 97 RNA specimens were included in a previous pharmacogenomic analysis using cDNA arrays (6). These 33 cases were profiled on both platforms (Affymetrix U133A and proprietary cDNA), and the results of the cross-platform comparison of gene expression data were published separately (8). Preoperative chemotherapy consisted of paclitaxel followed by 5-fluorouracil, doxorubicin, and cyclophosphamide. All patients underwent surgery after completion of 24 weeks of preoperative chemotherapy, and patients without any residual invasive cancer in the breast and axillary lymph nodes were considered to have pathologic CR. The study was approved by the Institutional Review Board of MDACC, and all patients signed an informed consent.
RNA extraction and gene expression profiling. Total RNA was extracted from FNA samples using the RNAeasy kit (Qiagen, Inc., Valencia, CA). The amount and quality of RNA was assessed with DU-640 U.V. Spectrophotometer (Beckman Coulter, Fullerton, CA), and it was considered adequate for further analysis if the A260/280 ratio was
1.8, and the total RNA yield was
0.6 µg. cRNA was generated using standard T7 amplification protocol. Second-strand cDNA synthesis was done in the presence of DNA Polymerase I, DNA ligase, and Rnase H, and the resulting double-stranded cDNA was blunt-ended using T4 DNA polymerase and purified by phenol/chloroform extraction. No second-round amplification was done. cRNA was generated in the presence of biotin-ribonucleotides using the BioArray High Yield RNA transcript labeling kit (Enzo Laboratories, Farmingdale, NY). The biotin-labeled cRNA was purified using Qiagen RNeasy columns (Qiagen), quantified, and fragmented at 94°C for 35 minutes in the presence of 1x fragmentation buffer. The robotic cDNA and cRNA purification used Ampure PCR purification System (Agencourt, Beverly, MA). Fragmented cRNA was hybridized to Affymetrix U133A gene chip, overnight at 42°C. Hybridization cocktail was prepared as described in the Affymetrix technical manual. Total RNA was the starting point for all replicate experiments.
Data analysis. dChip V1.3 (http://dchip.org) software was used to generate probe level signal intensities and for normalization of data across arrays, including the replicate experiments. This program normalizes the gene expression data to one standard array that represents a chip with median overall intensity from a predetermined set of experiments. Quality metrics, including median intensity, % probe set outliers, and % single probe outliers for each chip, were used in conjunction with alignment checks to identify poor-quality chips (12). The first 50 chips from MDACC, passing our quality assurance, were normalized to a baseline chip chosen by dChip with median signal. After normalization, model-based expression index signal intensity values for these 50 chips were fit by the perfect match only model in dChip, and the probe sensitivity indices were stored in a .psi file (13). All subsequent U133A CEL files in this study were processed using the same baseline chip and fitted probe sensitivity indices in the stored .psi file. Probe set level expression values used in this study were log 10 of 1 plus the model-based expression index value.
Concordance correlation coefficients and Pearson correlation coefficients were calculated to assess the reproducibility of the 35 replicate pairs within and between the two laboratories and by the two different labeling methods (14). Prediction performance was based on 31 of the 35 replicate cases that had clinical information regarding response to treatment. The intralaboratory replicate was used in all prediction analysis for the four cases that had three replicates. Prediction algorithms included diagonal linear discriminant analysis (DLDA), k-nearest neighbor (KNN), and support vector machines (SVM). The KNN method used three nearest neighbors (k = 3), and the radial kernel method was used for SVM (15). Gene expression values from 62 cases (pathologic CR = 19, residual disease = 43) that received uniform preoperative chemotherapy were used to build predictors using the three different algorithms with increasing numbers of probe sets included in the predictor. To determine the differentially expressed probes sets, probes sets were ordered by smallest to largest Ps from unequal variance t tests comparing expression values from pathologic CR cases versus residual disease cases. DLDA and KNN predictors were built using the following 38 subset sizes: n = 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 1,100, 1,200, 1,300, 3,000, 5,000, 10,000, 15,000, and 22,283. The SVM algorithm used the same subset sizes, except 10,000, 15,000, and 22,283. Each of the 31 test case samples was then predicted as pathologic CR or residual disease for each of the paired replicates. These 31 pairs of predictions were compared for agreement, and the percent agreement was calculated. The percent accuracy and area under the receiver operating characteristic curve (AUC) was also calculated when considering the true clinical outcome for the patient. Mean accuracy and mean AUC was assessed for the first 30 distinct predictors (probe set sizes, 2-1,000 differentially expressed probe sets) for each of the prediction algorithms.
The three prediction algorithms using the top 30 probe sets were used to assess the sensitivity to added noise. The levels of additional noise were determined by calculating the SD of log 10 expression values within each probe set for the top 1,000 informative genes and for the least informative 1,000 genes within the training set of 62 cases. The SD levels chosen were 0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, and 0.5. Random Gaussian noise with mean = 0 and using each SD level was added to perturb one half of each replicate pair. The prediction from the unperturbed replicate was compared with the predicted value from the perturbed replicate for all 31 cases, and the prediction agreement was calculated. Predictions were made using 100 different perturbations for each level of SD of added noise. All statistical analysis was done using S-PLUS V 6.1 (Insightful Corp., Seattle, WA).
| Results |
|---|
|
|
|---|
2-fold variation because of the tens of thousands of variables measured. For example, with an overall concordance of 97.98%, 1.31% of all gene expression measurements show
2-fold variation. This raises the following question: what degree of concordance is necessary to reproduce the same prediction results when multigene predictors are applied to the replicate data?
|
|
Prediction accuracy. To develop the optimal response, predictor was not the primary goal of this analysis. Sample size calculations by fitting learning curves to prediction results and extrapolating further gains in accuracy as sample size increases suggested that larger training samples would be needed to develop predictors that operate close to their plateau (16, 17). A separate report will be prepared when this sample size is achieved and when a sufficiently large independent validation set is accrued (18). However, we did examine the prediction accuracy of the 111 predictors developed from 62 cases and tested on the 31 independent cases that corresponded to 62 gene expression measurements. The training set was composed of 19 cases with pathologic CR and 43 cases with residual disease. In the test set, six cases had pathologic CR and 25 had residual disease. Accuracy was assessed for the first 30 distinct predictors for SVM, DLDA, and KNN. Mean accuracy of the SVM predictors was 77% (95% CI, 0.74-0.79; SD, 0.014); it was 66% (95% CI, 0.63-0.73; SD, 0.015) for the DLDA and 64% (95% CI, 0.60-0.67; SD, 0.016) for the KNN predictors. These results indicate that pharmacogenomic response predictors developed from 62 cases have overall prediction accuracy of around 70%. Receiver operating characteristic curves were also calculated for the first 30 predictors for each of the three different prediction methods. The mean areas under the receiver operating characteristic curves were 0.75, 0.71, and 0.69 for the SVM, DLDA, and KNN predictors, respectively. These values indicate tests with reasonably strong discriminating value. We expect the accuracy of pharmacogenomic predictors to improve as the training set size increases.
Sensitivity analysis of class predictors to simulated noise. We examined further how variation in gene expression data due to experimental noise could affect the prediction accuracy of three distinct predictors, including the 30-gene KNN, DLDA, and SVM models using mathematical simulation. Gaussian noise was added to the gene expression values of the 30 informative genes in one half of the replicate data pairs, and prediction agreement was compared between the unperturbed and perturbed pairs. To provide context, the real, observed noise in the training data (n = 62) was plotted as SD of log 10 expression values for the top 1,000 informative genes and for the least informative 1,000 genes on Fig. 2 . Figure 3 shows how the average prediction agreement decreases as the noise increases in one half of the replicate data pairs. Of the three distinct predictors, DLDA and KNN seemed to be more susceptible to noise than SVM. However, each of the predictors tolerated added noise (above the real experimental noise already present due to technical duplication) quite well up to noise levels of SD = 0.3, and SVM up to SD = 0.5. These results indicate that pharmacogenomic predictors are robust well above the noise levels that are encountered in real experiments in high volume core laboratories.
|
|
| Discussion |
|---|
|
|
|---|
2-fold variation in the replicate pairs of data. Therefore, we next examined the effect of this technical noise on the reproducibility of prediction results yielded by multigene predictors. We developed and trained KNN, DLDA, and SVM class predictors from 62 samples to predict pathologic complete response to preoperative chemotherapy and tested these predictors on replicate data from 31 independent cases. One hundred eleven different predictors were examined to systematically evaluate the statistical performance of a broad range of pharmacogenomic predictors. The mean percent agreement of prediction results was high, 92%, 94%, and 97%, respectively, for DLDA, KNN, and SVM methods. All three prediction algorithms done well over a broad number of informative genes.
How does this data compare with the reproducibility of standard clinical predictive tests that are used in the management of breast cancer today? Estrogen receptor (ER) and HER-2 receptor expression measurements are routinely done on breast cancer specimens, and these results are used to select patients for antiestrogen or trastuzumab therapies, respectively. Concordance between central and local laboratory HER-2 testing was assessed in a large randomized adjuvant therapy study using trastuzumab. The same cancer specimens were tested locally and centrally with fluorescent in situ hybridization or immunohistochemistry to determine HER-2 gene amplification or protein overexpression. The level of concordance for immunohistochemistry was 80%, and it was 85% for fluorescent in situ hybridization in two separate reports (21, 22). Similarly, only moderately good between-laboratory reproducibility was shown for ER assessment by immunohistochemistry. Interlaboratory concordance for high ER-expressing breast cancer was 99%, but for medium expressing breast cancer, it was 85%, and for low ER-expressing tumors, it was only 37% (23). In the context of these reports, the reproducibility of pharmacogenomic prediction results can be considered highly reproducible. It should be noted, however, that the more modest reproducibility of ER and HER-2 assays may represent the current practice, but it is by no means acceptable. It indicates that better standardization of these routine diagnostic tests is necessary particularly in low-volume laboratories.
To further evaluate the robustness of multigene predictors, we corrupted one half of the replicate data set by introducing random noise into the informative genes included in the prediction models. As expected, prediction agreement decreased as the added noise increased. However, each of the three distinct predictors that we tested tolerated substantial amount of added noise. Both the DLDA and KNN 30 gene predictors yielded over 80% prediction agreement up to noise level of SD = 0.3, and the SVM predictor up to SD = 0.5. These results indicate that pharmacogenomic predictors are robust well above the technical noise levels that are encountered in real experiments in high-volume laboratories. Our experiments did not address the effect of several other potential sources of noise in microarray data obtained from human cancer tissues. All specimens used in this study were collected at one institution (MDACC) and mostly by one pathologist who is experienced with the FNA technique (W.F.S.). Samples were processed without delay within a few seconds after taking them. Therefore, the effect of variable tissue sampling and RNA degradation are not represented in these results. However, the mathematical simulations with added noise suggest that even if these sources of noise are factored in pharmacogenomic predictors might still be at least as reproducible as the currently used molecular assays for ER and HER-2 detection.
Technical reproducibility is an essential but far from sufficient prerequisite for clinical use of any novel diagnostic test. A more important question is how accurate the prediction results are and what clinical value they provide. This question currently remains unanswered for pharmacogenomic predictors of response to chemotherapy. Our ongoing study has not yet reached the sample size to develop and validate a predictor that we believe performs close to its maximum accuracy. However, we did examine the overall predictive accuracy of 111 different predictors developed from the 62 training cases on the 31 independent cases that were included in the replicate studies. Mean accuracy was 66%, 64%, and 77% for the various DLDA, KNN, and SVM predictors, respectively. The corresponding AUCs were 0.71, 0.69, and 0.75. To put these numbers into a broader context, a recent report suggested that a ratio of IL17BR and HOXB13 genes could be used to predict outcome after tamoxifen therapy (24). In that article, the AUC values for IL17R and HOXB13 assessed in tissue sections were 0.79 and 0.67, respectively, and the AUC value for the ratio of the two genes was 0.81. The current diagnostic standard, ER gene expression, had an AUC value of 0.55 in that study. Our results fall within the range of AUC values that current and emerging diagnostic tests provide. The high reproducibility of results indicates that the technology is mature enough to yield reliable diagnostic tests. However, the true accuracy and clinical use of such tests is yet to be determined in larger clinical trials.
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
Received 7/15/05; revised 12/21/05; accepted 1/ 4/06.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
B. K.T. Tan, L. K. Tan, K. Yu, P. H. Tan, M. Lee, L. H. Sii, C. Y. Wong, G. H. Ho, A. W.Y. Yeo, P. K.H. Chow, et al. Clinical Validation of a Customized Multiple Signature Microarray for Breast Cancer Clin. Cancer Res., January 15, 2008; 14(2): 461 - 469. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Pusztai, K. Anderson, and K. R. Hess Pharmacogenomic Predictor Discovery in Phase II Clinical Trials for Breast Cancer Clin. Cancer Res., October 15, 2007; 13(20): 6080 - 6086. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Peintinger, K. Anderson, C. Mazouni, H. M. Kuerer, C. Hatzis, F. Lin, G. N. Hortobagyi, W. F. Symmans, and L. Pusztai Thirty-Gene Pharmacogenomic Test Correlates with Residual Cancer Burden after Preoperative Chemotherapy for Breast Cancer Clin. Cancer Res., July 15, 2007; 13(14): 4078 - 4082. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Pusztai Chips to Bedside: Incorporation of Microarray Data into Clinical Practice Clin. Cancer Res., December 15, 2006; 12(24): 7209 - 7214. [Full Text] [PDF] |
||||
![]() |
L. Pusztai, C. Mazouni, K. Anderson, Y. Wu, and W. F. Symmans Molecular Classification of Breast Cancer: Limitations and Potential Oncologist, September 1, 2006; 11(8): 868 - 877. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |