
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Molecular Oncology, Markers, Clinical Correlates |
National Cancer Institute, Bethesda, Maryland 20892 [L. M. M., R. A., S. E. T.]; Memorial Sloan-Kettering Cancer Center, New York, New York 10021 [C. C-C.]; University of Southern California School of Medicine/Norris Comprehensive Cancer Center, Los Angeles, California 90033 [R. C.]; University of Haifa, 31 905 Haifa, Israel [D. F.]; Laval University, Quebec City, Quebec G1R 2J6, Canada [Y. F.]; University of Texas M.D. Anderson Cancer Center, Houston, Texas 77030 [H. B. G.]; The Emmes Corporation, Potomac, Maryland 20854 [A. P.]; and University of California at San Francisco, San Francisco, California 94143 [F. M. W.]
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
This study was undertaken by the five institutions comprising the National Cancer Institute Bladder Cancer Marker Network to evaluate the reproducibility of p53 expression as measured by IHC. We describe the intra- and interlaboratory reproducibility and differences due to staining protocol and scoring criteria that one might experience for the p53 IHC assay in laboratories using the same primary antibody. Although we show that the results are generally reproducible, some variability did exist. Such results should be considered in decisions regarding combining study results across laboratories and in decisions regarding centralized versus noncentralized assays for large prospective diagnostic marker studies.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
The design, consisting of 50 blocks, with two sections from every block stained and initially scored at each of the five laboratories, allowed for estimation of within-laboratory agreement based on initial scorings to within ± 11% (with 95% confidence), assuming a true agreement rate of at least 80%. The maximum (95% confidence) margin of error would be ± 14%, attained if the true rate was 50%. Smaller margins of error were possible for estimation of between-laboratory agreement based on initial scorings, for comparisons using second scorings in addition to first, and for estimation of overall agreement rates averaged over the five laboratories.
p53 IHC.
Each institution stained the 100 slides using antigen retrieval,
according to their usual protocols. All institutions used the same lot
of anti-p53 antibody (PAb1801 Ab-2; Oncogene Research Products,
Cambridge, MA, kindly provided by CalBiochem). Staining was performed
in a timely manner, not more than 30 days after receipt of the slides.
Each institution used its own positive and negative controls.
The avidin-biotin peroxidase method was used by all laboratories, with modifications. Sections were deparaffinized and treated with 3% H2O2 to block endogenous peroxidase activity. They were then immersed in 0.01 M citric acid (pH 6.0) in a microwave oven for 15 min to enhance antigen retrieval (2) . After cooling, slides were incubated with normal horse serum for 10 min to block nonspecific staining, followed by an overnight incubation at 4°C with primary antibody at 2 µg/ml (3) . After extensive washing, sections were incubated at room temperature for 30 min with biotinylated horse antimouse antibody (1:200 final dilution; Vector Laboratories, Burlingame, CA), and then for 30 min with avidin-biotin peroxidase complexes (1:25 final dilution; Vector Laboratories). Diaminobenzidine (0.06%) was used as the final chromogen, and hematoxylin was used as the nuclear counterstain.
Each of the laboratories used variations on this general protocol. Laboratory A used 10% H2O2 for blocking, and sections were incubated with normal horse serum for 30 min. Laboratory B used a final concentration 1 µg/ml primary antibody, a 1:500 dilution for biotinylated horse antimouse IgG, incubation with streptavidin-horseradish peroxidase (1:200; Zymed Labs, South San Francisco, CA) instead of with avidin-biotin peroxidase complexes, and 0.05% rather than 0.06% diaminobenzidine. Laboratory C microwaved for 17 min, blocked in serum for 20 min, and incubated with primary antibody at room temperature. In addition, laboratory C used the Ultra Streptavidin Detection System (Signet Laboratories, Inc., Dedham, MA). Laboratory D incubated the primary antibody at room temperature, and at a final dilution of 2.5 µg/ml, used a 1:100 final dilution of avidin-biotin peroxidase complexes, and used 0.03% rather than 0.06% diaminobenzidine. Laboratory E differed only in using a 15-min microwave treatment.
Scoring.
The entire tissue section was screened for positive tumor cells,
defined as cells with nuclear staining. The estimated percentage of
tumor cells that stained positively, the intensity of the staining (1+
to 4+), and an overall assessment of positive or negative for the slide
were recorded along with identifiers and comments on background and
cytoplasmic staining. The percentage of cells staining positively for
p53 was recorded on an ordered categorical scale: 0 = zero, 1 = 15%, 2 = 610%, 3 = 1120%, 4 = 2130%,
5 = 3140%, 6 = 4150%, 7 = 5170%, 8 =
71100%, or "cannot be assessed." The overall (binary) assessment
for a slide involved a judgment of the percentage of stained cells
subjected to a "positivity" threshold, and the result was recorded
as positive, negative, "equivocal," or "cannot be assessed."
The thresholds applied by the laboratories involved in this study were
10% (laboratory A),
10% (laboratory B), >5% (laboratory C),
10% (laboratory D), and
20% (laboratory E). All results were
recorded on a standardized score sheet. A single individual in each
laboratory performed all scoring for this study.
Rescoring.
Each institution received 20 slides from each of the other institutions
for rescoring. These were combined with 20 slides that were originally
stained and scored at that institution, making a total of 100 slides.
Rescoring criteria were identical to the primary scoring.
Completeness of Data.
In total, 500 slides (10 sections from each of 50 tumor blocks) were
prepared for analysis in the study. One slide was lost in shipping,
leaving 499 slides on which scoring was attempted. For purposes of
analysis, the percentage of staining scores coded as cannot be assessed
and binary scores coded as either cannot be assessed or equivocal were
treated as missing values. The result was that all but 36 of the 499
slides had both a first and second percentage of staining score, and
all but 38 slides had usable (non-missing) values in both a first and
second binary scoring. Five slides having neither a first nor a second
binary and percentage of staining score were later found to have all
come from the same tumor block in which the portion of the block from
which the sections were cut contained minimal or no tumor. Results from
the 10 sections cut and stained from that tumor block were excluded
from all subsequent analyses even if some laboratories had reported
scores for them. Because of the small percentage of measurements
(<3%) excluded from the analyses, it was not expected that ignoring
the missing values would introduce significant bias into the results.
Statistical Analyses.
The two measures of p53 status evaluated for their reproducibility in
this study are the percentage of stained cells, recorded on the
categorical scale described in the "Scoring" section above, and the
overall (binary) assessment of p53 positivity. Reproducibility was
assessed both by computing simple percentage of agreement between pairs
of scores and by examining differences in average staining percentage
levels and rates of overall positivity across laboratories.
Percentage of Agreement.
Estimates of the percentage of agreement were calculated as the
percentage of matching scorings in all pairs of scorings made on the
same slide or on two different slides from the same block, depending on
the comparison of interest. Standard errors for these estimates were
calculated using a stratified jackknife procedure for clustered data
where the blocks were the independent (primary) sampling units and the
strata were the contributing institutions. The stratified jackknife
variance estimator used here is similar to the usual jackknife
estimator except that rather than iteratively deleting single
observations, all observations from a given block (primary sampling
unit) are deleted at each iteration, and the "deleted" estimate is
computed by reweighting observations from blocks in the same stratum
from which the given block was deleted. A precise definition is given
by Korn and Graubard (22)
.
Staining Percentage Levels and Positivity Rates.
Descriptive statistics for staining percentage levels and staining
positivity rates were obtained by calculating means for all
staining-by-scoring laboratory combinations. For the percentage of cell
staining measurement, each of the response categories 08 was scored
by its category midpoint. For example, measurements falling into
category 2 (610%) were scored as 8%. The mean value of the midpoint
scorings was then computed for each staining-by-scoring laboratory
combination. The mean of the overall binary assessments recorded by a
given staining-by-scoring laboratory combination is equal to the
proportion of positively scored slides for that combination. The
standardized means for scoring laboratories and for staining
laboratories were each balanced over the effects of the other factor,
i.e., scoring laboratory means were standardized to a
hypothetical situation in which equal numbers of slides were scored at
the five laboratories, and scoring laboratory means were standardized
to a situation in which equal numbers of slides were stained at the
five laboratories. This removes imbalance resulting from the fact that
in this study all slides stained in a particular laboratory received
their first scoring from that same laboratory. The means and their
SEs were calculated by SUDAAN PROC DESCRIPT (23)
,
using a stratified jackknife procedure similar to that described above
for percentage of agreement estimates.
Cumulative logit proportional odds models (24) for correlated ordinal data were used to analyze p53 cell staining percentages, and logistic regression models for correlated binary data were used to analyze overall p53 staining positivity assessments. Both models contained additive terms representing staining and scoring effects. Staining-by-scoring interaction terms were initially considered in each model but were not significant and were dropped from the model. These analyses were implemented in SUDAAN PROC MULTILOG (23) using generalized estimating equations methods (25) with a "working" independence correlation matrix. Blocks were designated as primary sampling units, and contributing institutions formed the strata.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
The percentage of agreement between scorings on two different
slides from the same block was computed as a measure of interslide
variability. It may reflect both staining variability and biological
variability as well as scoring variability. Fig. 6
shows the percentage of agreement of
binary scores on sections from the same block stained in different
laboratories but scored by the same laboratory (staining laboratory not
necessarily the same as scoring laboratory). The average overall
estimated percentage of agreement in Fig. 6
is 86% (SE, 2.8%).
If the slides were scored in different laboratories in addition to
being stained at different laboratories, the estimated percentage of
agreement between binary scorings of two slides from the same block
decreased slightly to 83% (SE, 2.6%).
|
To investigate the potential contribution of biological variation to the interslide variation, interslide percentage of agreement estimates in which the staining laboratory and the scoring laboratory were held fixed were examined for association with distance between slides. If there was substantial biological variation in p53 status on the scale represented by the amount of specimen encompassed by the collection of sections cut from each tumor block, then one might expect that the percentage of agreement would be decreased for more widely separated sections. No such association was found, and therefore, it could not be concluded that there was evidence for biological heterogeneity on this scale.
The percentage of agreement estimates for binary scoring by sources of
variability represented are summarized in Table 1
. The estimates in Table 1
suggest that
for the five laboratories involved in this study, staining differences
contributed more to the total variability than scoring or biological
variability.
|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
In this report, we were able to identify both considerable concordance and some discordance among the p53 assays in these five experienced laboratories. Intralaboratory reproducibility was generally quite good. Regarding interlaboratory reproducibility, agreement of binary assessments was good at the extremes of low or high nuclear tumor cell staining percentages. Considering the full range of cell staining percentages, agreement of binary assessments was reduced, largely because of differences exhibited by two of the five laboratories. There were significant differences in staining by one of the laboratories (laboratory A) compared with the other four, and significant differences in overall (binary) scoring by another laboratory (laboratory C) compared with the rest. Slides stained by laboratory A experienced significantly less p53 cell staining. Laboratory C used a lower cell staining percentage threshold for scoring slides as overall positive, and as a result, scored significantly more slides as overall positive compared with the other four laboratories. The remaining three of five laboratories exhibited good agreement with one another in both staining and scoring.
To improve scoring agreement, one must first determine whether disagreements are occurring in the estimation of cells staining positive, at the binary (thresholded) level, or both. For the five laboratories studied here, it appeared that the major scoring differences were resulting from use of different thresholds. The observed intraslide agreement on the binary scorings was improved by use of a uniform threshold of >5% or >10%, largely due to bringing laboratory Cs threshold "in line" with the other laboratories. If there had been large differences at the cell staining percentage level, then allowing each laboratory to adjust for its staining and scoring patterns by use of its individually selected threshold might be most appropriate. Admittedly, this study was not designed to evaluate all possible thresholds because the cell staining percentage data were collected as grouped (categorical data). In addition, any "optimal" threshold suggested by the data would require validation on an independent data set to determine not only whether the revised threshold(s) improve reproducibility, but whether the new measurements yield overall more accurate prognostic information.
The most obvious way to address staining differences is to standardize the laboratory staining protocols. The present study was intentionally designed without a standardized staining protocol to assess the reproducibility of results under existing practices, and to address the question of whether it would be valid to retrospectively combine assay results from different laboratories. The answer to that question seems to be that it would not be advisable to combine assay results from different laboratories for use in retrospective studies without consideration of potential differences in assay results between laboratories. Even with standardized protocols, it is possible that some subtle differences would remain; therefore, reproducibility should be re-assessed after standardization. Just as in the case of optimizing scoring criteria, final selection of a standardized staining protocol should be made with consideration of prognostic value.
The diminished reproducibility observed in binary scoring when cell staining percentages were in the low to mid range may be explained by multiple factors. The first, and most obvious, is that the laboratories thresholds fell in that range. Even if the laboratories were in perfect agreement on the cell staining percentage assessments, their binary assessments may have disagreed because they use different positivity thresholds. The second factor is that we found proportionately less variability in the cell staining percentage assessments within and between laboratories at the two extremes of the cell staining percentage scale. It would make sense that if a specimen was virtually devoid of p53-overexpressing tumor cells or was very densely populated with overexpressing tumor cells, there would be little room for subjective interpretation. These results suggest that for specimens with cell staining percentages falling in this intermediate range, analyses of additional sections, or additional analyses such as mutation analysis of the same section, might be advisable. Duplicate staining and assessment of such borderline staining usually is performed by the participating laboratories. However, the study design did not permit these additional analyses. Whether such a reassessment would decrease the observed variability in these low-to-mid range slides is uncertain.
Some issues that could not be addressed by this study are how well laboratories might agree on p53 assessments of lower stage or grade tumor specimens, and how well results might agree if staining was performed on sections cut from different blocks of the same tumor. In addition, variability due to different lots of antibodies could not be addressed. This study used a common lot of primary antibody and might therefore be expected to show a better level of reproducibility than if multiple lots of antibody or different primary antibodies had been used.
There are several implications of these results for conducting future studies to attempt to resolve the issue of p53 prognostic value in bladder cancer. Great care must be exercised in drawing conclusions across studies using different laboratories because there can be substantial variation in assays results. Some assay methods (or even a standardized assay carried out in some hands) may ultimately prove more valid than others in the sense of more accurately measuring something of biological or prognostic significance. To formally conduct a retrospective study by combining preexisting assay results from different laboratories would, at minimum, require that statistical adjustments for the assaying laboratories be attempted, and it would likely require that assay measurements be available in their most "raw" form rather than preprocessed into dichotomous values. However, such a solution might only be approximate and may require additional data comparing the laboratories assays to one another or to a common reference. For large-scale prospective studies, centralization of all assays to a single laboratory using the current best candidate assay procedure would be desirable, but if multiple laboratories are to be involved, then verification of interlaboratory reproducibility must occur before embarking on the study. Or, if a best candidate assay method has not yet emerged, a prospective study involving multiple assays for the same marker per specimen provides an extraordinary opportunity to simultaneously address issues of prognostic value and reproducibility. Alternatively, such a study may be conducted on a retrospective specimen collection by performing multiple assays on each specimen.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
1 This work was supported by the National Cancer
Institute Bladder Tumor Marker Network, Grants CA47538 (to C. C-C.),
CA70903 (to R. C.), CA47526 (to Y. F.), CA56973 (to H. B. G.), and CA47537 (to F. M. W.). ![]()
2 To whom requests for reprints should be
addressed, at National Cancer Institute, Biometric Research Branch,
Room 739, Executive Plaza North, MSC 7434, 6130 Executive Boulevard,
Bethesda, MD 20892-7434. Phone: (301) 402-0636; Fax: (301) 402-0560;
E-mail: McShaneL{at}ctep.nci.nih.gov ![]()
3 Members of the Bladder Tumor Marker Network are
Roger Aamodt (National Cancer Institute, Bethesda, Maryland), Carlos
Cordon-Cardo (Memorial Hospital, New York, New York), Richard Cote
(University of Southern California School of Medicine/Norris
Comprehensive Cancer Center, Los Angeles, California), Yves Fradet
(Laval University, Quebec City, Quebec, Canada), H. Barton Grossman
(University of Texas M.D. Anderson Cancer Center, Houston, Texas), and
Frederic M. Waldman (University of California, San Francisco,
California). ![]()
4 The abbreviation used is: IHC,
immunohistochemistry. ![]()
Received 10/ 4/99; revised 2/ 4/00; accepted 2/ 7/00.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
L. M. McShane, D. G. Altman, W. Sauerbrei, S. E. Taube, M. Gion, and G. M. Clark Reporting Recommendations for Tumor Marker Prognostic Studies J. Clin. Oncol., December 20, 2005; 23(36): 9067 - 9072. [Full Text] [PDF] |
||||
![]() |
L. M. McShane, D. G. Altman, W. Sauerbrei, S. E. Taube, M. Gion, G. M. Clark, and for the Statistics Subcommittee of the NCI-EORTC W Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK) J Natl Cancer Inst, August 17, 2005; 97(16): 1180 - 1184. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. J. Goebell, S. Groshen, B. J. Schmitz-Drager, R. Sylvester, M. Kogevinas, N. Malats, G. Sauter, H. B. Grossman, C. P.N. Dinney, F. Waldman, et al. Concepts for Banking Tissue in Urologic Oncology--The International Bladder Cancer Bank Clin. Cancer Res., January 15, 2005; 11(2): 413 - 415. [Full Text] [PDF] |
||||
![]() |
K. K. Dobbin, D. G. Beer, M. Meyerson, T. J. Yeatman, W. L. Gerald, J. W. Jacobson, B. Conley, K. H. Buetow, M. Heiskanen, R. M. Simon, et al. Interlaboratory Comparability Study of Cancer Gene Expression Analysis Using Oligonucleotide Microarrays Clin. Cancer Res., January 15, 2005; 11(2): 565 - 572. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Cordon-Cardo p53 and RB: Simple Interesting Correlates or Tumor Markers of Critical Predictive Nature? J. Clin. Oncol., March 15, 2004; 22(6): 975 - 977. [Full Text] [PDF] |
||||
![]() |
M G W Bol, J P A Baak, B van Diermen, S Buhr-Wildhagen, E A M Janssen, K H Kjellevold, A J Kruse, O Mestad, and P Ogreid Proliferation markers and DNA content analysis in urinary bladder TaT1 urothelial cell carcinomas: identification of subgroups with low and high stage progression risks J. Clin. Pathol., June 1, 2003; 56(6): 447 - 452. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. W.G. van Rhijn, A. N. Vis, T. H. van der Kwast, W. J. Kirkels, F. Radvanyi, E. C.M. Ooms, D. K. Chopin, E. R. Boeve, A. C. Jobsis, and E. C. Zwarthoff Molecular Grading of Urothelial Cell Carcinoma With Fibroblast Growth Factor Receptor 3 and MIB-1 is Superior to Pathologic Grade for the Prediction of Clinical Outcome J. Clin. Oncol., May 15, 2003; 21(10): 1912 - 1921. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Sanchez-Carbayo, N. D. Socci, E. Charytonowicz, M. Lu, M. Prystowsky, G. Childs, and C. Cordon-Cardo Molecular Profiling of Bladder Cancer Using cDNA Microarrays: Defining Histogenesis and Biological Phenotypes Cancer Res., December 1, 2002; 62(23): 6973 - 6980. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |