Abstract
Purpose: Several scoring systems have been developed to distinguish between benign and malignant adnexal tumors. However, few of them have been externally validated in new populations. Our aim was to compare their performance on a prospectively collected large multicenter data set.
Experimental Design: In phase I of the International Ovarian Tumor Analysis multicenter study, patients with a persistent adnexal mass were examined with transvaginal ultrasound and color Doppler imaging. More than 50 end point variables were prospectively recorded for analysis. The outcome measure was the histologic classification of excised tissue as malignant or benign. We used the International Ovarian Tumor Analysis data to test the accuracy of previously published scoring systems. Receiver operating characteristic curves were constructed to compare the performance of the models.
Results: Data from 1,066 patients were included; 800 patients (75%) had benign tumors and 266 patients (25%) had malignant tumors. The morphologic scoring system used by Lerner gave an area under the receiver operating characteristic curve (AUC) of 0.68, whereas the multimodal risk of malignancy index used by Jacobs gave an AUC of 0.88. The corresponding values for logistic regression and artificial neural network models varied between 0.76 and 0.91 and between 0.87 and 0.90, respectively. Advanced kernelbased classifiers gave an AUC of up to 0.92.
Conclusion: The performance of the risk of malignancy index was similar to that of most logistic regression and artificial neural network models. The best result was obtained with a relevance vector machine with radial basis function kernel. Because the models were tested on a large multicenter data set, results are likely to be generally applicable.
 ovarian mass
 mathematical model
 scoring system
 ultrasound
 prediction
 Gynecological cancers: ovarian
 Risk assessment
 Diagnostic imaging
There is a need for adequate preoperative assessment of an adnexal mass, because the presumed diagnosis determines the management. Functional cysts and some benign cysts may be treated conservatively. In benign cysts, laparoscopic treatment may decrease hospitalization length and surgical morbidity (1–3). However, if an adnexal mass is likely to be malignant, the patient should be referred to a gynecologic oncologist for surgical staging and optimal debulking. Spilling of an early stage ovarian cancer during laparoscopy or suboptimal debulking with residual disease may worsen the prognosis (4–7).
In most cases, experienced sonologists could correctly determine the character of an adnexal mass on the basis of their own subjective impression. They may achieve a sensitivity with regard to malignancy of >95% and a specificity of ∼90% (8, 9). Less experienced sonologists could be helped by scoring systems or mathematical models that can be used to discriminate between benign and malignant adnexal masses.
A whole series of prediction models (scoring systems, logistic regression models, and artificial neural networks) have previously been developed (10). Most of them have not been tested prospectively on a large population. Therefore, their generalizibility is unknown.
The aim of this study was to prospectively validate the performance of published scores and mathematical models to predict malignancy in an adnexal mass by applying the models to the data collected in a large multicenter study by the International Ovarian Tumor Analysis (IOTA) group (11).
Materials and Methods
Data set
The data set used was recruited during phase I of the IOTA study. This was the first study from the IOTA group, a prospective multicenter study done in nine centers from five different countries (11). The aim of the IOTA study was to collect sonographic and demographic data of >1,000 patients with an adnexal mass to develop mathematical models to predict malignancy in an ovarian mass. All details have been previously described (11). The primary outcome was the histologic classification of the excised tissue as malignant or benign. Data from 1,066 patients were available for analysis. Serum CA125 values were known for 809 (75.9%) of the 1,066 patients. Models that included CA125 were tested only on these 809 patients. Overall, 800 (75%) tumors were benign and 266 (25%) were malignant.
Selection of scores and mathematical models to predict malignancy
The selection of the old models was partly done by literature search. We selected models for which the statistical formula was published, provided that all the variables used in the formulae were also analyzed in phase I of the IOTA study. In this way, 17 models were selected: three scoring systems [two risk of malignancy indices (RMI and RMI2) and Lerner's scoring system; refs. 12–14], seven logistic regression models (15–21), three artificial neural network models (19, 20), and four least squares support vector machine (LSSVM) and relevance vector machine (RVM) models (20, 22). LSSVMs and RVMs are advanced mathematical models that are also highly capable of handling nonlinear data. They are believed to perform well when used prospectively and are briefly described below.
The variables used in the selected scoring systems and models are shown in Table 1 . The following models/scores did not fulfill the inclusion criteria: the scoring systems of DePriest et al. (23), Ferrazzi et al. (24), Alcazar et al. (25), and Sassone et al. (26); the logistic regression model of Alcazar et al. (27); and the neural networks from Biagiotti et al. (28), Clayton et al. (29), Bishop (30), and Tailor et al. (31). The reasons for exclusion are listed in the Supplementary Table that is published online.
LSSVMs. SVMs are learning machines that, using a kernel function, nonlinearly transform the input space into a high dimensional feature space. In this high dimensional feature space, a linear classifier is constructed. Depending on the type of transformation (linear or nonlinear kernel), this linear classifier in the feature space coincides with either a linear or nonlinear classifier in the original input space. The SVM looks for a linear classifier in the feature space by maximizing the margin between the data of both classes (leading to simpler models). Therefore, SVMs look for a tradeoff between minimizing complexity and minimizing the number of misclassifications. The resulting model is sparse in that only a few cases from the training set were used to construct the decision boundary. These cases are called the “support vectors” and are usually located on or close to the decision boundary. The LSSVM is a variant of the standard SVM in which training of the model is greatly simplified (20, 32). Lu et al. developed a linear (using a linear kernel) and a nonlinear (using a radial basis function or RBF kernel to transform the input data) LSSVM model in a Bayesian framework (20, 22).
RVMs. RVMs were inspired by the SVM model formulation, but because they do not work with a feature space, there are less restrictions with respect to valid transformations of the input space (32). Therefore, they are still fundamentally different from SVMs. RVM models use a Bayesian perspective and the resulting model is again sparse (33). However, the cases used for constructing the decision boundary were prototypical examples of both classes rather than cases that were close to the decision boundary. Lu et al. developed a linear and a nonlinear (using an RBF kernel) RVM model (20, 22). The principles of logistic regression analysis, artificial neural networks, SVMs, and RVMs are described in refs. (20, 30, 32–37).
Modification of variables
Some variables used in some scores or models lacked an exact corresponding variable in the IOTA database. The IOTA variables that we used in these cases are shown in the Supplementary Tables published online.
Statistical analysis
Statistical analyses were carried out using Statistical Analysis Software version 9.1 (SAS Institute, Inc.). The performance of the models and scores were compared with respect to their area under the receiver operating characteristic (ROC) curve (AUC). After applying the cutoff value to predict malignancy suggested in the original publication, the sensitivity, specificity, positive predictive value, negative predictive value, positive likelihood ratio (LR+), and negative likelihood ratio (LR−) for each score/model could be calculated. We used the cutoffs reported in the original articles, although the optimal cutoff was highly dependent on the population characteristics, e.g., the amount of malignant and exceptional tumors. Because the sensitivity and related measures were dependent on the cutoff value chosen, whereas the AUC was not, we considered the AUC to be the most important measure of diagnostic performance (38, 39). The nonparametric procedure described in ref. (38) was used to test whether the AUCs of two models differed. To compare all 17 scores/models with each other, we used a ranking method based on the work of Pepe et al. (40). All models were ranked with respect to a chosen criterion. This ranking, however, is influenced by sampling variability. This variability is accounted for by computing the probability that a method is ranked among the κ best methods: Pm(κ) (ref. 40). One thousand bootstrap samples were drawn using the IOTA phase I data, which have CA125 to allow the comparison of all models. For each bootstrap sample, the criterion was computed for each of the 17 models and the models were ranked within each bootstrap from good to bad. This allowed us to compute how many times each model was ranked among the best 5 and best 10 in order to obtain Pm(5) and Pm(10). As already mentioned, the main criterion used was the AUC. However, based on the work of Pepe et al., we also looked at the partial AUC (pAUC; ref. 40). The pAUC or the partial AUC gives you the area under a part of the ROC curve with a minimum acceptable specificity. We believe that the minimum acceptable specificity level when predicting the malignancy of ovarian tumors is 75%; therefore, we computed the partial AUC (0.75) for that part of the ROC curve in which the specificity is at least 75% (i.e., the most left part).
Results
The diagnostic performance of the scores and models tested is shown in Tables 2 and 3 and in Fig. 1 . The scoring system that did best when applied on the IOTA data was the RMI of Jacobs et al. (12) with an AUC of 0.88. The AUCs of the scores ranged from 0.66 to 0.88 (Tables 2 and 3; Fig. 1A). The AUCs of the logistic regression models ranged from 0.72 to 0.91 with the logistic regression model of Lu et al. (20) performing best (AUC, 0.91; Tables 2 and 3; Fig. 1B). All neural networks did well, their AUCs ranging from 0.88 to 0.89, and so did the LSSVMs and RVMs with AUCs of 0.91 to 0.92 (Tables 2 and 3; Fig. 1C).
Figure 1D shows the best performing model per category, the RVM with RBF kernel of Lu et al. (22) having the highest AUC (0.923). Table 4 shows the 17 scores and models ranked from high to low AUC. The five best performing models were the RVMs, the LSSVMs, and the logistic regression model from Lu et al. (20, 22). By using the ranking method from Pepe et al. (40), the five best performing models were the same (Table 4). The probability of being among the best five models, Pm(5), or the best 10 models, Pm(10), is much higher for these models than for any other model. The other criterion used, the pAUC(0.75), also ranked these five models at the top (Table 4).
Discussion
This is the first report on prospective testing of multiple mathematical models to predict malignancy in an adnexal mass from a large and multicenter database (41–43). External validation (i.e., prospective testing of a model on a new database in another center) is the ultimate test a model has to go through before it can be used in clinical practice. Internal validation or testing the performance of the model on a test set of data from the same center is less adequate because it does not give information on the generalizibility of the model. The advantage of the IOTA database is that data were collected in different centers with different referral patterns resulting in a diverse study population mimicking a more universal population.
Most models tested on the IOTA database did worse than was originally reported. For an excellent model, the AUC should reach 0.90 (35). Only the logistic regression model of Lu and colleagues (20), the LSSVM with linear and RBF kernel, and the RVM with linear and RBF did so (20, 22).
There are several plausible reasons why the performance in the original articles was better than the results that we report. One is that some variables in some scores and models lacked an exact corresponding variable in the IOTA database, so that the variables had to be redefined on the basis of the information available in the IOTA database. This was true for both RMIs, the Lerner score, and the Prömpeler logistic regression model. Second, one of the causes of poorer performance in a prospective test are the differences in population characteristics between the data from the original study and the data from the prospective study. If the model development was based on a small sample of tumors, the particularities of that specific sample will have great influence on the model. However, Table 3 shows the population size in the original reports and we could not find a linear correlation between the size of the data set in the original report and the performance of the score or the model. The prevalence of malignancy in a population and the mix of exceptional tumors and borderline tumors will characterize a population. The prevalence of malignancy, for example, is 25% in the IOTA phase I study, but when we look at the original reports of the models that were tested, it varied between 22% and 42%. Models that have been developed on a large and multicenter study sample, representative of a general population, are likely to be more robust. Therefore, a model developed in a tertiary referral center might perform less when it is applied in a regional gynecology center. Because the IOTA data set is a multicenter data set collected in hospitals from different countries and with different referral pattern, it is more likely to represent a “universal” population.
A third explanation for models incorporating ultrasound variables performing worse when tested prospectively, is that the results of some ultrasound variables may be difficult to reproduce, in particular, those including an element of subjectivity, e.g., regular versus irregular cyst wall and those highly dependent on the examination technique adopted, e.g., Doppler variables. Results of spectral Doppler ultrasound examinations depend on the tumor vessel investigated, velocity variables like peak systolic velocity and timeaveraged maximum velocity are angledependent, and estimation of the color content of the tumor scan to assess tumor vascularity is subjective. This may make scores and models incorporating Doppler variables difficult to reproduce.
In the IOTA database, all variables had been defined using strict criteria, but these criteria may not have corresponded exactly to those used when creating the score or model to be tested. The models that did best in this study, all had been developed on data that had—at least partly—been collected by examiners who also collected data for the IOTA study, and all had been created at institutions that contributed data to the IOTA study. This means that the variables used in these models were probably defined similarly when the models were created and tested, and that the tumor populations in which the models were created and tested were also similar. This may partly explain the apparent superior performance of these models. When using the AUC as a measure of diagnostic performance, the performances of the RMI of Jacobs et al. (12) and the artificial neural network models from Timmerman et al. (19) were surprisingly similar (AUC, 0.8830.8760.889), even with the very best mathematical model, the maximum difference in AUC was only 0.04 (0.883 versus 0.923). Also, in other studies, the RMI of Jacobs has been shown to perform very well when tested prospectively (13, 18, 44, 45). The RMI was developed on a relatively small population (143 patients) but its robustness may be explained by the inclusion of variables that did not require high ultrasound skills except for the diagnosis of metastatic lesions (CA125, menopausal status, tumor with any other ultrasound morphology than a unilocular cyst, ascites, bilateral lesions, and presence of metastases at a scan). On the other hand, it requires the knowledge of the serum CA125 level, which is expensive and timeconsuming and not always available at the time the scan is done. The RMI is one of the oldest multimodal scoring systems that uses the combination of a tumor marker and ultrasound variables.
When we used the cutoffs that were reported in the original articles, the RBF kernel model picked up another 51 additional malignancies in comparison with the RMI (218 of 242 versus 167 of 242 malignant masses) due to higher sensitivity (Table 4). Furthermore, when using the ranking method of Pepe et al. (40), the RMI was not listed in the top five best performing models.
Finally, the statistical development procedure of the models is of utmost importance. The final model depends on the selection of the different variables. The goodness of fit is important because it implies the accuracy with which the final regression model describes the data (46). Although the AUC represents the quality of a test and is very useful to compare the performance of different models, the selection of cutoff levels is dependent on the study population and the preference of the investigator: if a high sensitivity is chosen, few malignancies will be missed, whereas increasing the cutoff will lead to a higher specificity and thus fewer falsepositive results. It is interesting to note that the sensitivities and specificities of the scores/models when using the cutoffs that had been recommended in the original publications were not very good, and they were much worse than in the original publications. It is well known that the optimal cutoff point of any test will depend on the proportion of cancers in the study population; therefore, if one uses the same cutoff in a population with a significantly different amount of malignancies, one can expect that the model will perform worse (Table 3).
We conclude that the AUC of most logistic regression and artificial neural network models was similar. The best results were obtained with vector machine models. The use of these complex mathematical models resulted in the correct classification of a significant amount of additional malignancies. Because the models were tested on a large multicenter data set, results are likely to be generally applicable.
Appendix A
IOTA Steering Committee
Dirk Timmerman, Lil Valentin, Thomas H. Bourne, William P. Collins, Sabine Van Huffel, and Ignace Vergote.
IOTA principal investigators (in alphabetical order)

JeanPierre Bernard, Maurepas, France

Thomas H. Bourne, London, United Kingdom

Enrico Ferrazzi, Milan, Italy

Davor Jurkovic, London, United Kingdom

Fabrice Lécuru, Paris, France

Andrea Lissoni, Monza, Italy

Ulrike Metzger, Paris, France

Dario Paladini, Naples, Italy

Antonia Testa, Rome, Italy

Dirk Timmerman, Leuven, Belgium

Lil Valentin, Malmö, Sweden

Caroline Van Holsbeke, Leuven, Belgium

Sabine Van Huffel, Leuven, Belgium

Ignace Vergote, Leuven, Belgium

Gerardo Zanetta, Monza, Italy.
Footnotes

Grant support: This research was supported by interdisciplinary research grants of the Katholieke Universiteit Leuven, Belgium (IDO/99/03 and IDO/02/09 projects), by the Belgian Programme on Interuniversity Poles of Attraction, by the Concerted Action Project AMBioRICS of the Flemish Community, and by the EU Network of excellence BIOPATTERN into the IST program, entitled “Computational Intelligence for Biopattern Analysis in Support of eHealthcare” (contract no. FP62002IST 508803), and by research grants from the Swedish Medical Research Council (K200172X1160506A, K200272X1160507B, K200473X1160509A, and K200673X11605113), by funds administered by the Malmö General Hospital Foundation for the fight against cancer, and a Swedish governmental grant from the region of Scania.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).
 Accepted April 2, 2007.
 Received December 15, 2006.
 Revision received March 5, 2007.