Abstract
Purpose: To prospectively test the mathematical models for calculation of the risk of malignancy in adnexal masses that were developed on the International Ovarian Tumor Analysis (IOTA) phase 1 data set on a new data set and to compare their performance with that of pattern recognition, our standard method.
Methods: Three IOTA centers included 507 new patients who all underwent a transvaginal ultrasound using the standardized IOTA protocol. The outcome measure was the histologic classification of excised tissue. The diagnostic performance of 11 mathematical models that had been developed on the phase 1 data set and of pattern recognition was expressed as area under the receiver operating characteristic curve (AUC) and as sensitivity and specificity when using the cutoffs recommended in the studies where the models had been created. For pattern recognition, an AUC was made based on level of diagnostic confidence.
Results: All IOTA models performed very well and quite similarly, with sensitivity and specificity ranging between 92% and 96% and 74% and 84%, respectively, and AUCs between 0.945 and 0.950. A least squares support vector machine with linear kernel and a logistic regression model had the largest AUCs. For pattern recognition, the AUC was 0.963, sensitivity was 90.2%, and specificity was 92.9%.
Conclusion: This internal validation of mathematical models to estimate the malignancy risk in adnexal tumors shows that the IOTA models had a diagnostic performance similar to that in the original data set. Pattern recognition used by an expert sonologist remains the best method, although the difference in performance between the best mathematical model is not large.
 ultrasonography
 ovarian neoplasms
 color Doppler sonography
 logistic models
Translational Relevance
For clinicians, ultrasound pattern recognition is the standard method for the preoperative prediction of malignancy in an adnexal mass. In the hands of experienced ultrasound examiners, this strategy results in a sensitivity of 96% and a specificity of 90% (1). Less experienced examiners using pattern recognition will probably not achieve such high sensitivity and specificity. Therefore, the development of mathematical models to predict malignancy in adnexal masses is worthwhile. Because the prospective testing of most previously published models was disappointing, most likely because the models had been developed in a single small center and because neither examination technique nor terms to describe the ultrasound findings had been standardized, we developed 11 new models in a multicenter study (the IOTA study) including 1,066 patients scanned using the same standardized ultrasound protocol. In phase 1 of the IOTA study, all models proved to perform very well with AUC of >0.92. This area is larger than that of the risk of malignancy index, which is often regarded as a standard test (2–6). Internal validation is the next step in the evaluation process of new diagnostic tests. This article describes the internal validation of our new models. The AUCs of all models were similar to those in the IOTA study phase 1 and to that of pattern recognition. These results are promising, but external validation of the models in the hands of less experienced examiners remains to be done.
Correct preoperative discrimination between benign and malignant adnexal masses is important, because the preoperative diagnosis will determine the treatment of the patient. Incorrect diagnosis is likely to result in incorrect treatment and this may worsen the prognosis. Because benign and malignant adnexal masses show different ultrasound morphology, a transvaginal ultrasound examination can be used to discriminate preoperatively between benign and malignant masses (1, 2), one of the best ultrasound methods being pattern recognition, that is, subjective evaluation of grayscale and Doppler ultrasound findings by the ultrasound examiner (3, 4). Several scoring systems and mathematical models using ultrasound variables have been developed for the preoperative prediction of probability of malignancy. However, before these models can be used in daily practice, they should be tested prospectively in new populations. Some of these models have been tested prospectively on small data sets with disappointing results (5–7).
We have prospectively tested 17 scoring systems and mathematical models for calculation of malignancy risk in adnexal masses on a large data set collected by the International Ovarian Tumor Analysis (IOTA) collaborative group, the IOTA phase 1 data set (8). Unfortunately, most models and scoring systems performed worse than they had done in the studies where they had been created (8). However, the primary aim of the IOTA phase 1 study was not to prospectively test previously developed methods for discriminating between benign and malignant adnexal masses but to prospectively collect clinical information and standardized data from ultrasound examinations in a large number of patients with adnexal tumors to create new scoring systems and mathematical models to distinguish benign from malignant adnexal tumors (9). More than 50 variables of 1,066 patients with 1,233 masses were analyzed. Most variables were grayscale and color Doppler variables, but clinical and demographic variables were recorded as well. All patients were scanned by an expert sonologist of the IOTA group in one of the nine European centers following a strict protocol that has been published previously (10). For the development of all models, the database was divided into a training set of 754 patients and an independent test set of 312 patients. All models performed well when tested on the test set of patients (n = 312). Their area under the receiver operating characteristics (ROC) curve (AUC) ranged from 0.93 to 0.95 (9, 11, 12).
The aim of the present study was to prospectively test the mathematical models developed on the IOTA phase 1 data set on a new data set and to compare their performance with that of pattern recognition, which is assumed to be the standard method.
Materials and Methods
New prospective data set. Patients were recruited from three ultrasound centers: University Hospitals Leuven, Malmö University Hospital, and Instituto di Clinica Ostetrica e Ginecologica. These centers had contributed two thirds of the cases to the IOTA phase 1 data set (25%, 30%, and 12%, respectively). Consecutive patients meeting the inclusion criteria of the IOTA phase 1 study (9) underwent ultrasound examination by the same examiners as in the IOTA phase 1 study following the same IOTA study protocol and using the same or similar ultrasound systems as in the IOTA phase 1 study. Patient recruitment started immediately after completion of the IOTA phase 1 study (June 2002) and finished in December 2005. The IOTA study protocol and the IOTA terms and definitions have been described in detail elsewhere (9, 10). In accordance with the IOTA phase 1 study protocol, the ultrasound examiner was obliged to give his/her subjective impression in two ways: (a) classification of each mass as benign or malignant based on subjective evaluation of ultrasound findings (pattern recognition) and (b) expressing his/her level of confidence as follows: benign, probably benign, uncertain, probably malignant, or malignant. The category “uncertain” was split into two subcategories: uncertain but first classified as benign and uncertain but first classified as malignant. The criteria described in ref. 13 were used for pattern recognition.
Pattern recognition and the following mathematical models developed on the IOTA phase 1 data set were tested on the newly collected study population: two logistic regression models [LR1 (12 variables) and LR2 (6 variables); ref. 9], three least squares support vector machine (LSSVM) models (LSSVM lin, LSSVM rbf, and LSSVM add rbf; ref. 11), three relevance vector machine (RVM) models (RVM lin, RVM rbf, and RVM add rbf; ref. 3), and three neural networks (Bay MLP 112a, Bay MLP 112b, and Bay Perc 11; ref. 12).
LR1 was developed using statistical stateoftheart logistic regression modeling, with due attention to multicollinearity, possible interactions, and linearity in the logit assumption. It held back 12 independent variables to predict malignancy. These variables were selected using both automated and manual selection procedures. LR1 gave an AUC of 0.942 on the test set (n = 312). Sensitivity and specificity were 93% and 76%, respectively, using a cutoff of 0.10. LR1 was the main logistic regression model. A reduced model (LR2) using only 6 variables was developed as well (9).
Kernel methods were also applied for binary classification purposes in refs. 12, 14–16. Methods considered are Bayesian LSSVM (17) and RVM (17) with linear, radial basis function (RBF), and additive RBF kernels. LSSVMs and RVMs are mathematically advanced methods that are flexible in creating a nonlinear separation between two classes. Mathematically, these methods make a linear separation after applying a kernel function to the data. Relative to the original data, the use of a kernel function may introduce nonlinearity in the separation between both classes. RVMs work with any function (not necessarily with a positive semidefinite kernel function), such that RVMs are not true kernelbased methods, whereas LSSVMs are. Using a linear kernel, the separation between classes will still be linear with respect to the original data. A typical kernel function to obtain nonlinear separation is the RBF. The additive RBF kernel (18) is an extension of the RBF kernel to get more insight in how the model works. More explanation on LSSVMs and RVMs can be found in refs. 11, 17, 18. More information about the Bayesian approach to LSSVMs is given in ref. 11. Variable selection was done using forward and backward selection in a Bayesian LSSVM framework. The variable selection analyses resulted in 12 variables being included in the mathematical models (Table 1 ). When these models were tested on the test set of IOTA phase 1, all models had similar performance, with test set AUCs ranging from 0.94 to 0.95, accuracy from 83% to 86%, sensitivity from 91% to 93%, and specificity from 81% to 84%.
Next to the kernel methods, a Bayesian multilayer perceptron using the evidence procedure (19, 20) was applied as another type of model that can incorporate nonlinearity (14). Variable selection was done using Automatic Relevance Determination and model selection using crossvalidation. A linear model using Bayesian perceptron was also developed. The best Bayesian multilayer perceptron included 11 variables with two hidden neurons (Bay MLP 112a). All three neural networks used different sets of variables; the models included 6 to 12 of the variables, which are listed in Table 1.
Statistical analysis. The diagnostic performance of each model was expressed as the AUC (21) and the partial AUC (22). The partial AUC is the area under that part of the ROC curve, which is defined by the lowest acceptable specificity. Because we believe that the lowest acceptable specificity when predicting malignancy in ovarian tumors is 75%, we computed the partial AUC (0.75), that is, the area under that part of the ROC curve where the specificity is ≥75% (the most left part of the ROC curve; ref. 21). Using the six levels of diagnostic confidence as different cutoffs, a ROC curve for pattern recognition could be constructed as well. The diagnostic performance of the models was also expressed as sensitivity, specificity, positive and negative predictive value, and positive and negative likelihood ratio when using the risk cutoff recommended in the original article describing the model. However, sensitivity and specificity of a model depend on the cutoff chosen, whereas the AUC reflects overall test performance. Therefore, we considered the AUC to be the most important measure of diagnostic performance (20). Because the statistical comparison of all 11 models would yield 55 different P values, we preferred to compare the performance of the models by using the ranking method based on the work of Pepe et al. (23). In this method, all models are ranked with regard to a chosen criterion. For our analysis, the main criterion was the AUC. Because the ranking is influenced by sampling variability, the probability that a method is ranked among the κ best methods, Pm(κ), is computed using 1,000 bootstrap samples (23). A bootstrap sample is a new data set generated from the original data set, where both have the same sample size. The bootstrap sample is constructed by randomly selecting patients from the original data set. Patients are selected “with replacement,” meaning that a patient can be selected more than once or not at all for a given bootstrap sample (25). For each bootstrap sample, the AUCs were computed and the models were ranked according to their AUC. This allowed us to compute how many times each model was ranked as the best, Pm(1), among the best three, Pm(3), and among the best five, Pm(5). The mean rank over all bootstrap samples was also computed.
To determine differences between benign and malignant tumors in numerical ultrasound and clinical variables, the 𝛉 index for effect size [with its 95% confidence interval (95% CI)] was used as the main indicator (24). The 𝛉 index can take on any value between 0.5 and 1 and can be interpreted as the degree of overlap between the benign and the malignant groups. A 𝛉 value of 1 means no overlap and a 𝛉 value of 0.5 means maximal overlap. The 𝛉 index is mathematically identical to the AUC, but a ROC curve analysis has different objectives (25). Nonetheless, this relationship may help nonstatisticians to interpret 𝛉. For dichotomous variables, the difference between percentages was computed, together with a 95% CI using method 10 as described by Newcombe (26).
To compare levels of sensitivity or specificity between different approaches, we computed the difference in sensitivity or specificity and constructed a 95% CI for paired proportions using method 10 as described by Newcombe (27).
Results
The prospectively collected study population consists of 507 patients with complete information for all 17 variables used in the models tested. Of these 507 patients, 287 (57%) were examined in Leuven, Belgium, 96 (19%) in Rome, Italy, and 124 (24%) in Malmö, Sweden. The malignancy rate in the whole study population was 28% (143 of 507), with a malignancy rate of 30% (85 of 287) in Leuven, 33% (32 of 96) in Rome, and 21% (26 of 124) in Malmö. Histologic diagnoses are shown in Table 2 . Tables 3 and 4 present clinical data and ultrasound findings in benign and malignant tumors separately for each of the three participating centers.
Diagnostic performance of the models tested
The performance of the models tested is shown in Table 5 together with the results of pattern recognition. The models performed very well and quite similarly, with AUCs ranging from 0.945 to 0.950, sensitivity from 91.6% to 95.1%, and specificity from 73.9% to 83.8%. A LSSVM with linear kernel and the logistic regression model LR1 had the largest AUC (0.950). The largest difference in AUC between any two models was only 0.005 (Table 5).
The logistic regression model LR2 with 6 variables had the highest partial AUC (0.75), that is, 0.208. By using the ranking method for AUCs of Pepe et al. (23), the LSSVM with linear kernel and with rbf kernel and the two logistic regression models were ranked among the five best performing models. However, Pm(1) ranked LSSVM with rbf on the seventh place (Table 5).
Using the subjective classification as benign or malignant, sensitivity was 90.2% and specificity was 92.9%. This corresponds to a positive LR of 12.7 and a negative LR of 0.11 (Table 5; Fig. 1 ). After having expressed the level of diagnostic confidence with which the diagnosis of benignity or malignancy was made by the expert sonologists, they seemed to be uncertain about their diagnosis in 8% (39 of 507) of the cases and completely confident about the benign or malignant character of the mass in 46% (226 of 507) of the cases. The use of the different levels of confidence resulted in an AUC for pattern recognition of 0.963.
Finally, Table 6 and Fig. 1 compare the performance of the best IOTA model (LSSVM lin; ref. 11) with the assumed standard method, pattern recognition. The AUC for LSSVM lin was 0.950 versus 0.963 for pattern recognition (difference = −0.013; 95% CI, −0.034 to 0.005). Using the cutoff that was reported in the original publication, sensitivity was 91.6% for LSSVM lin versus 90.2% for pattern recognition (difference = 1.4%; 95% CI, −4.2% to 7.1%) and specificity was 83.8% versus 92.9% for pattern recognition (difference = −9.1%; 95% CI, −13.1% to −5.2%).
Results per center
Leuven. All IOTA models performed very well when tested on the data collected in Leuven. The AUC values ranged from 0.947 to 0.958. A Bayesian neural network had the largest AUC (0.958). The two logistic regression models also had large AUCs (0.956 and 0.957; Table 5).
Rome. The performance of the IOTA models was moderate to good when tested on the data collected in Rome. The AUCs ranged from 0.878 to 0.905 with the largest AUC for a Bayesian perceptron network (Table 5).
Malmö. All IOTA models performed very well on the data collected in Malmö. The AUCs ranged from 0.975 to 0.992. The logistic regression model LR1 had the largest AUC (Table 5).
Discussion
Pattern recognition by an experienced sonologist is an excellent method for discriminating between benign and malignant adnexal masses and should probably be regarded as the standard method for preoperative classification of adnexal masses (4, 13). However, the ability to discriminate between benign and malignant adnexal masses using pattern recognition increases with increasing experience (1), and in daily clinical practice, it is impossible to ask an expert's opinion on every adnexal mass. The aim of developing mathematical models is to improve the less experienced examiner's ability to discriminate between benign and malignant adnexal masses so that it approaches that of an expert ultrasound examiner.
In the United Kingdom, the Royal College of Obstetricians and Gynaecologists recommends the use of the risk of malignancy index (28, 29), but in ref. 9, we compared the performance of the two IOTA logistic regression models with the risk of malignancy index on a test set of 236 patients and showed that the risk of malignancy index performed significantly worse: AUC 0.936 and 0.916 for the logistic regression models versus 0.870 for risk of malignancy index (P = 0.0038; ref. 9). The strategy followed by the American College of Gynecologists is the use of guidelines based on demographic and ultrasound variables (CA 125 level: >35 units/mL for postmenopausal patients and >200 units/mL for premenopausal patients, evidence of ascites on ultrasound or computed tomography, a nodular or fixed pelvic mass, evidence of abdominal or distant metastases on computed tomography, and family history of at least one firstdegree relative with ovarian or breast cancer) to classify an adnexal mass as benign or malignant and to select which patient should be referred to a tertiary center with a gynecologic oncology department (30).
When the performance of these guidelines was tested on a data set of 837 patients, the guidelines reached a sensitivity of 79% and 93% for the premenopausal and postmenopausal groups, respectively, and a specificity of 70% and 60%. After a revision of these guidelines that decreased the cutoff level of serum CA 125 in the premenopausal group (from 200 to 67 units/mL), sensitivity was 75% and 91% for premenopausal and postmenopausal patients, respectively, and specificity 91% and 76%.
Because the performance of these “gold Royal College of Obstetricians and Gynaecologists and American College of Gynecologists standards” were not optimal and the performance of the IOTA models on a test set of 236 patients was very good, we believed that the IOTA models deserved internal and external validation. Unfortunately, we cannot compare the performance of the IOTA models with that of the Royal College of Obstetricians and Gynaecologists or American College of Gynecologists guidelines because we have not prospectively collected information on physical examination or computed tomography findings concerning the presence of metastases.
In this internal validation, the IOTA models performed as well or better than they had done in the original studies where they had been created.
The robustness of the IOTA models may be explained by them having been developed on a large data set where data had been collected from many centers and using a standardized ultrasound protocol and standardized terms and definitions. On the other hand, the models were validated in the three centers that had provided most of the data to the IOTA phase 1 study, which means that our study presents the results of internal validation only. The models performed better in the centers in Leuven and Malmö than in Rome. This may be explained by differences in pathology between centers (Table 2). The center in Rome included more primary invasive tumors and metastatic tumors but fewer borderline tumors than the other centers, and it included more fibromas and abscesses. Fibromas and abscesses are often misclassified by mathematical models because of their solid character and high vascularity. Differences in performance between centers may also be explained by differences between ultrasound examiners in their evaluation of certain ultrasound features (e.g., in the evaluation of color score, which is highly subjective) and by differences in ultrasound equipment (the Doppler sensitivity of the ultrasound system being important when assessing the color score).
Pattern recognition in the hands of an experienced ultrasound examiner is a very good method for discriminating between benign and malignant adnexal tumors. The mathematical models seem to have slightly higher sensitivity and lower specificity than pattern recognition, but this is dependent on the cutoff chosen for the mathematical model. In this study, the mathematical models performed nearly as well as pattern recognition when comparing the ROC curves. The models now need to undergo external validation, that is, to be tested in new centers. Moreover, their performance in the hands of less experienced examiners (for whom the models were created) needs to be determined.
Conclusion
This is the first study to prospectively test the IOTA mathematical models built to predict preoperatively the benign or malignant character of an adnexal mass. Although our results are those of an internal validation, they are reassuring. From all the models that were developed to discriminate between benign and malignant adnexal tumors, the logistic regression models and the LSSVM with linear and rbf kernel had the highest AUC and were ranked among the five best performing models. However, because the differences in AUCs between the 11 models were extremely small, this might be clinically irrelevant. The models now need to undergo external validation, that is, they need to be tested in completely new centers that did not take part in the IOTA phase 1 study. They should also be tested by ultrasound examiners with varying experience. None of the models tested was superior to subjective evaluation of ultrasound findings by an experienced ultrasound examiner.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Footnotes

Grant support: Research Council of the Katholieke Universiteit Leuven (GOAAMBioRICS, CoE EF/05/006 Optimization in Engineering OPTEC), Belgian Federal Science Policy Office IUAP P6/04 (“Dynamical Systems, Control and Optimization,” 20072011), EU: BIOPATTERN (FP62002IST 508803), ETUMOUR (FP62002LIFESCIHEALTH 503094), Healthagents (IST200427214), Swedish Medical Research Council (grants K200172X1160506A, K200272X1160507B, K200473X1160509A, and K200673X11605113), Malmö University Hospital, Allmänna Sjukhusets i Malmö Stiftelse för bekämpande av cancer (Malmö General Hospital Foundation for Fighting Against Cancer), and ALFmedel and Landstingsfinansierad regional forskning (two Swedish governmental grants).

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted October 2, 2008.
 Received January 14, 2008.
 Revision received October 2, 2008.