Numerous analyses of new markers are published routinely. Discovery of a new marker may be important for a variety of reasons. For example the new marker might yield insight on a disease process. However, the common application and analysis of a new marker is that of stratifying patients with respect to the outcome. The typical scenario is the collection of a dataset containing the new marker, established markers, and patient outcome. In this setting the scientist desires from the biostatistician an analysis of the empirical value of the new marker. This report addresses such an analysis. While there may be other value in dicovery of a particular marker, the central issue here is often prediction of outcome. When a new marker is identified the founder would like others to measure and use it because of the belief that it would be better able to predict patient outcome and, thereby, improve patient counseling and help patients make better treatment decisions (1) . However, the methods often used to evaluate and demonstrate the usefulness of a new marker are in need of improvement. In this article, the typical approach to the evaluation of a new marker is discussed, and an alternative is suggested. In the typical approach, the association of a new marker with established markers is examined, and the univariable and multivariable analyses of the new marker are performed. It is argued that a better approach is to compare the predictive ability of the multivariable model that contains the marker to the predictive ability of the model that lacks the marker (2) .
Association with Established Marker(s)
One often begins the analysis of a new marker by first presenting its association or correlation with an established marker (e.g., tumor grade). For example, higher expression levels of the new marker in patients with high-grade tumors might be found. However, the value of an analysis like this is not clear. The results of this correlation analysis are not conclusive regarding the value of the new marker. For example, one would not want to see that a new marker correlated perfectly with an existing marker, as this would imply that the new marker was redundant. That is, equivalent predictions could be obtained by using an established marker. Unless the new marker is cheaper to measure than the established marker, or the new marker allows the patient to avoid a painful procedure (e.g., biopsy), correlation analysis provides little insight into the potential value of the new marker.
The next analysis often provided is a plot of Kaplan-Meier curves for the new marker. An example of this can be seen in Fig. 1⇓ , where curves illustrate survival for high and low expression levels of a new marker. These curves are indeed informative regarding insight into the time to failure for groups of patients. However, the typical concern of the founder of the marker is simply whether the curves are distinct. Again, this analysis contributes very little and does not answer our central question of whether the new marker is of value. The major weakness of this analysis is that established markers are not considered. Just because the new marker shows distinct survival differences does not mean that equivalent separation cannot already be achieved by using an established marker or combination of established markers. This limitation often prompts the plot of Fig. 2⇓ . Here, the new marker separation is compared with that of an established marker (e.g., stage) to show that the new marker provides a wider, more significant separation of prognostic groups.
There are several limitations to the analysis provided in Fig. 2⇓ , which, again, does not directly answer the question of the value of the new marker. First, Fig. 2⇓ presumes that there is only one established marker available in the disease (stage, in this example) and that the single established marker is binary (early versus advanced, again in this example). If there is more than one established marker, or if the established marker has more than two levels, one has to question how the two groups in the right panel of Fig. 2⇓ were chosen. However, of greater concern is whether prognostic value is lost by collapsing the established marker information into the two groups. The real question is whether, after considering everything we know about the patients presently, it is possible to predict their outcomes with greater accuracy if the new marker is considered. Because the right panel of Fig. 2⇓ is likely not optimally incorporating everything we know about the patient, the Fig. 2⇓ analysis is inadequate for making the case for the new marker. In the unlikely scenario that there really is only one established marker for our disease, and that it is indeed binary, there remains the problem in Fig. 2⇓ of whether the separation in the left panel (P < 0.01) is statistically significantly better than the separation provided in the right panel (P = 0.04). Mere comparison of the two P values is not definitive.
In general, the definitive marker assessment is multivariable analysis. For example, consider Table 1⇓ . Unfortunately, results such as those in Table 1⇓ are also plagued with limitations and do not directly test the value of the new marker. For example, the P essentially tests whether the hazard ratio is 1, not whether the prediction is improved by the new marker (3, 4, 5) . This is a problem because several issues affect the numerical value of the hazard ratio: (1) how the new marker was coded, for example, categorized or continuous; (2) how the existing markers were coded; (3) which existing markers were included in the analysis; (4) was stepwise variable selection performed or were only the variables significant in univariable analysis included; and (5) how the variables were modeled (e.g., transforms, splines). In short, there are many potential judgment calls that can affect the hazard ratio and render it somewhat subjective. As continuous variables generally have smaller hazard ratios, the hazard ratio also makes them look bad. This leads to categorization of the continuous variables somehow, a process often fraught with difficulty (6) . Yet another concern with the analysis in Table 1⇓ is that it assumes that a Cox regression (which is necessary to provide the hazard ratios) is the best prediction model. This might not be the case. An alternative (e.g., a classification and regression tree) might provide the most accurate predictions presently available with standard markers and/or the standard plus novel markers. If this is so, this alternative should be incorporated in the marker evaluation. Thus, the standard multivariable analysis does not address the central question of whether our new marker permits us to predict patient outcome more accurately than we are presently able.
Suggested Alternative: Change in Concordance Index
How then, to show that patient outcome can be predicted more accurately when one has knowledge of the new marker? An attractive solution is to show the improvement in predictive accuracy that is obtained when the new marker is added to a model containing the established markers. To do this, however, one must first choose a metric for predictive accuracy. For example, predictive accuracy can be measured by the concordance index. The concordance index is the probability that given two randomly selected patients, the patient with the worse outcome is, in fact, predicted to have a worse outcome (7) . This measure, similar to the area under the receiver operating characteristic curve, ranges from 0.5 (i.e., chance or a coin flip) to 1 (perfect ability to rank patients). As a measure of a model’s predictive ability, the concordance index admittedly might not be the perfect metric, and methods of comparing concordance indices do need further development, but it is perhaps the best measure presently available (8) . It is particularly attractive because it does not require that we specify a cutpoint in the predicted value, as would simple classification accuracy as a metric.
With a measure of predictive accuracy in place, one now needs to show how it is affected by the inclusion of the new marker. Consider Table 2⇓ . In this table, three models are being compared with the full model containing all variables. Each of the three models lacks one variable. The model lacking established marker 1 is compared with the full model and the degree to which predictive accuracy is reduced (drop in concordance index) is shown. Here, not knowing established marker 1 would reduce our predictive accuracy, as measured by the concordance index, by 0.1. The critical row of Table 2⇓ is the third. This shows that the predictive accuracy is improved by 0.15 when the new marker is measured. Thus, the incremental value of the new marker has been established, and a framework is provided that does not presume a particular form of the prediction model (e.g., a Cox regression). By focusing on the predictive accuracy measure, the framework allows for the use of any form of prediction model, if shown to provide superior predictive ability.
Although this article is meant to be more of a conceptual piece, some notes about the mechanics of Table 2⇓ are in order. For example, the drop in concordance index should be bias corrected (i.e., not represent simple overfit). This means that the predictive ability should be representative of what would be expected when the model is applied to future patients. This can easily be done by comparing cross-validated predicted probabilities (i.e., probabilities produced for patients not used to derive the prediction model). Similarly, bootstrapping can be used to provide 95% confidence intervals and p-values.
One additional question regarding Table 2⇓ is which established markers should be included. The simple recommendation is to include all other markers that are available and believed to be prognostic, ideally based on biologic rationale, at the time the new marker is measured (9) . There may not be universal agreement on which variables are believed to be prognostic or the biologic rationale for some of the markers. Therefore, it is important to at least include what is generally felt to be the least common denominator with regard to the list of variables and their measurement scales. Going beyond this only makes a stronger statement regarding the new marker. Previous analyses with other, completely separate data sets, as well as clinical judgment, should determine the list of established markers. In particular, univariable analyses and stepwise variable selection on the same data set (or subset of these data) clearly should not be performed for determining which variables to include in this table. The reason for this is that these methods are biased (10) , and this bias will cause the new marker to look better than it really is (4 , 8) . The markers included should comprise all of the markers that would routinely be used to predict patient outcome. Remember the question: does the new marker contribute to our ability to predict patient outcome beyond what we can already achieve based on everything we know about the patient? We need to maximize the value and potential contribution of everything we know before we assess the new marker. Conceivably, for a more direct test of improvement in predictive accuracy, one could use the predicted probability from the established model as the only “established marker” to compare against.
Another attraction of the proposed methodology is that some of the subjective modeling aspects that might affect the hazard ratio estimate can now be made more objective by the focus on predictive accuracy as the criterion. Again, modeling choices should be made with the goal of maximizing predictive accuracy, in particular, the comparison of the most accurate model lacking the marker of interest with the most accurate model containing the marker. Whatever cutpoints, transforms, etc. that produce the most accurately predicting model should be used to make these typically subjective choices more objective.
This article has dealt primarily with the question of whether a new marker is truly a prognostic or predictive factor. However, this question leads to several related questions.
Is the New Marker Better than Established Marker X?
The answer to this question really doesn’t matter. Again, it is suggested to ask whether the new marker contributes beyond what is already known, and one typically knows more than established marker X. The goal is not to replace established marker X but instead trying to improve on the performance achieved by using all established markers.
What Is the Most Important Prognostic Factor? Is It the New Marker?
Similarly, the answer to this question really doesn’t matter. Whether the new marker is the most important factor is not the issue. The real question is whether the new marker contributes to our ability to predict patient outcome. If it does, we should consider routinely measuring and using it, regardless of its rank in importance. Having said this, the drop in concordance index would be a good measure of importance, as it is a bottom line assessment of predictive accuracy. Granted, the concordance index may be affected by anything that affects the models, such as sample size, selection of variables, measurement error, and modeling methods. Nonetheless, any approach that results in improvement in the concordance index is valuable, this translates into better ability to predict individual patient outcomes.
What Are the Prognostic Factors for this Disease?
The answer to this question does not directly help the individual patient. The prognostic factors, themselves, do not allow a prediction of patient outcome. Rather, they must be combined or organized in some fashion to form a model, and the model predicts patient outcome. Thus, a more proper question would be: “What is the most accurate prediction model for this disease?” After this, our interest is whether this model contains the new marker or, if not, whether the model would be improved by its addition. If neither, the new marker is not a prognostic factor and, thus, not important.
The methods commonly used for marker evaluation are problematic. Instead, an analysis of a marker’s impact on the concordance index of the prediction model is recommended.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Requests for reprints: Michael W. Kattan, Departments of Urology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, Mailbox 27, New York, NY 10021. Phone: (646) 422-4386; Fax: (630) 604-3605; E-mail:
- Received August 19, 2003.
- Revision received October 8, 2003.
- Accepted October 14, 2003.