## Abstract

To provide an overview of noninferiority trials in oncology with a special emphasis on methodologic issues, we conducted a systematic review of randomized trials assessing noninferiority of antineoplastic treatments. We identified 72 articles, of which 65 were randomized phase III trials with a single control arm, 3 were factorial phase III trials, and 4 were randomized phase II trials. Forty-six were trials in lung, colorectal, or breast cancer. The quality of reporting was improved chronologically (*P* < 0.01); the major deficiencies were claims of noninferiority when the results did not meet statistical criteria for noninferiority (7 articles) or when the noninferiority margin was not prespecified (5 articles). Four trials (6%) presented plans for switching from superiority to noninferiority. The analysis populations were intent to treat (ITT) in 52, per-protocol set (PPS) in 6, and both ITT and PPS in 11 trials. Noninferiority margins were set in 68 trials (94%); 1 trial used both of the conventional and effect retention methods, 17 trials used the conventional method, 5 trials used the effect retention method, and in 45 trials, the method was not specified. Some trials used margins that possibly were larger than the assured effects of the active controls. No trials explicitly took into consideration uncertainty in historical data. Two trials (3%) specified 2 values of margins. Our findings highlight critical deficiencies in design and reporting of noninferiority trials. Seven practical recommendations are presented. *Clin Cancer Res; 18(7); 1837–47. ©2012 AACR*.

## Introduction

A variety of circumstances of comparative effectiveness research in oncology can lead to the undertaking of a noninferiority trial, but a typical scenario is one in which an experimental treatment is potentially less toxic, less costly, or easier to administer than a conventional treatment, all of which may outweigh a likely loss of efficacy (1). To date, a few agents, such as fulvestrant (2, 3), capecitabine (4–6), and pemetrexed (7), have been approved by the U.S. Food and Drug Administration (FDA) on the basis of results of noninferiority trials.

This type of trial has inherent complexities in design and reporting. The aim of a noninferiority trial is to show that the experimental treatment is not worse than the active control by more than a prespecified small amount, known as a noninferiority margin (8, 9). Although much has been written about a noninferiority margin (1, 10–12), a considerable uncertainty remains about how it should be determined. Unfortunately, there is little published experience on how to select a margin (9), so empirical research is needed. In principle, a noninferiority margin should be prespecified and be taken into account in the sample size calculation. Switching from superiority to noninferiority, that is, specification of a margin after viewing the results, can produce an increase in the alpha error rate. Switching from superiority to noninferiority, that is, specification of a margin after viewing the results, can produce an increase in the alpha error rate (refer to the following section for the definition of alpha in noninferiority trials). For analysis, both intent-to-treat (ITT) and per-protocol set (PPS) analyses should be conducted to examine differences in the 2 results (1). These methodologic factors are often underreported, and the conclusions are misleading (13, 14). The CONSORT checklist was extended to noninferiority trials in 2008 (15), but it is unknown whether it contributes to better communication. This systematic review, therefore, aims to provide an overview of previously published noninferiority trials in cancer with a special emphasis on these methodologic issues.

## Overview of Methods for Selecting a Noninferiority Margin

Traditionally, a noninferiority margin has been selected by the size of effects that are considered to be of no clinical relevance or to be outweighed by other benefits of the experimental treatment; this method is called the conventional method (8, 9). Figure 1 illustrates the concept of this method. Here, margin (Fig. 1A) was set by an HR of no clinical relevance, 1.25, and the upper limit of the 95% confidence interval (CI) would be compared with 1.25 (Fig. 1). Unfortunately, this method has a limitation, because, for example, if the HR of the active control compared with best supportive care (BSC) was not at least 1/1.25, but, for example, 1/1.2, showing noninferiority would not represent evidence of any efficacy (9). Thus, it has been recognized in the past few years that a noninferiority trial aiming for drug approval must be designed to prove that the experimental treatment has an effect greater than zero (e.g., efficacious over BSC; refs. 1, 10). To meet this requirement, the FDA proposed the “effect retention” or “putative placebo” method (11, 12), in which the ratio of the effect of the experimental treatment to the effect of the active control, as compared with BSC or other reference treatments, is tested against a prespecified fraction. Figure 1 also depicts the 50% effect retention methods. In this example, the HR of BSC compared with the standard treatment (Fig. 1C) is assumed to be 1.3 (“M1,” which is termed by the FDA; ref. 1), with 95% lower confidence limit of 1.2. To test whether 50% of the effect of the standard is preserved, margin (Fig. 1B) is set to be the midpoint between HR = 1 and M1, specifically (1.3 + 1)/2 = 1.15 (“M2,” which is termed by the FDA; ref. 1), as shown in Fig. 1. Alternatively, one can choose a conservative margin that takes into consideration uncertainty about precision and consistency of the effect. For example, the 95%–95% approach (11) defines the margin as one half of the 95% lower confidence limit of the effect of the standard (Fig. 1).

Sample size of a noninferiority trial is typically determined to ensure a high level of power. Notably, a noninferiority trial reverses the role of the null and alternative hypotheses; the null corresponds to the noninferiority margin, and the alternative is typically specified as equivalent treatment outcomes. Using this terminology, “power” is defined as the probability of concluding noninferiority under the alternative hypothesis (16). Alpha error rate is defined as the probability of concluding noninferiority under the null hypothesis. Half of one minus the confidence level of the 2-sided CI, usually 2.5% or 5%, corresponds to the alpha error rate in most cases.

## Materials and Methods

### Search strategy

We did an electronic search of PubMed and the Cochrane Central Register of Controlled Trials on October 19, 2010, using the search terms Neoplasms AND non-inferiority OR non-inferior OR noninferiority OR noninferior OR “non inferiority” OR “not inferior,” as well as a manual search mainly by reference lists. Three authors (Y. Kataoka, Y. Kinjo, and S. Tanaka) then screened the titles and abstracts to identify potentially relevant trials, and the final selections were made from reading the full texts. Inclusion criteria were English language; publication as a full-length article; randomized controlled trials; publication as a noninferiority trial, even if the trial was originally planned as a superiority trial; and chemotherapy, adjuvant, surgery, or any antineoplastic treatment. Articles describing only the design of the trial or results other than the primary results (e.g., quality-of-life data or pooled analysis) were excluded. We deleted duplicate publications, (i.e., the same trial described in several articles).

### Data extraction

The entire text of each included article was evaluated in a structured fashion for prespecified attributes. The attributes included year of publication, type of malignancy, experimental and control treatments, primary endpoint, design (number of arms, factorial design or not, phase II or phase III), accrued sample size and determination, choice of analysis population (ITT, PPS, or both; full analysis set was classified as ITT), presence of an interim analysis, methods for noninferiority analysis, value and justification of noninferiority margins, and switching from superiority to noninferiority (i.e., a noninferiority analysis was conducted after the trial failed to show superiority, with or without a prespecified margin). We classified methods for noninferiority analysis into the conventional method, the effect retention method, others, and not specified. Only trials that reported retention proportions were regarded as effect retention methods. Trials that specified a noninferiority margin without further description were classified as not specified. When the primary endpoint was not explicitly specified, we considered the primary endpoint as the one used for sample size calculation or the endpoint listed first. Data were extracted independently by 2 of the 3 authors and were tabulated. Differences in the data were resolved through a consensus process.

### Evaluation of quality of reporting

We evaluated the quality of reporting by items specific to noninferiority trials in the extension of the CONSORT checklist (15). Specifically, we counted how many of the following items were reported in each article: items number 1 to 7 (title and abstract, background, participants, interventions, objectives, outcomes, and sample size); 12 (statistical methods); 16 (numbers analyzed); 17 (outcomes and estimation); and 20 (interpretation). These items include descriptions of a prespecified margin, sample size calculation, choice of analysis population, and interpretation of the results taking into account the noninferiority hypothesis. Item 20 was examined in terms of accordance between the authors' conclusions and fulfillment of the prespecified noninferiority criteria.

### Statistical analysis

Categorical variables were described with frequencies and percentages. Chronologic trends in the number of reported items in the extension of the CONSORT checklist and the proportion of trials with PPS or with both ITT and PPS were examined by tests for linear trend using univariate linear and logistic models, respectively. All *P*-values are 2-sided, and *P* < 0.05 is considered as significant. All statistical analyses were done using SAS version 9.2 (SAS Institute).

## Results

### Selection of articles

A total of 244 citations were identified (235 from PubMed, 106 from the Cochrane Central Register of Controlled Trials, and 8 by the manual search; see Supplementary Fig. S1). After excluding 172 articles that did not meet the inclusion criteria, we reviewed 72 articles. As shown in Fig. 2, an increasing trend in conducting this type of trial was observed (4.9 articles per year between 2001 and 2007, 12.7 articles per year between 2008 and 2010). The extracted data of the 72 articles are provided in Supplementary Tables S1 and S2.

### Characteristics of noninferiority trials

Table 1 summarizes the characteristics of the 72 noninferiority trials. Among them, 65 (90%) were randomized phase III trials with a single control arm, 3 (4%) were factorial phase III trials, and 4 (6%) were randomized phase II trials. All of the factorial designs were 2-by-2 comparisons with pooling of treatment arms, and multiplicity was not adjusted. A total of 46 (64%) were trials in lung, colorectal, or breast cancer, and 45 trials (63%) compared regimens of chemotherapy (33 first line, 12 second line). Other modalities included adjuvant or neoadjuvant chemotherapy (12 trials); radiotherapy (5 trials); surgery (4 trials); and multimodality (3 trials). Only 26 trials (36%) evaluated overall survival (OS) as a primary endpoint. Other primary endpoints included progression-free survival (PFS) or time to progression (17 trials, 24%); disease-free survival (DFS); recurrence-free survival (RFS); time to recurrence (TTR) or local recurrence (13 trials, 18%); objective response rate (ORR; 6 trials, 8%); biochemical or clinical failure (2 trials, 3%); and 8 other endpoints (clinical lesion response, complete resection, failure-free survival, lymph node yield, major cytogenetic response, sonographic response rate, proportion of patients without progression at 2 years, and suppression of testosterone). In most of the trials, the primary endpoints were time-to-event outcomes (58 trials, 81%) or binary outcomes (13 trials, 18%; data not shown).

### Quality of reporting

Figure 2 also shows that numbers of reported items increased linearly (*P* < 0.01), suggesting chronologic improvement in quality of reporting. All of the items listed above were reported in 35.3% (12/34) of the articles between 2001 and 2007, whereas the inclusion increased to 50.0% (19/38) between 2008 and 2010. Overall, only 31 articles (43%) satisfied all of the items. One item was not reported in 22 articles (31%), and 2 or more items were lacking in 19 articles (26%). Importantly, 15 articles (21%) lacked item 20, that is, the authors claimed noninferiority when the results did not meet statistical criteria for noninferiority (7 articles), the noninferiority margin was not prespecified (5 articles), or they drew other misleading conclusions (3 articles).

### Switching from superiority and other statistical issues

A total of 9 trials (13%) switched from superiority to noninferiority. Among them, 5 trials (7%) conducted noninferiority analyses without prespecified margins. The margins were specified after the analysis of superiority had been conducted (4 trials) or after the start of patient enrollment (1 trial). The other 4 trials (6%) had a prespecified plan of switching from superiority to noninferiority. Three of the 4 trials adjusted for multiplicity by the Bonferroni method, the Hochberg method, or the closed testing procedure.

More than half of the 72 trials accrued 500 patients or more in total, and most had apparent statistical power of 80% or more. However, 11 trials (15%) calculated the sample size as a superiority trial, and 9 trials (13%) calculated it as a noninferiority trial under alternative hypotheses of superiority, that is, that the experimental treatment was slightly better than the control, suggesting potential under power for testing the noninferiority hypothesis.

Alpha error rates of the 4 randomized phase II trials were generally high or underreported, although they had sufficient power (alpha/power: 5%/80%; ref. 17; 20%/75%; ref. 18; 20%/80%; ref. 19; and not reported/90%; ref. 20). The primary endpoints in 3 trials were not OS, but were ORR, suppression of testosterone, and PFS, respectively, making the noninferiority margins less clinically relevant. All the randomized phase II trials concluded noninferiority of the experimental treatments.

A total of 26 trials (36%) reported on a planned interim analysis, and 9 trials (13%) were terminated early. Among them, only 1 trial reported that a formal interim analysis of primary endpoint showed the efficacy of the experimental treatment. Four trials were stopped because of futility of the noninferiority comparison. Two trials were stopped because of toxicity of the experimental treatment. Two trials were stopped because of low accrual rate, and 1 of these trials concluded that the experimental regimen was promising despite the immaturity of the data. Of the 9 trials terminated early, only 2 trials used a factorial design or a multiarm randomized phase II design, and it was not clear whether the complexity of multiarm noninferiority trials makes early termination highly likely.

The analysis populations were ITT in 52 trials (72%) and PPS in 6 trials (8%), and 11 trials (15%) reported results of both ITT and PPS. The proportion of trials with PPS or both analysis populations increased chronologically; however, the increasing trend was not significant (*P* = 0.12). The proportion was still low [14.7% (5/34) between 2001 and 2007, and 31.6% (12/38) between 2008 and 2010]. Three trials (4%) did not report which analysis population was analyzed.

### Methods for selecting a noninferiority margin

Sixty-eight trials (94%) reported their analyses with the use of noninferiority margins (Table 2). Among them, 1 trial used both of the conventional and effect retention methods, 17 trials used the conventional method, 5 trials used the effect retention method, and 45 trials used noninferiority margins but did not specify the method for selecting the margins. Two trials (3%) specified 2 values of margins. In the 18 trials using the conventional method, the margins were justified on the basis of efficacy (12 trials), trade-off between toxicity and efficacy (3 trials), or requirement for drug approval (3 trials). Only 7 of these trials provided reference to historical data. A trial in breast cancer determined the noninferiority margin by a clinically least-acceptable HR of tegafur–uracil as compared with cyclophosphamide + methotrexate + 5-fluorouracil (5-FU; CMF), derived from a structured questionnaire among investigators (21). Among the 6 trials using the effect retention method, a retention proportion of 50% was most frequent (5 trials). Five trials cited historical data for the effect of active control. We did not find any trials that took uncertainty in historical data into consideration explicitly.

Among the 55 trials that used noninferiority margins and primary time-to-event endpoints, 38 trials specified margins in terms of an HR, 15 trials specified them in terms of difference in survival function, and 2 trials reported margins according to both of these expressions. One trial used a Bayesian design, which calculates Bayesian predictive probability of the HR being greater than 0.8046 with the use of a noninformative prior (22). Two trials determined noninferiority by a *P*-value for a superiority test higher than 0.05 or 0.09. One trial was designed to test noninferiority, but the usual analysis for superiority was reported.

### Review in selected cancers

#### Advanced non–small cell lung cancer.

Eight trials of first-line chemotherapy (23–29) and 5 trials of second-line chemotherapy (7, 30–33) were published in advanced non–small cell lung cancer (NSCLC; Table 3). Active controls were platinum-doublet regimens in all the trials of first-line chemotherapy and were docetaxel in all the trials of second-line chemotherapy. Ten of the 13 trials evaluated OS as a primary endpoint, and 3 recent trials used PFS. In the trials of first-line chemotherapy that evaluated OS, the margins ranged from 1.176 to 1.33. Three of them were larger than 1.3, an estimate of the HR of BSC compared with platinum-doublet from a meta-analysis (34). Unlike those of the first-line setting, the margins in trials of second-line chemotherapy ranged from 1.11 to 1.25, which were much less than the inverse of the HR of docetaxel 75 mg/m^{2}, 1/0.56 = 1.79, but the 95% CI of the HR is wide (0.35–0.88; refs. 7, 35). In our classification, 6 trials did not meet statistical criteria for noninferiority, but 3 of the 6 trials nevertheless concluded noninferiority.

#### Advanced colorectal cancer.

Seven trials of first-line chemotherapy (4, 5, 36–40) and 3 trials of second-line chemotherapy (41–43) were published in advanced colorectal cancer (Table 3). Half of the 10 trials evaluated noninferiority of capecitabine-containing regimens. Molecular-targeting agents were not investigated. The primary endpoints were OS (3 trials), PFS (5 trials), and ORR (2 trials). In the trials in 2001 and 2002, 5-FU + leucovorin was selected as an active control, whereas 5-FU + leucovorin + biweekly oxaliplatin (FOLFOX4) or 5-FU + leucovorin + irinotecan (FOLFIRI) were selected in most of the recent trials. Capecitabine was approved by the FDA in the 2 trials conducted in 2001 (4, 5). The primary endpoint of these trials was ORR, but the FDA conducted a noninferiority analysis of OS using the 50% effect retention method (11). In this retrospective analysis, the FDA derived that the HR of 5-FU alone compared with 5-FU + leucovorin was 1.26 (95% CI, 1.09–1.46) from a meta-analysis of 10 trials. Generally, noninferiority margins in advanced colorectal cancer were in a narrow range (1.18–1.33), but some were larger than the FDA's estimate. Two trials did post hoc noninferiority analysis. Conclusions of the other 8 trials were consistent with our classification.

#### Early breast cancer.

Six trials of adjuvant chemotherapy (21, 22, 44–47), 1 trial of neoadjuvant chemotherapy (48), and 2 trials of radiotherapy (49, 50) were published on early breast cancer (Table 3). Active controls in the adjuvant and neoadjuvant chemotherapy trials included CMF, doxorubicin + cyclophosphamide, or doxorubicin + cyclophosphamide + docetaxel. Only 1 trial evaluated OS. Seven of the 9 trials evaluated DFS, RFS, or local recurrence. Unlike the trials in advanced diseases, noninferiority margins were specified in terms of a difference in survival rate more frequently than an HR. The noninferiority margins in HR varied between 1.25 and 1.40, which were slightly larger than those in advanced NSCLC and colorectal cancer, but they seem to be small enough to prove superiority to no adjuvant therapy, given that CMF-based adjuvant chemotherapy decreases the annual recurrence rate of breast cancer by an HR of 1/1.69 (95% CI, 1.16–1.36; ref. 51). In our classification, 5 trials did not meet statistical criteria for noninferiority, but 2 of the 5 trials concluded noninferiority.

## Discussion

Our findings highlight critical deficiencies in design and reporting of noninferiority trials. We present 7 practical recommendations for this type of trial, as summarized in Table 4.

### (i) Prespecification of a noninferiority margin

The most critical error we found was the post hoc noninferiority analysis, which was conducted in 7% of the examined trials. This error typically occurs when results of a superiority trial were negative but with a relatively small difference in the primary endpoint. There are at least 2 criticisms against such an analysis. First, specification of the margin after viewing the results can produce an increase in alpha error rate. The second criticism is lack of statistical power, that is, the sample size was originally calculated to detect superiority. We strongly recommend specification of a noninferiority margin prior to the start of patient enrollment. A couple of trials had a prespecified plan of switching using adjustment by multiple testing methods, but such designs are not established and need further methodologic research.

### (ii) Methods for selecting a noninferiority margin

Our descriptive analysis showed that some trials selected margins that possibly were larger than the assured effects of the active controls. Although the conventional method for selection of margin was common, the effect retention method was also used in 8% of the trials, despite the fact that this idea is relatively novel. One reason could be the FDA requirement (1), but actually, this method has attractive features, particularly in oncology. Often, experimental agents are toxic, and the worst that could happen is that agents with no efficacy would be commonly used in clinical practice despite their toxicity. The effect retention method is tailored to avoid this risk. Further, the calculation is explicit and objective.

In the past, some experimental treatments were compared with established therapy, and an arbitrary clinical cutoff of an increased risk of 25% or 33% was used; that is, the margin was an HR of 1.25 or 1.33. Our findings indicate the risk of such an arbitrary cutoff. A careful examination of the assured effect of the active control is, therefore, necessary even when the margin is selected by the conventional method.

Strictly speaking, a formal effect retention method requires taking into account uncertainty in the estimate of the effect of the active control from historical evidence, such as the 95%–95% and the Bayesian approaches (11), but such consideration on uncertainty was rarely applied in practice (Table 2). Further methodologic research in this area is needed.

### (iii) Historical evidence on the effect of active control

A reliable estimate of the effect compared with BSC or other reference treatments is necessary to guarantee that the margin is no larger than the effect of the active control. Fleming (12) emphasized the need for an objective choice of evidence to avoid potential selection bias. Fleming showed a case study in which the noninferiority margin was selected by the trial sponsor on the basis of historical data after excluding a subset of patients, yielding substantially more impressive results (7, 12). The best way to reduce bias and random error would be to do a meta-analysis of historical trials selected systematically, but in practice most of the trials we examined relied on results from only 1 trial, or historical data were not referred to (Table 3).

### (iv) Alternative hypothesis in sample size calculation

We found that 9 noninferiority trials calculated the sample size under alternative hypotheses of superiority (Table 1). This method leads to potential under power for testing the noninferiority hypothesis. The sample size of a noninferiority trial should be calculated to provide adequate power under the alternative hypothesis of equivalence unless there is a clinically and scientifically valid reason to do otherwise.

### (v) Randomized phase II noninferiority design

This review suggested that the major differences between randomized phase II noninferiority trials and phase III noninferiority trials are alpha error rate and primary endpoint. Unlike a phase II superiority trial, a large alpha error rate in a phase II noninferiority trial is not reasonable, because it leads to a risk of treating future patients with an unacceptably inferior treatment. Moreover, in phase II settings, that is, screening of promising antineoplastic agents, short-term endpoints such as ORR or other biomarkers for antitumor activity are often required. Unfortunately, such endpoints do not reflect patients' benefit directly. If the noninferiority margin is not clinically relevant, a randomized phase II noninferiority design does not seem to be advantageous over conventional single-arm or randomized phase II designs.

### (vi) Analysis population

We found that the use of PPS or of both ITT and PPS was not common in past noninferiority trials. ITT analysis is generally preferable as a conservative approach in superiority trials but not in noninferiority trials. If, for example, violations in eligibility criteria or cross-over from 1 regimen to the other are present, noninferiority can be more easily shown by ITT analysis because the treatment effect would be diluted. Therefore, regulatory authorities suggest that it is important to conduct both ITT and PPS analyses and that differences in results using the 2 analyses will need close examination (1).

### (vii) Reporting

Inherent complexities in noninferiority trials make it difficult to communicate their results accurately and fairly. The major deficiency in reporting is a misleading conclusion. We suggest that drawing a conclusion by simply comparing the obtained results with the prespecified margin avoids this problem. We also found that multiple noninferiority margins (e.g., ref. 7) can be an obstacle to interpretation and communication. The CONSORT checklist should reflect these findings. Despite the fundamental role, calculation of noninferiority margins and reference to historical data were often underreported. Only 37% of the trials provided sufficient information to distinguish the conventional method and the effect retention method (Table 2). To clarify the validity of a study design, justification for the margin should be reported with reference to historical data in addition to the items in the CONSORT checklist.

Finally, several limitations of this study warrant mention. First, the citations we identified may not be exhaustive. Second, modifications of design, such as switching from superiority to noninferiority, were possibly missed if the detail was not reported. Third, we could not examine the effect of selection of inappropriate patients, poor compliance with intended treatments, and insufficient follow-up, although these issues are important in noninferiority trials.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Footnotes

**Note:**Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).

- Received June 28, 2011.
- Revision received December 24, 2011.
- Accepted January 9, 2012.

- ©2012 American Association for Cancer Research.