Using Diversity for Classiﬁer Ensemble Pruning: An Empirical Investigation

The concept of ‘diversity’ has been one of the main open issues in the ﬁeld of multiple classiﬁer systems. In this paper we address a facet of diversity related to its eﬀectiveness for ensemble construction, namely, explicitly using diversity measures for ensemble construction techniques based on the kind of overproduce and choose strategy known as ensemble pruning. Such a strategy consists of selecting the (hopefully) more accurate subset of classiﬁers out of an original, larger ensemble. Whereas several existing pruning methods use some combination of individual classiﬁers’ accuracy and diversity, it is still unclear whether such an evaluation function is better than the bare estimate of ensemble accuracy. We empirically investigate this issue by comparing two evaluation functions in the context of ensemble pruning: the estimate of ensemble accuracy, and its linear combination with several well-known diversity measures. This can also be viewed as using diversity as a regularizer, as suggested by some authors. To this aim we use a pruning method based on forward selection, since it allows a direct comparison between diﬀerent evaluation functions. Experiments on thirty-seven benchmark data sets, four diversity measures and three base classiﬁers provide evidence that using diversity measures for ensemble pruning can be advantageous over using only ensemble accuracy, and that diversity measures can act as regularizers in this context.

Using Diversity for Classifier Ensemble Pruning. . .

Introduction
During twenty years of research in the classifier ensemble field, understanding the notion of diversity has been one of the main goals [1,2].A general agreement exists on the qualitative definition of diversity and on its role in classifier ensembles; basically, to obtain an effective (accurate) ensemble, its members should be as accurate and diverse as possible, where 'diverse' means that they should not make coincident errors [1,2].Individual accuracy and diversity are well-known to be contrasting goals, which means that a trade-off between them has to be achieved.On the other hand, formally defining and measuring diversity, as well as explicitly using it for ensemble construction, turned out to be not straightforward.
A number of diversity measures have been proposed over the years [1,2,3].Most measures have been derived intuitively, as attempts to formally characterize the pattern of error of individual classifiers (e.g., the Double-Fault and Disagreement measures [2]).In particular, it has been clearly pointed out that diversity measures alone can not be monotonically related to ensemble accuracy, since the latter depends on a trade-off between diversity and individual classifiers' performance [2,4].For instance, searching for a diversity measure strongly related to ensemble performance runs the risk of 'replacing a simple calculation of the ensemble error by a clumsy proxy which we call diversity' [2] (ch.8).A few other measures have been inspired by exact error decompositions derived in the regression field, despite the lack of a direct analogy to classification problems [5].The Kohavi-Wolpert Variance [3] (and our attempt in [6]) was inspired by the biasvariance-covariance error decomposition of [7].The measure derived in [8] (which we extended in [6]) was inspired by the ambiguity decomposition of [9], and provided useful insights, leading to the concept of 'good' and 'bad' patterns of diversity.Such measures were motivated by the goal of obtaining exact, additive decompositions of the ensemble error into terms accounting for individual classifiers' performance, and terms hopefully interpretable as diversity.Several authors also analyzed, empirically or analytically, the connection between ensemble performance on one side, and the pattern of individual classifiers' performance and existing diversity measures on the other side (e.g., [4,10]).Such a relationship turned out to be far from clear-cut, and no 'right' diversity measure has emerged so far.
Beside theoretical investigations on defining diversity and using this concept to explain ensemble performance, a considerable research effort has been spent toward the practical goal of explicitly using diversity measures for ensemble construction.Among existing methods, almost all follow the overproduce and choose approach.It consists of first generating a large ensemble (e.g., using Bagging) and then selecting the most accurate subset of classifiers.The overproduce and choose approach is also known as ensemble pruning, selection or thinning.It is supported by theoretical and empirical evidence showing that a (suitable) subset of the available classifiers could outperform the original ensemble [11,12,13].
Since ensemble pruning has exponential complexity in the size of the original ensemble, several heuristics have been proposed.In this context, diversity measures have been used in the objective function of pruning methods, to attain a trade-off between individual classifiers' performance and diversity.The effectiveness of using diversity measures to this aim has however been questioned by several authors, based also on empirical evidence [3,4,13], and [2] (ch.8.3).In particular, its actual advantage over directly evaluating ensemble performance (estimated, e.g., from validation (3 of 15) data) is not clear yet.It is also well known that popular and effective ensemble construction techniques like Bagging and Boosting do not use any explicit diversity measure.Nevertheless, despite the questionable effectiveness of heuristic pruning approaches, a theoretically grounded analysis in [14] related to ensembles of binary classifiers combined by majority voting has shown that (a suitable measure of) diversity can have a regularization effect in ensemble pruning.
Based on the above premises, the aim of this work is to compare the effectiveness of explicitly using existing diversity measures in ensemble pruning, against the direct estimation of ensemble performance.This is a follow-up of our preliminary work [15].In particular, inspired by [14], we evaluate whether several well-known diversity measures can have a regularization effect on the (estimate of) ensemble accuracy.To this aim we consider a pruning method based on the forward selection (FS) algorithm, since it allows a direct comparison between evaluation functions.We then compare the estimated ensemble accuracy against its linear combination with a given diversity measure, using the latter as a regularizer.We carry out experiments on 37 benchmark data sets.We use the popular Bagging as the ensemble construction technique and majority voting as the fusion rule, and evaluate a subset of the ten well-known diversity measures analyzed in [3].Our results show that using diversity measures for ensemble pruning can be advantageous over using only ensemble accuracy, and that diversity measures can act as regularizers in this context.

Previous Work on Using Diversity for Ensemble Design
As pointed out in Sec. 1, diversity measures have been explicitly used so far for ensemble construction only in pruning methods.The only exception is [16], where a diversity measure was used in an ensemble learning algorithm.
In [17] ensemble pruning methods have been categorized as follows: • Ranking-based: individual classifiers are first ranked according to some criterion, and then the top-L ones are selected as the final ensemble.
• Clustering-based: individual classifiers are first clustered based on the similarity of their predictions; each cluster is then pruned to remove redundant classifiers, and the remaining ones in each cluster are finally combined.
• Optimization-based: methods search for a subset of the original ensemble that optimizes a given objective function, which can include a diversity measure.To avoid exhaustive search, three main heuristic search strategies have been proposed: hill climbing, genetic algorithms, and semi-definite programming.
Given an initial ensemble, FS picks the best individual classifier and iteratively selects among the remaining classifiers the one that maximizes a given objective function.It stops either when a predefined ensemble size is reached, or when all the classifiers from the original ensemble have been selected; in the latter case, FS returns the best ensemble among the ones obtained at each iteration.The BS algorithm works similarly, iteratively removing from E one classifier at a time.More refined versions of FS/BS have also been proposed, which include a back-fitting step [19].
In the context of optimization-based pruning, three kinds of objective functions have been proposed so far: • The ensemble accuracy [19,21], combined with a diversity measure in [14].
• A given diversity measure (disregarding the performance of individual classifiers and of the ensemble) [19,20,23].
• Ad hoc measures specifically devised for ensemble pruning, which combine into a single scalar the individual classifiers' performance and the complementarity (diversity) between their errors [18,22,23,24].
A different and theoretically grounded view on the role of diversity in ensemble pruning was proposed in [14], in the context of ensembles of binary classifiers combined by majority voting: using a suitable diversity measure it was shown that promoting diversity can be seen as a regularization technique.A pruning method was also proposed based on these results, which exploits a strategy similar to FS: it starts with the most accurate classifier from the original ensemble, then iteratively sorts the remaining classifiers based on their diversity (evaluated using the proposed measure) with the current sub-ensemble, and among the most diverse ones it selects the classifier which leads to the next most accurate sub-ensemble.
Cavalcanti et al. [25] tackle the problem of diversity measures for ensemble pruning using genetic algorithm.Also, in [26] another method for ensemble pruning using margin and diversity based measure is proposed by Guo et al.
It is also worth mentioning two ensemble construction techniques [27,28] which are not pruning techniques but are related to the pruning criteria considered in this work.They consist of building individual classifiers from different subsets of the available features, analogously to the well known Random Subspace Method [29].The difference with respect to RSM is that they use a feature selection criterion analogous to the optimization-based pruning criterion mentioned above (including FS in [28]), and evaluate the individual classifiers on the basis of a trade-off between individual classifiers' accuracy and diversity.In particular, in [28] a linear combination of these two quantities was used as the objective function, and five different measures of diversity were considered.
In our previous work [15] we carried out a preliminary comparison between using the ensemble accuracy as the evaluation measure and using existing, ad hoc measures proposes for pruning methods, that combine the individual (not the ensemble's) classifiers' performance and the complementarity between their errors.In this work we carry out a direct comparison of ensemble accuracy against its combination with well-known diversity measures that do not include individual classifiers' performance, and are not specifically devised for ensemble pruning.

Aim of this work
As mentioned in Sec. 1, many existing ensemble pruning methods use heuristic evaluation functions that combine the performance of individual classifiers and some measure of their diversity.
Using Diversity for Classifier Ensemble Pruning. . .(5 of 15) It is then interesting to understand whether and under what conditions such evaluation functions are more effective (in terms of the performance of the resulting ensemble) than directly evaluating the performance of the considered ensembles (estimated, e.g., from validation data) during the pruning procedure.Quite surprisingly, so far such a comparison has been carried out by only a few authors [14,19,22,23,24], and only with a limited scope.In particular, it was often limited to the proposed evaluation measure, and using different and incomparable experimental set-up (i.e., different data sets, base classifiers, ensemble construction methods, etc.).We also point out that, among these works, only in [14,24] the use of the proposed evaluation functions provided a statistically significant improvement over a direct estimation of ensemble performance.
To sum up, so far no clear evidence has been provided about the effectiveness of using diversity measures for ensemble pruning.A notable exception is the work of [14], where an original view of the role of diversity as a regularizer in ensemble design was proposed and theoretically investigated, in the case of binary classifiers combined by majority voting, and with a specific diversity measure.Their theoretical results showed that promoting diversity during ensemble design can actually have a regularization effect.Based on these results, a specific ensemble pruning method was then proposed in [14].
Based on the above premises, and inspired by [14], the aim of this work is to investigate whether also existing diversity measures can have a regularization effect in ensemble pruning, with respect to the (estimate of) ensemble accuracy.More precisely, we consider two evaluation functions: ensemble accuracy A alone, and its linear combination with a given diversity measure D, given by A + λD (with λ > 0), which is the usual form of regularization terms.
To carry out a direct comparison between such evaluation functions we consider a pruning method based on the forward selection (FS) algorithm.We first build an ensemble of N classifiers using a given ensemble construction technique, then we use FS to obtain a subset of L < N classifiers, for a given L. We consider the basic version of FS: it starts with the best (estimated) individual classifier of the original ensemble, then it iteratively selects from the remaining classifiers the one that provides the best evaluation function (either A or A+λD) on the new candidate ensemble.The pseudo code is shown in Alg. 1.
Require: an ensemble E of N classifiers; a desired ensemble size L < N ; a validation set V; an objective function f obj (to be computed on V).

Diversity Measures
In this section we describe the diversity measures used in this work.We started from the ten measures analyzed in [3]: Q-statistic (Q), Correlation coefficient (ρ), Disagreement (Dis), Doublefault (DF ), Kohavi-Wolpert variance (KW ), Interrater agreement (κ), Entropy (E), Difficulty (θ), Generalised diversity (GD) and Coincident failure diversity (CF D).They include pairwise and non-pairwise measures (i.e., measures that are defined on two classifiers, or on a classifier ensemble of any size), respectively Q, ρ, Dis, DF , and E, KW , κ, θ, GD, CF D; and measures that require the true label of the samples on which they are computed (all except E and Dis), and measures that do not (E and Dis).For pairwise measures, the diversity of an ensemble of more than two classifiers is computed as their average value over all distinct pairs of ensemble members.
In [3] it was observed that some of the considered measures are strongly correlated (positively or negatively).We therefore decided to select only a subset of the least correlated measures.To this aim we estimated the correlation between all pairs of such measures by simulating the outputs of two binary classifiers on 1,500 input instances.For both classifiers we randomly and independently generated 1,500 binary values (0 and 1) from a uniform distribution, which represent either incorrect (0) and correct (1) decisions, in the case of diversity measures defined in terms of classification outcomes (correct/incorrect, which requires the true class label to be known), or the predicted labels of a two-class problem (which does not require the true class labels), in the case of diversity measures defined in terms of classifier decisions (namely, Entropy and Disagreement).We repeated the above procedure for twenty times, and evaluated the correlation coefficient between every distinct pair of diversity measures.
These values are reported in Tab. 1.It is worth noting that our results qualitatively agree with the ones reported in [3], although they have been obtained using different data.
Based on these results, we first selected the two least correlated measures, i.e., θ and DF (their correlation is 0.4492, see Tab. 1).All the other measures exhibit a quite high correlation with either θ or DF .Among them, we selected two further measures exhibiting the lowest maximum correlation with θ and DF , which turn out to be Dis and GD.Note that the four selected measures include pairwise and non-pairwise measures, as well as measures defined in terms of classification outcomes and in terms of classifier decisions.We report their definition for (7 of 15) completeness (see [3] for the definition of the other measures).
Considering two classifiers C 1 and C 2 and assuming m the number of instances on which these measures are computed, a the number of instances correctly classified by both C 1 and C 2 , b the number of instances correctly classified only by C 1 , c the number of instances correctly classified only by C 2 , d the number of instances incorrectly classified by both C 1 and C 2 , and p i the accuracy of C i (i = 1, 2) estimated on the same set of instances.DF is a pairwise measure proposed in [30]: GD is a non-pairwise measure proposed in [31]: Dis is a pairwise measure proposed in [32]: Finally, Difficulty (θ) is a non-pairwise measure proposed in [33], which is defined as the variance of the pairwise Dis measure computed for all distinct pairs of classifiers:

Experimental Setting
As explained in Sec. 3, the aim of our experiments is to compare two ensemble evaluation functions for ensemble pruning, using the basic FS pruning strategy described in Alg.1: the ensemble performance, evaluated as the classification accuracy A estimated from validation data, and its linear combination with a given diversity measure D evaluated on the same validation set, A+λD, with λ > 0.
To this aim we create an initial ensemble E composed of N = 100 classifiers, and prune it to an ensemble of L classifiers, with L = 5, 15, 25, 35, using the FS algorithm.We used Bagging to obtain E, as it is a well-known ensemble creation technique, and has already been used to this aim for ensemble pruning, e.g.[11,34].We used majority voting as the combining rule, since it is the standard choice for Bagging [35].
In our experiments we used three different base classifiers: Multi-Layer Perceptron Neural Networks (NN), Decision Trees (DT) and K-Nearest Neighbors (K-NN).We used their standard Matlab implementation (Neural Networks and Statistics and Machine Learning Toolboxes).In particular, for NNs we used the patternnet function with a learning rate η = 0.05, gradient descent with momentum as the learning algorithm, and a maximum of 1000 epochs as a stop criterion.For DTs we used the Gini impurity criterion, the χ 2 stopping criterion, and the default threshold equal to 1 for the pre-pruning stopping criterion.For K-NN we used K = 1.
In the evaluation function A + λD we used several values of λ: 0.2, 0.5, and 0.7.We also considered the four diversity measures chosen in Sec.4: DF , θ, Dis and GD.
Using Diversity for Classifier Ensemble Pruning. . .(8 of 15) We carried out our experiments on 37 benchmark data sets from the UCI Machine Learning Repository Database,1 containing only numerical attributes and no missing values (see Tab. 2).They represent a remarkable range of classification problems: the number of patterns ranges from 160 to 10992, the number of classes from 2 to 10, and feature set size from 2 to 85.We randomly subdivided each data set, using stratified sampling, into a training set, a validation set and a test set.The size of the training set is defined as explained in Sec.5.1.The size of the validation set was chosen as 1/3 of the training set, and the remaining instances were used as the testing set.We repeated this procedure for 20 runs, and evaluated the resulting average accuracy on testing samples.

Choice of the training set size
For each data set we chose the training set size that maximizes the (estimated) difference between the highest and lowest accuracy attained by different ensembles of a given size L. The rationale is that, if all ensembles of L classifiers obtained from the initial ensemble E exhibit a similar accuracy, it becomes difficult to evaluate the difference (if any) between different pruning methods (in our case, different evaluation functions used in the same pruning method).Fig. (1) illustrates the idea.
To this aim we carried out preliminary experiments, considering training sets sizes ranging from 1% to 70% of the whole data set.For NNs, we also considered different numbers of hidden units, between 3 and 20.Since considering different ensemble sizes L is computationally costly, and obviously considering all possible subsets of size L of a given ensemble is infeasible, we only considered ensembles of size L = N 2 = 50, and estimated the performance of the best and worst such ensembles with the ones of ensembles made up of the L best and by the L worst individual classifiers.
The resulting training set sizes used in the rest of our experiments are shown in Tab. 3.For NNs the number of hidden units is also shown.

Statistical test
To compare the two considered ensemble pruning evaluation functions we carried out a test of statistical significance between the corresponding average test set accuracy over the different runs of our experiments.To this aim we chose the Wilcoxon signed-rank test, as it is recommended in [36] for comparing two algorithms over multiple data sets, which is the setting considered in our experiments.This is a non-parametric statistical hypothesis test that can be used to determine whether two dependent samples were drawn from populations having the same distribution.This test is used to evaluate the statistical significance of the obtained results, i.e., whether it is possible to reject the null hypothesis that the observed values -in our case, the accuracies obtained by different ensembles -are different only by chance.We used a p-value of 0.05.( Table 2 Characteristics of the data sets.

Experimental Results
For each pruned ensemble size L, base classifier, diversity measure and value of λ, Tab. 4 shows the results of our experiments in terms of the statistical significance of the difference in test set accuracy of the FS pruning method implemented using the two considered evaluation functions.More precisely, the null hypothesis is that there is no difference between these evaluation functions.In Tab. 4 entries marked with 'A' mean that for the corresponding pruned ensemble size, Using Diversity for Classifier Ensemble Pruning. . .Table 3 For each data set, the number of hidden units for the NN base classifiers (second column) and the training set size for the three base classifiers (NNs, DTs and k-NNs) is shown.
base classifier, diversity measure and value of λ, using only ensemble accuracy (estimated from validation data) as the evaluation function is significantly better (according to Wilcoxon signedrank test) than using its linear combination with the diversity measure.Entries marked with 'D' mean the opposite (the latter evaluation function is significantly better than the former).We point out that the null hypothesis has always been rejected; therefore, every entry of Tab. 4 is marked with either 'A' or 'D'.These results provide a quite strong evidence that a linear combination of ensemble accuracy and of a diversity measure between ensemble members outperforms the use of ensemble accuracy alone as the pruning evaluation function, to a statistically significant extent.
The table clearly shows that using A + λD as the evaluation function in the FS algorithm provides a statistically significantly better pruned ensembles than using accuracy alone, in almost all the considered cases.The only exceptions can be observed for the largest considered ensembles (L = 35) of DT classifiers, when DF and θ were used as diversity measures, and the λ coefficient was 0.2 and 0.5; and for ensembles of various sizes of NN classifiers, when the other two diversity measures (Dis and GD) were used and the λ coefficient was 0.5 and 0.7.It is also worth noting that the A + λD evaluation function always outperformed its counterpart A for ensembles of K-NN classifiers, and with the only exception of the largest ensembles (L = 35) for the DT classifier.With regard to the diversity measures, using DF , θ and GD in the A + λD evaluation function turned out to be worse than using A alone only for 2 out of the 108 combinations of pruned ensemble size, base classifier and value of λ (3 diversity measures, 4 ensemble sizes, 3 base classifiers and 3 values of λ); using Dis, this happened for 4 out of the 36 combinations.Due to the lack of space, we have not included the detailed results in the paper.Detailed results are available on the Pralab website2 .
As far as our experiments are concerned, we can conclude that well-known, 'generic' ensemble diversity measures (i.e., not specifically devised for ensemble pruning) seem to be useful when used together with ensemble accuracy as the pruning evaluation function.In particular, such diversity measures seem to act as regularizers of the estimated ensemble accuracy, which is in agreement with the more specific results of [14].

Conclusions
Whereas the usefulness of diversity measures for ensemble construction has been questioned by some authors, their specific role as regularizers has been recently pointed out in [14] based on theoretical results as well as on empirical evidence in the context of ensemble pruning, although in a specific setting (binary classifiers, and an ad hoc diversity measure).As a follow-up of our preliminary work [15], in this paper we investigated the effectiveness of well-known, generic diversity measures in ensemble pruning.In particular, we considered their use in the ensemble evaluation function of pruning methods based on the forward search strategy, by linearly combining them with ensemble accuracy (estimated from validation data).This can be viewed as using diversity measures as regularizers, in the spirit of [14].
As far as our experiments are concerned, our empirical results provided evidence that also generic ensemble diversity measures can be useful when used together with ensemble accuracy as the pruning evaluation function.This is in agreement with the results we obtained in [15], related to ad hoc evaluation functions proposed by other authors for ensemble pruning, that combine individual classifiers' (not ensemble) accuracy and diversity (more precisely, complementarity between their errors).Our results also show that also generic diversity measures can have a regularization effect on the estimated ensemble accuracy, in the context of ensemble pruning.This provides some evidence that the results of [14], related to a specific diversity measure, could be extended to generic diversity measures.

Figure 1
Figure1Qualitative illustration of the criterion used for choosing the training set size and the number of hidden units in NN classifiers (X axis): maximizing the accuracy gap between the best and the worst ensemble of a given size (see text for the details).

Table 1
[3]relation coefficient between each pair of the diversity measures considered in[3].

Table 4
Outcome of the statistical significance test for the comparison between the use of the evaluation functions A and A + λD (see text) for ensemble pruning, for several ensemble sizes L, values of λ, base classifiers and diversity measures.'A' means that the evaluation function A is statistically significantly better than A + λD, 'D' means the opposite (see text for the details).