Sparsity-driven weighted ensemble classifier

In this letter, a novel weighted ensemble classifier is proposed that improves classification accuracy and minimizes the number of classifiers. Ensemble weight finding problem is modeled as a cost function with following terms: (a) a data fidelity term aiming to decrease misclassification rate, (b) a sparsity term aiming to decrease the number of classifiers, and (c) a non-negativity constraint on the weights of the classifiers. The proposed cost function is a non-convex and hard to solve; thus, convex relaxation techniques and novel approximations are employed to obtain a numerically efficient solution. The proposed method achieves better or similar performance compared to state-of-the art classifier ensemble methods, while using lower number of classifiers.


Introduction
To improve classification accuracy, more than one machine learning algorithm may be combined. This process is known with different names in different domains such as classifier fusion, classifier ensemble, classifier combination, mixture of experts, committees of neural networks, voting pool of classifiers, and others [1].
Ensembles can be categorized as weak and strong classifier ensemble according to used classifiers type. The weak classifiers are machine learning algorithms with fast training times and lower classification accuracy individually. Due to fast training times, weak classifier ensembles contain higher number of classifiers, such as 50-200 classifiers. On the other hand, strong classifiers have slow training times and higher generalization accuracy individually. Due to slow training times, strong classifier ensembles contain lower number of classifiers, such as 3-7 classifiers. Different methods are used to combine classifiers [2]. One of the most simple methods to ensemble classifiers is majority voting. In the majority voting method, every classifier in ensemble gets a single vote for result. The output is the most voted, majority voted, result [2]. Another common approach that uses majority voting in its decision stage is Bootstrap aggregating algorithm (Bagging) [3]. Bagging trains weak classifiers from same dataset using uniform sampling with replacement.
Instead of using single vote for every classifier, weighted voting may be used [2]. Weighted majority voting (WMV) algorithm [2] uses accuracy of individual classifiers to find weights. Those classifiers that have better accuracy in training step gets better weights for their votes, and becomes more effective in voting.
Others researchers proposed different approaches to find suitable weights for combining classifiers. For example, heuristic optimization techniques were used for this purpose [4,5]. While Zhang et al [4] proposed Differential Evolution, Sylvester and Chawla [5] used genetic algorithms. These two methods do not model ensemble weight finding problem but they use output classification accuracy of whole ensemble for fitness function.
Another common approach to ensemble classifiers is boosting. Most used boosting algorithm, Adaboost [6], trains weak classifiers iteratively and adds them to ensemble. Different from bagging, subset creation is not randomized in boosting. In each iteration, subsets are created according to previous iterations results, i.e miss-classified data in previous subsets are more likely included. In addition, every weak classifier is weighted according to their accuracy.
Other methods are also used to combine classifiers. Verma and Hassan [7] proposed a multi-layered hybrid method for classifier ensemble. First two layers uses Self Organising Map (SOM) and k-means to cluster data. Last layer uses Multi-Layer Perceptron (MLP) to fuse clusters. They generate parallel classifiers using this hybrid technique. Final decision is generated using majority voting approach. Maudes et al [8] use different projection methods to generate new features. Their new feature space size is generally higher than original feature space. They train base SVM classifiers using these new features and ensemble them using boosting techniques.
Although using more classifiers increases generalization performance of ensemble classifier, this performance increase stops after a while and leads to accuracy reduction. To put it in another way, similar classifiers do not contribute to overall accuracy very much. This deficiency can be removed by increasing the classifier diversity [1,4,9]. Although classifier diversity has no accepted definition, it affects ensemble accuracy [9]. However, the increase in ensemble size reduces classifier diversity which finally reduces classification accuracy [2,9]. Therefore, improvement of diversity of classifiers is a challenge in ensemble classifier studies.
Ahmad [10] is among the researchers that studied classifier diversity. He proposes to use kernel function generated features in decision tree ensembles. In his study, using kernel functions, different kernel features are generated. Decision tree classifiers are trained from randomly chosen generated kernel features and original features. These new features introduce diversity to decision tree ensembles. Similarly, Lee et al [11] used hierarchical pair competition-based parallel genetic algorithms (HFC-PGA) for ensembling neural networks. In normal genetic algorithm, individual with high fitness values will saturate population pool with their descendants very quickly. HFC-PGA prevents this phenomena using parallel populations with diversity. These parallel populations may have low fitness value but have high diversity among themselves. Using different population with diverse features, Lee et al [11] trained diverse neural networks. After training, they combined neural networks in a ensemble using negative correlation rule. Similarly, Kim et al [12] proposed ensemble approach for biological data. Their approach were similar to boosting but they also used sparse features in their weak classifiers.
Another method to improve ensemble performance is pruning (ensemble selection). After training ensemble, classifiers are removed from whole ensemble such that overall performance is not reduced. Liu et al [13] have proposed to use Greedy Randomized Adaptive Search Procedure (GRASP) for ensemble selection. They constructed Adaboost decision tree ensembles and prune it with GRASP. They compared original Adaboost results with their proposed method. Zhang et al [14] implemented GRASP algorithm on Extreme Learning Machine ensembles. Zhang and Dai [15] improved their earlier work [14] using path relinking.
Most of the above mentioned studies do not model ensemble problem as a mathematical model. But other researchers modeled ensemble weight finding as mathematical optimization problem and proposed different methods for solving these models.
Zhang and Zhou [16] formulated weight finding problem as linear programming problem and solved it accordingly. Yin et al [17] proposed weight finding cost function that encompasses data regularization term, sparsity term and diversity term. They solved this cost function using genetic algorithm. In a later study, they improved their approach [18], and finally they solved the same cost function using convex optimization techniques [19].
Mao et al [20] proposed 0-1 matrix decomposition and quadratic form [21] methods to find classifier weights. Sen and Erdogan [22] proposed different cost function and different loss function to solve weight finding problem.
Inspired from mentioned studies [16,20,21] and authors' earlier work [23], sparsity-driven weighted ensemble classifier (SDWEC) has been proposed. Our proposed cost function and solution differs from previous studies. Proposed cost function consists of following terms: (1) a data fidelity term with sign function aiming to decrease misclassification rate, (2) L 1 -norm sparsity term aiming to decrease the number of classifiers, and (3) a non-negativity constraint on the weights of the classifiers. Cost function proposed in SDWEC is hard to solve since it is non-convex and non-differentiable; thus, (a) the sign operation is convex relaxed using a novel approximation, (b) the non-differentiable L 1norm sparsity term and the non-negativity constraint are approximated using log-sum-exp and Taylor series. SDWEC improves classification accuracy, while minimizing the number of classifiers used in ensemble. Since number of classifiers used in ensemble decreases, testing time for whole ensemble decreases according to sparsity level of SDWEC.

Sparsity-driven Weighted Ensemble Classifier
An ensemble consists of l number of classifiers. Classifiers are trained using training dataset. We aim to increase ensemble accuracy on test dataset by finding suitable weights for classifiers using validation dataset. Ensemble weight finding problem is modeled with the following matrix equation. In this matrix equation, classifiers predictions are weighted such that obtained prediction for each data row becomes approximately equal to expected results.
Matrix H consists of l classifier predictions for m data rows that drawn from validation dataset. Our aim is to find suitable weights for w in a sparse manner while preserving condition of sgn(Hw) ≈ y (sign function). For this model, the following cost function is proposed: In equation 1, first term acts as a data fidelity term and minimizes the difference between true labels and ensemble predictions. Base classifiers of ensemble give binary predictions (−1 or 1) and these predictions are multiplied with weights through sign function. To make this term independent from data size, it is divided to m (number of data rows). The second term is sparsity term [24] that forces weights to be sparse [23]; therefore, minimum number of classifier is utilized. In sparsity term, any L p -norm (0 ≤ p ≤ 1) can be used. When p < 1, weights become more sparse as p gets closer to 0. However, when (0 ≤ p < 1), sparsity term becomes non-convex and thus solution becomes harder, and when p is 0 then solution of L 0 -norm becomes NPhard [25]. Here, L 1 -norm is used as a convex relaxation of L p -norm [24,26]. Similar to data fidelity term, this term is also normalized with division to l (number of individual classifiers). The third term is used as nonnegativity constraint. Since base binary classifiers use (−1, 1) for class labels, negative weights change sign prediction; thus they change class label of prediction. To prevent this problem, this constraint term is used to force weights to be non-negative. Using Lagrange-Multipliers and definition of |x| = max(−x, x) , cost function is transformed into equation 2.
In equation 2, w ≥ 0 constraint is better satisfied as β becomes larger. Equation 2 is a non-convex function, since sgn function creates jumps on cost function surface. In addition, max function is non-differentiable. Due to these two functions, max and sgn, Equation 2 is hard to minimize. Therefore, we propose a novel convex relaxation for sgn as given in equation 3. Figure 1 shows approximation of sign function using Equation 3.
where In this equation, is a small positive constant. We also introduce a new constantŵ as a proxy for w. Therefore, S s = (|H sŵ |+ ) −1 is also a constant. However, this sgn approximation is only accurate around introduced constantŵ. Therefore, the approximated cost function needs to be solved iteratively. Additionally, max function is approximated with log-sum-exp [27] as following: Accuracy of log-sum-exp approximation becomes better as γ, a positive constant, increases. Double precision floating point can represent values up to 10 308 in magnitude [28]. This means that γ|w r | should be less than 710 where exp(709) ≈ 10 308 , otherwise exponential function will produce infinity (∞). At w r = 0, there is no danger of numerical overflow in exponential terms of log-sum-exp approximation; thus, large γ values can be used. But as |w r | gets larger, there is danger of numerical overflow in exponential terms of log-sum-exp approximation since e γ|wr| may be out of double precision floating point limits.
To remedy this numerical overflow problem, a novel adaptive γ approximation is proposed, where γ r is adaptive form of γ and defined as γ r = γ(|ŵ r | + ) −1 . One can decrease or increase γ to improve approximation accuracy. Figure 2 shows proposed adaptive γ and resulting approximations for two different and γ values. Validity of the approximation can be checked by taking the limits at −∞, 0, and +∞ with respect to w r . These limits are −x, √ 2 λr , and x at −∞, 0, and +∞ respectively. As |x| gets larger, dependency to γ decreases; thus, proposed adaptive γ approximation is less prone to numerical overflow compared to standard log-sumexp approximation.
Applying adaptive γ approximation leads to following equations: This approximation leads to cost function in equation 8, where n is the iteration number.
To get a second-order accuracy and to obtain a linear solution after taking the derivative of the cost function, equation 8 is expanded as a second-order Taylor series centered onŵ r , and equation 9 is obtained.
In equation 9, A r contains constant terms, B r contains w r terms, and C r contains w 2 r terms after Taylor To ensure that w r changes slowly, a new regularization term, ||w r −ŵ r || 2 2 , is added into the cost function. Refined cost function is given in Equation 10.
Equation 10 can be written in a matrix-vector form as: Due to employed numerical approximations, negative weights in small magnitude may occur around zero. Since our feasible set is w ≥ 0, back projection to this set is performed after solving linear system at each iteration in algorithm 1. This kind of back-projection to feasible domain is common and an example can be seen in [29]. Additionally, small weights in ensemble do not contribute to overall accuracy; therefore, these small weights are thresholded after iterations are completed. construct v B and C 10: An example run of Algorithm 1 can be seen in Figure 3 where cost values for equations 2 and 11 decrease steadily. As seen in Figure 3 difference between nonconvex cost function and its convex relaxation is minimal especially in the final iterations that shows they converge to very similar values. Since convex Equation 11 and non-convex Equation 2 is converged to similar points, this converged points is within close proximity of the global minima.

Algorithm 1 SDWEC Pseudo code
Approximated convex-relaxed Equation 11 and nonconvex Equation 2 are close to each other due to (w − w) 2 term and employed iterative approach for minimization. These results show success of the proposed approximations.

Experimental Results
To evaluate performance, SDWEC has been compared with following algorithms on well-known UCI datasets and NSL-KDD dataset [30]: Single tree classifier (C4.5), bagging [3], WMV [2], and state-of-the-art ensemble QFWEC [21]. In all ensemble methods, 200 base classifiers (C4.5) are used. Each dataset is divided to training (80%), validation (10%), and testing (10%) datasets. This process has been repeated 10 times for cross validation. Mean values have been used as results in Table 1. QFWEC accuracy values in Table 1 are higher than original publication [21] since weights are found using validation dataset instead of training dataset, which provides better generalization.

Experimental Results -Sparsity
The principle of parsimony (sparsity) states that simple explanation should be preferred to complicated ones [24]. Sparsity mostly used for feature selection in the machine learning. In our study, principle of sparsity is used to select among weak classifiers. According to dataset and used hyper parameters, SDWEC achieves different sparsity levels. SDWEC was applied to 11 different datasets and achieved sparsity levels between 0.80 and 0.88, see Figure 4. This means that among 200 weak classifiers, 24 (0.88 sparsity) to 40 (0.80 sparsity) classifiers were used in ensemble.  Table 1. SDWEC-A has no sparsity, all 200 base classifiers have been used in ensemble; thus, it has superior performance at the cost of testing time. SDWEC-A has best accuracy in 4 of 10 datasets and it is very close to top performing ones in others. SDWEC-B has 0.90 sparsity, 20/200 base classifiers have been used in ensemble; nonetheless, it has best accuracy in 2 of 10 datasets. In addition, its accuracy values are marginally lower (about 2%) but its testing time is significantly better (90%) than other approaches. Corresponding values can be seen in Table 1 3

.2 Theoretical and Experimental Analysis of Computational Complexity
In this section, Computational Complexity of SDWEC have been analyzed. Computation complexity of every line is given in Table 2 and and final computational complexity is determined. In Table 2 m is the number of data, l is the number of classifiers, and k is the iteration count.

Computational complexity of for loop is
for the SH multiplication in line 10 of the Algorithm 1 where S is a diagonal matrix. Our iteration count is k, then final computational complexity of SDWEC is O(kml) that is linear in k, m, and l. This computational complexity analysis shows the linearity of the proposed minimization and its computational efficiency. Table 3 shows training time of SDWEC on various datasets. NSL-KDD (100778) dataset has 25 times  These results show that practical execution times are in alignment with theoretical computational complexity analysis. Differences between theoretical analysis and actual execution times are due to implementation issues and caching in CPU architectures.

Conclusion
In this article, a novel sparsity driven ensemble classifier method has been presented. Efficient and accurate solution for original cost function (hard to minimize, non-convex, and non-differentiable) has been developed. Proposed solution uses a novel convex relaxation technique for sign function, and a novel adaptive log-sum-exp approximation that reduces numerical overflows. SDWEC has been compared with other ensemble methods in well-known UCI datasets and NSL-KDD dataset. By tuning parameters of SDWEC, a more sparse ensemble-thus, better testing time-can be obtained with a small decrease in accuracy.