International Journal of Computational Intelligence Systems

Volume 11, Issue 1, 2018, Pages 962 - 978

Sparsity-driven weighted ensemble classifier

Authors
Atilla Özgür1, a.oezguer@jacobs-university.de, Fatih Nar2, fatih.nar@gidatarim.edu.tr, Hamit Erdem3, herdem@baskent.edu.tr
1Logistics Engineering, Jacobs University, Campus Ring 1 28759 Bremen, Germany
2Computer Engineering, Konya Food and Agriculture University, Dede Korkut Mah. Beyşehir Cad. No:9, Meram / Konya / Turkey
3Electrical Engineering, Başkent University, Bağlıca Kampüsü Fatih Sultan Mahallesi Eskişehir Yolu 18. km, Ankara 06790, Turkey
Received 20 April 2017, Accepted 5 April 2018, Available Online 20 April 2018.
DOI
10.2991/ijcis.11.1.73How to use a DOI?
Keywords
Machine Learning; Ensemble; Convex Relaxation; Classification; Classifier Ensembles
Abstract

In this study, a novel sparsity-driven weighted ensemble classifier (SDWEC) that improves classification accuracy and minimizes the number of classifiers is proposed. Using pre-trained classifiers, an ensemble in which base classifiers votes according to assigned weights is formed. These assigned weights directly affect classifier accuracy. In the proposed method, ensemble weights finding problem is modeled as a cost function with the following terms: (a) a data fidelity term aiming to decrease misclassification rate, (b) a sparsity term aiming to decrease the number of classifiers, and (c) a non-negativity constraint on the weights of the classifiers. As the proposed cost function is non-convex thus hard to solve, convex relaxation techniques and novel approximations are employed to obtain a numerically efficient solution. Sparsity term of cost function allows trade-off between accuracy and testing time when needed. The efficiency of SDWEC was tested on 11 datasets and compared with the state-of-the art classifier ensemble methods. The results show that SDWEC provides better or similar accuracy levels using fewer classifiers and reduces testing time for ensemble.

Copyright
© 2018, the Authors. Published by Atlantis Press.
Open Access
This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

1. Introduction

The accuracy of classification can be improved by using more than one classifier. This process is known by different names in different domains such as classifier fusion, classifier ensemble, classifier combination, mixture of experts, committees of neural networks, or voting pool of classifiers etc [1].

Ensembles can be categorized as weak or strong depending on the used classifier type [2]. The weak classifiers use machine learning algorithms with fast training times and lower classification accuracy. Due to fast training times, weak classifier ensembles contain high number of classifiers, such as 50–200 classifiers. On the other hand, strong classifiers have slow training times and high generalization accuracy individually. Due to slow training times, strong classifier ensembles contain as low as 3–7 classifiers.

Although using more classifiers increases generalization performance of ensemble classifier, this degrades after a while. To put it in another way, similar classifiers do not contribute to overall accuracy very much. This deficiency can be removed by increasing the classifier diversity [1, 3, 4]. Therefore, finding new diversity measurements [5] and improving existing ones [4] are an ongoing research effort in ensemble studies.

Research in the ensembles can be categorized into two groups according to their construction methods: (a) Combining pre-trained classifiers. (b) Constructing classifiers and ensemble together.

Methods in the first group (a) are the easiest to understand and the mainly used methods to create ensembles. The classifiers are trained using training set and combined in an ensemble. The simplest method to ensemble classifiers is majority (plurality) voting. In the majority voting method, every classifier in an ensemble gets a single vote for result. The output is the most voted result. A well-known approach that uses majority voting in its decision stage is Bootstrap aggregating algorithm (Bagging) [6]. Bagging trains weak classifiers from same dataset using uniform sampling with replacement, then classifiers are combined using simple majority voting [7]. Instead of using a single vote for every classifier, weighted voting might be used [7]. tandard Weighted majority voting (WMV) algorithm [7] uses accuracy of individual classifiers for finding weights. Classifiers that have better accuracies in training step get higher weights for their votes, and become more effective in voting.

Kuncheva and Rodriguez [8] proposed a probabilistic framework for classifier ensembles. This framework shows relationships between four combiners: majority voting, weighted voting, recall voting, and naive bayes voting. According to the experiments of Kuncheva and Rodríguez [8] on 73 benchmark datasets, there is no definite best combiner among those four. These results conform to “no free lunch theorem” [9]. No universal classifier exists that is suitable for every problem. Numerous other methods has been proposed for finding weights to combine pre-trained classifiers, Table 1. Methods in Table 1 are also summarized in Section 1.1. Similar to approaches in Table 1, main focus of this study is to present a new approach for finding weights in an ensemble that uses pre-trained classifiers using convex optimization techniques.

Study Year Classifiers Method Size Sparse Cost Function Regularizer Notes
Sylvester and Chawla [17] 2006 12 Different Classifiers Genetic Algorithms 120 No No Information No Information
Li and Zhou [18] 2009 Decision Tree Quadratic Programming 100 Yes Hinge Loss L1
Kim et al. [19] 2011 Decision Tree Matrix Decomposition 64 No Indicator Loss No Regularization
Mao et al. [20] 2011 Decision Tree, SVM1 Matching Pursuit 100 Yes Sign Loss No Regularization
Zhang and Zhou [21] 2011 KNN2 Linear Programming 100 Yes Hinge Loss L1
Goldberg and Eckstein [22] 2012 No Information Linear Programming NI Yes Indicator Loss L0 a
Santos et al. [23] 2012 SVM1, MLP3 Genetic Algorithms 6 No No Cost Function No Regularization
Yin et al. [24] 2012 Neural Networks Genetic Algorithms 100 Yes Square Loss L1 b
Meng and Kwok [25] 2013 Decision Tree, SVM1, KNN2 Domain Heuristic 3 No No Cost Function No Regularization
Tinoco et al. [26] 2013 SVM1, MLP3 Linear Programming 6 Yes Hinge Loss L1 d
Hautamaki et al. [27] 2013 Logistic Regression Nelder–Mead 12 Yes cross-entropy [28] L1, L2, L1 + L2 c
Şen and Erdoğan [29] 2013 13 Different Classifiers Convex Opt. 130 Yes Hinge Loss L1, Group Sparsity
Mao et al. [30] 2013 Decision Tree Singular Value Decomposition 10 No Absolute Loss No Regularization
Yin et al. [31] 2014 Neural Networks Genetic Algorithms 100 Yes Square Loss L1 e
Yin et al.[32] 2014 Neural Networks Quadratic Programming 100 Yes Square Loss L1 f
Zhang et al. [3] 2014 5 Different Classifiers Differential Evolution 5 No No Cost Function No Regularization
Mao et al. [33] 2015 Decision Tree Quadratic Form 200 No Square Loss L1
1

SVM Support Vector Machines.

2

KNN K-Nearest Neighbor

3

MLP Multi Layer Perceptron

a

No experimental results.

b

Diversity Term Yule’s Q Statistic is used

c

Improved version of [23]

d

3 regularizers are compared

e

Journal version of [24]

f

Convex Quadratic model of [31] and [24]

Table 1:

Ensemble weights finding studies that use pre-trained classifiers

In the second categorization (b), ensemble construction and classifier construction affect each other. Adaboost [10] is a well known example for this categorization that trains weak classifiers iteratively and adds them to ensemble. Different from bagging, subset creation is not randomized in boosting. At each iteration, subsets are obtained using results of previous iterations. That is miss-classified data in previous subsets are more likely included. In classifier ensemble, standard weighted majority voting is used.

Gurram and Kwon [11] used similar approach to classify remote sensing images. Randomly selected features were used to train weak SVM classifiers. Optimization process of training and combination of these classifiers were done together. Lee et al. [12] combined neural network weak classifiers in their ensemble. Genetic algorithms were used for finding weights for neural network neurons and increase diversity among neural networks. Then, these diverse neural networks were combined using negative correlation rule. Neural networks were trained and combined in one step. Tian and Fend [13] proposed an approach that combines feature sub-selection and ensemble optimization. They proposed three-term cost function: a classification accuracy term, a diversity term and a feature size term. They solved this ensemble cost function using population based heuristics optimization. Zhang et al. [14] used Kernel sparse representation based classifiers for ensemble in face recognition domain. Features were projected to higher dimensions using kernels, then sparse representation of these features were found using optimization techniques. Similarly, Kim et al. [15] proposed ensemble approach for biological data. Their approach were similar to boosting but they selected sparse features in their weak classifiers.Özgür and Erdem [16] used Genetic algorithms to select features and find weights for classifier ensemble in their study. They combined different strong classifiers and experimented on NSL-KDD dataset.

1.1. Related works: ensembles that combine pre-trained classifiers

Focus of this study is to combine pre-trained classifiers so that combined accuracy of the ensemble is better than individual classifiers. This study aims to accomplish this objective in a sparse manner so that not all of the classifiers are used in ensemble; therefore, weak decision tree classifiers are used in the experiments. Although some of the other sparse approaches [11, 14, 15, 34] are mentioned before, in this section, only ensemble classifiers that proposed methods to find weights for base classifiers are investigated.

Sylvester and Chawla [17] proposed differential evolution to find suitable weights for ensemble base classifiers. Similar to most heuristic solution techniques, they did not explicitly define cost function, but use classification accuracy for fitness function. ID3 decision trees, J48 decision trees (C4.5), JRIP rule learner (Ripper), Naive Bayes, NBTree (Naive Bayes trees), One Rule, logistic model trees, logistic regression, decision stumps, multi-layer perceptron (MLP), SMO (support vector machine), and 1BK (k-nearest neighbors) classifiers from Weka toolbox [35] were used in the experiments.

Li and Zhou [18] modeled ensemble weights finding problem using cost function that consists of hinge loss and L1 regularization. This cost function were minimized using Quadratic programming. Decision tree weak classifiers and UCI datasets were used for experiments. A semi-supervised version was also suggested.

Zhang and Zhou [21] formulated weights finding problem using three different cost functions: LP1 uses a cost function that consists of Hinge loss only. LP2 uses a cost function that consists of Hinge loss and L1 regularization. LP3 allows weights to be negative. These cost functions were minimized using linear programming. They used K-Nearest neighbor (KNN) algorithm as base classifiers and UCI datasets in their experiments.

Kim et al. [19] proposed an approach similar to boosting. They considered two weight vectors, one for classifier and one for instances. Hard to classify instances get more weight and correspondingly they affect weight vector more. Different from boosting, their approach works with pre-trained classifiers. Weights for ensemble was found using matrix decomposition and an iterative algorithm. Decision tree weak classifiers and UCI datasets were used for experiments.

Mao et al. [20] proposed matching pursuit algorithm to find weights for ensemble base classifiers. Since matching pursuit is a sparse approximation algorithm [36], their approach include sparsity. Decision Tree and SVM weak classifiers and UCI datasets were used for experiments.

Goldberg and Eckstein [22] modeled weights finding problem with indicator loss function and L0 regularization function. According to Goldberg and Eckstein [22], this problem is NP-hard in special cases. They gave different relaxation strategies to solve this problem and gave their relaxation bounds. Different from other studies, this study was purely theoretical.

Santos et al. [23] combined MLP and SVM algorithms to classify remote sensing images. They did not give any explicit cost function but used genetic algorithms for finding weights. An improved version of their studies [26] modelled weights finding problem using hinge loss and L1 regularization. This cost function were minimized using linear programming. In both versions, remote sensing images were classified using ensemble of MLP and SVM classifiers.

Mong and Kwok [25] combined Decision Tree(J48), K-Nearest Neighbor and SVM classifiers. They suggest using following domain heuristic for weights of classifiers: ”...weighted ranking (precision of false alarm > recall of false alarm > classification accuracy) is an appropriate and correct way to decide the weight values with high confidence in ensemble selection...” [25].

Hautamaki et al. [27] investigated using sparse ensemble in speaker verification domain. Ensemble weights finding problem were modeled using cross-entropy loss function and three different regularization functions: L1, L2, and L1 + L2. These cost functions were minimized using Nelder–Mead method. Logistic regression classifiers were used in experiments.

Yin et al. [24] modeled ensemble weights finding problem with a cost function that consists of the terms a square loss, L1 regularization and a diversity based-on Yule’s Q statistics. They used neural network classifiers on 6 UCI datasets in their experiments. In their first study [24], the proposed cost function were minimized using genetic algorithms. In their second study [31], the Pascal 2008 webspam dataset were added to their experiments. Finally, convex optimization techniques [32] were used to minimize the same cost function.

Sen and Erdogan [29] modeled ensemble weights finding problem using a cost function that consists Hinge loss and two different regularization functions, L1 and group sparsity. This cost function were minimized using convex optimization techniques. In their experiments, 13 different classifiers were compared on 12 UCI datasets and 3 other datasets using CVX Toolbox [37, 38].

Zhang et al. [3] proposed Differential Evolution for finding suitable weights for ensemble base classifiers. Similar to most heuristic solution techniques, they did not explicitly define cost function, but use classification accuracy for fitness function. Decision Tree (J4.8), Naive Bayes, Bayes Net, K-Nearest Neighbor, and ZeroR classifiers from Weka toolbox [35] were used in the experiments.

Mao et al. [30] modeled ensemble weights finding problem using a cost function that consists of absolute loss only. Proposed cost function was minimized using 0–1 matrix decomposition. In a later study [33], Mao et al. proposed a cost function that consists of square loss and L1 regularization function. This cost function was minimized using quadratic form approximation. Both studies used decision tree weak classifiers and UCI datasets in experiments.

As can be seen from Table 1, numerous approaches exist for finding weights in ensemble classification. Inspired from studies of [16, 21, 30, 33, 39], sparsity-driven weighted ensemble classifier (SDWEC) has been proposed. SDWEC can use both strong classifiers or weak classifiers for classifier ensemble. In this study, decision tree as a weak classifier is used as the base classifier since choosing fewer number of classifiers among large number of weak classifiers leads to high accuracy with shorter testing time. Proposed cost function consists of the following terms: (1) a data fidelity term with sign function aiming to decrease misclassification rate, (2) L1-norm sparsity term aiming to decrease the number of classifiers, and (3) a non-negativity constraint on the weights of the classifiers. Cost function proposed in SDWEC is hard to solve since it is non-convex and non-differentiable; thus, (a) the sign operation is convex relaxed using a novel approximation, (b) the non-differentiable L1-norm sparsity term and the non-negativity constraint are approximated using log-sum-exp and Taylor series. Contributions of SDWEC can be summarized as follows:

  1. 1.

    A new cost function is proposed for ensemble weights finding problem.

  2. 2.

    This cost function is minimized using novel convex relaxation and approximation techniques for sign function and absolute value function.

  3. 3.

    SDWEC provides similar or better classification accuracy, while minimizing the number of classifiers used in the ensemble.

  4. 4.

    According to sparsity level of SDWEC, number of classifiers used in the ensemble decreases; thus, the testing time for whole ensemble decreases.

  5. 5.

    The sparsity level of SDWEC allows tradeoff between testing accuracy and testing time when needed.

  6. 6.

    Computational Complexity of SDWEC is analyzed theoretically and experimentally, which is linear in number of data rows, number of classifiers and number of algorithm iterations.

2. Sparsity-driven weighted ensemble classifier

An ensemble consists of l number of classifiers which are trained using training dataset. We aim to increase ensemble accuracy on test dataset by finding suitable weights for classifiers using validation dataset. Ensemble weights finding problem is modeled with the following matrix equation.

sgn([11+1+1111+1+11]Hmxl[w1w2wl1wl]wlx1)[y1y2ym1ym]ymx1
H

classifiers results {1,1}mxl

m

number of samples in the validation dataset

l

number of individual classifiers

w

classifier weights

y

true labels {−1,1}mx1 for the validation dataset

In this matrix equation, classifiers predictions are weighted so that obtained prediction for each data row becomes approximately equal to expected results. Matrix H consists of l classifier predictions for m data rows that are drawn from validation dataset. Vector y contains the labels for the validation dataset. Our aim is to find suitable weights for w in a sparse manner while preserving condition of sgn(Hw) ≈ y (sign function). For this model, the following cost function is proposed:

J(w)=λms=1m(sgn(Hsw)ys)2+1lw11subjecttow0
λ

data fidelity coefficient (λ > 0)

Hs

sth row vector of matrix H

ys

sth label for vector y

In equation 1, the first term acts as a data fidelity term and minimizes the difference between true labels and ensemble predictions. Base classifiers of ensemble give binary predictions (−1 or 1) and these predictions are multiplied with weights through sign function which leads to {−1,0,1} as an ensemble result. To make this term independent from data size, it is divided to m (number of data rows).

The second term is a sparsity term [40] that forces weights to be sparse [39]; therefore, minimum number of classifiers are utilized. In sparsity term, any Lp-norm (0 ⩽ p ⩽ 1) can be used. Weights become more sparse as p gets closer to 0. However, when (0 ⩽ p < 1), sparsity term becomes non-convex and thus the problem becomes harder to solve. When p is 0 (L0-norm) then problem becomes NP-hard [41]. Here, L1-norm is used as a convex relaxation of Lp-norm [40, 42]. Similar to the data fidelity term, this term is also normalized with division by l (number of individual classifiers).

The third term is used as a non-negativity constraint. Since base binary classifiers use values of −1 and 1 for class labels, negative weights change sign of prediction; thus they change class label of prediction. To avoid this problem, the constraint term is added to force weights to be non-negative.

Using the |x| = max(−x, x) as the definition of L1-norm and using the penalty method [43] for transforming the constraint in the equation 1 to a penalty term (i.e. w ⩾ 0 → max(−wr, 0), 1 ⩽ rl), below unconstrained cost function is obtained:

J(n)(w)=λms=1m(sgn(Hsw)ys)2+1lr=1lmax(wr,wr)+β(n)lr=1lmax(wr,0)

In equation 2, n is the iteration number since constrained cost function in equation 1 is converted into series of unconstrained problems using penalty method. Due to employed penalty method approach, the constraint w ⩾ 0 is better satisfied as the penalty coefficient β(n) is increased in each iteration where β(1) > 0 in the first iteration. Equation 2 is a non-convex function, since sgn function creates jumps on cost function surface. In addition, max function is non-differentiable. Functions max and sgn in Equation 2 are hard to minimize. Therefore, we propose a novel convex relaxation for sgn as given in equation 3. Figure 1 shows approximation of sign function using Equation 3.

sgn(Hsw)Hsw|Hsw^|+ɛ=SsHsw
where
Ss=(|Hsw^|+ɛ)1

Figure 1:

Sign function approximation using equation 3. Dotted Lines are approximations using Equation 3 at various points.

In this equation, ε is a small positive constant. We also introduce a new constant ŵ as a proxy for w. Therefore, Ss = (|Hsŵ| + ε)1 is also a constant. However, this sgn approximation is only accurate around introduced constant ŵ. Therefore, the approximated cost function needs to be solved around ŵ. Additionally, max function is approximated with log-sum-exp [44] as follows:

max(wr,wr)1γlog(eγwr+eγwr)

Accuracy of log-sum-exp approximation becomes better as γ, a positive constant, increases. In double-precision floating-point format [45], values up to 10308 in magnitude can be represented. This means that γ|wr| should be less than 710 where exp(709) ≈ 10308, otherwise exponential function will produce infinity (∞). At wr = 0, there is no danger of numerical overflow in exponential terms of a log-sum-exp approximation; thus, large γ values can be used. But as |wr| gets larger, there is a danger of numerical overflow in exponential terms of log-sumexp approximation, since eγ|wr| may be out of double precision floating point upper limit.

To remedy this numerical overflow issue, a novel adaptive γ approximation is proposed, where γr is adaptive form of γ and defined as γr = γ (|ŵr| + ε)1. The accuracy of approximation can be improved by decreasing ε or increasing γ. Figure 2 shows proposed adaptive γ and resulting approximations for two different set of values (γ = 10, ε = 0.1) and (γ = 10, ε = 1).

Figure 2:

Adaptive gamma (γ1) L1 Approximation with different ε values.

Validity of the approximation can be checked by taking the limits at –∞, 0, and +∞ with respect to wr. These limits are –x, ɛlog2λr , and x when wr goes to –∞, 0, and +∞. As |x| gets larger, dependency to γ decreases; thus, proposed adaptive γ approximation is less prone to numerical overflow compared to standard log-sum-exp approximation.

Regularization term given in equation 6 is added to the unconstrained cost function (equation 2) since approximated cost function needs to be solved around ŵ. This new regularization term forces solution to be around ŵ by imposing a quadratic penalty between solution and ŵ. Due to this new regularization term, solution in each iteration will be changed slowly; thus, this new term is called as slow-step regularization. The main drawback of penalty method is the need to increase penalty coefficient in each iteration, theoretically up to infinity, that leads to ill-conditioning in the minimization of the cost function. However, increase of β(n) in each iteration is not needed since during the minimization changes in the solution will be small. These small changes is accomplished due to employed slow-step regularization. Therefore, penalty coefficient is used as a constant β with a small value (i.e β < 102) for all iterations. Note that, using a small value for penalty coefficient β leads to numerically well-posed minimization problem.

1lr=1l(wrw^r)2

Application of adaptive γ approximation leads to the following equations:

max(wr,wr)1γrlog(eγrwr+eγrwr)
βmax(wr,0)βγrlog(eγrwr+1)=P(wr)

Use of slow-step regularization in equation 6 and log-sum-exp approximation with adaptive γ leads to the cost function shown in equation 9.

J(n)(w)=λms=1m(SsHswys)2+1lr=1l1γrlog(eγrwr+eγrwr)+1lr=1lβγrlog(eγrwr+1)+1lr=1l(wrw^r)2

In order to achieve a second-order accuracy and to obtain a linear solution, after taking the derivative of the cost function, equation 9 is expanded as a second-order Taylor series centered on ŵr, leading to equation 10.

J(n)(w)=λms=1m(SsHswys)2+1lr=1l(Ar+Brwr+Crwr2)+1lr=1l(wrw^r)2

In equation 10, Ar represents constants terms while Br and Cr are the coefficients of the terms wr and wr2 , respectively. If wr values differ significantly from constant point, ŵr, Taylor approximation diverges from true cost function. In proposed method, employed slow-step regularization also ensures the accuracy of Taylor approximations.

Equation 10 can be written in a matrix-vector form as follows:

J(n)(w)=λm(SHwy)(SHwy)+1l(vA1+vBw+wCw)+1l(ww^)(ww^)
S

matrix form of Ss

1

vector of ones

vA

vector form of Ar

vB

vector form of Br

C

diagonal matrix form of Cr

Equation 11 is strictly convex (see appendix for the details) thus it has a unique global minimum. Therefore, to minimize J(n)(w) in Equation 11, the derivative with respect to w is taken and is equalized to zero. This leads to system of linear equations:

Mw=b
where
M=2λm(SH)(SH)+2l(C+I)b=2λm(SH)y+2w^vBl

In Equation 12, M is dense, symmetric, real, and positive definite matrix with size of l × l.

Final model is solved using algorithm 1 iteratively. Due to the employed numerical approximations and using constant β, small negative weights may occur around zero. Since our feasible set is w ⩾ 0, back projection to this set is performed after solving linear system at each iteration in algorithm 1. This kind of back-projection to feasible domain is commonly used [46]. Additionally, small weights in ensemble do not contribute to overall accuracy; therefore, these small weights are thresholded after iterations are completed.

1: H, y, λ, β, γ, ε are initialized
2: w1
3: m, lsize of Hmxl
4: k ← 25     ▹ Maximum Iteration
5: for n = 1 to k do
6:     ŵw
7:      γrγ|w^|+ɛ
8:     construct S as diagonal form of Ss
9:     construct vB and C
10:      M2λm(SH)(SH)+2l(C+I)
11:      b2λm(SH)y+2w^vBl
12:     solve Mw = b
13:     w = max(w, 0)     ▹ Back projection to w ⩾ 0
14: end for
15: wthreshold = argminwr (P(wr) − 10−3)2
16: w={wifw>wthreshold0otherwise
Algorithm 1

SDWEC Pseudo code

An example run of Algorithm 1 can be seen in Figure 3, where cost values for equations 2 and 11 decrease steadily. As seen in Figure 3, the difference between non-convex cost function and its convex relaxation is minimal especially in the final iterations. This shows that two functions converge to very similar values. Since non-convex Equation 2 and convex Equation 11 are converged to similar points, this converged points are within close proximity of the global minimum. Non-convex Equation 2 and convex-relaxed Equation 11 are close to each other due to the slow-step regularization term and employed iterative approach for numerical minimization. These results show success of the proposed approximations.

Figure 3:

Cost function minimization for 4 datasets (Non-convex equation 2 vs convex-relaxed equation 11).

3. Experimental results

The performance of SDWEC has been tested on 11 datasets; 10 UCI datasets and NSL-KDD [47]. NSL-KDD is a popular database for intrusion detection [47, 48, 49]. In all ensemble methods, 200 base decision tree classifiers, Classification And Regression Trees (CART) [50] are used. SDWEC has been compared with the following algorithms : Single decision tree classifier (CART) [50], bagging [6], WMV [7], and state-of-the-art ensemble QFWEC [33]. Each dataset is divided to training (80%), validation (10%), and testing (10%) datasets. This process has been repeated 10 times for cross validation. Mean values have been used in Table 2. The accuracy values for QFWEC in Table 2 are higher than original publication [33] since weights are found using validation dataset instead of training dataset, which provides better generalization.

Datasets QFWEC SDWEC-A SDWEC-B WMV bagging singleC
breast 0.9736 0.9737 (0) 0.9532 (0.90) 0.9355 0.9722 0.9400
heartC 0.8085 0.8186 (0) 0.8279 (0.90) 0.8118 0.8118 0.7268
ionosphere 0.9344 0.9371 (0) 0.9427 (0.92) 0.9371 0.9342 0.8799
sonarP 0.8088 0.8136 (0) 0.8126 (0.88) 0.7893 0.8088 0.7367
vehicle 0.9788 0.9693 (0) 0.9539 (0.91) 0.9681 0.9670 0.9634
voteP 0.9576 0.9703 (0) 0.9525 (0.84) 0.8509 0.9703 0.9533
waveform 0.8812 0.8652 (0) 0.8600 (0.93) 0.8634 0.8620 0.8220
wdbcP 0.9595 0.9507 (0) 0.9418 (0.88) 0.9489 0.9507 0.9138
wine 0.9722 0.9722 (0) 0.9605 (0.89) 0.7514 0.9719 0.9500
wpbcP 0.7989 0.8036 (0) 0.7477 (0.91) 0.7850 0.7750 0.6911
NSL-KDD 0.9828 0.9766 (0) 0.9849 (0.88) 0.9610 0.9613 0.9976

SDWEC-A λ = 0.1 β = 35 γ = 5 ε = 0.1, Mean sparsity 0.00
SDWEC-B λ = 10 β = 15 γ = 15 ε = 1.0, Mean sparsity 0.90
Table 2:

Comparison of accuracies (sparsity values are given in parentheses)

SDWEC finds weights of ensemble for pre-trained classifiers; thus, it is divided into 3 sub tasks.

  1. 1.

    Training base classifiers on training dataset: This sub task is common for the ensemble methods which aims to combine pre-trained classifiers. Employed pre-trained classifier can be a weak classifier or a strong classifier where generally weak classifiers are faster to train with lower accuracy and strong classifiers are slower to train with higher accuracy. Training time of base classifiers depends on training complexity of the method which is dependent to the number of data in the training dataset (p), number of features (d), number of classes (i.e. binary, multi-label), and data characteristics. Computational (time) complexity of base classifier training are independent from the proposed SDWEC method; thus, one can use a base classifier of his choice. SDWEC aims to use few number of classifiers among trained l base classifiers; therefore, weak decision tree classifiers (CART) [50] are used in the experiments.

  2. 2.

    Finding ensemble weights on validation dataset (SDWEC training): SDWEC finds the ensemble weights of base classifiers using y and H. Here, y consists of true labels and H consists of l classifier predictions for m data rows for the validation dataset. Prediction speed of creating the matrix H depends on the choice of base classifier, number of data in the validation dataset (m), number of features (d), number of classes (i.e. binary), and data characteristics. So, this study only investigates the computational complexity (see Table 4) and execution time (see Figure 6) of the proposed SDWEC training method (see Algorithm 1) for the ensemble weight finding. Note that, computational complexity of the SDWEC training only depends on number of data in validation set (m), number of classifiers (l), and number of algorithm iteration (k) (see table 3).

  3. 3.

    Applying ensemble on real-world data (i.e. test dataset): Prediction time of SDWEC for test data (or unseen real-world data) depends on base classifiers prediction speed and number of base classifiers selected by SDWEC method (Algorithm 1). As the weights (w) of ensemble becomes more sparse (fewer nonzero elements in the solution w) fewer base classifiers are used in testing phase. Thus, execution time of the testing time decreases as the weights become sparser independent of the employed base classifier. In this study, weak decision tree classifier is used as a base classifier since it is fast in training and prediction; thus, testing time of the SDWEC mostly depends on the sparsity of the obtained ensemble weights.

3.1. Experimental results: sparsity

The principle of parsimony (sparsity) states that simple explanation should be preferred to complicated ones [40]. Sparsity mostly used for feature selection in machine learning. In our study, principle of sparsity is used for selecting subset of classifiers among weak classifiers. During experiments, sparsity definition given in equation 13 is used where 𝒮 (w) = 0 corresponds to least sparse solution while solution becomes more sparse as 𝒮 (w) gets closer to 1. According to dataset and hyper-parameters used, SDWEC achieves different sparsity levels. When SDWEC applied to 11 different datasets, sparsity levels between 0.80 and 0.88 has been achieved (Figure 4). This means that among 200 weak classifiers, 24 classifiers (sparsity level of 0.88) to 40 classifiers (sparsity level of 0.80) are used in ensembles.

𝒮(w)=11lw0
where
w0=#(r|wr0),(1rl)

Figure 4:

4 Datasets and their sparsity levels (λ = 1, β = 10, γ = 20, ε = 0.1).

Figure 5:

Sparsity vs accuracy of SDWEC. The sparsity and accuracy values come from the mean of 11 datasets. Corresponding values can be seen in Table 2.

Figure 6:

Number of classifier (l) versus SDWEC training time (see Table 4).

Line Code in Alg 1 Complexity Notes
6 ŵw O(l)

7 γrγ|w^|+ɛ O(l)

8 construct S as diagonal form of Ss O(ml) SSs diagonal sparse matrix (m × m) (Eq 4)

9 vB O(l)
9 C O(l) C sparse diagonal matrix

10 SH O(ml) Xm×l=Sm×m×Hm×l
10 XX O(l3) X : O(l2), XX : O(l3)
10 M2λm[XX]+2(C+I)l O(l3 + l2)

11 Xy O(l2)
11 2w^VBl O(l)
11 b2λmXy+2w^VBl O(l2 + l)

12 solve Mw = b O(l3) Cholesky solver

13 w = max(w, 0 ) O(l)

M is dense, symmetric, real, and positive definite. Cholesky solver is used to solve Mw = b, O(23l3) .

Table 3:

Computational complexity of SDWEC

Dataset Rows (m) Time (sec.) l classifier count
l = 100 l = 200 l = 500 l = 1000
breast-cancer 547 0.05 0.10 0.48 1.63
ionosphereP 280 0.04 0.07 0.31 1.01
wpbcP 155 0.03 0.06 0.26 0.89
wdbcP 456 0.05 0.09 0.44 1.34
wine 143 0.03 0.05 0.23 0.91
waveform 4000 0.43 0.96 3.01 7.78
voteP 186 0.03 0.07 0.24 0.97
vehicleP 667 0.06 0.18 0.73 1.83
sonarP 167 0.03 0.06 0.23 0.83
heartC 239 0.03 0.07 0.25 1.02
NSL-KDD 100778 12.73 25.95 80.23 204.59
Table 4:

SDWEC training time on various datasets,

Here, ||w||0 is the L0-norm of a vector w. Mathematically speaking, L0-norm is not a proper norm since it is not absolutely homogeneous while it satisfies other norm properties. In practice, L0-norm is a cardinality function which has its definition in the form of Lp-norm for counting the number of nonzero elements in a given vector.

Two different results with different sparsity values (A and B), chosen from Figure 5 have been provided in Table 2. SDWEC-A has no sparsity, all 200 base classifiers have been used in ensemble; thus, it has superior performance at the cost of testing time. SDWEC-A has best accuracy values in 4 out of 10 datasets and it is very close to top performing ones in others. QFWEC is only slightly more accurate in 4 datasets comparing to SDWEC-A while SDWEC-A is only slightly more accurate comparing to QFWEC in other 4 datasets. SDWEC provides similar accuracies with the best performing method (QFWEC) since both QFWEC and SDWEC-A use all base classifiers. SDWEC-B has 0.90 sparsity, 20 of 200 base classifiers have been used in ensemble; nonetheless, it has best accuracy values in 2 out of 10 datasets. Besides, its accuracy values are marginally lower (about 2%) but its testing time is significantly better (90%) than the other approaches. Testing time of the methods in Table 2 is defined as (1 𝒮(.))𝒯(l) where 𝒮(.) is the sparsity provided by the ensemble method (see equation 13) and 𝒯(l) is the testing time for all base classifiers. SDWEC-B has 10 times faster testing time comparing to QFWEC since 𝒮(.) is 0 for QFWEC and 0.9 for SDWEC-B.

3.2. Computational Complexity Analysis

In this section, computational complexity of SDWEC (Algorithm 1) has been analyzed. First, computational complexity of every pseudo-code line in Algorithm 1 is given in table 3 and then overall computational complexity is determined. In Table 3, m stands for the number of data in the validation dataset, l stands for the number of base classifiers, and k stands for the iteration count.

Computational complexity inside the for loop is O(ml) +C1O(l3) +C2O(l2) +C3O(l). Since lm, dominant term is O(ml) for the SH multiplication in line 10 of the Algorithm 1, where S is a diagonal matrix. Our iteration count is k, then final computational complexity of SDWEC is O(kml), that is linear in k, m, and l (see Table 4 and Figure 6). This computational complexity analysis shows the computational efficiency of the proposed numerical minimization.

Table 4 shows training time (weight finding) of SDWEC on various datasets. Note that, H is an input to the Algorithm 1 and calculated as a prior step; thus, training times given in Table 4 only corresponds to SDWEC training. In this experiment, execution time is only dependent on number of rows (m) and number of classifiers (l) since fixed iteration count is used (k = 25). In training set, NSL-KDD dataset (100778 instances) has 25 times more instances than waveform dataset (4000 instances). And training time of NSL-KDD (25.95) is about 25 times of waveform (0.96). In Figure 6, SDWEC training times are shown for 3 datasets with different number of data (m), different number of classifiers (l), and for fixed iteration count. As seen in Table 4 and Figure 6, practical execution times are in alignment with theoretical computational complexity analysis. Slight differences between theoretical analysis and actual execution times are due to implementation issues and caching in CPU architectures.

4. Conclusion

In this article, a novel sparsity driven ensemble classifier method, SDWEC, has been presented. An efficient and accurate solution for original cost function (hard to minimize, non-convex, and non-differentiable) has been developed. A novel convex relaxation technique for sign function, and a novel adaptive log-sum-exp approximation for the approximation of max function that reduces numerical overflows are proposed. Computational complexity of SDWEC has been investigated theoretically and experimentally. SDWEC training has a linear computational complexity in number of classifier used (l), number of instances in the validation dataset (m), and number of algorithm iterations (k). SDWEC has been compared with other ensemble methods in well-known UCI and NSL-KDD datasets. According to the experiments, SDWEC decreases number of classifiers used in ensemble without significant loss of accuracy. By tuning parameters of SDWEC, a more sparse ensemble –thus, better testing time– can be obtained with a small decrease in accuracy.

Appendix

Optimality conditions can be used to show strict convexity since equation 11 is in matrix-vector form and differentiable.

In equation 12, first derivative of equation 11 is equalized to zero and a close-form solution is obtained as a linear system so first order optimality condition is satisfied.

Let G be a second derivative (Hessian) of the cost function J(n)(w) given in the equation 11. A symmetric matrix G ∈ ℝlxl is called positive definite (thus J(n)(w) is strictly convex), denoted by G ≻ 0, if xGx > 0 for every x ∈ ℝl with x ≠ 0. Lets take the second derivative of the convex-relaxed cost function J(n)(w) given in equation 11:

2J(n)(w)w2=2λm(SH)(SH)+2lC=G

Lets show that xGx > 0 for all non-zero x:

x(2λm(SH)(SH)+2lC)x>0

If we distribute x and x from left and right:

2λmx(SH)(SH)x+2lxCx>0

Since λ, m, and l are all positive we just need to show that x(SH)(SH)x > 0 and xCx > 0:

x(SH)(SH)x>0(SHx)(SHx)>0

Lets define z as z = SHx, then zz > 0 since S is a diagonal matrix with all positive elements (see equation 4), H contains non-zero elements {−1,1}, and x is non-zero vector.

C is a diagonal matrix with diagonal elements Cr, 1 ⩽ rl, which are defined as below (from second order Taylor approximation):

Cr=γr(4ur2+8ur3+4ur4+βur+2βur3+βur5)2(ur3+ur2+ur+1)2
where ur = eŵrγr. Here, β is a positive constant, γr = γ(r|+ε)1 is always positive since γ > 0, and ur is always positive since ŵrγr ⩾ 0. Thus, xCx > 0 is satisfied since Cr is always positive.

Therefore, both first order optimality conditions and second order optimality conditions are satisfied which shows that cost function J(n)(w) given in equation 11 is strictly convex.

References

[7]LI Kuncheva, Combining pattern classifiers: methods and algorithms, John Wiley & Sons, 2005.
[10]Y Freund, R Schapire, and N Abe, “A short introduction to boosting”, Journal of Japanese Society for Artificial Intelligence, Vol. 14, 1999, pp. 771-780.
[16]A Özgür and H Erdem, “Feature selection and multiple classifier fusion using genetic algorithms in intrusion detection systems”, Journal of the Faculty of Engineering and Architecture of Gazi University, Vol. 33, No. 1, 2018, pp. 75-87.
[37]M Grant and S Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1”, Mar 2014. http://cvxr.com/cvx,
[38]M Grant and S Boyd, “Graph implementations for nonsmooth convex programs”, V Blondel, S Boyd, and H Kimura (editors), Recent Advances in Learning and Control, Springer-Verlag Limited, 2008, pp. 95-110. Lecture Notes in Control and Information Sciences, http://stanford.edu/~boyd/graph_dcp.html.
Journal
International Journal of Computational Intelligence Systems
Volume-Issue
11 - 1
Pages
962 - 978
Publication Date
2018/04/20
ISSN (Online)
1875-6883
ISSN (Print)
1875-6891
DOI
10.2991/ijcis.11.1.73How to use a DOI?
Copyright
© 2018, the Authors. Published by Atlantis Press.
Open Access
This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Atilla Özgür
AU  - Fatih Nar
AU  - Hamit Erdem
PY  - 2018
DA  - 2018/04/20
TI  - Sparsity-driven weighted ensemble classifier
JO  - International Journal of Computational Intelligence Systems
SP  - 962
EP  - 978
VL  - 11
IS  - 1
SN  - 1875-6883
UR  - https://doi.org/10.2991/ijcis.11.1.73
DO  - 10.2991/ijcis.11.1.73
ID  - Özgür2018
ER  -