International Journal of Computational Intelligence Systems

Volume 9, Issue 1, January 2016, Pages 184 - 201

Hyperrectangles Selection for Monotonic Classification by Using Evolutionary Algorithms

Javier García1,, Adnan M. AlBar2,, Naif R. Aljohani2,, José-Ramón Cano1,, Salvador García3,
1Department of Computer Science, University of Jaén, EPS of Linares, Calle Alfonso X el Sabio S/N, Linares, 23700, Spain
2Information Systems Department, Faculty of Computing and Information Technology, King Abdulaziz University, Kingdom of Saudi Arabia
3Department of Computer Science and Artificial Intelligence, University of Granada, Calle Periodista Daniel Saucedo Aranda S/N, Granada, 18071, Spain
Received 4 July 2015, Accepted 3 January 2016, Available Online 20 January 2016.
10.1080/18756891.2016.1146536How to use a DOI?
Monotonic Classification; Nested Generalized Examples; Evolutionary Algorithms; Rule Induction; Instance-based Learning

In supervised learning, some real problems require the response attribute to represent ordinal values that should increase with some of the explaining attributes. They are called classification problems with monotonicity constraints. Hyperrectangles can be viewed as storing objects in ℝn which can be used to learn concepts combining instance-based classification with the axis-parallel rectangle mainly used in rule induction systems. This hybrid paradigm is known as nested generalized exemplar learning. In this paper, we propose the selection of the most effective hyperrectangles by means of evolutionary algorithms to tackle monotonic classification. The model proposed is compared through an exhaustive experimental analysis involving a large number of data sets coming from real classification and regression problems. The results reported show that our evolutionary proposal outperforms other instance-based and rule learning models, such as OLM, OSDL, k-NN and MID; in accuracy and mean absolute error, requiring a fewer number of hyperrectangles.

© 2016. the authors. Co-published by Atlantis Press and Taylor & Francis
Open Access
This is an open access article under the CC BY-NC license (

1. Introduction

The classification with monotonicity constraints, also known as monotonic classification 1, is an ordinal classification problem where a monotonic restriction is present: a higher value of an attribute in an example, fixing other values, should not decrease its class assignment. The monotonicity of relations between the dependent and explanatory variables is very usual as a prior knowledge form in data classification 2. To illustrate, while considering a credit card application 3, a $1000 to $2000 income may be considered a medium value of income in a data set. If a customer A has a medium income, a customer B has a low income (i.e. less than $1000) and the rest of input attributes remain the same, there is a relationship of partial order between A and B: B < A. Considering that the application estimates lending quantities as output class, it is quite obvious that the loan that the system should give to customer B cannot be greater than the given to customer A. If so, a monotonicity constraint is violated in the decision.

The estimation of the knowledge about monotonicity in learning models is of a great interest for two main arguments 4. Firstly, monotonicity imposes constraints on the prediction function. This decreases the size of the hypothesis space and also the complexity of the model. Secondly, the domain experts decide the acceptance or rejection of the models yielded if they are consistent with the domain knowledge, regardless of their accuracy 5.

Many data learning algorithms have been adapted to be able to handle monotonicity constraints in several styles. There are two steps to treat with monotonic classification problems. The first one is to preprocess the data 6 in order to “monotonize” the data set 7, rejecting the examples that violate the monotonic restrictions or selecting features to improve classification performance and avoid overfitting 8,9; and the second one is to force learning only monotone classification functions. Proposals of this type are: classification trees and rule induction 10,11,12,13, neural networks 14 and instance-based learning 15,16,17.

Instance-based learning was baptized as learning family in 18 and considers a set of methods widely used in machine learning 19. An analogous scheme for instance-based learning is the Nested Generalized Exemplar (NGE) theory. It was announced in 20 and perform various major adjustments to the instance-based learning model. The most relevant is that it allows two type of examples, standard examples represented as single points and generalized examples which fill a hyper-volume in ℝn. They are closely connected to the nearest neighbor classifier (NN) 21, in such a way that they are intended to enhance it. NGE learning algorithms are being more popular in recent years due to the simplicity and effectiveness of the outcome they provide.

In this manner, the hyperrectangles are generalization of examples in ℝn, according to NGE theory. Single and generalized examples coexist and hyperrectangles may be nested and inner hyperrectangles serve as exceptions to surrounding hyperrectangles. They constitute axis-parallel rectangle representations as in many of the rule learning systems 22. Using this model, a new example can be classified by estimating the Euclidean distance between it and every of the hyperrectangles stored. The class is predicted according to the label associated with the nearest hyperrectangle. In the case that two or more hyperrectangles cover the example, it is necessary to resolve a possible conflict derived from different labels among the hyperrectangles 20.

The profits of combining hyperrectangles with instances to build classification models are pointed out in the literature 23,24,25. Looking at rule induction 22, the modeling of decision surfaces derived from combinations between parallel axis separators and Voronoi diagrams (typical decision surfaces of NN classifiers) may adapt the prediction to examples drawn along curves, improving the performance in complex domains. Regarding instance-based classification 18, the hyperrectangles adds interpretability to the model and reduces storage requirements.

The generation of an optimal minimal number of hyperrectangles for classifying a set of points is NP-hard. Heuristic algorithms produce a large but finite subset of hyperrectangles from the training data. Nevertheless, it may be easy that almost all hyperrectangles modeled could be unnecessary. Thus, there is a need for selecting the most influential ones, and it can be done by using data reduction schemes 6. Evolutionary Algorithms (EAs) 26 have been used for data reduction with promising results 27. They have been favorably used in NGE learning in the past 28,29.

In this paper, we propose the utilization of EAs for hyperrectangles’ selection in monotonic classification tasks. Our goal is to increase the performance in this type of problem by means of selecting the best suitable set of hyperrectangles which optimizes the nearest hyperrectangle classification with monotonicty constraints. We compare our algorithm with other monotonic learning models belonging to both instance-based and rule learning paradigms. They are OLM 15, MID 10, OSDL 16 and monotonic k-NN 17. The experimental design incorporates non-parametrical statistical testing 30,31. The results show a significant improvement in accuracy whereas the number of examples stored in the final subset is further reduced. Besided, the main key point of our proposal is that the model built is composed of generalized examples which all fulfill the monotonicity constraints.

The paper is organized as follows. Section 2 provides an background in monotonic classification and the NGE learning model. In Section 3, all topics concerning the approach proposed are described. In Section 4 the experimentation framework is given and in Section 5 the results and analysis are presented. In Section 6, the conclusions are highlighted.

2. Background

We will do a brief review of the monotonic classification including the description of the most used techniques for monotonic classification in Subsections 2.1 and 2.2. The NGE classification will be described in Subsection 2.3.

2.1. Monotonic Classification

Ordinal classification problems are those in which the class is neither numeric nor nominal. Instead, the class values are ordered. For instance, a worker can be described as “excellent”, “good” or “bad”, and a bond can be evaluated as “AAA”, “AA”, “A”, “A-”, etc. Similar to a numeric scale, an ordinal scale has an order, but it does not possess a precise concept of distance. Ordinal classification problems are important, since they are fairly common in our daily life. Employee selection and promotion, determining credit rating, bond rating, economic performance of countries, industries and firms, and insurance underwriting, are examples of ordinal problem-solving in business. Rating manuscripts, evaluating lecturers, student admissions, and scholarships decisions for students, are examples of ordinal decision-making in academic life. Ordinal problems have been investigated in scientific disciplines such as information retrieval, psychology, and statistics for many decades 4.

A monotonic classifier is one that will not violate monotonicity constraints. Informally, the monotonic classification implies that the assigned class values are monotonically non-decreasing (in ordinal order) with the attribute values. More formally, let {xi, class(xi)} denote a set of examples with attribute vector xi = (xi,1,…,xi,m) and a class, class(xi), being n the number of instances and m the number of attributes. Let xi xh iff ∀j=1,…,m, xi,jxh,j.

A data set {xi, class(xi)} is monotonic if and only if all the pairs of examples i, h are monotonic with respect to each other 10 (see equation 1).


Some monotonic ordinal classifiers require monotonic data sets to successfully learn, although there are others that are capable of learning from non-monotonic data sets as well.

2.2. Monotonic Classification Methods

Four monotonic learning methods are pioneers and well-known in the field of monotonic classification. In the following, we will discuss each one in detail.

  • The Ordinal Learning Model (OLM) 15 is a very simple algorithm that learns ordinal concepts by eliminating non-monotonic pairwise inconsistencies. The generated concepts can be viewed as rules. During the learning phase, each example is checked against every rule in a rule-base, which is initially empty. If an example is inconsistent with a rule in the rule-base, one of them is selected at random while the other is discarded, but if the example is selected, it must be checked for consistency against all the other monotonicity rules. If it passes this consistency test, it is added as a rule. Consequently, the rule-base is kept monotonic at all times. Classification is done conservatively. All the rules are checked in decreasing order of class values against an attribute vector, and the vector is classified as the class of the first rule that covers it. If such a rule does not exist, the attribute vector is assigned the lowest possible class.

  • As its name implies, Ordinal Stochastic Dominance Learner (OSDL) 16, is based on the concept of Ordinal Stochastic Dominance (OSD). The rationale behind OSD can be given through an example: In life insurance, one may expect a stochastically greater risk to the insurer from older and sicker applicants than from younger and healthier ones. Higher premiums should reflect greater risks and vice versa. There are several definitions of stochastic ordering. The stochastic order computes when a random variable is bigger than another. Considering this order, stochastic dominance can be established as a form of stochastic order. In this case, a probability distribution over possible predictions can be ranked. The ranking depends of the nature of the data set. Stochastic dominance refers to a set of relations that may hold between a pair of distributions.

    The most commonly used, as stochastic ordering, is first OSD, which was used by 36 in the OSDL. For each vector xi, the OSDL computes two mapping functions: one that is based on the examples that are stochastically dominated by xi with the maximum label (of that subset), and the second is based on the examples that cover (i.e., dominate) xi, with the smallest label. Later, an interpolation between the two class values (based on their position) is returned as a class.

  • The monotonic kNN (Mk-NN) was proposed in 17. This method consists of two steps. In the first step, the training data is made monotone by relabeling as few cases as possible. This relabeled data set may be considered as the monotone classifier with the smallest error index in the training data. In the second step, we use a modified nearest neighbour rule to predict the class labels of new data so that violations of the restrictions of monotonicity will not occur. Considering the monotonic nearest neighbor rule, the class label assigned to a new data point x0 must lie in the interval [classmin,classmax], where

    where {xi, class(xi)} is the monotone data set. To conserve the monotonicity the choice of the class value for x0 must be in this interval.

  • A monotone extension of ID3 (MID) was proposed by Ben-David 10 using an additional impurity measure for splitting, the total ambiguity score. However, the resulting tree may not be monotone anymore even when starting from a monotone data set. MID defines the total-ambiguity-score as the sum of the entropy score of ID3 and the order-ambiguity-score. This last score is defined in terms of the non-monotonicity index of the tree, which computes the number of pair branches that are non-monotonic regarding the total possible non-monotonic pairs there may be.

2.3. NGE Learning

NGE is a learning paradigm based on class exemplars, where an induced hypothesis has the graphical shape of a set of hyperrectangles in ℝn. Exemplars of classes are either hyperrectangles or single instances 20. The input of an NGE system is a set of training examples, each described as a vector of pairs numeric_attribute/value and an associated class. Attributes can either be numerical or categorical. Numerical attributes are usually normalized in the [0,1] interval.

In NGE, an initial set of hyperrectangles in ℝn formed by single points directly taken from the data is generalized into a smaller set of hyperrectangles regarding the elements that it contains. Choosing which hyperrectangle is generalized from a subset of points or other hyperrectangles and how it is generalized depends on the concrete NGE algorithm employed.

The matching process is one of the central features in NGE learning and it allows some customization, if desired. Generally speaking, this process computes the distance between a new example and an exemplar memory object (a generalized example). Let the example to be classified be termed as x0 and the generalized example as G, regardless of whether G is formed by a single point or if it has some volume.

The model computes a match score between x0 and G by measuring the Euclidean distance between two objects. The Euclidean distance is well-known when G is a single point. Otherwise, the distance is computed as follows (considering numerical attributes):

where m is the number of attributes of the data, x0,j is the value of the jth attribute of the example x0, max(Gj) and min(Gj) are the maximum and minimum values of G for the jth attribute and maxj and minj are the maximum and minimum values for jth attribute in training data, respectively.

The distance measure represents the length of a line dropped perpendicularly from the point x0 to the nearest surface, edge or corner of G. Note that internal points in a hyperrectangle have a distance of 0 to that rectangle. In the case of overlapping rectangles, several strategies could be implemented, but it is usually accepted that a point falling in the overlapping area belongs to the smaller rectangle (the size of a hyperrectangle is defined in terms of volume). The volume is computed following the indications given in 23. In nominal attributes, the distance is 0 when two attributes have the same categorical label, and 1 on the contrary.

3. Evolutionary Selection of Hyperrectangles for Monotonic Clasification

The approach proposed in this paper, named Evolutionary Hyperrectangle Selection for Monotonic clasification by CHC (EHSMC-CHC), is fully explained in this section. First, we approximate the NGE theory to monotonic classification in Subsection 3.1 by providing some definitions. Then, we describe the process for generating the initial set of hyperrectangles in Subsection 3.2. After this, we introduce the CHC model used as an EA to perform hyperrectangle selection in Subsection 3.3. Finally, the specific issues regarding representation and fitness function are specified in Subsection 3.4.

3.1. Definitions

We represent each instance with xi = (xi,1; xi,2;.……;xi,m;class(xi)), where xi,j is the input value of the instance i in the attribute j, m is the total number of input attributes and class(xi) is the class or output value of the instance i. Let the hyperrectangles be denoted by H:(jAj;C) , where Aj is a condition belonging to the antecedent and C is the response. A formal definition is given next:


= {[jmin,jmax]if numerical,{α,β,.}if nominal,


= a value of the class,

where jmin and jmax are the minimum and maximum boundaries for the condition j respectively, and α, β,… are the possible categorical values belonging to the domain of the attribute j, if it is nominal.

The hyperrectangles should not break monotonicity in the sense we saw in Subsection 2.1. Hence, we define a partial order relation in the set of disjoint hyperrectangles, deciding when a precedent is higher than, lower than or equal to another:

Aj¯Ajif attributejis numeric andjmaxjminor if attributejis nominal andAjAj,Aj¯Ajif attributejis numeric andjmaxjminor if attributejis nominal andAjAj,Aj=Ajif attributejis numeric andjmin=jminandjmax=jmaxor if attributejis nominal andAjAj.

Then let two hyperrectangles H and H′ be defined as:


Two hyperrectangles are comparable if H H′ or H H′, and non-comparable in contrary case, since we cannot establish an order between them. Hence, an hyperrectangle H:(jAj,C) will be non-monotonic with respect to another H:(jAj,C) if:


The distance between an instance xi and an hyperrectangle H is defined as follows:

where dis j and Range are defined differently depending on the type of the attribute. If the attribute is numeric:

If the attribute is nominal:

disj={0xi,jAj,1otherwise,Range=Num. of possible values of the attribute.
where xi,j is the value of the j-th attribute of the instance i, jmax and jmin are the maximum and minimum values of the hyperrectangle H in the attribute j and maxi and mini are the maximum and minimum values of the attribute j in the data set.

The distance between two hyperrectangles H and H′ is defined as:

where dis j and Range are defined as:
if the attribute j is numeric and
if the attribute j is nominal.

3.2. Getting the Initial Set of Hyperrectangles

We start from a training set TR with n instances which consists of pairs (xi, class(xi)), i = 1, …, n. Each one of the n instances has m input attributes.

In this first phase will get a hyperrectangle set HS with N hyperrectangles whose rule representation consists of pairs (Hi, class(Hi)), i = 1, …, N, where Hi defines a set of conditions (A1,A2, …, Am) and class(Hi) defines the associated class label of the hyperrectangle. Each one of the N hyperrectangles has m conditions which can be numerical conditions, expressed in terms of minimum and maximum values in intervals [0,1]; or they can be categorical conditions, by using a set of possible values Ai = {v1i, v2i, …, vvi}, assuming that it has vi different values. Note that we make no distinction between a hyperrectangle with volume and minimal hyperrectangles formed by isolated points.

In this first stage of our method, we have used a simple heuristic which is fast and yields acceptable results. The heuristic yields a hyperrectangle from each example in the training set. For each one, it finds the k − 1 nearest neighbors being the k−th neighbor an example of a different class. Then, each hyperrectangle is expanded considering these k − 1 neighbors by using, in the case of numerical attributes, the minimal and maximal values as the limits of the interval defined, or getting all the different categorical values, in the case of nominal attributes, to form a subset of possible values from them.

Once all the hyperrectangles are obtained, the duplicated ones are removed, keeping one representative in each case. Hence |HS| ≤ |TR|. Note that point hyperrectangles are possible to be obtained using this heuristic when the nearest neighbor of an instance belongs to a different class.

3.3. CHC Model

As an evolutionary computation method, we have used the CHC model 32. CHC is a classical evolutionary model that introduces important aspects to obtain a trade-off between exploration and exploitation; such as incest prevention, restoration of the search process when it becomes locked and the fight among parents and offspring into the replacement process.

During each generation the CHC realizes the following stages:

  • NC children born after mating of NC individuals of the father population.

  • Then, among the 2NC individuals formed by parents and children takes place a struggle for survival in which only remain NC individuals for the new generation.

CHC also implements a form of heterogeneous recombination using HUX, a special recombination operator 32. HUX exchanges half of the bits that differ between parents, where the bit position to be exchanged is randomly determined. CHC also employs a method of incest prevention. Before applying HUX to the two parents, the Hamming distance between them is measured. Only those parents who differ from each other by some number of bits (mating threshold) are mated. The initial threshold is set at L/4, where L is the length of the chromosomes. If no offspring are inserted into the new population then the threshold is reduced by one.

CHC also performs a form of heterogeneous HUX recombination using a special operator recombination 32. HUX exchanged half the bits that differ between parents, where the bit position to exchange is determined randomly. CHC also employs a method of preventing incest. Before applying HUX to the both parents, the Hamming distance between them is measured. Only parents who are distinguished by some number of bits (mating threshold) are coupled. The initial threshold is stable at L / 4, where L is the length of the chromosomes. If no offspring are inserted into the new population then the threshold it decrements by one.

No mutation is applied during the recombination phase. Instead, when the population converges or the search stops making progress (i.e., the difference threshold has dropped to zero and no new offspring are being generated which are better than any member of the parent population) the population is reinitialized to introduce new diversity to the search. The chromosome representing the best solution found over the course of the search is used as a template to reseed the population. Reseeding of the population is accomplished by randomly changing 35% of the bits in the template chromosome to form each of the other NC – 1 new chromosomes in the population. The search is then resumed.

3.4. Representation and Fitness Function

Let SHS be the subset of selected hyperrectangles that result from the run of a hyperrectangle selection algorithm. Hyperrectangle selection can be considered as a search problem to which EAs can be applied. We take into account two important issues: the specification of the representation of the solutions and the definition of the fitness function.

  • Representation: We use a binary representation, a chromosome consists of N genes (one from each hyperrectangle in HS) with two possible values: 0 and 1. If the gene is 1, the associated hyperrectangle is included in the subset represented by S. If this is 0, this is not included.

  • Fitness Function: Let S be a subset of hyperrectangles of HS and be coded by a chromosome. We define a fitness function based on the accuracy (classification rate) and the Non-Monotonic Index (NMI) evaluated over TR through the formula.


    clas_rat means the percentage of correctly classified objects from TR using S. perc_red is defined as


    cover refers the total coverage of examples in TR in the subset of selected hyperrectangles or, in other words, the number of examples of TR whose distance value has been equal to 0 (examples covered by hyperrectangles in S).

    NMI_red is defined as

    where NMI_initial is the number of monotonic violations computed by the predictions made for all the rules derived from the training set and divided by the total number of pairs of examples. NMI_current is similarly computed to NMI_initial, but instead using the rules selected by the chromosome. NMI_red represents the relative reduction of NMI achieved by each chromosome and it is weighted by a constant factor of 50 due to the fact that this measure represents low values with respect to classification and reduction rates.

The objective of the EAs is to maximize the fitness function defined, i.e., maximize the classification rate and coverage while minimizing the number of hyperrectangles selected and the anti-monotonic index. Although the fitness function defined is focused on discarding single trivial hyperrectangles (points), exceptions could be present in special cases where points are necessary to achieve high rates of accuracy. Thus, the necessity of using single hyperrectangles or not will be determined by the tradeoff accuracy-coverage and will be conditioned by the problem tackled.

Regarding the parameters, we preserved the value of α = 0.5 as the best choice, due to the fact that it was analyzed in previous works related to instance selection 33,34,35. We have conducted several experimental evaluations to estimate the best values of these parameters. For the sake of shortness, we determined that a suitable value for β should fall near 0.66 and the value for λ should be close to 0.25.

It is worth mentioning that our objective is to identify the best values of the parameters that configure the evolutionary approach in a general set up. It is true that for a specific classification problem, these values could be tuned in order to optimize the results achieved, but this may affect to other aspects, like efficiency or simplicity. General rules can be given about this topic:

  • The number of evaluations and population size are the main factors for yielding good results in accuracy and simplicity. The raise of these values has a negative effect on efficiency. In larger problems, it may be necessary to increase both values, but we will show in the experimental study that values of 10,000 and 50 work appropriately, respectively.

  • Parameters α and β allow us to obtain a desired trade-off between the accuracy and the number of hyperrectangles created. In the case of obtaining poor accuracy rates in a specific problem we have to increase α or decrease β. In contrary case, when the rules yielded are numerous and we are interested in producing simpler models, we have to increase β or decrease α. Besides, a large value of λ penalizes the maximization of the accuracy and minimization of the number of hyperrectangles in favor of the loss of anti-monotonic index.

Regarding to the specification of the classification conflicts, we employ the same mechanisms as shown in 20. In short, they are:

  • If no hyperrectangle covers the example, the class of the nearest hyperrectangle defines the prediction.

  • If various hyperrectangles cover the example, the one with lowest volume is the chosen to predict the class, allowing exceptions within generalizations.

    Our approach computes the volume of a hyperrectangle in the following way:

    where Lj is computed for each condition as
    Lj={HupperHlowerif numeric andHupperHlower,1if numeric andHupper=Hlower,num. values selectedviif nominal.

4. Experimental Framework

In this section, we present the experimental framework developed to analyze and compare our proposal EHSMC-CHC with other well-known algorithms presented in this domain. The study will incorporate two types of data set. On the one hand, we will involve real life data sets coming from standard classification problems whose classes are not initially ordered. For our purposes, we instigate the sorting of the class values, assigning each category a number. The order among classes is made according to the arrangement that achieves the fewest number of monotonicity violations reported in each data set. On the other hand, we will use regression data sets whose output attribute has a numeric domain which will be ordered into categorical values. In this second subset, the order of classes is natural and we tackle real ordinal prediction problems.

In the next subsection, the experimental methodology will be specified, including data sets, performance measures, parameters of the algorithms and statistical validation.

4.1. Experimental methodology

The elements included in the framework are the following:

  • Data sets: The study includes a total of 30 data sets, as a result of the joint selection of standard classification data sets (CLAS) whose class attribute is transformed into an ordinal attribute, assigning each category a number in such a way that the number of pairs of non-monotonic examples is minimized, and regression data sets (REGR) whose class attribute is discretized into 4 or 10 categorical values, keeping the class distribution balanced. Also, four classical ordinal classification data sets (ORD) are included in this study (era, esl, lev and swd) 1. Their main characteristics are described in Table 1. They are classical data sets used in the classification scope and extracted from the UCI and KEEL repositories 37,38. The run of the algorithms has been done following a 10-fold cross validation procedure (10-fcv).

    Table 1 shows the name of the data sets and their number of instances, variables and classes. In case of containing missing values, the total number of instances of the original data set appears in parenthesis. In this paper, these instances have been ignored.

  • Evaluation metrics:

    • Accuracy (Acc), defined as the number of successful hits relative to the total number of classifications. It has been by far the most commonly used metric for assessing the performance of classifiers for years 39.

    • Mean Absolute Error (MAE), calculated as the sum of the absolute values of the errors and then dividing it by the number of classifications. The errors between the real label and the predicted label are estimated by the difference between the ordinal class values. Various studies conclude that MAE is one of the best performance metrics in ordered classification 40.

    • Monotonic Accuracy (MAcc), computed as standard Acc, but only considering those examples that completely fulfill the monotonicity constraints. In other words, non-monotonic comparable examples do not take part in the calculation of MAcc. The sense behind this is to simulate that future examples to be classified will check the monotonicity assumption. This metric serves as a manner of monotonicity level measurement of the predictions carried out. To address this, the process followed is presented in Figure 1 where the partitions used in 10-fcv are modified conserving the training data sets but removing the non-monotonic instances in the test data sets. We must point out that in the new validation data sets the test instances belong to the original data set, they are real instances, not artificial. In the same way for this new data sets obtained, we must highlight that the test data sets have been generated using different strategies than the noise removal filter or relabeling, searching to fair comparisons. The process presented in Figure 1 does not present any kind of randomness due to it is based in a deterministic greedy algorithm (see Algorithm 1).

    • Monotonic Mean Absolute Error (MMAE), used to pursue the same goal of MAcc, but using MAE instead of Acc. Obviously, the data sets used in this case are the same than in the MAcc analysis.

    • Non-Monotonicity Index (NMI) 41, defined as the number of clash-pairs divided by the total number of pairs of examples in the predictions made by an algorithm:

      where x is an example from the data set D. NClash(x) is the number of examples from D that do not meet the monotonicity restrictions (or clash) with x and n is the number of instances in D.

    • Number of Rules (NRules), number of rules/hyperrectangles that compose the models produced by the algorithms.

  • Parameters configuration: See Table 2.

  • Statistical procedures: Several hypothesis testing procedures are considered to determine the most relevant differences found among the methods 30. The use of non-parametric tests will be preferred to parametric ones, since the initial conditions that guarantee the reliability of the latter may not be satisfied. This could cause the statistical analysis to lose credibility. Friedman ranks test 31 is used to contrast the behavior of each algorithm. It highlights the significant differences between methods if they appear.

Figure 1:

Process of transformation the 10-fcv to the new one by removing the non-monotonic instances from the test data sets.

Data set Ins. At. Cl. Type Data set Ins. At. Cl. Type
appendicitis 106 7 2 CLAS hepatitis 80 (155) 19 2 CLAS
bands 365 (539) 19 2 CLAS ionosphere 351 33 2 CLAS
baseball10cl 337 16 10 REG iris 150 4 3 CLAS
baseball4cl 337 16 4 REG machinecpu10cl 209 6 10 REG
breast 277 (286) 9 2 CLAS machinecpu4cl 209 6 4 REG
cleveland 297 (303) 13 5 CLAS movement_libras 360 90 15 CLAS
dee10cl 365 6 10 REG newthyroid 215 5 3 CLAS
dee4cl 365 6 4 REG pima 768 5 2 CLAS
dermatology 358 (366) 34 6 CLAS sonar 208 60 2 CLAS
ecoli 336 7 8 CLAS spectfheart 267 4 2 CLAS
era 1000 4 9 ORD swd 1000 10 4 ORD
esl 488 4 9 ORD vowel 990 13 11 CLAS
german 1000 20 2 CLAS wdbc 569 30 2 CLAS
glass 214 9 7 CLAS wine 178 13 2 CLAS
haberman 306 3 2 CLAS wisconsin 683 (699) 9 2 CLAS
Table 1:

Description of the 30 data sets used in the study.

Algorithm Parameters
Mk-NN k = {1, 3}, distance = euclidean

OLM modeResolution = conservative
modeClassification = conservative

OSDL classificationType = media, balanced = No
weighted = No, tuneInterpolationParameter = No,
lowerBound = 0, upperBound = 1
interpolationParameter = 0.5, interpolationStepSize = 10

MID confidence = 0.25, 2 items per leaf, R = 1

EHSMC-CHC Popul. Size = 50, Num. Evaluations = 10000
α = 0.5, β = 0.66, λ = 0.25
Table 2:

Parameters considered for the algorithms compared.

function GreedyRemove(T - dataset)
   while NumberOfTotalCollisions(T)>0 do
     maxColis=0, instancSelec=0;
     for each instance i in T do
       if Colis>maxColis then
         maxColis=Colis, instancSelec=i;
       end if
     end for
     T = T - instancSelec;
   end while
   return T
end function
Algorithm 1

Greedy Non-Monotonic Instances Removal Algorithm for test partitions.

5. Results and Analysis

This section shows the results obtained in the experimental study as well as the analysis based on them. Tables 3 and 4 report the results measured by accuracy, MAE and their monotonic versions explained before (MAcc and MMAE) in test data. Furthermore, Table 5 shows the NMI measured in the prediction for each algorithm and the number of rules reported for those algorithms which produce rules: MID and OLM. The best case in each data set is stressed in bold. The last row in each table shows the average measured by considering all data sets.


appendicitis 0.8673 0.8236 0.8018 0.7736 0.1855 0.1591 0.1327 0.1764 0.1982 0.2264 0.8145 0.8409
bands 0.6658 0.6194 0.3702 0.6219 0.6859 0.6250 0.3342 0.3806 0.6298 0.3781 0.3141 0.3750
baseball10cl 0.2796 0.2493 0.1988 0.2643 0.2494 0.2314 1.3814 1.4658 2.0048 1.5855 1.7870 1.8635
baseball4cl 0.6647 0.5850 0.5289 0.4983 0.4748 0.4961 0.3678 0.4802 0.5869 0.6028 0.6822 0.6611
breast 0.7146 0.6749 0.6817 0.5447 0.6272 0.6310 0.2854 0.3251 0.3183 0.4553 0.3728 0.3690
cleveland 0.5389 0.5517 0.5657 0.5655 0.5254 0.5253 0.9424 0.6931 0.8477 0.6471 0.7436 0.7774
dee10cl 0.2875 0.2633 0.1781 0.0985 0.2384 0.2766 1.3971 1.2493 2.6059 4.5396 1.8289 1.7619
dee4cl 0.6493 0.6053 0.4682 0.2522 0.5454 0.5809 0.4085 0.4221 0.7838 1.4927 0.5887 0.5259
dermatology 0.9354 0.9239 0.3800 0.1259 0.7898 0.8094 0.1435 0.1691 1.5331 1.7599 0.4625 0.4259
ecoli 0.8126 0.8007 0.5715 0.0324 0.5923 0.5893 0.7016 0.7738 1.5379 2.3482 1.4875 1.5829
era 0.1620 0.2740 0.1690 0.2320 0.1430 0.1410 2.1480 1.3630 2.1500 1.2850 2.3360 2.3400
esl 0.5353 0.6784 0.5963 0.6785 0.3956 0.3915 0.7750 0.3525 0.4324 0.3521 0.7418 0.7459
german 0.6970 0.6920 0.7110 0.3740 0.6370 0.6330 0.3030 0.3080 0.2890 0.6260 0.3630 0.3670
glass 0.6975 0.6200 0.3244 0.3379 0.7172 0.6783 0.5602 0.7595 1.7816 1.7729 0.5408 0.6460
haberman 0.7514 0.7120 0.3496 0.7158 0.4803 0.4771 0.2486 0.2880 0.6504 0.2842 0.5197 0.5229
hepatitis 0.8343 0.7735 0.2497 0.8118 0.8222 0.8030 0.1657 0.2265 0.7503 0.1882 0.1778 0.1970
ionosphere 0.9174 0.9002 0.6269 0.7407 0.7071 0.6702 0.0826 0.0998 0.3731 0.2593 0.2929 0.3298
iris 0.9467 0.9533 0.9333 0.3667 0.9533 0.9533 0.0533 0.0467 0.0667 0.9067 0.0467 0.0467
machinecpu10cl 0.2864 0.3298 0.3205 0.3493 0.3105 0.2962 1.6021 1.2550 1.4960 1.2631 1.4176 1.4938
machinecpu4cl 0.6457 0.6267 0.6219 0.5736 0.6267 0.6505 0.4352 0.3924 0.4452 0.4843 0.4117 0.3929
movement_libras 0.7222 0.6667 0.3000 0.0611 0.6139 0.5778 1.2444 1.5667 4.8306 7.0333 1.7528 2.0944
newthyroid 0.9491 0.9264 0.5636 0.1439 0.7957 0.7723 0.0833 0.1104 0.7662 0.8654 0.3165 0.3578
pima 0.6787 0.7265 0.7110 0.6563 0.7358 0.7320 0.3213 0.2735 0.2890 0.3437 0.2642 0.2680
sonar 0.7312 0.7460 0.4662 0.5419 0.8555 0.8307 0.2688 0.2540 0.5338 0.4581 0.1445 0.1693
spectfheart 0.7942 0.7608 0.1983 0.7830 0.7236 0.7050 0.2058 0.2392 0.8017 0.2170 0.2764 0.2950
swd 0.3930 0.5660 0.4040 0.5820 0.3360 0.3360 0.6630 0.4670 0.7510 0.4350 0.8810 0.8810
vowel 0.7838 0.7990 0.0909 0.0859 0.9949 0.9788 0.4313 0.5586 5.0000 4.8273 0.0101 0.0444
wdbc 0.9364 0.9350 0.3392 0.3568 0.3465 0.3341 0.0636 0.0650 0.6608 0.6432 0.6535 0.6659
wine 0.9431 0.9160 0.3042 0.3261 0.7542 0.8147 0.0569 0.0840 0.9536 0.9487 0.3180 0.2408
wisconsin 0.8230 0.9460 0.8872 0.9593 0.9622 0.9622 0.1770 0.0540 0.1128 0.0407 0.0378 0.0378

Average 0.6881 0.6882 0.4637 0.4485 0.5942 0.5887 0.5328 0.4966 1.1393 1.2423 0.6861 0.7107
Table 3:

Results for Acc and MAE


appendicitis 0.9431 0.9111 0.8018 0.7736 0.1500 0.1167 0.0569 0.0889 0.1982 0.2264 0.2111 0.8500
bands 0.6663 0.6175 0.3724 0.6253 0.6896 0.6281 0.3337 0.3825 0.6276 0.3747 0.3635 0.3104
baseball10cl 0.2920 0.2700 0.5783 0.5350 0.2841 0.2647 1.3342 1.3383 0.5048 0.5411 1.8795 1.5814
baseball4cl 0.7119 0.6155 0.2237 0.3072 0.5209 0.5464 0.3123 0.4414 1.9201 1.3864 0.5367 0.6134
breast 0.7902 0.7534 0.7769 0.5710 0.6958 0.6996 0.2098 0.2466 0.2231 0.4290 0.2309 0.3042
cleveland 0.5682 0.5517 0.5575 0.5673 0.5536 0.5533 0.8920 0.6931 0.8448 0.6583 0.6963 0.6811
dee10cl 0.3204 0.2960 0.5069 0.2439 0.2776 0.3237 1.3093 1.1540 0.7144 1.5358 1.3429 1.4843
dee4cl 0.6807 0.6350 0.2085 0.0991 0.5898 0.6270 0.3728 0.3829 2.5892 4.7203 0.3818 0.5036
dermatology 0.9370 0.9313 0.4037 0.1334 0.8357 0.8489 0.1360 0.1583 1.5298 1.7417 0.3153 0.3405
ecoli 0.8521 0.8366 0.5653 0.0235 0.6696 0.6566 0.4925 0.6243 1.5889 2.3213 1.0180 1.0977
era 0.1626 0.7239 0.2837 0.5134 0.1811 0.1836 1.9900 0.4726 1.7797 0.5651 2.5840 2.0704
esl 0.7429 0.8340 0.6758 0.8119 0.4476 0.4399 0.2752 0.1861 0.3318 0.2035 1.6437 0.6796
german 0.7035 0.6973 0.7260 0.3815 0.6496 0.6465 0.2965 0.3027 0.2740 0.6185 0.3556 0.3504
glass 0.6983 0.6200 0.3227 0.3411 0.7221 0.6832 0.5580 0.7595 1.7892 1.7667 0.6372 0.5307
haberman 0.9460 0.8009 0.2777 0.7157 0.5857 0.5814 0.0540 0.1991 0.7223 0.2843 0.8754 0.4143
hepatitis 0.8343 0.7735 0.2497 0.8118 0.8222 0.8030 0.1657 0.2265 0.7503 0.1882 0.2000 0.1778
ionosphere 0.9290 0.9111 0.6576 0.7703 0.7416 0.7027 0.0710 0.0889 0.3424 0.2297 0.2212 0.2584
iris 0.9595 0.9590 0.9267 0.3667 0.9662 0.9662 0.0405 0.0410 0.0733 0.9133 0.0271 0.0338
machinecpu10cl 0.3302 0.3907 0.6707 0.6252 0.3831 0.3619 1.4441 1.0995 0.3969 0.4158 1.2460 1.2553
machinecpu4cl 0.6660 0.6640 0.3719 0.4290 0.6815 0.7026 0.4119 0.3570 1.3555 1.0573 0.3382 0.3496
movement_libras 0.7226 0.6715 0.3145 0.0643 0.6434 0.6049 1.2474 1.5329 4.7228 6.9883 1.5169 1.4765
newthyroid 0.9561 0.9366 0.5721 0.1504 0.8310 0.8065 0.0687 0.0924 0.7486 0.8591 0.3393 0.2562
pima 0.6787 0.7587 0.7005 0.6563 0.7894 0.7863 0.3213 0.2413 0.2995 0.3437 0.2062 0.2106
sonar 0.7457 0.7460 0.4662 0.5419 0.8555 0.8307 0.2543 0.2540 0.5338 0.4581 0.1445 0.1693
spectfheart 0.7917 0.7575 0.2008 0.7917 0.7315 0.7110 0.2083 0.2425 0.7992 0.2083 0.2880 0.2685
swd 0.4101 0.8485 0.4682 0.8878 0.3876 0.3876 0.6151 0.1561 0.6393 0.1122 1.0399 0.7730
vowel 0.7838 0.7990 0.0909 0.0859 0.9949 0.9788 0.4313 0.5586 5.0000 4.8273 0.0444 0.0101
wdbc 0.9364 0.9315 0.3073 0.2970 0.4626 0.4495 0.0636 0.0685 0.6927 0.7030 0.6184 0.5374
wine 0.9462 0.9229 0.3258 0.3258 0.7947 0.8595 0.0538 0.0771 0.9324 0.9490 0.2169 0.2581
wisconsin 0.8230 0.9529 0.8641 0.9571 0.9708 0.9708 0.1770 0.0471 0.1359 0.0429 0.0280 0.0292

Average 0.7176 0.7373 0.4823 0.4801 0.6303 0.6240 0.4732 0.4171 1.1020 1.1890 0.6516 0.5959
Table 4:

Results for MAcc and MMAE

NMI(%) NRules

appendicitis 5.44 0.00 0.22 0.00 14.34 15.37 9.8 8.7 23.3
bands 0.00 0.00 0.00 0.01 0.01 0.01 49 60 102.9
baseball10cl 1.13 10.68 12.58 12.45 14.09 13.98 33.7 93.2 215.7
baseball4cl 0.00 11.14 13.48 13.36 12.13 12.26 40.2 61.2 123.4
breast 0.16 0.64 3.48 2.35 3.78 3.78 25.5 142.9 18.4
cleveland 0.00 0.40 0.89 0.98 1.01 0.89 13.3 60.1 35.7
dee10cl 0.84 2.82 0.02 4.40 4.99 4.95 27.5 92.5 94.5
dee4cl 0.56 2.41 0.02 4.86 4.26 4.13 35.7 55.1 27.5
dermatology 0.02 0.00 0.05 0.19 0.16 0.16 13.3 13.2 36.9
ecoli 0.17 11.31 0.02 12.72 13.40 13.17 33.4 37.7 75.8
era 8.79 17.03 13.49 12.59 12.25 12.27 9.4 26.9 38.2
esl 0.44 61.70 61.92 61.56 63.28 63.30 24.4 55.4 39.6
german 0.00 0.02 0.06 0.05 0.05 0.06 38.6 288.8 288.8
glass 0.00 0.00 0.00 0.00 0.00 0.00 28.6 33.4 191
haberman 3.46 0.29 0.56 3.80 19.72 19.74 25.4 8.9 25.2
hepatitis 0.00 0.00 0.10 0.10 0.40 0.00 5.3 9.8 17.3
ionosphere 0.09 0.00 0.29 0.23 0.45 0.40 24.1 24.5 116.8
iris 0.00 23.00 3.89 20.97 23.05 23.05 6.1 4.4 5.5
machinecpu10cl 0.87 5.18 4.88 4.92 6.78 6.93 17.9 53.3 58.9
machinecpu4cl 0.20 3.95 6.24 6.19 5.05 4.89 21.3 35.6 20.4
movement_libras 0.06 0.15 0.02 0.57 0.51 0.51 44.1 55.2 51.2
newthyroid 0.20 0.05 0.09 3.32 1.76 2.16 8 11.5 33.3
pima 0.01 3.47 0.12 3.67 5.50 5.52 37.9 52.7 39.3
sonar 0.00 0.00 0.00 0.00 0.00 0.00 28.8 22.3 187.2
spectfheart 0.00 0.00 0.00 0.00 0.03 0.03 22.7 27.6 239.5
swd 1.50 10.73 10.11 9.04 9.29 9.29 4.1 57.8 41.6
vowel 0.00 0.00 0.00 0.00 0.00 0.00 142.6 104.1 891
wdbc 0.01 0.00 0.00 1.37 3.35 3.34 17 16.7 64.3
wine 0.00 0.00 0.01 0.01 0.33 0.26 9.4 7.9 17.9
wisconsin 0.00 36.83 37.66 32.46 37.55 37.85 6.2 22.8 9.7

Average 0.80 6.73 5.67 7.07 8.58 8.61 26.78 51.47 104.36
Table 5:

Results for NMI and MRules

At a glance, observing Tables from 3 to 5, we can make the following analyses:

  • The EHSMC-CHC proposal yields the second best average result in accuracy over test data, the first best average results is offered by MID. It is slightly more accurate than Mk-NN and much more accurate than OLM and OSDL.

  • The MID proposal gets the best average result in MAE over test data. EHSMC-CHC obtains a very close result to MID, but it is clearly better than OLM and OSDL.

  • By considering the monotonic measures, MAcc and MMAE, EHSMC-CHC is again slightly outperformed by MID, obtaining the second best results. Also, both are clearly superior to the rest of algorithms.

  • The EHSMC-CHC proposal obtains the best average result in NMI over test data. The differences in NMI are very clear, thus our approach obtains very good monotonic predictions.

  • Finally, the number of rules required by EHSMCCHC is the lowest regarding the three algorithms compared. Hence, the models produces are simpler than the ones produced by MID and OLM.

In summary, our proposal EHSMC-CHC achieves competitive results in terms of accuracy and MAE by considering both classical and monotonic measures. It requires few rules and yields more monotonic models. Clearly, EHSMC-CHC is as good as MID in precision performance, but it requires a much lower number of generalized instances for building the model. With respect to the monotonicity in predictions, the NMI associated with EHSMC-CHC is also better than the reported by MID. Hence, EHSMC-CHC requires less rules and makes predictions with a higher monotonic index with similar behavior to MID.

Table 6 collects the results of applying the Wilcoxon test to EHSMC-CHC and the rest of the methods studied in this paper. In each one of the cells, three symbols can appear: +, = or –. They represent either that EHSMC-CHC outperforms (+), is similar to (=) or is worse (–) in performance than the rest of the methods. The value in brackets is the p-value obtained in the comparison. Other statistical studies can be performed by using non-parametric multiple comparison tests. These types of procedures study the set of results obtained by all the algorithms to compute a ranking that takes into account the multiple comparison. The smaller the ranking, the better the algorithm is. Rankings of the Friedman test are depicted in Table 7.

Algorithm Acc MAE MAcc MMAE NMI NRules
MID =(0.2580) =(0.7577) =(0.1642) =(0.6071) +(0.0034) +(0.0009)
OLM +(0.0001) +(0.0002) +(0.0001) +(0.0020) +(0.0258) +(0.0000)
OSDL +(0.0005) +(0.0040) +(0.0017) +(0.0086) +(0.0000)
M1-NN +(0.0052) +(0.0080) +(0.0248) +(0.0081) +(0.0000)
M3-NN +(0.0015) +(0.0035) +(0.0081) +(0.0137) +(0.0000)
Table 6:

Wilcoxon p-values reported comparing with EHSMC-CHC

Algorithm Acc MAE MAcc MMAE NMI NRules
EHSMC-CHC 2.3 2.4333 2.3833 2.45 1.8833 1.4333
MID 2.6667 2.5 2.9333 2.8667 2.6 2.05
OLM 4.5 4.7667 4.35 4.4333 3.2333 2.5167
OSDL 4.1333 4.1 4.3333 4.35 3.7833
M1-NN 3.4333 3.2833 3.3 3.4667 4.8167
M3-NN 3.9667 3.9167 3.7 3.4333 4.6833
Table 7:

Rankings reported by the test of Friedman

The results of the statistical tests allow us to highlight the following observations:

  • EHSMC-CHC and MID are the best approaches compared with the other techniques regarding accuracy and MAE. No statistical differences are registered between both, thus they behave equally in performance.

  • Our algorithm significantly outperforms OLM, OSDL, M1-NN and M3-NN in Acc, MAE, MAcc and MMAE.

  • Considering NMI and NRules, our approach is the best according to the statistical report.

6. Conclusions

The purpose of this paper was to present a proposal of hyperrectangle selection for monotonic classification through evolutionary algorithms. The novel algorithm named EHSMC-CHC constitutes a novel approach for nested generalized exemplar learning in classification with monotonicity constrains. It builds a beginning set of hyperrectangles by using a heuristic on training data and then it carries out a selection process focused on maximizing the performance of several objectives: accuracy, coverage of examples and reduction of the monotonicity violations registered in the model learned with the lowest possible number of hyperrectangles.

The results showed that EHSMC-CHC will allow us to produce very accurate models with a low number of hyperrectangles with very few monotonicity ruptures. We have compared it with the most important monotonic learning approaches, including both rule and instance based learners, and the performance of the models obtained by EHSMC-CHC in several data sets overcomes the offered by them.


This work is supported by the National Research Project TIN2014-57251-P.


11.R Potharst and AJ Feelders, “Classification trees for problems with monotonicity constraints”, SIGKDD Explorations, Vol. 4, No. 1, 2002, pp. 1-10.
30.J Demšar, “Statistical comparisons of classifiers over multiple data sets”, Journal of Machine Learning Research, Vol. 7, 2006, pp. 1-30.
32.LJ Eshelman, “The CHC adaptive search algorithm: How to safe search when enganging in nontraditional genetic recombination”, G Rawlings (editor), “Foundations of Genetic Algorithms”, Morgan Kaufmann, 1991.
36.K Cao-Van, “Supervised Ranking, from semantics to algorithms”, Ghent University, 2003. Ph.D. dissertation
37.M Lichman, “UCI Machine Learning Repository []”, University of California, School of Information and Computer Science, Irvine, CA, 2013.
40.N Japkowicz and M Shah, “Evaluating Learning Algorithms. A classification Perspective”, Cambridge University Press, 2014.
International Journal of Computational Intelligence Systems
9 - 1
184 - 201
Publication Date
ISSN (Online)
ISSN (Print)
10.1080/18756891.2016.1146536How to use a DOI?
© 2016. the authors. Co-published by Atlantis Press and Taylor & Francis
Open Access
This is an open access article under the CC BY-NC license (

Cite this article

AU  - Javier García
AU  - Adnan M. AlBar
AU  - Naif R. Aljohani
AU  - José-Ramón Cano
AU  - Salvador García
PY  - 2016
DA  - 2016/01/20
TI  - Hyperrectangles Selection for Monotonic Classification by Using Evolutionary Algorithms
JO  - International Journal of Computational Intelligence Systems
SP  - 184
EP  - 201
VL  - 9
IS  - 1
SN  - 1875-6883
UR  -
DO  - 10.1080/18756891.2016.1146536
ID  - García2016
ER  -