The LINEX Weighted k-Means Clustering
- 10.2991/jsta.d.190524.001How to use a DOI?
- LINEX loss function; Feature weights; Weighted k-means; Clustering
LINEX weighted k-means is a version of weighted k-means clustering, which computes the weights of features in each cluster automatically. Determining which entity is belonged to which cluster depends on the cluster centers. In this study, the asymmetric LINEX loss function is used to compute the dissimilarity in the weighted k-means clustering. So, the cluster centroids are obtained by minimizing a LINEX based cost function. This loss function is used as a dissimilarity measure in clustering when one wants to overestimate or underestimate the cluster centroids, which helps to reduce some errors of misclassifying entities. Therefore, we discuss the LINEX weighted k-means algorithm. We examine the accuracy of the algorithm with some synthetic and real datasets.
- © 2019 The Authors. Published by Atlantis Press SARL.
- Open Access
- This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).
Consider the dataset with entities and m features . The aim of clustering is to group the observations of a dataset into disjoint similar clusters, . The most common clustering algorithm is k-means  which partitions the dataset in different groups by minimizing the dissimilarity between objects and their corresponding cluster centroids , , for . Consider the following cost function:where and is a variable that represents weather if belongs to or not. can be a famous symmetric measure such as Euclidean, City Block, and Minkowski. In k-means clustering algorithms, all features in a dataset have an equal effect on results. However, in fact, all the features are not equally importance and some of them are noisy. Huang et al.  have proposed weighted k-means clustering algorithm (Wk-means) which assigned different weights to each feature. It minimized the below equation where and H is a matrix that includes , is the vector of feature weights, each is nonnegative such that and represents the impact of weights (when , it assigns small weights to the features, therefore in this work we consider ). To minimize the cost function (1) between and its corresponding centroid in each class subject to , the weights are updated using the following relation: where
The Wk-mean algorithm is as below:
Specify the number of clusters, . Set the initial centroids and feature weights randomly such that , it could be .
Allocate each entity to the nearest centroid by minimizing (1).
All centroids must be updated to the centers of the corresponding groups. Stop if there is no change in clusters of step 2. Then, are the final partitions.
Using (2), the weights are updated by considering the constraint
Amorin et al.  generalized this algorithm to the Minkowski Wk-means (MWk-means), which used the Minkowski distance as the dissimilarity measure instead of Euclidean. It minimizes the following criterion:
The MWk-means algorithm is used to classify the dataset into disjoint clusters of analogous objects using the Minkowski distance such that different weights are assigned to each feature at different clusters. Choosing the dissimilarity measure in clustering algorithms is an important issue. Modha and Spangler  explained that one could use a nonnegative, complex, and symmetric loss function as the dissimilarity measure in the clustering, and they introduced the convex k-means algorithm. Kummamuru et al.  illustrated the class of dissimilarity measures, which are asymmetric. Parsian and Kirmani  explained that the symmetric loss functions are useful when the overestimating and underestimating of the estimator are not important; otherwise, an asymmetric loss function as LINEX is appropriate. For example, the risk of overfilling the gasoline tank of an aircraft is less than underfilling it, or misdiagnosis of cancer in a patient might be more costly. LINEX is an asymmetric loss function, which is convex and has some interesting properties. For example, it is exponential or linear for different input values according to its parameter. In the next part, we talk more about this loss function, see .
Here, we want to generalize the Wk-means algorithms to the one with the asymmetric LINEX loss function, as the dissimilarity measure. We call this LINEX weighted k-means (LINEX Wk-means) clustering algorithm. It is useful, especially when one wants to force some data to place in specific clusters to reduce some losses, while it assigns different weights to each feature. For example, misclassifying a set of bombs into two groups of high energy or low energy may lead to hazardous results. To check the accuracy of our LINEX Wk-means algorithm, which means how much it can partition the data well, we use some synthetic and real datasets, which are labeled before. We also compute the Davies–Bouldin (DB) index, the normalized variation information (NVI), and an accuracy measure (AM) that measures the percent of actual clustering labels to compare the results.
2. LINEX k-MEANS ALGORITHMS
Sometimes we deal with clustering a dataset such that misclassification of an entity into a specific cluster might be costly, hazardous, or we tend to put some data into a particular cluster. In this case, the k-means algorithm with a symmetric dissimilarity measure may not be so appropriate. Therefore, we need to use an asymmetric dissimilarity measure like LINEX so that the overestimating or underestimating error is different. This algorithm is called LINEX k-means algorithm .
Suppose the parameter be unknown and let be an estimator of which is based on . Then is the error of estimating . The LINEX loss function, which was proposed by Varian , is as the following:where is a scalar and is convex, asymmetric, and nonnegative. is linear for negative if and it is exponential if for positive . For small values of , is not far from the symmetric Euclidean loss. So the optimal estimate falls on the mean which is obtained by the Euclidean one. This means that estimation under LINEX loss is consistent with one achieved from the Euclidean loss (when ). When the overestimating is more costly than the underestimating, is completely asymmetric and is close to one . The multiparameter case is as below, where , and .
Now we want to use the LINEX loss function in the weighted k-means clustering, as the dissimilarity measure. Suppose we have a dataset X, as was introduced before. The LINEX k-means algorithm uses the following cost function to measure the distance between each entity in a cluster and its corresponding centroid,
The process is like the k-means algorithm except in the dissimilarity measure and optimal points of centroids in each cluster. The optimal centers are achieved by minimizing with respect to C, .
When is close to zero, the results are close to the common k-means algorithm with symmetric dissimilarity measures. When , the overestimation is more important than the underestimation and the entities which are close to the border of two clusters are pushed into a cluster, according to the overestimated centers.
For large values of , since is not computable, so cannot be evaluated. Therefore, before processing, the data should be standardized by the following relation:
When is small enough, is computable and normalizing the data is not needed.
3. THE LINEX Wk-MEANS CLUSTERING
LINEX Wk-means clustering is a generalization of Wk-means and LINEX k-means algorithms. It assigns different weights to each feature while using a LINEX loss function as the dissimilarity measure. The processes are as the Wk-means algorithm with some differences. Consider the following equation:where each weight is nonnegative, and is the impact of weights and , as it was defined before. The goal is to find the optimal centers and feature weights , for , by minimizing (4). We rewrite (4) as where
Equating it to zero leads to and by summing over , and finally,
When , a unique value is assigned to the -th variable in each cluster. So, we put . If is equal to zero for a then (6) is not efficient. To avoid these conditions, adding a positive constant to is suggested . This constant could be the average of the dispersion of the features in a dataset. When , the minimum of (4) is then achieved at and the weights of the other features are zero, where is the feature that has the smallest sum of within-cluster variance. The optimum weight depends on the ratio of , which are based on the LINEX loss directly. It seems overestimation or underestimation affect just on centers (not weights).
Now it is turned to find the optimal centers. Consider the following optimization problem:where is the LINEX dissimilarity measure and we have is for and , for and . We minimize T(H, C, W) according to the above conditions in the following three steps:
Fix and is minimized iff,
if for ,
Fix and and solve . Then T is minimized at in relation (6).
Fix and and solve . Then it is minimized if and only if for each entity we have,
To show step 3, fix k and d, then it is enough to minimize,
By differentiating it with respect to and then equating it to zero,and (7) is obtained.
The LINEX Wk-means is the same as the Wk-means algorithm except in cluster centers optimization, weights, and the dissimilarity measure (steps 2, 3, and 4). That is,
2. Allocate each entity to the nearest centroid by minimizing (4).
3. All centroids must be updated to the centers of the corresponding groups using (7).
4. Using (6), the weights are updated by considering the constraint 2.
There is also another version of LINEX Wk-means algorithm, which differs in its exponential weights and the cost function is as the following:such that . We call it LINEX exponentially weighted k-means algorithm (LINEX EWk-means). The optimal centers are the same as (7), so it is enough to find the optimal feature weights, , by minimizing (8) with respect to . We rewrite (8) as then we minimize the Lagrange function with respect to . Therefore,
Equating it to zero leads to and by summing over , . However,
For , we add the average of the dispersion of the features in a dataset to to avoid a division by zero in (9).
4. PERFORMING THE EXPERIMENTS
In comparison with the LINEX k-means clustering, one advantage of the LINEX Wk-means algorithm is to assign different feature weights in a dataset to classify the entities. Here we want to check the performance of LINEX Wk-means on some simulated and real datasets, which are available in the UC Irvine Machine Learning Repository . The simulated datasets are generated from Normal, Log-Normal, Gamma, and Poisson distributions as the following densities:
All of these datasets are labeled, which helps us to evaluate the accuracy of an algorithm. To accomplish this, we map these labels to their clusters, which are generated from the algorithm. We run the algorithms several times since the initial centroids are chosen from the entities randomly and lead to different results in each iteration. Each time, the percent of the corrected labels are computed, and finally, the average of the accuracies is calculated. In the next experiment, we add some features to our datasets such that they contain random noises [3,10]. These random noises are generated from uniform density. Then we specify the impact of these noisy features on the algorithm results. We want to represent how much the LINEX Wk-means algorithm results remained fixed when adding these noisy features. Now we introduce the datasets:
4.1. Simulated Datasets
The sum of n independent and identically distributed (iid) random variables with Normal, Gamma (with the same ), and Poisson distributions have the same respective distributions, and the product of some iid Log-Normal random variables is Log-Normal. We produce three datasets with 100 dependent data (using the above rules) and each with three features from Normal, Log-Normal, and Gamma densities. Our fourth dataset is from Poisson density, but with one feature. Each dataset is partitioned into two clusters. The first 50 entities are labeled as 1 and the second 50 entities have labeled 2. To see the procedure of generating these datasets, one can see .
4.2. Real Datasets
We choose some real datasets such that classifying some events in a special cluster might be better or worse, since sometimes when some entities are labeled wrongly, they may lead to irreparable losses. Therefore, we say the overestimating and the underestimating are not of equal importance.
Gamma Telescope: This dataset contains 19020 events with 11 features of different high-energy gamma particles. These events are divided into two groups, background, and single. Clustering a background event as a single is worse than its conversing.
Haberman's Survival: It contains 306 data of breast cancer patients with three features. It is divided into two clusters “the patients who were alive five years or more” and “the patients who were alive less than five years.”
Seismic bumps: It contains 2584 data with 18 features that represent the energy of seismic bums in a coal mine. This is clustered into two classes “high energy” and “no high energy.”
First, we illustrate the performance of the two LINEX Wk-means algorithms for different values of a and which we choose them in steps of 0.1 in and , respectively. Note that when is close to zero, the results are not so different from Wk-means algorithm with Euclidean distance. Then, we add the noisy features to the datasets to see how much the LINEX Wk-means algorithm is sensitive to these noisy features.
Now we check the results in 14 datasets (8 simulated and 6 real datasets). To evaluate the two versions of LINEX Wk-means algorithms, we compute AM, NVI, and DB [11,12]. NVI and AM are two external criteria that depend on the datasets inherently. To compute them, we need a dataset, which is prelabeled. They compare the output labels of an algorithm with the dataset labels. AM is our accuracy measure that represents the percent of the actual classified observations and is better when it is close to 100. If NVI takes values from 0 to 1 it shows good clustering performance. It decreases, as the clusters are more homogeneous. DB is an internal criterion that depends on the dispersion between clusters and the inherent data. The smaller value of DB states that the clusters are separated well. The computations have been performed on the Intel Core i3 processor CPU 2.13 GHz, Ram 6 GB, on the MATLAB, version 2013a.
5.1. Performance in Simulated Datasets
For each dataset, we run the Wk-means, the LINEX Wk-means, and the LINEX EWk-means 100 times (since the initial centroids are selected from the entities randomly) and obtain the average of AM, NVI, and DB criteria. Table 1 shows the results of the four simulated datasets for the above three algorithms. The parameter is the power of weights in Wk-means and LINEX Wk-means algorithms and is the LINEX parameter. The smaller values of AM, NVI, and DB helps to determine better values of and . We examine different values of 2 in steps of 0.1. In most datasets, leads to better results in LINEX Wk-means algorithm. As the overestimating and the underestimating in the datasets of Table 1, are not important, we consider close to zero. In all Tables, the best values of the criteria in each column of different datasets, are bolded.
|Dataset||Algorithm||AM||Max of AM||DB||NVI|
Abbreviations: AM: Accuracy measure (average percent of actual clustering labels); DB: Davies–Bouldin index; NVI: Normalized variation information.
The results of Wk-means, LINEX Wk-means, and LINEX EWk-means algorithms on simulated datasets.
In Table 2, we add three noisy feature to Normal, Log-Normal, and Gamma datasets and one noisy feature to Poisson dataset to check how much the algorithms are sensitive to the noisy features. As you see the results in Tables 1 and 2 are very close together and in almost cases, the LINEX Wk-means, and the LINEX EWk-means leads to better results than Wk-means.
|Dataset||Algorithm||AM||Max of AM||DB||NVI|
|Normal with three||LINEX Wk-means||1||100||100.0||0.60||0.00|
|noisy feature||LINEX EWk-means||-||100||100.0||0.60||0.00|
|Log-Normal with three||LINEX Wk-means||1||92.1||100.0||0.80||0.20|
|noisy feature||LINEX EWk-means||-||94.2||100.0||0.78||0.17|
|Gamma with three||LINEX Wk-means||1||98.00||98.00||0.62||0.21|
|noisy feature||LINEX EWk-means||-||98.00||98.00||0.62||0.21|
|Poisson with a||LINEX Wk-means||1||94.00||100.0||0.46||0.32|
|noisy feature||LINEX EWk-means||-||96.94||100.0||0.45||0.18|
The results of Wk-means, LINEX Wk-means, and LINEX EWk-means algorithms on the simulated datasets with the added noisy features.
5.2. Performance in Real Datasets
In this part, we examine the LINEX Wk-means, the LINEX EWk-means, and the Wk-means algorithms on three real datasets and three real datasets with added noisy features. Like the previous part, each time, we run the algorithm 100 times, and we compute the averages of AM, NVI, and DB criteria in all iterations. In these datasets, the overestimating and the underestimating are important since misclassifying an entity in a special cluster might be hazardous or costly. We consider in steps of 0.1 and we choose according to the lowest values of the criteria. We note that means the highest overestimating, and does not lead to good results most of the times (since it may push all of the data into one cluster). Choosing is the same as , but we take in steps of 0.1. According to the results of Tables 3 and 4, we understand that in most datasets leads to the best results in LINEX Wk-means and LINEX EWk-means algorithms. The performances of LINEX Wk-means and LINEX EWk-means algorithms are not so different, but they seem to give better results than Wk-means algorithm. One advantage of LINEX EWk-means algorithm in comparison with LINEX Wk-means is that it has only one parameter, . In Table 4, we add some noisy features to each dataset and repeat the clustering algorithms. The results illustrate that our algorithms are not sensitive to the noisy features like Wk-means algorithm.
|Dataset||Algorithm||AM||Max of AM||DB||NVI|
|Haberman's survival||LINEX Wk-means||1||0.2||100.0||100.0||4.43||0.00|
|Magic Gamma telescope||LINEX Wk-means||1||0.1||70.79||71.25||1.34||0.95|
|Seismic bumps||LINEX Wk-means||1||0.1||93.42||93.42||0.79||1.00|
The results of Wk-means, LINEX Wk-means, and LINEX EWk-means algorithms on three real datasets.
|Dataset||Algorithm||AM||Max of AM||DB||NVI|
|Haberman's survival with||LINEX Wk-means||1||0.2||100.0||100.0||9.78||0.00|
|three noisy feature||LINEX EWk-means||-||0.2||66.14||68.30||2.34||0.99|
|Magic Gamma telescope with||LINEX Wk-means||1||0.1||70.61||71.33||1.96||0.94|
|11 noisy feature||LINEX EWk-means||-||0.1||70.29||71.33||1.97||0.94|
|Seismic bumps with 19||LINEX Wk-means||1||0.1||93.42||93.42||0.79||1.00|
|noisy feature||LINEX EWk-means||-||0.1||93.42||93.42||0.79||1.00|
The results of algorithms in Table 3 on three real datasets with the added noisy features.
In Wk-means algorithms, different weights are assigned to various features. Therefore, the less important features have lower influences in clustering results. We propose two weighted k-means algorithms with asymmetric LINEX loss function as the dissimilarity measure. They are useful when the overestimating and the underestimating are important, while they consider features with different weights. In these algorithms, the optimal centers are different from Wk-means algorithms with Euclidian dissimilarity distance. To evaluate the proposed algorithms, we investigate the results on some real and simulated datasets, and we compute some internal and external criterion to check the accuracies. The results represent good performances of the algorithms.
The authors thank the editor and the reviewers of the manuscript for their excellent comments and suggestions which help to improve the quality of the content considerably.
Cite this article
TY - JOUR AU - Narges Ahmadzadehgoli AU - Adel Mohammadpour AU - Mohammad Hassan Behzadi PY - 2019 DA - 2019/05/30 TI - The LINEX Weighted k-Means Clustering JO - Journal of Statistical Theory and Applications SP - 147 EP - 154 VL - 18 IS - 2 SN - 2214-1766 UR - https://doi.org/10.2991/jsta.d.190524.001 DO - 10.2991/jsta.d.190524.001 ID - Ahmadzadehgoli2019 ER -