International Journal of Computational Intelligence Systems

Volume 12, Issue 1, November 2018, Pages 299 - 310

A Methodology to Refine Labels in Web Search Results Clustering

Authors
Zaher Salah1, *, Ahmad Aloqaily1, Malak Al-Hassan2, Abdel-Rahman Al-Ghuwairi1
1Prince Al Hussein Bin Abdullah II Faculty for Information Technology, Hashemite University, Zarqa, Jordan
2Department of Business of Information Technology, University of Jordan, Amman, Jordan
*Corresponding author. Email: zahersalah@hotmail.com
Corresponding Author
Zaher Salah
Received 25 June 2018, Revised 19 November 2018, Accepted 14 December 2018, Available Online 31 December 2018.
DOI
10.2991/ijcis.2019.125905647How to use a DOI?
Keywords
Information retrieval; Machine learning; Web search results clustering; Web intelligence
Abstract

Information retrieval systems like web search engines can be used to meet the user’s information needs by searching and retrieving the relevant documents that match the user’s query. Firstly, the query is inputted to the web search engine and assumed to be a good representative for the user’s intention and reflecting specifically his information needs and thus it should be long enough, discriminative, specific and unambiguous. Secondly, the web search engine typically respond to the query by sending back a long flat list of web search results and each search result represents a relevant document. Typically, that list may contain thousands or millions of web search results and thus it is difficult to navigate and locate a specific document relevant to a specific topic. As a postretrieval process, web search results clustering may be a solution for this issue where web search results can be categorized as clusters. These clusters supposed to contain topically related documents and labelled by descriptive and concise labels. These labels supposed to correctly describe the contents of each cluster. Thus the users can easily choose a cluster representing the intended topic and navigate through relatively few documents inside that cluster. High-quality labelling for clusters is crucial for users who can now gain insight into that clusters’ contents, general structure, and distribution of the topics among documents in the clusters. This make the user able to preview and navigate easily and fast. To this end, the authors in this paper introduced a methodology to enhance labels for clusters of web search results. The proposed methodology is founded on the idea of using the existing labels nominated by the original Suffix Tree Clustering (STC) algorithm and adapting these labels and/or clusters so that it become more concise and descriptive. The propose methodology was conducted on the original STC algorithm to produce an enhanced version of the classical STC algorithm. The enhanced algorithm was experimented and the produced clusters and labels were evaluated and compared with respect to the classical STC algorithm. For evaluation, the authors used clusters labelling performance measure considered five parameters f1: Comprehensibility, f2: Descriptiveness, f3: Discriminative Power, f4: Uniqueness, and f5: Nonredundancy. The reported results shown that the new enhanced labels outperformed the original labels and the overall performance has been enhanced. The recorded results indicated that: (i) The proposed methodology achieved better performance and the overall average recorded values for the used performance measure (f6) was 0.921. (ii) Number of clusters was decreased from 15 to 9 clusters only. (iii) Number of duplicated results was decreased from 143 to 121 only, and (iv) average number of phrases per label was increased from 1.67 to 2.00 phrases.

Copyright
© 2019 The Authors. Published by Atlantis Press SARL.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

Web pages are dynamic, superabundant, and miscellaneous and thus a heterogeneous variation of topics represented by this massive amount of documents is expected as these documents are originated from various resources worldwide and covering various topics like, for example, arts, science, engineering, economy, politics, sport, and so on.

Starting, typically, from the first web search result(s) the user may waste a valuable time to browse many search results which are irrelevant to the intended topic that the user looking for. As a result, nearly (50%) of users only browse the first two pages of the Web Search Results returned from the search engine [1]. This arises the need for more sophisticated web search engines that may employ “postretrieval” process to improve the presentation of the search results to the users [2] where the search results can be organized into labelled clusters. Each cluster represents a specific topic and this is very important to the user who will be able to avoid previewing many irrelevant documents by deepen the navigation inside the intended cluster covering a specific topic and containing homogenous documents relevant to the user’s information needs. Furthermore, it is beneficial for the naive users to possibly find “unexpected relevant documents” too [3]. To achieve this we need more than document grouping; the most important requirement is how to choose a comprehensible and expressive descriptor or label for each documents group, so that the user will be able to locate the intended documents easier and faster depending on that short and meaningful labels to concisely explain to the users what the group’s content is about [4].

In general, the performance of many search results clustering techniques is still poor in the context of clusters labelling especially when the labelling phase is highly dependent on the efficiency of clustering phase that determine the content of each cluster. Homogeneous content for a cluster is crucial to induce a good label for that cluster. Many promising directions for various research approaches presented themselves so as to extend the functionality and enhance the operation of specific phases in the clustering algorithms. Choosing the appropriate clustering algorithm with optimized parameters and efficient mechanism to elect the representative terms to be used as candidate labels are the key to produce better clusters with concise and knowledgeable (descriptive and informative) labels.

Web search engines (like, for example, Google, Bing, Dogpile, Yahoo, and Baidu) respond to the user query by sending a long list of search results meeting the information requirements (user intention) expressed by that query. In ranked retrieval systems, thesearch results list is ordered decreasingly according to a specific relevancy ranking (scoring) scheme and typically contains a title, a small portion of text called snippet and a URL for each search result [3]. Figure 1 shows an example of web search results for the query: Hashemite university represented as a flat ordered list of results. Both the low precision and flat presentation of search results made the process of meeting the user’s information needs far more exhaustive than it should to be and thus raise the need for more sophisticated search engines in which the relevant search results are easier to brows and to navigate [5].

Figure 1

An example of a flat list of web search results for the query: Hashemite university containing a title, a URL and a snippet for each search result associated with a relevant document.

Figure 2

Query used for clustering.

Figure 3

A list of the first 6 web search results retrieved from different web search engines (meta search) for the query: Donald Trump and containing a title, a URL and a snippet for each search result.

2. PRELIMINARIES

In this previous work section we provide some background concerning the labelling phase in the process of web search results clustering (WSRC). Various types of algorithms can be adopted for clustering textual documents employing, for example: neural network, fuzzy logic, rough set, or graph theory. WSRC algorithms can be classified into two categories [6]: (i) Numerical-based algorithms which have performance issues with web search result clustering because it expect full-text as input not short-text snippets like in web search results. In the other hand, its output is “raw numerical” thus it cannot be used to label clusters because it is un-interpretable by the user. (ii) Phrase-based algorithms which produce more comprehensible and descriptive labels than numerical algorithms. Table 1 [6] below differentiates between numerical-based and phrase-based algorithms while Table 2 [7] presents a comparison between the most typically used clustering algorithms in the literature based on various characteristics.

Numerical-Based Algorithms
  • Documents are converted to term- document matrix.

  • Numerical algorithms require more data than is available.

  • Raw numerical outcome is also difficult to convert back to cluster description

  • Data model is usually used Vector Space Model.

Phrase-Based Algorithms
  • Phrase based on frequent phrases instead of numerical

  • It is simpler than numerical algorithms.

  • These algorithms usually discard smaller clusters.

  • Data model is usually used N-gram, Suffix Tree.

Table 1

Comparison between numerical-based and phrase-based algorithms.

Method Semantic Relation Cluster Label Phrase Based Incremental Complexity
K-Means Clustering No One word only No No O(nkt) k: initial clusters n: no. of documents t: iteration
Suffix Tree Clustering Yes Shorter but appropriate Yes Yes, but merging phase is not incremental O(n)
Lingo No Longer more descriptive Yes No O(n)
Semantic Suffix Tree Yes Meaningful and readable labels Yes Yes O(n)
Improved K-Means Yes Based on K-means first and then on the documents linked to it No No Time Consuming
Inductive Clustering Yes Phrases extracted from results from internal and external summary Yes No Negligible with cluster titles
Fuzzy C-Mediod Clustering No Produce category Yes Yes O(n2)
Histogram-based Clustering Yes Matching phrases of documents Yes Yes O(n2)
Hierarchical Clustering No Most frequent terms from inside clusters No Yes O(n2): single link O(n3): complete link
Semantic Hierarchical Online Clustering (SHOC) Yes Labels that describe clusters (extract frequent phrases and SVD technique) Yes Yes O(n)

SVD, singular value decomposition.

Table 2

A comparison between the most typically used clustering approaches.

The Commonly used Suffix Tree Clustering (STC) algorithm deals with each document as a string (sequence of words) instead of a bag-of-words (BOW) which neglects the order of words while considers only the frequencies of distinct words occurrences in the corpus. STC uses suffix trees for summarizing the documents and extracting the frequent phrases while other algorithms like semantic hierarchical online clustering (SHOC) uses suffix arrays instead [8]. Label Induction Grouping Algorithm (Lingo) produces more clusters than STC and K-mean algorithms while STC is more scalable than Lingo and K-mean. [9] Lingo commences with extracting expressive labels first and then clustering documents individually to the fittest label (each label representing a cluster). Labels are generated from the pruned frequent terms (phrases and words) that achieve the required level of labelling descriptiveness and informativeness quality [7, 10].

In both Lingo and SHOC algorithms, labels of the clusters should be (i) present in a web search result snippet a number of times exceed a given threshold, (ii) meaningful and contained in a single sentence covering a specific topic, and (iii) clear by being complete (not partial phrase), long enough and frequent phrase. In addition, stop words that are present in the phrase must be preserved to produce more eligible cluster labels [11].

STC is fast (linear to the number of documents) and incremental thus it is very useful in search results clustering process which is online postretrieval process where time is critical requirement [8]. STC clusters documents or search resultsnippets containing common phrases (sequence of words or single terms) and uses information about frequency and order of terms in the documents. STC works in two main phases namely: (i) base cluster discovery using a suffix tree and (ii) merging base clusters into proper clusters. Firstly, STC summarizes document contents and extracts phrases to be assigned then as cluster labels and thus produces concise and meaningful cluster labels [6] depending on candidate frequent phrases describing the main topic covered by the document contents. Secondly, STC assigns snippets to each of these labels to form proper clusters. Thresholds are used to manage the clustering process but tuning these thresholds is often problematic [6].

In the work described in [12], documents are also treated as strings (sequence of words) and similarity between documents is computed using string-kernel function where similarity between two documents is the number of matching subsequences. More shared substrings (not always contiguous) means more similar documents. To grouping the documents, Spectral clustering is used which is a graph-based clustering algorithm where in short, the clustering problem is a graph cut problem to isolate set of nodes from others in the collection.

In general, there are three essential steps for any WSRC method [13] listed as follows:

  1. Retrieve a list R = (r1, r2, ⋯, rn) of n search results for the user query q.

  2. Cluster R to form a list C = (C0, C1, ⋯, Cm) of m + 1 clusters.

  3. Label clusters.

The method described above uses each created cluster to extract a meaningful label to be assigned as a good descriptor for that cluster while in [14], for example, labels are induced first and then clustering is performed by assigning snippets to the closest preextracted label. This is the same in the Lingo “description comes first” approach which uses frequent phrases to induce distinct enough labels to cover as much topics as possible, and after that the clustering is performed by assigning each snippet to the closest label [8, 11]. Steps for “description comes first” approach are listed briefly as the following:

  1. Preprocessing the input snippets by performing tokenization, stemming, and stop-words removal.

  2. Extracting frequent words and phrases in the input snippets.

  3. Inducing cluster labels by employing singular value decomposition (SVD).

  4. Assigning snippets to each of these labels to form proper clusters.

  5. Postprocessing like clusters merging and pruning.

In addition to clustering documents automatically in acceptable time, it is essentially to assign a meaningful and comprehensible label to each cluster to describe the semantic topic covered by that cluster concisely. Labelling is not a priority in traditional data mining approaches which is mainly concerned in grouping data precisely and efficiently. While WSRC is concerned in making search results easier to brows by grouping search results in well-described clusters [6] in order to make it easier to locate the required documents and even unexpected relevant documents by reviewing certain cluster [3].

3. CLUSTERS LABELLING

Extracting relevant terms for labelling clusters and act as readable, meaningful and distinguishing group descriptor is a challenging process especially in WSRC where search results snippets (small portion of text) contain few terms. The high locally frequent and low globally frequent term in a cluster is typically good representative label for that cluster [3]. Terms can be weighted using local and global factors as the following:

  1. Local Factor:

    Lt=lg(1+FCt)
    FCt is the frequency of documents containing term t in cluster C. Logarithmic frequency is used to avoid FCt high-frequency problem.

  2. Global Factor:

    Gt=lgFCt|C|FRt|R|
    FRt is the frequency of documents containing term t in search results R.

Label selection criterion combines local and global factors to calculate scores for terms in each cluster as the following:

Scoret=Lt×Gt

For each cluster, the term with the highest score will be selected as the cluster label.

[15] used Lingo algorithm to extract frequent phrases and original terms in addition to synonym terms from WordNet lexical database, in order to induce better abstractive labels for clusters. Other external knowledge resources like Wikipedia can be used to enrich the candidate label with new meaningful terms imported from the online free encyclopedia which contains a huge amount of “controlled” preclustered and manually annotated contents [4].

[16] proposed an approach to extract and combine significant bi-grams into n-grams according to term co-occurrence statistics and use the top-ranked unredundant phrases as candidate labels. To retrieve significant bi-grams, for each pair of words <w, wi>, strength is computed as the following:

strength=freqf¯σ
f¯=1n1infreqi
σ=1in(freqif¯)n

Word pairs with strength value < the threshold β0 will be discarded.

Also spread is computed as the following:

spread=djd, j0(pijpi¯)22d
pi¯=djd, j0pij2d
spread describes the shape of the pij histogram. Small spread value indicates flat histogram which means that wi can be used equivalently in almost any position around w, while large spread value indicates histogram with peaks which means that wi can only be used in one (or several) specific position around w. Word pairs with spread value < the given threshold ρ0 will be discarded too. The remaining word pairs are the significant bi-grams.

Now, bi-grams will be used to discover n-grams. Each bi-gram <w, wi> will be represented in a graph as a directed edge and the two words will be the vertices. A tri-gram “abc” is identified if the edges ab, bc, ac exist.

ngram is defined as ngram(w1  wn)={edge(w1w2) ifn=2ngram(w1wn1)  n1i=1edge(wiwn) ifn>2}
where, edge(w1w2)=false  otherwisetrue  if(w1,w2)

Depth-first traversal to all nodes results in extracting all the n-grams. After that, redundant n-grams need to be eliminated and only unredundant n-grams will be used as candidate cluster labels. Removing the redundancy in n-grams can be performed by applying remove-or-merge process.

Let ts(p) be the term set of n-gram p, ss(p) be the sentence set of p, and ω0 be a threshold. The remove-or-merge condition is defined as:

if ts(pi)ts(pj)ts(pi)ts(pj)ω0 &ss(pi)ss(pj)ss(pi)ss(pj)ω0, merge piandpjif ts(pi)ts(pj)mints(pi),ts(pj)=1 &ss(pi)ss(pj)ss(pi)ss(pj)ω0, delete shorter gram

Each candidate cluster label p is ranked by its significance (Sig(p)) as the following:

Sig(p)=t  fid  f(p)×boost(p)
t  fid  f(p)=t  f(p)×log1+Ndf(p)
boost(p)=5,|p|>8c|p|base,|p|8
boost(p) is a boost factor of phrase p, c and base are constants (c = 1.25 and base = 0.5).

The top M candidate cluster labels are selected to construct base clusters. All snippets containing the same label (phrase) are aggregated in a base cluster labelled by the phrase.

Even though it is generated automatically, evaluation of clusters labels may be better to be conducted against manually created gold standard data where human annotators are asked to identify the fittest cluster given a cluster label [17].

4. STC CLUSTERS LABELLING ENHANCEMENT

The enhancement process is founded on the idea of using the existing labels nominated by the standard STC algorithm. The original labels and/or clusters will be modified and combined so that it become more concise and descriptive. To this end, the propose methodology will be conducted on the original STC algorithm to produce an enhanced version of the classical STC algorithm. The proposed methodology employed a deeper linguistic analysis and more robust techniques (as seen in Algorithm 1) than that used in other research works like, for example, the work described in [18].

Once the raw original labels are induced using STC algorithm all the cluster label phrases and clusters will be processed with respect to the Algorithm 1. The aim of this algorithm is to enhance f1, f2, f3, f4, and f5 (see Section 5 for details) by refining and reformulating both of labels and clusters. The major steps are described Algorithm 1.

5. CLUSTERS LABELLING QUALITY EVALUATION

Labels for web search results clusters should be discriminatory and carefully describe the contents of individual clusters. Low-quality labelling for web search results clusters may confuse the user and mislead him during navigation through clusters and thus negatively affect the whole process aiming to meet the user information needs [18]. In this section, a discussion concerning the evaluation of the generated labels for the produced clusters of web search results is presented. This is important for generating descriptive and precise labels for clusters and/or conducting a comparison concerning the descriptiveness of different labelling techniques.

Clusters labelling quality measures can be conducted as an external measure according to the source of the “validity criteria.” External measure compares the clustering algorithm’s results against external, manually, or automatically, prelabelled results in order to compare the difference between the two results. Many labelling quality measures have been proposed in different contexts in the literature. [18] introduced a new metric to evaluate the quality of clusters labels using a comparative evaluation strategy. The authors in [18] argued that, to be responsible, clusters labelling evaluation should take into consideration the following five parameters:

Algorithm 1 Labels Enhancement and Clusters Refinement

  1. Comprehensibility (f1): A cluster label should give a clear interpretation for the contents of a cluster to the user. It can be formally defined as cC plc:PL(G)p>1, where lc is the cluster label of cluster c, and L(G) define a formal language identifying noun phrases (a word or group of words containing a noun and functioning in a sentence as subject, object, or prepositional object.).

    f1(p)=NP(p)Penalty(p)
    NP(p)=0,   Otherwise1,   if PL(G)
    Penalty(p)=exp(|p||p|opt)22d2,   if |p|>10.5, Otherwise

    The exponential expression in Equation (7) is used to penalize too short or too long phrases by setting |p|opt = 4 and d = 8. [19].

  2. Descriptiveness (f2): All documents in a cluster should contain the label associated with that cluster. It can be formally defined as: cCplcp'p'lcPc:dfc(p). Where Pc is the set of phrases in cluster c. p′ is the complement value of p, dfc(p) represents the number of documents in a cluster containing phrase p.

    f2(c,p)=11|Pclc|p'lcpPcdfc(p)dfc(p)

  3. Discriminative Power (f3): A cluster label should only exist, exclusively, in documents from its associated cluster. It can be formally defined as:

    ci,cjcicjCPlc:dfci(p)|ci|dfcj(p)|cj|
    f3(cj,p)=11k1cicjciC|cj|dfci(p)|ci|dfcj(p)
    Where ci and cj are two clusters while dfc(p) represents the number of documents in a cluster containing the phrase p.

  4. Uniqueness (f4): Each cluster label should be uniquely associated with one cluster. It can be formally defined as: ci,cjcicjC:lcilcj=

    f4(cj,p)=11k1cicjciC|plci||plcj|
    Where p is a phrase and lc is the label associated with a cluster.

  5. Nonredundancy (f5): Cluster labels can not be synonymous (having the same or nearly the same meaning). It can be formally defined as cCp,p'lc: p and p′ are not synonymous [pp'].

    f5(c,p)=11|lc|1p'pp'lcSyn(p,p')
    Where Syn : p × p →{0,1}.

Label relevancy: Relevance of a phrase with respect to a cluster: All constraints can be combined into a single criterion:

rel(c,p)=i=1||wifi(c,p)
Where wi is a weighting factor and = {f|1...5}, namely:
  • f1 : Comprehensibility

  • f2 : Descriptiveness

  • f3 : Discriminative Power

  • f4 : Uniqueness

  • f5 : Nonredundancy

Clusters labelling quality measures can be categorized as (i) external, (ii) internal, and (iii) relative measures according to the source of the “validity criteria.” External measure compares the clustering algorithm’s results against external, manuallyor automatically, preclustered results in order to disclose the difference between the two results. While internal measure employs functions to assess the similarity between cluster’s documents in addition to the dissimilarity between resulted clusters without referring to any external information. Relative measure assesses the results by comparing them against results from different algorithms, or compares the results of the same algorithm but under different conditions like different thresholds [20].

6. RESULTS AND DISCUSSION

One of the challenges of work on WSRC is the lack of “ground truth” data. In some cases it is possible to construct such data by hand however this still entails subjectivity and requires considerable resources (to the extent that it is not possible to construct significant benchmark data).

To act as a focus for the work described in this paper, the top-ranked web search results automatically clustered into relatively small thematic collections of documents for the query Donald Trump (see Figure 4). These clustered web search results retrieved from Carrot2 which is an open source WSRC engine (available on: http://search.carrot2.org/stable/search) using STC algorithm.

Figure 4

Clustering and labelling results using the classical Suffix Tree Clustering (STC) algorithm for the query: Donald Trump.

Figure 5

Clusters visualisation for the resulted clusters produced from the Suffix Tree Clustering (STC) algorithm for the query: Donald Trump.

Figure 6

Clustering and Labelling results using the enhanced Suffix Tree Clustering (STC) algorithm for the query: Donald Trump.

To evaluate the proposed methodology the authors compared the reformulated clusters and their enhanced labels with the original clusters and labels generated from the classical STC algorithm. For evaluation, the authors used clusters labelling performancemeasure considered five parameters (as discussed in Section 5). We also had to forget, for the purpose of the evaluation, the numerical “intensity” of the computed values for f1: Comprehensibility, f2: Descriptiveness, f3: Discriminative Power, f4: Uniqueness, and f5: Nonredundancy.

The results are presented in tabular form and show the performance of the proposed methodology to enhance the STC algorithm with respect to the classical STC algorithm. The evaluation shows that the proposed methodology to enhance the STC algorithm performs well with respect to the quality of the enhanced labels for clusters. Inspection of the recorded results in Figure 7 indicates that (i) The proposed methodology achieved better performance and the overall average recorded values for the used performance measure (f6) was 0.921. (ii) Number of clusters was decreased from 15 to clusters only. (iii) Number of duplicated results was decreased from 143 to 121 only, and (iv) average number of phrases per label was increased from 1.67 to 2.00 phrases.

Figure 7

Classical Vs enhanced Suffix Tree Clustering (STC).

7. CONCLUSIONS

In this paper the authors described the proposed methodology for enhancing the classical STC algorithm for clustering web search results. The operation of the proposed methodology was illustrated and evaluated. The objective of this research was to deploydeep linguist analysis techniques for the enhancement of phrase labels that will in turn allow for the reformulation of the structure of web search results clusters and thus produce better performance for web search engines and achieve a better end-user satisfaction. The proposed methodology used the existing labels nominated by the original STC algorithm and adapted that labels and/or clusters to be more concise and descriptive. The propose methodology was conducted on the original STC algorithm to produce an enhanced version of the classical Suffix Tree Clustering algorithm. The enhanced algorithm was experimented and the produced clusters and labels were compared and evaluated with respect to the classical STC algorithm using clusters labelling performance measure considered five parameters f1: Comprehensibility, f2: Descriptiveness, f3: Discriminative Power, f4: Uniqueness, and f5: Nonredundancy. The recorded results indicated that the new enhanced labels outperformed the original labels and the overall performance has been enhanced. The results shown that better performance was achieved (f6 = 0.921), clusters were decreased (from 15 to 9 clusters only), duplicated web search results were decreased (from 143 to 121 only), and average number of phrases per label was increased (from 1.67 to 2.00 phrases).

The promising results obtained so far indicate that (i) it is possible to capture the clusters structure (ii) it is possible to enhance the produced labels from STC algorithm to improve the overall performance by producing more comprehensive and descriptive labels for clusters and thus the user will be able to preview and navigate easily and fast.

Future work will initially be directed at the adoption of deeper linguistic approaches and data mining techniques to enhance other WSRC algorithms like Lingo and K-mean. The intention is also to increase the size of our dataset.

REFERENCES

1.R.K. Roul and S.K Sahay, Cluster labelling using chi-square-based keyword ranking and mutual information score: a hybrid approach, Int. J. Intell. Syst. Des. Comput., Vol. 1, No. 1–2, 2017, pp. 145-167.
2.H. Chim and X. Deng, Efficient phrase-based document similarity for clustering, IEEE Trans. Knowl. Data Eng., Vol. 20, No. 9, 2008, pp. 1217-1229.
3.H.-M. Li, C.-X. Sun, and K.-J. Wang, Clustering web search results using conceptual grouping, in 2009 International Conference on Machine Learning and Cybernetics, IEEE (Red Hook, NY), Vol. 3, 2009, pp. 1499-1503.
4.D. Carmel, H. Roitman, and N. Zwerdling, Enhancing cluster labeling using Wikipedia, in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, ACM (New York, NY), 2009, pp. 139-146.
5.G. Kr Yadav and A. Kumar, A described feasibility analysis on web document clustering, Int. J. Res. Stud. Sci. Eng. Technol., Vol. 2, 2015, pp. 1-13.
6.U. Bharambe and A. Kale, Landscape of web search results clustering algorithms, S. Unnikrishnan, S. Surve, and D. Bhoir (editors), Advances in Computing, Communication and Control, Springer, Switzerland, 2011, pp. 95-107.
7.H. Agrawal and S. Yadav, Search engine results improvement–a review, in IEEE International Conference on Computational Intelligence Communication Technology (Piscataway, NJ, February 2015), pp. 180-185.
8.S. Kopidaki, P. Papadakos, and Y. Tzitzikas, STC+ and NM-STC: two novel online results clustering methods for web searching, Web Information Systems Engineering (WISE 2009), Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 523-537.
9.T. Rani and A. Goyal, Survey of clustering techniques for information retrieval in data mining, Int. J. Sci. Eng. Technol. Res., Vol. 4, No. 4, 2015, pp. 738-740.
10.M. Waseem Khan, H.M. Shahzad Asif, and Y. Saleem, Semantic based cluster content discovery in description first clustering algorithm, Mehran University Research J. Eng. Technol., Vol. 36, No. 1, 2017, pp. 1-6.
11.S. Osinski and D. Weiss, A concept-driven algorithm for clustering search results, IEEE Intell. Syst., Vol. 20, No. 3, 2005, pp. 48-54.
12.Q. Shi, X. Qiao, and X. Guangquan, Using string kernel for document clustering, Int. J. Inf. Technol. Comp. Sci., Vol. 2, 2010, pp. 40-46.
13.A. Di and R. Navigli, Clustering web search results with maximum spanning trees, Congress of the Italian Association for Artificial Intelligence, Springer, 2011, pp. 201-212.
14.H.D. Abdulla and V. Snasel, Using singular value decomposition (svd) as a solution for search result clustering, in International Conference on Innovations in Information Technology, IEEE (Piscataway, NJ), 2008, pp. 302-306.
15.A. Sameh and A. Kadray, Semantic web search results clustering using lingo and wordnet, Int. J. Res. Rev. Comput. Sci., Vol. 1, No. 2, 2010, pp. 71-76. http://search.proquest.com/openview/d2be52a63fc5a2f4a20f0603250e8d56/1?pq-origsite=gscholar&cbl=276284
16.Y. Zhang and B. Feng, A co-occurrence based hierarchical method for clustering web search results, in IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE (Piscataway, NJ), Vol. 1, 2008, pp. 407-410.
17.A. Aker, E. Kurtic, A.R. Balamurali, M. Paramita, E. Barker, M. Hepple, and R. Gaizauskas, A Graph-Based Approach to Topic Clustering for Online Comments to News, Springer International Publishing, Cham, Switzerland, 2016, pp. 15-29.
18.R. Mahalakshmi and V.L. Praba, enhancing the labelling technique of suffix tree clustering algorithm, Int. J. Data Min. Know. Manag. Process, Vol. 4, 2014, pp. 41-50.
19.D. Weiss, Descriptive clustering as a method for exploring text collections, PhD thesis, Citeseer, 2006.
20.T. Velmurugan and T. Santhanam, Clustering mixed data points using fuzzy c-means clustering algorithm for performance analysis, Int. J. Comput. Sci. Eng., Vol. 2, 2010, pp. 3100-3105.
Journal
International Journal of Computational Intelligence Systems
Volume-Issue
12 - 1
Pages
299 - 310
Publication Date
2018/12/31
ISSN (Online)
1875-6883
ISSN (Print)
1875-6891
DOI
10.2991/ijcis.2019.125905647How to use a DOI?
Copyright
© 2019 The Authors. Published by Atlantis Press SARL.
Open Access
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Zaher Salah
AU  - Ahmad Aloqaily
AU  - Malak Al-Hassan
AU  - Abdel-Rahman Al-Ghuwairi
PY  - 2018
DA  - 2018/12/31
TI  - A Methodology to Refine Labels in Web Search Results Clustering
JO  - International Journal of Computational Intelligence Systems
SP  - 299
EP  - 310
VL  - 12
IS  - 1
SN  - 1875-6883
UR  - https://doi.org/10.2991/ijcis.2019.125905647
DO  - 10.2991/ijcis.2019.125905647
ID  - Salah2018
ER  -