A Methodology to Refine Labels in Web Search Results Clustering

Zaher Salah; Ahmad Aloqaily; Malak Al-Hassan; Abdel-Rahman Al-Ghuwairi

doi:10.2991/ijcis.2019.125905647

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Volume 12, Issue 1, November 2018, Pages 299 - 310

A Methodology to Refine Labels in Web Search Results Clustering

Authors

Zaher Salah¹^{, *}, Ahmad Aloqaily¹, Malak Al-Hassan², Abdel-Rahman Al-Ghuwairi¹

¹Prince Al Hussein Bin Abdullah II Faculty for Information Technology, Hashemite University, Zarqa, Jordan

²Department of Business of Information Technology, University of Jordan, Amman, Jordan

^*Corresponding author. Email: zahersalah@hotmail.com

Corresponding Author

Zaher Salah

Received 25 June 2018, Revised 19 November 2018, Accepted 14 December 2018, Available Online 31 December 2018.

DOI: 10.2991/ijcis.2019.125905647 How to use a DOI?
Keywords: Information retrieval; Machine learning; Web search results clustering; Web intelligence
Abstract: Information retrieval systems like web search engines can be used to meet the user’s information needs by searching and retrieving the relevant documents that match the user’s query. Firstly, the query is inputted to the web search engine and assumed to be a good representative for the user’s intention and reflecting specifically his information needs and thus it should be long enough, discriminative, specific and unambiguous. Secondly, the web search engine typically respond to the query by sending back a long flat list of web search results and each search result represents a relevant document. Typically, that list may contain thousands or millions of web search results and thus it is difficult to navigate and locate a specific document relevant to a specific topic. As a postretrieval process, web search results clustering may be a solution for this issue where web search results can be categorized as clusters. These clusters supposed to contain topically related documents and labelled by descriptive and concise labels. These labels supposed to correctly describe the contents of each cluster. Thus the users can easily choose a cluster representing the intended topic and navigate through relatively few documents inside that cluster. High-quality labelling for clusters is crucial for users who can now gain insight into that clusters’ contents, general structure, and distribution of the topics among documents in the clusters. This make the user able to preview and navigate easily and fast. To this end, the authors in this paper introduced a methodology to enhance labels for clusters of web search results. The proposed methodology is founded on the idea of using the existing labels nominated by the original Suffix Tree Clustering (STC) algorithm and adapting these labels and/or clusters so that it become more concise and descriptive. The propose methodology was conducted on the original STC algorithm to produce an enhanced version of the classical STC algorithm. The enhanced algorithm was experimented and the produced clusters and labels were evaluated and compared with respect to the classical STC algorithm. For evaluation, the authors used clusters labelling performance measure considered five parameters f1: Comprehensibility, f2: Descriptiveness, f3: Discriminative Power, f4: Uniqueness, and f5: Nonredundancy. The reported results shown that the new enhanced labels outperformed the original labels and the overall performance has been enhanced. The recorded results indicated that: (i) The proposed methodology achieved better performance and the overall average recorded values for the used performance measure (f6) was 0.921. (ii) Number of clusters was decreased from 15 to 9 clusters only. (iii) Number of duplicated results was decreased from 143 to 121 only, and (iv) average number of phrases per label was increased from 1.67 to 2.00 phrases.
Copyright: © 2019 The Authors. Published by Atlantis Press SARL.
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

Web pages are dynamic, superabundant, and miscellaneous and thus a heterogeneous variation of topics represented by this massive amount of documents is expected as these documents are originated from various resources worldwide and covering various topics like, for example, arts, science, engineering, economy, politics, sport, and so on.

Starting, typically, from the first web search result(s) the user may waste a valuable time to browse many search results which are irrelevant to the intended topic that the user looking for. As a result, nearly (50%) of users only browse the first two pages of the Web Search Results returned from the search engine [1]. This arises the need for more sophisticated web search engines that may employ “postretrieval” process to improve the presentation of the search results to the users [2] where the search results can be organized into labelled clusters. Each cluster represents a specific topic and this is very important to the user who will be able to avoid previewing many irrelevant documents by deepen the navigation inside the intended cluster covering a specific topic and containing homogenous documents relevant to the user’s information needs. Furthermore, it is beneficial for the naive users to possibly find “unexpected relevant documents” too [3]. To achieve this we need more than document grouping; the most important requirement is how to choose a comprehensible and expressive descriptor or label for each documents group, so that the user will be able to locate the intended documents easier and faster depending on that short and meaningful labels to concisely explain to the users what the group’s content is about [4].

In general, the performance of many search results clustering techniques is still poor in the context of clusters labelling especially when the labelling phase is highly dependent on the efficiency of clustering phase that determine the content of each cluster. Homogeneous content for a cluster is crucial to induce a good label for that cluster. Many promising directions for various research approaches presented themselves so as to extend the functionality and enhance the operation of specific phases in the clustering algorithms. Choosing the appropriate clustering algorithm with optimized parameters and efficient mechanism to elect the representative terms to be used as candidate labels are the key to produce better clusters with concise and knowledgeable (descriptive and informative) labels.

Web search engines (like, for example, Google, Bing, Dogpile, Yahoo, and Baidu) respond to the user query by sending a long list of search results meeting the information requirements (user intention) expressed by that query. In ranked retrieval systems, thesearch results list is ordered decreasingly according to a specific relevancy ranking (scoring) scheme and typically contains a title, a small portion of text called snippet and a URL for each search result [3]. Figure 1 shows an example of web search results for the query: Hashemite university represented as a flat ordered list of results. Both the low precision and flat presentation of search results made the process of meeting the user’s information needs far more exhaustive than it should to be and thus raise the need for more sophisticated search engines in which the relevant search results are easier to brows and to navigate [5].

2. PRELIMINARIES

In this previous work section we provide some background concerning the labelling phase in the process of web search results clustering (WSRC). Various types of algorithms can be adopted for clustering textual documents employing, for example: neural network, fuzzy logic, rough set, or graph theory. WSRC algorithms can be classified into two categories [6]: (i) Numerical-based algorithms which have performance issues with web search result clustering because it expect full-text as input not short-text snippets like in web search results. In the other hand, its output is “raw numerical” thus it cannot be used to label clusters because it is un-interpretable by the user. (ii) Phrase-based algorithms which produce more comprehensible and descriptive labels than numerical algorithms. Table 1 [6] below differentiates between numerical-based and phrase-based algorithms while Table 2 [7] presents a comparison between the most typically used clustering algorithms in the literature based on various characteristics.

Numerical-Based Algorithms	Documents are converted to term- document matrix. Numerical algorithms require more data than is available. Raw numerical outcome is also difficult to convert back to cluster description Data model is usually used Vector Space Model.
Phrase-Based Algorithms	Phrase based on frequent phrases instead of numerical It is simpler than numerical algorithms. These algorithms usually discard smaller clusters. Data model is usually used N-gram, Suffix Tree.

Table 1

Comparison between numerical-based and phrase-based algorithms.

Method	Semantic Relation	Cluster Label	Phrase Based	Incremental	Complexity
K-Means Clustering	No	One word only	No	No	O(nkt) k: initial clusters n: no. of documents t: iteration
Suffix Tree Clustering	Yes	Shorter but appropriate	Yes	Yes, but merging phase is not incremental	O(n)
Lingo	No	Longer more descriptive	Yes	No	O(n)
Semantic Suffix Tree	Yes	Meaningful and readable labels	Yes	Yes	O(n)
Improved K-Means	Yes	Based on K-means first and then on the documents linked to it	No	No	Time Consuming
Inductive Clustering	Yes	Phrases extracted from results from internal and external summary	Yes	No	Negligible with cluster titles
Fuzzy C-Mediod Clustering	No	Produce category	Yes	Yes	O(n²)
Histogram-based Clustering	Yes	Matching phrases of documents	Yes	Yes	O(n²)
Hierarchical Clustering	No	Most frequent terms from inside clusters	No	Yes	O(n²): single link O(n³): complete link
Semantic Hierarchical Online Clustering (SHOC)	Yes	Labels that describe clusters (extract frequent phrases and SVD technique)	Yes	Yes	O(n)

SVD, singular value decomposition.

Table 2

A comparison between the most typically used clustering approaches.

The Commonly used Suffix Tree Clustering (STC) algorithm deals with each document as a string (sequence of words) instead of a bag-of-words (BOW) which neglects the order of words while considers only the frequencies of distinct words occurrences in the corpus. STC uses suffix trees for summarizing the documents and extracting the frequent phrases while other algorithms like semantic hierarchical online clustering (SHOC) uses suffix arrays instead [8]. Label Induction Grouping Algorithm (Lingo) produces more clusters than STC and K-mean algorithms while STC is more scalable than Lingo and K-mean. [9] Lingo commences with extracting expressive labels first and then clustering documents individually to the fittest label (each label representing a cluster). Labels are generated from the pruned frequent terms (phrases and words) that achieve the required level of labelling descriptiveness and informativeness quality [7, 10].

In both Lingo and SHOC algorithms, labels of the clusters should be (i) present in a web search result snippet a number of times exceed a given threshold, (ii) meaningful and contained in a single sentence covering a specific topic, and (iii) clear by being complete (not partial phrase), long enough and frequent phrase. In addition, stop words that are present in the phrase must be preserved to produce more eligible cluster labels [11].

STC is fast (linear to the number of documents) and incremental thus it is very useful in search results clustering process which is online postretrieval process where time is critical requirement [8]. STC clusters documents or search resultsnippets containing common phrases (sequence of words or single terms) and uses information about frequency and order of terms in the documents. STC works in two main phases namely: (i) base cluster discovery using a suffix tree and (ii) merging base clusters into proper clusters. Firstly, STC summarizes document contents and extracts phrases to be assigned then as cluster labels and thus produces concise and meaningful cluster labels [6] depending on candidate frequent phrases describing the main topic covered by the document contents. Secondly, STC assigns snippets to each of these labels to form proper clusters. Thresholds are used to manage the clustering process but tuning these thresholds is often problematic [6].

In the work described in [12], documents are also treated as strings (sequence of words) and similarity between documents is computed using string-kernel function where similarity between two documents is the number of matching subsequences. More shared substrings (not always contiguous) means more similar documents. To grouping the documents, Spectral clustering is used which is a graph-based clustering algorithm where in short, the clustering problem is a graph cut problem to isolate set of nodes from others in the collection.

In general, there are three essential steps for any WSRC method [13] listed as follows:

Retrieve a list R = (r₁, r₂, ⋯, r_n) of n search results for the user query q.
Cluster R to form a list C = (C₀, C₁, ⋯, C_m) of m + 1 clusters.
Label clusters.

The method described above uses each created cluster to extract a meaningful label to be assigned as a good descriptor for that cluster while in [14], for example, labels are induced first and then clustering is performed by assigning snippets to the closest preextracted label. This is the same in the Lingo “description comes first” approach which uses frequent phrases to induce distinct enough labels to cover as much topics as possible, and after that the clustering is performed by assigning each snippet to the closest label [8, 11]. Steps for “description comes first” approach are listed briefly as the following:

Preprocessing the input snippets by performing tokenization, stemming, and stop-words removal.
Extracting frequent words and phrases in the input snippets.
Inducing cluster labels by employing singular value decomposition (SVD).
Assigning snippets to each of these labels to form proper clusters.
Postprocessing like clusters merging and pruning.

In addition to clustering documents automatically in acceptable time, it is essentially to assign a meaningful and comprehensible label to each cluster to describe the semantic topic covered by that cluster concisely. Labelling is not a priority in traditional data mining approaches which is mainly concerned in grouping data precisely and efficiently. While WSRC is concerned in making search results easier to brows by grouping search results in well-described clusters [6] in order to make it easier to locate the required documents and even unexpected relevant documents by reviewing certain cluster [3].

3. CLUSTERS LABELLING

Extracting relevant terms for labelling clusters and act as readable, meaningful and distinguishing group descriptor is a challenging process especially in WSRC where search results snippets (small portion of text) contain few terms. The high locally frequent and low globally frequent term in a cluster is typically good representative label for that cluster [3]. Terms can be weighted using local and global factors as the following:

Local Factor:
(1)Lt=lg(1+FCt)
F_Ct is the frequency of documents containing term t in cluster C. Logarithmic frequency is used to avoid F_Ct high-frequency problem.
Global Factor:
(2)Gt=lgFCt|C|FRt|R|
F_Rt is the frequency of documents containing term t in search results R.

Label selection criterion combines local and global factors to calculate scores for terms in each cluster as the following:

(3)Scoret=Lt×Gt

For each cluster, the term with the highest score will be selected as the cluster label.

[15] used Lingo algorithm to extract frequent phrases and original terms in addition to synonym terms from WordNet lexical database, in order to induce better abstractive labels for clusters. Other external knowledge resources like Wikipedia ^†can be used to enrich the candidate label with new meaningful terms imported from the online free encyclopedia which contains a huge amount of “controlled” preclustered and manually annotated contents [4].

[16] proposed an approach to extract and combine significant bi-grams into n-grams according to term co-occurrence statistics and use the top-ranked unredundant phrases as candidate labels. To retrieve significant bi-grams, for each pair of words <w, w_i>, strength is computed as the following:

(4)strength=freq−f¯σ

f¯=1n∑1≤i≤nfreqi

σ=∑1≤i≤n(freqi−f¯)n

Word pairs with strength value < the threshold β₀ will be discarded.

Also spread is computed as the following:

(5)spread=∑−d≤j≤d, j≠0(pij−pi¯)22d

pi¯=∑−d≤j≤d, j≠0pij2d

spread describes the shape of the pij histogram. Small spread value indicates flat histogram which means that w_i can be used equivalently in almost any position around w, while large spread value indicates histogram with peaks which means that w_i can only be used in one (or several) specific position around w. Word pairs with spread value < the given threshold ρ₀ will be discarded too. The remaining word pairs are the significant bi-grams.

Now, bi-grams will be used to discover n-grams. Each bi-gram <w, w_i> will be represented in a graph as a directed edge and the two words will be the vertices. A tri-gram “abc” is identified if the edges a → b, b → c, a → c exist.

n−gram is defined as n−gram(w1 ⋯ wn)={edge(w1w2) if n=2n−gram(w1⋯wn−1) ∧ ∧n−1i=1edge(wiwn) if n>2}

where, edge(w1w2)=false otherwisetrue if (w1,w2)

Depth-first traversal to all nodes results in extracting all the n-grams. After that, redundant n-grams need to be eliminated and only unredundant n-grams will be used as candidate cluster labels. Removing the redundancy in n-grams can be performed by applying remove-or-merge process.

Let ts(p) be the term set of n-gram p, ss(p) be the sentence set of p, and ω₀ be a threshold. The remove-or-merge condition is defined as:

if ts(pi)∩ts(pj)ts(pi)∪ts(pj)≥ω0 & ss(pi)∩ss(pj)ss(pi)∪ss(pj)≥ω0, merge pi and pjif ts(pi)∩ts(pj)mints(pi),ts(pj)=1 & ss(pi)∩ss(pj)ss(pi)∪ss(pj)≥ω0, delete shorter gram

Each candidate cluster label p is ranked by its significance (Sig(p)) as the following:

(6)Sig(p)=t fid f(p)×boost(p)

t fid f(p)=t f(p)×log1+Ndf(p)

boost(p)=5, |p|>8c|p|−base, |p|≤8

boost(p) is a boost factor of phrase p, c and base are constants (c = 1.25 and base = 0.5).

The top M candidate cluster labels are selected to construct base clusters. All snippets containing the same label (phrase) are aggregated in a base cluster labelled by the phrase.

Even though it is generated automatically, evaluation of clusters labels may be better to be conducted against manually created gold standard data where human annotators are asked to identify the fittest cluster given a cluster label [17].

4. STC CLUSTERS LABELLING ENHANCEMENT

The enhancement process is founded on the idea of using the existing labels nominated by the standard STC algorithm. The original labels and/or clusters will be modified and combined so that it become more concise and descriptive. To this end, the propose methodology will be conducted on the original STC algorithm to produce an enhanced version of the classical STC algorithm. The proposed methodology employed a deeper linguistic analysis and more robust techniques (as seen in Algorithm 1) than that used in other research works like, for example, the work described in [18].

Once the raw original labels are induced using STC algorithm all the cluster label phrases and clusters will be processed with respect to the Algorithm 1. The aim of this algorithm is to enhance f1, f2, f3, f4, and f5 (see Section 5 for details) by refining and reformulating both of labels and clusters. The major steps are described Algorithm 1.

5. CLUSTERS LABELLING QUALITY EVALUATION

Labels for web search results clusters should be discriminatory and carefully describe the contents of individual clusters. Low-quality labelling for web search results clusters may confuse the user and mislead him during navigation through clusters and thus negatively affect the whole process aiming to meet the user information needs [18]. In this section, a discussion concerning the evaluation of the generated labels for the produced clusters of web search results is presented. This is important for generating descriptive and precise labels for clusters and/or conducting a comparison concerning the descriptiveness of different labelling techniques.

Clusters labelling quality measures can be conducted as an external measure according to the source of the “validity criteria.” External measure compares the clustering algorithm’s results against external, manually, or automatically, prelabelled results in order to compare the difference between the two results. Many labelling quality measures have been proposed in different contexts in the literature. [18] introduced a new metric to evaluate the quality of clusters labels using a comparative evaluation strategy. The authors in [18] argued that, to be responsible, clusters labelling evaluation should take into consideration the following five parameters:

Algorithm 1 Labels Enhancement and Clusters Refinement

Comprehensibility (f1): A cluster label should give a clear interpretation for the contents of a cluster to the user. It can be formally defined as ∀c∈C ∀p∈lc:P∈L(G)p>1, where lc is the cluster label of cluster c, and L(G) define a formal language identifying noun phrases (a word or group of words containing a noun and functioning in a sentence as subject, object, or prepositional object.).
(7)f1(p)=NP(p)∗Penalty(p)

NP(p)=0, Otherwise1, if P∈L(G)

Penalty(p)=exp−(|p|−|p|opt)22⋅d2, if |p|>10.5, Otherwise

The exponential expression in Equation (7) is used to penalize too short or too long phrases by setting |p|_opt = 4 and d = 8. [19].
Descriptiveness (f2): All documents in a cluster should contain the label associated with that cluster. It can be formally defined as: ∀c∈C ∃p∈lc ∀p'∈p'∉lcPc:≪dfc(p). Where P_c is the set of phrases in cluster c. p′ is the complement value of p, df_c(p) represents the number of documents in a cluster containing phrase p.
(8)f2(c,p)=1−1|Pc∖lc|∑p'∉lcp′∈Pcdfc(p′)dfc(p)
Discriminative Power (f3): A cluster label should only exist, exclusively, in documents from its associated cluster. It can be formally defined as:
∀ci,cjci≠cj∈C ∃P∈lc:dfci(p)|ci|≪dfcj(p)|cj|

(9)f3(cj,p)=1−1k−1∑ci≠cjci∈C|cj|⋅dfci(p)|ci|⋅dfcj(p)
Where c_i and c_j are two clusters while df_c(p) represents the number of documents in a cluster containing the phrase p.
Uniqueness (f4): Each cluster label should be uniquely associated with one cluster. It can be formally defined as: ∀ci,cjci≠cj∈C:lci∩lcj=∅
(10)f4(cj,p)=1−1k−1∑ci≠cjci∈C|p∩lci||p∪lcj|
Where p is a phrase and l_c is the label associated with a cluster.
Nonredundancy (f5): Cluster labels can not be synonymous (having the same or nearly the same meaning). It can be formally defined as ∀c∈C∀p,p'∈lc: p and p′ are not synonymous [p ≠ p'].
(11)f5(c,p)=1−1|lc|−1∑p'≠pp'∈lcSyn(p,p')
Where Syn : p × p →{0,1}.

Label relevancy: Relevance of a phrase with respect to a cluster: All constraints can be combined into a single criterion:

(12)rel(c,p)=∑i=1|ℱ|wi⋅fi(c,p)

Where w_i is a weighting factor and ℱ = {f|1...5}, namely:

f1 : Comprehensibility
f2 : Descriptiveness
f3 : Discriminative Power
f4 : Uniqueness
f5 : Nonredundancy

Clusters labelling quality measures can be categorized as (i) external, (ii) internal, and (iii) relative measures according to the source of the “validity criteria.” External measure compares the clustering algorithm’s results against external, manuallyor automatically, preclustered results in order to disclose the difference between the two results. While internal measure employs functions to assess the similarity between cluster’s documents in addition to the dissimilarity between resulted clusters without referring to any external information. Relative measure assesses the results by comparing them against results from different algorithms, or compares the results of the same algorithm but under different conditions like different thresholds [20].

6. RESULTS AND DISCUSSION

One of the challenges of work on WSRC is the lack of “ground truth” data. In some cases it is possible to construct such data by hand however this still entails subjectivity and requires considerable resources (to the extent that it is not possible to construct significant benchmark data).

To act as a focus for the work described in this paper, the top-ranked web search results automatically clustered into relatively small thematic collections of documents for the query Donald Trump (see Figure 4). These clustered web search results retrieved from Carrot² which is an open source WSRC engine (available on: http://search.carrot2.org/stable/search) using STC algorithm.

To evaluate the proposed methodology the authors compared the reformulated clusters and their enhanced labels with the original clusters and labels generated from the classical STC algorithm. For evaluation, the authors used clusters labelling performancemeasure considered five parameters (as discussed in Section 5). We also had to forget, for the purpose of the evaluation, the numerical “intensity” of the computed values for f1: Comprehensibility, f2: Descriptiveness, f3: Discriminative Power, f4: Uniqueness, and f5: Nonredundancy.

The results are presented in tabular form and show the performance of the proposed methodology to enhance the STC algorithm with respect to the classical STC algorithm. The evaluation shows that the proposed methodology to enhance the STC algorithm performs well with respect to the quality of the enhanced labels for clusters. Inspection of the recorded results in Figure 7 indicates that (i) The proposed methodology achieved better performance and the overall average recorded values for the used performance measure (f6) was 0.921. (ii) Number of clusters was decreased from 15 to clusters only. (iii) Number of duplicated results was decreased from 143 to 121 only, and (iv) average number of phrases per label was increased from 1.67 to 2.00 phrases.

7. CONCLUSIONS

In this paper the authors described the proposed methodology for enhancing the classical STC algorithm for clustering web search results. The operation of the proposed methodology was illustrated and evaluated. The objective of this research was to deploydeep linguist analysis techniques for the enhancement of phrase labels that will in turn allow for the reformulation of the structure of web search results clusters and thus produce better performance for web search engines and achieve a better end-user satisfaction. The proposed methodology used the existing labels nominated by the original STC algorithm and adapted that labels and/or clusters to be more concise and descriptive. The propose methodology was conducted on the original STC algorithm to produce an enhanced version of the classical Suffix Tree Clustering algorithm. The enhanced algorithm was experimented and the produced clusters and labels were compared and evaluated with respect to the classical STC algorithm using clusters labelling performance measure considered five parameters f1: Comprehensibility, f2: Descriptiveness, f3: Discriminative Power, f4: Uniqueness, and f5: Nonredundancy. The recorded results indicated that the new enhanced labels outperformed the original labels and the overall performance has been enhanced. The results shown that better performance was achieved (f6 = 0.921), clusters were decreased (from 15 to 9 clusters only), duplicated web search results were decreased (from 143 to 121 only), and average number of phrases per label was increased (from 1.67 to 2.00 phrases).

The promising results obtained so far indicate that (i) it is possible to capture the clusters structure (ii) it is possible to enhance the produced labels from STC algorithm to improve the overall performance by producing more comprehensive and descriptive labels for clusters and thus the user will be able to preview and navigate easily and fast.

Future work will initially be directed at the adoption of deeper linguistic approaches and data mining techniques to enhance other WSRC algorithms like Lingo and K-mean. The intention is also to increase the size of our dataset.

Footnotes

†

http://www.wikipedia.org/

REFERENCES

1.R.K. Roul and S.K Sahay, Cluster labelling using chi-square-based keyword ranking and mutual information score: a hybrid approach, Int. J. Intell. Syst. Des. Comput., Vol. 1, No. 1–2, 2017, pp. 145-167.

2.H. Chim and X. Deng, Efficient phrase-based document similarity for clustering, IEEE Trans. Knowl. Data Eng., Vol. 20, No. 9, 2008, pp. 1217-1229.

3.H.-M. Li, C.-X. Sun, and K.-J. Wang, Clustering web search results using conceptual grouping, in 2009 International Conference on Machine Learning and Cybernetics, IEEE (Red Hook, NY), Vol. 3, 2009, pp. 1499-1503.

4.D. Carmel, H. Roitman, and N. Zwerdling, Enhancing cluster labeling using Wikipedia, in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, ACM (New York, NY), 2009, pp. 139-146.

5.G. Kr Yadav and A. Kumar, A described feasibility analysis on web document clustering, Int. J. Res. Stud. Sci. Eng. Technol., Vol. 2, 2015, pp. 1-13.

6.U. Bharambe and A. Kale, Landscape of web search results clustering algorithms, S. Unnikrishnan, S. Surve, and D. Bhoir (editors), Advances in Computing, Communication and Control, Springer, Switzerland, 2011, pp. 95-107.

7.H. Agrawal and S. Yadav, Search engine results improvement–a review, in IEEE International Conference on Computational Intelligence Communication Technology (Piscataway, NJ, February 2015), pp. 180-185.

8.S. Kopidaki, P. Papadakos, and Y. Tzitzikas, STC+ and NM-STC: two novel online results clustering methods for web searching, Web Information Systems Engineering (WISE 2009), Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 523-537.

9.T. Rani and A. Goyal, Survey of clustering techniques for information retrieval in data mining, Int. J. Sci. Eng. Technol. Res., Vol. 4, No. 4, 2015, pp. 738-740.

10.M. Waseem Khan, H.M. Shahzad Asif, and Y. Saleem, Semantic based cluster content discovery in description first clustering algorithm, Mehran University Research J. Eng. Technol., Vol. 36, No. 1, 2017, pp. 1-6.

11.S. Osinski and D. Weiss, A concept-driven algorithm for clustering search results, IEEE Intell. Syst., Vol. 20, No. 3, 2005, pp. 48-54.

12.Q. Shi, X. Qiao, and X. Guangquan, Using string kernel for document clustering, Int. J. Inf. Technol. Comp. Sci., Vol. 2, 2010, pp. 40-46.

13.A. Di and R. Navigli, Clustering web search results with maximum spanning trees, Congress of the Italian Association for Artificial Intelligence, Springer, 2011, pp. 201-212.

14.H.D. Abdulla and V. Snasel, Using singular value decomposition (svd) as a solution for search result clustering, in International Conference on Innovations in Information Technology, IEEE (Piscataway, NJ), 2008, pp. 302-306.

15.A. Sameh and A. Kadray, Semantic web search results clustering using lingo and wordnet, Int. J. Res. Rev. Comput. Sci., Vol. 1, No. 2, 2010, pp. 71-76. http://search.proquest.com/openview/d2be52a63fc5a2f4a20f0603250e8d56/1?pq-origsite=gscholar&cbl=276284

16.Y. Zhang and B. Feng, A co-occurrence based hierarchical method for clustering web search results, in IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE (Piscataway, NJ), Vol. 1, 2008, pp. 407-410.

17.A. Aker, E. Kurtic, A.R. Balamurali, M. Paramita, E. Barker, M. Hepple, and R. Gaizauskas, A Graph-Based Approach to Topic Clustering for Online Comments to News, Springer International Publishing, Cham, Switzerland, 2016, pp. 15-29.

18.R. Mahalakshmi and V.L. Praba, enhancing the labelling technique of suffix tree clustering algorithm, Int. J. Data Min. Know. Manag. Process, Vol. 4, 2014, pp. 41-50.

19.D. Weiss, Descriptive clustering as a method for exploring text collections, PhD thesis, Citeseer, 2006.

20.T. Velmurugan and T. Santhanam, Clustering mixed data points using fuzzy c-means clustering algorithm for performance analysis, Int. J. Comput. Sci. Eng., Vol. 2, 2010, pp. 3100-3105.

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Journal: International Journal of Computational Intelligence Systems
Volume-Issue: 12 - 1
Pages: 299 - 310
Publication Date: 2018/12/31
ISSN (Online): 1875-6883
ISSN (Print): 1875-6891
DOI: 10.2991/ijcis.2019.125905647 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Zaher Salah
AU  - Ahmad Aloqaily
AU  - Malak Al-Hassan
AU  - Abdel-Rahman Al-Ghuwairi
PY  - 2018
DA  - 2018/12/31
TI  - A Methodology to Refine Labels in Web Search Results Clustering
JO  - International Journal of Computational Intelligence Systems
SP  - 299
EP  - 310
VL  - 12
IS  - 1
SN  - 1875-6883
UR  - https://doi.org/10.2991/ijcis.2019.125905647
DO  - 10.2991/ijcis.2019.125905647
ID  - Salah2018
ER  -

download .riscopy to clipboard