On the use of Side Information Based Improved K-Means Algorithm for Text Clustering

Nikhil Patankar; Sailee Salkar

On the use of Side Information Based Improved K-Means Algorithm for Text Clustering

Nikhil Patankar, Sailee Salkar

Abstract

In many text mining applications, such as Scientific
Research Publication data, Internet Movie Database, etc. as metainformation
or side-information is linked with the text documents
collection. It is observed that, such attributes may contain a
tremendous amount of information for clustering purposes.
However, the relative importance of this side-information may be
difficult to estimate, especially when some of the information is
noisy. Additionally, it can be risky to incorporate side- information
into the mining process, because it can either improve the quality of
the representation for the mining process, or can add noise to the
process. Therefore, this paper explores way to perform the mining
process, so as to maximize the advantages from using this side
information in text mining applications with the use of COntent and
Auxiliary attribute based TExt clustering Algorithm (COATES)
approach which combines classical partitioning algorithms with
probabilistic models in order to create an effective clustering
approach along with its extension to the classification problem

Full Text:

PDF

References

.Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu, â€œOn the Use of Side

Information for Mining Text Dataâ€, In IEEE Transactions on Knowledge and

Data Engineering, Vol. 26, No. 6, June 2014.

. A. McCallum. (1996). Bow: â€œA Toolkit for Statistical Language Modeling,

Text Retrieval, Classification and Clusteringâ€, [Online] Available:

http://www.cs.cmu.edu/ mccallum/bow

. C. C. Aggarwal and C.-X. Zhai, â€œMining Text Dataâ€, New York, NY, USA:

Springer, 2012.

. D. Cutting, D. Karger, J. Pedersen, and J. Tukey, â€œScatter/Gather: A clusterbased

approach to browsing large document collectionsâ€, in Proc. ACM SIGIR

Conf., New York, NY, USA, 1992, pp. 318â€“329.

. H. Schutze and C. Silverstein, â€œProjections for efficient document

clusteringâ€, in Proc. ACM SIGIR Conf., New York, NY, USA, 1997, pp. 74â€“

. C. Silverstein and J. Pedersen, â€œAlmost-constant time clustering of arbitrary

corpus set,â€, in Proc. ACM SIGIR Conf., New York, NY, USA, 1997, pp. 60â€“

I. Dhillon, â€œCo-clustering documents and words using bipartite spectral

graph partitioningâ€, in Proc. ACM KDD Conf., New York,NY, USA, 2001, pp.

â€“274.

I. Dhillon, S. Mallela, and D. Modha, â€œInformation-theoretic coclusteringâ€,

in Proc. ACM KDD Conf., New York, NY, USA, 2003,pp. 89â€“98.

T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, â€œAn evaluation of feature selection

for text clustering,â€, in Proc. ICML Conf., Washington, DC, USA, 2003, pp.

â€“495.

W. Xu, X. Liu, and Y. Gong, â€œDocument clustering based on nonnegative

matrix factorizationâ€, in Proc. ACM SIGIR Conf., New York, NY, USA, 2003,

pp. 267â€“273.

A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, â€œAn

Efficient enhanced k-means clustering algorithm,â€ journal of Zhejiang

University, 10(7):1626-1633, 2006.

K. A. Abdul Nazeer and M. P. Sebastian, â€œImproving the accuracy and

efficiency of the k-means clustering algorithm,â€ in International Conference on

Data Mining and Knowledge Engineering (ICDMKE),Proceedings of the World

Congress on Engineering (WCE-2009), Vol 1, July 2009, London, UK.

Chen Zhang and Shixiong Xia, â€œ K-means Clustering Algorithm with

Improved Initial center,â€ in Second International Workshop on Knowledge

Discovery and Data Mining (WKDD), pp. 790-792, 2009.

F. Yuan, Z. H. Meng, H. X. Zhangz, C. R. Dong, â€œ A New Algorithm to

Get the Initial Centroids,â€ proceedings of the 3rd International Conference on

Machine Learning and Cybernetics, pp. 26-29, August2004.

Refbacks

There are currently no refbacks.

Username
Password
Remember me