On the use of Side Information Based Improved K-Means Algorithm for Text Clustering

Nikhil Patankar, Sailee Salkar


In many text mining applications, such as Scientific
Research Publication data, Internet Movie Database, etc. as metainformation
or side-information is linked with the text documents
collection. It is observed that, such attributes may contain a
tremendous amount of information for clustering purposes.
However, the relative importance of this side-information may be
difficult to estimate, especially when some of the information is
noisy. Additionally, it can be risky to incorporate side- information
into the mining process, because it can either improve the quality of
the representation for the mining process, or can add noise to the
process. Therefore, this paper explores way to perform the mining
process, so as to maximize the advantages from using this side
information in text mining applications with the use of COntent and
Auxiliary attribute based TExt clustering Algorithm (COATES)
approach which combines classical partitioning algorithms with
probabilistic models in order to create an effective clustering
approach along with its extension to the classification problem

Full Text:



.Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu, “On the Use of Side

Information for Mining Text Data”, In IEEE Transactions on Knowledge and

Data Engineering, Vol. 26, No. 6, June 2014.

. A. McCallum. (1996). Bow: “A Toolkit for Statistical Language Modeling,

Text Retrieval, Classification and Clustering”, [Online] Available:

http://www.cs.cmu.edu/ mccallum/bow

. C. C. Aggarwal and C.-X. Zhai, “Mining Text Data”, New York, NY, USA:

Springer, 2012.

. D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: A clusterbased

approach to browsing large document collections”, in Proc. ACM SIGIR

Conf., New York, NY, USA, 1992, pp. 318–329.

. H. Schutze and C. Silverstein, “Projections for efficient document

clustering”, in Proc. ACM SIGIR Conf., New York, NY, USA, 1997, pp. 74–

. C. Silverstein and J. Pedersen, “Almost-constant time clustering of arbitrary

corpus set,”, in Proc. ACM SIGIR Conf., New York, NY, USA, 1997, pp. 60–

I. Dhillon, “Co-clustering documents and words using bipartite spectral

graph partitioning”, in Proc. ACM KDD Conf., New York,NY, USA, 2001, pp.


I. Dhillon, S. Mallela, and D. Modha, “Information-theoretic coclustering”,

in Proc. ACM KDD Conf., New York, NY, USA, 2003,pp. 89–98.

T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An evaluation of feature selection

for text clustering,”, in Proc. ICML Conf., Washington, DC, USA, 2003, pp.


W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative

matrix factorization”, in Proc. ACM SIGIR Conf., New York, NY, USA, 2003,

pp. 267–273.

A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, “An

Efficient enhanced k-means clustering algorithm,” journal of Zhejiang

University, 10(7):1626-1633, 2006.

K. A. Abdul Nazeer and M. P. Sebastian, “Improving the accuracy and

efficiency of the k-means clustering algorithm,” in International Conference on

Data Mining and Knowledge Engineering (ICDMKE),Proceedings of the World

Congress on Engineering (WCE-2009), Vol 1, July 2009, London, UK.

Chen Zhang and Shixiong Xia, “ K-means Clustering Algorithm with

Improved Initial center,” in Second International Workshop on Knowledge

Discovery and Data Mining (WKDD), pp. 790-792, 2009.

F. Yuan, Z. H. Meng, H. X. Zhangz, C. R. Dong, “ A New Algorithm to

Get the Initial Centroids,” proceedings of the 3rd International Conference on

Machine Learning and Cybernetics, pp. 26-29, August2004.


  • There are currently no refbacks.

Copyright © IJETT, International Journal on Emerging Trends in Technology