On the use of Side Information Based Improved K-Means Algorithm for Text Clustering
Abstract
Research Publication data, Internet Movie Database, etc. as metainformation
or side-information is linked with the text documents
collection. It is observed that, such attributes may contain a
tremendous amount of information for clustering purposes.
However, the relative importance of this side-information may be
difficult to estimate, especially when some of the information is
noisy. Additionally, it can be risky to incorporate side- information
into the mining process, because it can either improve the quality of
the representation for the mining process, or can add noise to the
process. Therefore, this paper explores way to perform the mining
process, so as to maximize the advantages from using this side
information in text mining applications with the use of COntent and
Auxiliary attribute based TExt clustering Algorithm (COATES)
approach which combines classical partitioning algorithms with
probabilistic models in order to create an effective clustering
approach along with its extension to the classification problem
Full Text:
PDFReferences
.Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu, “On the Use of Side
Information for Mining Text Data”, In IEEE Transactions on Knowledge and
Data Engineering, Vol. 26, No. 6, June 2014.
. A. McCallum. (1996). Bow: “A Toolkit for Statistical Language Modeling,
Text Retrieval, Classification and Clustering”, [Online] Available:
http://www.cs.cmu.edu/ mccallum/bow
. C. C. Aggarwal and C.-X. Zhai, “Mining Text Data”, New York, NY, USA:
Springer, 2012.
. D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: A clusterbased
approach to browsing large document collections”, in Proc. ACM SIGIR
Conf., New York, NY, USA, 1992, pp. 318–329.
. H. Schutze and C. Silverstein, “Projections for efficient document
clustering”, in Proc. ACM SIGIR Conf., New York, NY, USA, 1997, pp. 74–
. C. Silverstein and J. Pedersen, “Almost-constant time clustering of arbitrary
corpus set,”, in Proc. ACM SIGIR Conf., New York, NY, USA, 1997, pp. 60–
I. Dhillon, “Co-clustering documents and words using bipartite spectral
graph partitioning”, in Proc. ACM KDD Conf., New York,NY, USA, 2001, pp.
–274.
I. Dhillon, S. Mallela, and D. Modha, “Information-theoretic coclustering”,
in Proc. ACM KDD Conf., New York, NY, USA, 2003,pp. 89–98.
T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An evaluation of feature selection
for text clustering,”, in Proc. ICML Conf., Washington, DC, USA, 2003, pp.
–495.
W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative
matrix factorization”, in Proc. ACM SIGIR Conf., New York, NY, USA, 2003,
pp. 267–273.
A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, “An
Efficient enhanced k-means clustering algorithm,” journal of Zhejiang
University, 10(7):1626-1633, 2006.
K. A. Abdul Nazeer and M. P. Sebastian, “Improving the accuracy and
efficiency of the k-means clustering algorithm,” in International Conference on
Data Mining and Knowledge Engineering (ICDMKE),Proceedings of the World
Congress on Engineering (WCE-2009), Vol 1, July 2009, London, UK.
Chen Zhang and Shixiong Xia, “ K-means Clustering Algorithm with
Improved Initial center,” in Second International Workshop on Knowledge
Discovery and Data Mining (WKDD), pp. 790-792, 2009.
F. Yuan, Z. H. Meng, H. X. Zhangz, C. R. Dong, “ A New Algorithm to
Get the Initial Centroids,” proceedings of the 3rd International Conference on
Machine Learning and Cybernetics, pp. 26-29, August2004.
Refbacks
- There are currently no refbacks.
Copyright © IJETT, International Journal on Emerging Trends in Technology