Reducing Data Skew with Round Robin Horizontal Partitioning of Data for Distributed Association Rule Mining of Large Data Set
Abstract
computer in all field. This data is not useful for decision making in
business, unless is mined to extract interesting knowledge from it.
For analyzing such data and extracting true knowledge from it,
various data mining techniques are used. Association rule mining
is one of them; it aims at finding associations or relations among
data. As size of the data increase, knowledge discovery on this high
volume data becomes slow, with conventional data mining
technique, as it has to be done serially. The number of data records
may make the learning process very slow. The solution to the
problem is to speed-up the learning process with the help of parallel
or distributed techniques. Through mining, interesting relations and
patterns between variables of large database can be observed using
the distributed mining algorithms. The performance in terms of time
complexity data mining algorithm can be from O(N) to lower bound
O(N/k) with parallel or distributed approach, where N = number of
data instances and k = number of nodes in distributed system[1].
Partitioning and distribution of data on different nodes in distributed
system may lead to data skew and intern a problem in computing
support and confidence. This paper addresses the distributed
association rule mining on large datasets and merging rules in single
rule set. This system horizontally distributes large data set using
round robin method and association rule mining using Apriori
algorithm is performed with global support count at least s and
confidence count at least c. Duplicate rules in the system create rule
redundancy. Duplicate rules are found and redundancy is removed
from rule set before final merger of the rules at central server. Data
security issue in distributed mining has been handled by many
researchers so it is not addressed here. The speed up is acquired
with proposed method is significant along with utilization of
available computing resources.
Full Text:
PDFReferences
G. Alex and A. Freitas, Scalable, high-performance data mining with parallel
processing,in Principles and Practice of Knowledge Discovery in
Databases,(Nantes, France),1998.
J. Han and M. Kamber,“Data Mining: Concepts and Techniques”, Morgan
Kaufmann publishers, Third edition, 2014.
Agrawal, R, and Shafer, J.C. “Parallel Mining of Association Rules”, IEEE
Transactions on Knowledge and Data Engineering, Vol 8, No 6, pp962-969,
Cheung D. W., Han J., Vincent T. Ng., and Ada W. Fu “A fast distributed
algorithm for mining association rules”. In Proceedings of IEEE 4th International
Conference on Parallel and Distributed Information Systems, pages 31-42,
December 1996.
T. Tassa, Secure Mining of association rules in horizontally distributed
Database"proc, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA EN
GINEERING, VOL. 26, NO. 4, APRIL 2014.
T. Tassa and E. Gudes. Secure distributed computation of anonymized views
ofshared database", Transactions on Database Systems, 2012.
M. Kantarcioglu and C. Clifton, Privacy-preserving distributed mining of
association rules on horizontally partitioned data", IEEE Transactions on
Knowledgeand Data Engineering, 16:10261037, 2004.
R. Agrawal and R. Srikant, “ Fast algorithms for mining association rules in
large Database”. In VLDB, pages 487499, 1994.
A. Ben-David, N. Nisan, and B. Pinkas, “FairplayMP - A System for Secure
Multi-Party Computation,Proc. 15th ACM Conf Computer and Comm Security
(CCS), pp. 257-266, 2008.
A.V. Evenmievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy
preservingmining of association rules.In KDD, pages 217228, 2002.
J. Brickell and V. Shmatikov, Privacy-preserving graph algorithms in the
semi honest model", In ASIACRYPT, pages 236252, 2005.
J.S. Park , M. Chen, and P.S. Yu , An effective Hash Based Algorithm for
MiningAssociation Rules, "Proc. 1995 ACM SIGMOD Int'l Conf.
Managementof Data , ACM Press, 1995, pp. 175-186.
C. Ray, Distributed database systems. Pearseron Education India, 2009.
Frequent Item set Mining Dataset Repository, www.fimi.ua.ac.be/data.
KDD community data repository, www.sigkdd.org
IBM dataset from Almaden Quest research,
www.research.ibm.com/labs/almaden.
Refbacks
- There are currently no refbacks.
Copyright © IJETT, International Journal on Emerging Trends in Technology