Reducing Data Skew with Round Robin Horizontal Partitioning of Data for Distributed Association Rule Mining of Large Data Set

Dipak V. Patil

Abstract


High growth in data size is observed due use of
computer in all field. This data is not useful for decision making in
business, unless is mined to extract interesting knowledge from it.
For analyzing such data and extracting true knowledge from it,
various data mining techniques are used. Association rule mining
is one of them; it aims at finding associations or relations among
data. As size of the data increase, knowledge discovery on this high
volume data becomes slow, with conventional data mining
technique, as it has to be done serially. The number of data records
may make the learning process very slow. The solution to the
problem is to speed-up the learning process with the help of parallel
or distributed techniques. Through mining, interesting relations and
patterns between variables of large database can be observed using
the distributed mining algorithms. The performance in terms of time
complexity data mining algorithm can be from O(N) to lower bound
O(N/k) with parallel or distributed approach, where N = number of
data instances and k = number of nodes in distributed system[1].
Partitioning and distribution of data on different nodes in distributed
system may lead to data skew and intern a problem in computing
support and confidence. This paper addresses the distributed
association rule mining on large datasets and merging rules in single
rule set. This system horizontally distributes large data set using
round robin method and association rule mining using Apriori
algorithm is performed with global support count at least s and
confidence count at least c. Duplicate rules in the system create rule
redundancy. Duplicate rules are found and redundancy is removed
from rule set before final merger of the rules at central server. Data
security issue in distributed mining has been handled by many
researchers so it is not addressed here. The speed up is acquired
with proposed method is significant along with utilization of
available computing resources.

Full Text:

PDF

References


G. Alex and A. Freitas, Scalable, high-performance data mining with parallel

processing,in Principles and Practice of Knowledge Discovery in

Databases,(Nantes, France),1998.

J. Han and M. Kamber,“Data Mining: Concepts and Techniques”, Morgan

Kaufmann publishers, Third edition, 2014.

Agrawal, R, and Shafer, J.C. “Parallel Mining of Association Rules”, IEEE

Transactions on Knowledge and Data Engineering, Vol 8, No 6, pp962-969,

Cheung D. W., Han J., Vincent T. Ng., and Ada W. Fu “A fast distributed

algorithm for mining association rules”. In Proceedings of IEEE 4th International

Conference on Parallel and Distributed Information Systems, pages 31-42,

December 1996.

T. Tassa, Secure Mining of association rules in horizontally distributed

Database"proc, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA EN

GINEERING, VOL. 26, NO. 4, APRIL 2014.

T. Tassa and E. Gudes. Secure distributed computation of anonymized views

ofshared database", Transactions on Database Systems, 2012.

M. Kantarcioglu and C. Clifton, Privacy-preserving distributed mining of

association rules on horizontally partitioned data", IEEE Transactions on

Knowledgeand Data Engineering, 16:10261037, 2004.

R. Agrawal and R. Srikant, “ Fast algorithms for mining association rules in

large Database”. In VLDB, pages 487499, 1994.

A. Ben-David, N. Nisan, and B. Pinkas, “FairplayMP - A System for Secure

Multi-Party Computation,Proc. 15th ACM Conf Computer and Comm Security

(CCS), pp. 257-266, 2008.

A.V. Evenmievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy

preservingmining of association rules.In KDD, pages 217228, 2002.

J. Brickell and V. Shmatikov, Privacy-preserving graph algorithms in the

semi honest model", In ASIACRYPT, pages 236252, 2005.

J.S. Park , M. Chen, and P.S. Yu , An effective Hash Based Algorithm for

MiningAssociation Rules, "Proc. 1995 ACM SIGMOD Int'l Conf.

Managementof Data , ACM Press, 1995, pp. 175-186.

C. Ray, Distributed database systems. Pearseron Education India, 2009.

Frequent Item set Mining Dataset Repository, www.fimi.ua.ac.be/data.

KDD community data repository, www.sigkdd.org

IBM dataset from Almaden Quest research,

www.research.ibm.com/labs/almaden.


Refbacks

  • There are currently no refbacks.


Copyright © IJETT, International Journal on Emerging Trends in Technology