Вы находитесь на странице: 1из 6

Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing

Generalized Association Rule Mining Algorithms based on Data Cube

Zhang Hong, Zhang Bo , Kong Ling-Dong ,Cai Zheng-Xing


School of Computer Science and Technology, China University of Mining and Technology
Jiangsu xuzhou, 221008,China
hongzh@cumt.edu.cn

Abstract current association rule formalized and the generalized


association rule mining algorithms based on data cube;
designed the data cube model and relevant algorithms.
This paper defined a kind of multi-dimension data
cube model, and presented a new formalization of
generalized association rule based on data cube model. 2 Mining System of Generalized
After comprehending the weaknesses of the current Association Rule Based on Data Cube
generalized association rule mining algorithms based on
data cube, we proposed a new algorithm GenHibFreq which
was suitable for mining multi-level frequent itemset based on 2.1. Association Rule Mining System Structure
data cube. By taking advantage of the item taxonomy,
algorithm GenHibFreq reduced the number of candidate
itemsets counted, and had better efficiency. We also designed This system structure of data cube association rule
an algorithm GenerateLHSs-Rule for generating generalized mining based on the data warehouse as Fig. 1.It
association rule from multi-level frequent itemset. contains four parts: Data warehouse, Working data
Demonstrated through examples, algorithms proposed in this cube, OLAP engine and Associated Rule Mining
paper had better efficiency and less generated redundant Engine.
rules than several existing mining algorithms, such as
Cumulate, Stratify and ML_T2L1, and had good performance
in flexibility, scalability and complexity and had new ideas on
conducting the Generalized Association Rule Mining
Algorithms in multi-dimension enviornment nad it also has
great theroritical meaning and practical value..

1. Introduction

A ssociation rule mining is one of the most


researched areas in data mining, It was firstly advanced
in the article written by Agrawal,Imielinski and Swami Figure1. The structure of Associated Rule Mining
in 1993[1].And then many researchers did much hard
work on the exploit of association rule mining, the 2.2 Data Cube Model
design of algorithms, parallel association rule mining
and quantitative association rule mining. They also
tried their best to improve the efficiency, adaptability 2.2.1 Data Cube. Data Cube is a description of
and applicability of the mining algorithms and multi-dimension data, it can be n-dimension, not be
popularize them. At present, some researches has confined to 3-dimension. Data Cube is defined by
associated the association rule mining with the data dimensions and facts.
warehouse technique to study the generalized In the data cube, data organization is
association rule mining, for example use the Apriori multi-dimension, each dimension contains several
algorithms etc. [1],[2],[3],[4]. However less study of abstract layers defined by concept taxonomy. And one
the generalized association rule mining concerning data concept taxonomy defined one mapping sequence and
cube has been put forward, moreover, some typical mapped the bottom layer concept to the more general
association rule mining studies all described the mining high layer concept. Concept taxonomy can be defined
process using informalized or demonstrated ways, it is by value of the given dimension and produce
lack of support of formalized theories. dimension level. Therefore, data cube is the data of
This paper presented generalized association rule concept layer, multi-dimension, moderate collecting, it
formalized definition aiming at the weakness of the provides high quality data sources for association rule

0-7695-2909-7/07 $25.00 © 2007 IEEE 803


DOI 10.1109/SNPD.2007.291
mining. Furthermore, the abundant meta data stored in association rule in n -dimension set R 
D , M , Dstr is
it (e.g. dimensions layer) makes the algorithms of as follows:
mining pass through every layer and provides
convenience for the generalized association rule Definition 2:Item is a 2-dimension  d, v  ,itemset
mining. I 
 d  D, v  DOM ( d ),among them D is
d , v
the dimension , DOM (d ) is dimension d ’
s value
2.2.2 Data Cube Model. At present, there are about 9
types and 3categories of typical multi-dimension data domain.
model: simple multi-dimension data model, structural Definition 3 : If item x (d x , v x ) I ,
multi-dimension data model and statistic object model
[5],[6],[7],[8]. In order to adapt to the demanded of y (d y , v y ) I , suppose d x d y , and
multi-level association rule mining, this paper adopts vx Child (vy ) , then we call y as x ’s father’s
the structural data cube model based on partial order
and mapping concept [9], this model has better generation ,item x is the filial generation item of
 
expressing ability of the complex dimension level y ,record it as y x (d x , v x ) .
structure and it can indicate the complex level structure 
of the data cube model efficiently. The expression of Definition 4: If itemset Z  I , Z I ,

the data cube model as follows: suppose Z and Z contains same items ,also we can
Definition 1: n-dimension data model in the 
get Z through using its’father’
s generation item to
generalized association rule mining is a 3-dimension 
set, set R  D, M count , Dstr  replace one or several items in Z ,then we call Z as
(1)D  the father’s generation itemset of Z ,and Z is the
d1 , d2 , , dn is a dimension set , d i is 
called dimension. filial generation itemset of Z .
(2) M count is called measure attribute. Definition 5 : k-itemset
(3 ) D str 
X  ( di 1 , v i 1), ( d i 2 , vi 2 ),, ( dik , v ik ) I

1, 
,2 , 
, n , 
is called , among
dimension structure set, them 1 p, q k , dip d iq . Pr 
X shows the
i , define the structure of dimension di . number of transactions in original transaction database
that contains the all items in itemset X,
i si1 , si 2 ,, sim 
is a limited set group, among
PrX F ( di 1 vi 1, d i 2 vi 2 ,  , d ik vik ) , among
them each set sij (1 j m ) is called dimension
them F refers to the function dependence relationship
di ’s one dimension level attribute; from the dimension set D in multi -dimension set R to
(4)measure attribute M count function depends on measure attribute Mcount , set X is the support degree of
dimension set D , that is D there is function sup(X ) Pr( X ) .
M count between Definition 6:Generalized association rule is an
F : DOM( d1) DOM(d n )  DOM( M count ) ,among implication X  Y ,among them
formula as
them DOM ( M count ) is the value domain of the X I , Y I , X Y , and x X , all

measure attribute M count . x Y .the support degree of X  Y rule
This model has better expressing ability of the is sup( X  Y ) = sup
X Y  ,confidence degree is
complex dimension level structure and it can indicate
the complex level structure of the data cube model confidence( X  Y ) = supX Y sup( X ) .
efficiently. This paper discussed the generalized The formalized definition of multi-level association
association rule mining in the formalization data cube rule based on n-dimension data cube laid a foundation
based on this model. for further study of the algorithms of mining
generalized association rule based on data cube.
2.3 The Formalization of Generalized
2.4 Mining of Generalized Association Rule
Association Rule
Two steps of Mining of Generalized Association
The formalized description of multi -level Rule: first, to find the frequent itemset that have
support greater than the user-specified minimum

804
support; second to produce association rule from the 3. Algorithms OF Frequent ItemSet Mining
frequent itemset that have confidence greater than the In order to improve the efficiency of creating frequent
user-specified minimum confidence. itemset in data cube and decrease the number of the
candidate item set that needs computing as far as
2.4.1 Generalized Frequent ItemSet. During the possible, then put forward the GenHibFreq Algorithms
process of generalized association rule mining, which was according with strategy(2).
multi-level frequent itemset mining is still a difficult 1) Basic idea of GenHibFreq Algorithms
point for researchers. In order to improve the efficiency According definition 6, while counting the support
of mining of frequent itemset, this paper concluded two of itemset by generalized association rule mining based
mining strategies of producing multi-level frequent on data cube, it only needs accessing the relevant cell,
itemset on the bases of summarizing the former not need to scan the whole data cube, it will decrease
algorithms to propose a new algorithm GenHibFreq the number of the candidate itemset that needs
which was suitable for mining multi-level frequent counting as far as possible
itemset based on data cube. 2) GenHibFreq Algorithms
1. Item depth and itemset depth Notation explanations are defined as follows:
Two concepts this paper needed were given as C k ,h : c Ck , h , depth( c) h
follows:
Definition 8: if item x I ,it’ s depth can be Lk ,h : l Lk , h , depth(l ) h
 L : Frequent itemset
expressed as depth (x ) . Suppose x which is the
Description of GenHibFreq Algorithms:
parent item of x is not exist ,then depth (x ) =0,
 Input : n-dimension R 
D , M , Dstr 
,user specified
or depth (x ) = depth( x) 1 .
minimum support value min_sup.
Definition 9 : if itemset X I , it’ s depth
Output: n-dimension multi-level frequent L
=
depth( X ) max({depth( x) x X }) .
① k 1; h 0; L ;
2. The Mining Strategies of Frequent Item Set
The mining strategies of multi -level frequent ②Creating frequent 1-itemset L1,0 of
itemset can be concluded as two following categories: depth 0 for every dimension;
( 1 )The k -itemset Lk 1 can be got through
③ h 1 ;
joining and pruning of candidate frequent While( L1, h 1 ) do {
(k-1)-itemset C k ,then traverse transactions database
For each 1-itemset l L1, h1 do
to count the support of all the candidate k-itemset
Ck ,delete those itemset that can not meet the For each 1-itemset e {i i is the filial
minimum support, get the frequent k-itemset Lk ; generation itemset of l } do {
(2)While creating frequent k-itemset L k ,firstly if ( Pr
e totalcount min_ sup ) add
traverse transactions database to count the support of 1-itemset e to L1,h ;
the candidate k -itemset Ck of the highest abstract }
level(i.e.,depth(Ck ) 0 )delete those itemsets that can h ;
not meet the minimum support value,then count the }
lower abstract level candidate k -itemset, one level L1 = h L1,h ;
deep into another ,in the process of it, we can decrease
the number of the candidate k -itemset that need count ④For ( k 2 ;( Lk 1  and k n ); k ) {
through deleting those parent itemset that can not meet h 0 ;
the minimum support value. repeat{
Comparing the above two strategies ,we can find
if ( h 0 ) {
easily that those two strategies all need joining and
pruning all the frequent (k-1)-itemset Lk 1 to get all the Ck ,0 gen _ candidate _ Apriori ( Lk 1,0 ) ;
candidate k -itemset ,The difference is that the L k ,0 =all candidates in C k,0 with minimum
computing ways are different that the traversing data
support;
base to compute the support of the candidate k -itemset
}
in C k .
else

805
L k ,h gen _ frequent _ Hib (k , h, L1,h , Lk ,h1 ) ; This algorithms made the number of the candidate
itemset that needs computing reach the least,thereby it
h ; can improve the efficiency of creating frequent itemset
} from the data cube effectively.
Until( Lk ,h1 );
2.4.2 Algorithms of Mining of Association Rule. In
L k = h Lk ,h ; order to decrease the creatintg of the abundant rules
} and fit GenHibFreq Algorithms of multi-level frequent
itemset mining based on data cube, we put forward
⑤Answer= k Lk ;
asscoation rule GenerateLHSs-Rule which was
Function gen _ frequent _ Hib( k , h, L1, h , Lk ,h 1) composed of two parts , one is BorderLHSs, the other
is GenerateRule . At first we use BorderLHSs
Input:L1 ,h depth h frequent 1-itemset,L k ,h 1
Algorithms through reverse searching means to find
depth h 1 frequent (k-1)-itemset the dividing line of LHSs , then we use Generate Rule
Output: L k ,h depth h frequent k-itemset Algorithms to create Association Rule,
GenerateLHSs-Rule can decrease the creating of the
①Queue FIFO ; Ck , h  abcundant rules eddiciently. Descriptions of these two
Algorithms as follows.
②For each k-itemset l Lk ,h 1 do { 1. BorderLHSs
Enqueue k-itemset l to the end of FIFO ; We can use the downward closure property based on
} LHS of the association rule to find the dividing line of
LHSst through reverse searching means of
③ while( FIFO ) {
BorderLHSs(A) under the conditiong of the given
Dequeue k-itemset A {a1 , a2 , , ak } from minimum support value, Description of BorderLHSs
the head of FIFO ; Algorithms is follows:
[Input]: Frenquent Itemset A
If ( depth( A) h )
[Output]:Rule Condition(LHS)Dividing Lines
{ (LHSs)
If ( L k ,h consists of ① FIFO={A}; LHSs ;
k-itemset A {a1 , a2 , , ak } ) continue; ② while(FIFO ) do{
③ Dequeue B from the head of FIFO;
Else Add k-itemset A to Lk ,h ;
④ onBorder=TRUE;
}
For ⑤ For each ( B -1)-subset C of B do {
( j 1 ;( depth(a j ) h 1 and j k ); j ) { if( P (C ) P( A) min_ conf ) then {
onBorder=FALSE;
For each item e {i i is the filial generation
⑥ if (C is not in FIFO) then
item of a j , also{i} L1 ,h } do { Enqueue C to the end of FIFO;
}
Replace item a j in A with e ; }
if ⑦ if (onBorder==TRUE) then add B to
( LHSs;
 
Pr {a1 , a2 ,, a j 1 , e, a j1 ,, ak } totalcount min_ sup }
⑧ Answer= LHSs;
) BorderLHSs(A) will decrease the complexity
Enqueue k-itemset ecomously ,because once the item set of LHSs was
{ a1 , a2 , , a j 1 , e, a j 1 , , ak } to the found, the searching algorithms will stop searching
other subset. Even in the worst condition, the
end of FIFO ;
}
complexity of this Algorithms is O 2 A . 
} 2. GenerateRule Algorithms
} GenerateRule Algorithms was gotten through
deleting one frequent itemset LHSs and making it not
④ Answer= L k, h ;
cross with any superset or subset.

806
If m frequent itemset A1 , A2 ,..., Am ,among them 3.2 Creating Association Rule
anyone itemset is the superset of ( A 1) layer of or
subset of A. if If the minimum confidence value min_conf is
B ( BorderLHSs(A) - i 1 BorderLHSs(Ai ))
m
,relative to 60%,according to GenerateLHSs-Rule Algorithms the
Association Rule was created, it was shown as Table2.
any other rules , B  ( A B) is irredundant.
Table 1. sales business data base
Descriptiong of GenerateRule Algorithms is as
tid age income buys
follows:
100 25 45k { IBM Laptop, HP Color
[Input]:All Frequent Itemset L Printer }
[Output]:Irredudant Association Rule AR 200 28 40k { HP Desktop , Canon Color
① For each AL do { Printer }
② LHS(A) = BorderLHSs(A); 300 44 45k { IBM Desktop, HP
③ For each C  L such that C is a Desktop }
400 21 20k { HP Desktop , Epson b/w
( A 1) -superset or a child itemset of A do { Printer }
LHS(A) = LHS(A) -BorderLHSs(C); 500 36 40k { IBM Laptop }
}
④ For each BLHS(A) do { 600 32 30k { HP Laptop, Epson b/w
Printer }
add rule “B  ( A B) ”to AR ;
}
3.3 Outcome Analysis
}
⑤ Answer= AR ;
This Algorithms gets the most least and irredundant (1)31 association rules have been created using
association rule, the efficiency of the association rule general algorithms , and only create 16 association
numbet was improved greatly. Suposse the opetating rules when using the algorithms in this paper ,so we
time of every frequent itemset in set L is the same ,then can say our algorithms can decrease the reduntant rules
the computing complexity of this Algorithms and the efficiently.
value of set L are linearly dependence. ( 2 ) The Cumulate 、 Stratify and
ML_T2L1 algorithms need larger store space and
3.Example Verification and Analysis also have distinct limitation. However when counting
the support of itemset using the a lgorithms of this
paper, it only needs to access the relevant cell, not need
Sales transtional database as Table 1, sales has four to scan the whole data cube ,decrease the number of
attributes: transaction identifier tid, customer’s age, the candidate itemset that needs counting to improve
income and buys, age and income are all amount the efficiency of the creating of frequent item set.
attribute ,buys is category attribute. (3)BorderLHSs(A) algorithms can guarantee one
time visting for every subset of A, once the itemset of
3.1 Sales Database and Working Data Cube the condition border LHSs was found, the searching
algorithms weill stop searching all the subset to make
Designing working data cube relevant to tasks
according to data cube model, suppose this data
 A
the complexity less than O 2 , so it can decrease
the complexity of this algorithms greatly.
contains 3 dimensionalities: age, income, and buys.
Suppose 3-itemset
4 Conclusion
X { (age,[20,29]) ,(income,[40k,49k]),(buys, Color
Printer)} , according to definition 4 , This paper described a multi -dimension data cube
PrX F ( age=[20,29],income=[40k,49k],buys=Color model; put forward the formalized definition of the
generalized association rule; and concluded two
Printer)=2 , that is there are 2 transactions in the
categories mining strategies of creating multi-level
original transactional database contains all the items in frequent itemset mining algorithms . GenHibFreqeh
itemset . which was suitable for data cube, this algorithms used
the abstract level among the itemsets adequately to

807
decrease the number of the candidate item set which Cumulate 、 Stratify and ML_T2L1 on algorithms
needs counting to improve the efficiency of this efficiency and creating of irredundant rules .At the
algorithms, we put forward the Algorithms of mining same time ,this algorithms has good performance in
of the Generalized Association Rule Based on Data flexibility, scalablicity and complexity .
Cube (GenerateLHSs-Rule) which can decrease the This paper was supported by Nature Science
creating of the redundant rules efficiently. Experiment Foundation of Jiang Su province(NO: BK2005021).
shows that the algorithms in this paper is superior to
Table 2. Association Rule created by GenerateRule(L) Algorithms
Multi-layer Association Rule Support Confidence
degree degree
(buys, Printer)  (age, [20,29]) 3 75%
(income, [40 k,49 k])  (buys, Computer) 4 100%
(buys, Computer)  (income, [40 k,49 k]) 4 66.7%
(buys, Desktop)  (age, [20,29]) 2 66.7%
(age, [30,39])  (buys, Laptop) 2 100%
(buys, Laptop)  (age, [30,39]) 2 66.7%
(buys, Desktop)  (income, [40 k,49 k]) 2 66.7%
(buys, Laptop)  (income, [40 k,49 k]) 2 66.7%
(age, [20,29])  (buys, HP Desktop) 2 66.7%
(buys, HP Desktop)  (age, [20,29]) 2 66.7%
(buys, HP Desktop)  (income, [40 k,49 k]) 2 66.7%
(buys, IBM Desktop)  (income, [40 k,49 k]) 2 100%
(age, [20,29])  (income, [40 k,49 k])  (buys, Computer) 2 66.7%
(income, [40 k,49 k])  (buys, Printer)  (age, [20,29]) 2 100%
(age, [20,29])  (income, [40 k,49 k])  (buys, Color Printer) 2 66.7%
(buys, Color Printer)  (income, [40 k,49 k]) (age, [20,29]) 2 100%

Reference 7. W. Lehner, “Modeling Large Scale OLAP


Scenarios”, In Proc. of the 6th International
Conference on Extending Database Technology
1. R. Agrawal, T. Imielinski, and A. Swami, “Mining (EDBT'98), Valencia, Spain: Springer-verlag,
association rules between sets of items in large 1998,pp,23-27.
databases”, In Proc. of ACM SIGMOD Conference 8. T. B. Pedersen, C. S. Jensen, “Multi-dimensional
on Management of Data, Washington D.C.: ACM data modeling for complex data”, In Proc. of the
Press, 1993,pp.207-216. 15th on Data Engineering, Los Alamitos, CA:
2. G. Piatetsky-Shapiro, W. J. Frawley, Knowledge IEEE Society Press, 1999,pp,336-345.
Discovery in Databases, Menlo Park, California: 9. L I Jian zhong, GAO Hong, “Mult idimensional
AAAI/MIT Press, 1991. Data Modeling for Data Warehouses”, Journal of
3. M. Chen, J. Han, and P. S. Yu, “Data mining: An Software,2000,11(7),pp,908-917.
Overview from a Database Perspective”, IEEE
Trans. on Knowledge and Data Engineering, 1996,
8(6) ,pp.866-883.
4. J. Han, M. Kamber, Data Mining— —Concept and
Technology, Bei Jing: China Machine Press, 2001.
5. R. Agrawal, A. Gupta, S. Sarawagi, “Modeling
Multi-dimensional databases”, In Proc. of the 13th
International Conf. on Data Engineering, Los
Alamitos, CA, IEEE Society Press,
1997,pp,105-116.
6. C. Li, X. S. Wang, “A data model for supporting
on-line analytical processing”, In Proc. of the 5th
International Conf. on Information and Knowledge
Managemen, New York, Springer-verlag,
1996,pp,81-88.

808

Вам также может понравиться