Вы находитесь на странице: 1из 5

2/28/2014

Association rules and frequent itemsets (associate) Orange Documentation Orange

Features

Download

Add-ons

Documentation

Development

Forum

Blog

Association rules and frequent itemsets (a s s o c i a t e ) Orange provides two algorithms for induction of association rules, a standard Apriori algorithm [AgrawalSrikant1994] for sparse (basket) data analysis and a variant of Apriori for attribute-value data sets. Both algorithms also support mining of frequent itemsets. For example, consider a simple market basket data: B r e a d ,M i l k B r e a d ,D i a p e r s ,B e e r ,E g g s M i l k ,D i a p e r s ,B e e r ,C o l a B r e a d ,M i l k ,D i a p e r s ,B e e r B r e a d ,M i l k ,D i a p e r s ,C o l a The following script induces association rules with items that appear in at least 30% of data instances (transactions): i m p o r tO r a n g e d a t a=O r a n g e . d a t a . T a b l e ( " m a r k e t b a s k e t . b a s k e t " ) r u l e s=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s S p a r s e I n d u c e r ( d a t a ,s u p p o r t = 0 . 3 ) p r i n t" % 4 s% 4 s % s "%( " S u p p " ," C o n f " ," R u l e " ) f o rri nr u l e s [ : 5 ] : p r i n t" % 4 . 1 f% 4 . 1 f % s "%( r . s u p p o r t ,r . c o n f i d e n c e ,r ) The code reports on support and confidence first five rules found: S u p pC o n f 0 . 4 1 . 0 0 . 4 0 . 5 0 . 4 1 . 0 0 . 4 1 . 0 0 . 4 1 . 0 R u l e C o l a>D i a p e r s D i a p e r s>C o l a C o l a>D i a p e r sM i l k C o l aD i a p e r s>M i l k C o l aM i l k>D i a p e r s

In Apriori, association rule induction is two-stage algorithm first finds itemsets that frequently appear in the data and have sufficient support, and then splits them to rules of sufficient confidence. Function get_itemsets reports on itemsets alone and skips rule induction: i m p o r tO r a n g e d a t a=O r a n g e . d a t a . T a b l e ( " m a r k e t b a s k e t . b a s k e t " ) i n d=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s S p a r s e I n d u c e r ( s u p p o r t = 0 . 4 ,s t o r e E x a m p l e s=T r u e ) i t e m s e t s=i n d . g e t _ i t e m s e t s ( d a t a ) f o ri t e m s e t ,t i d si ni t e m s e t s [ : 5 ] : p r i n t" ( % 4 . 2 f )% s "%( l e n ( t i d s ) / f l o a t ( l e n ( d a t a ) ) , "" . j o i n ( d a t a . d o m a i n [ i t e m ] . n a m ef o ri t e mi ni t e m s e t ) ) The above script lists frequent itemsets and their support: ( 0 . 4 0 )C o l a ( 0 . 4 0 )C o l aD i a p e r s ( 0 . 4 0 )C o l aD i a p e r sM i l k ( 0 . 4 0 )C o l aM i l k ( 0 . 6 0 )B e e r

Association rules induction algorithms A s s o c i a t i o n R u l e s S p a r s e I n d u c e rinduces frequent itemsets and association rules from sparse data sets. These can be either provided in the basket format (see Loading and saving data ) or in an attribute-value format where any entry in the data table is considered as presence of a feature in the transaction (an item), and any unknown (empty) entry signifies its absence. A s s o c i a t i o n R u l e s I n d u c e rworks featurevalue data, where am item is a combination of feature and its value (e.g., astigmatic=yes). Sparse (basket) data sets class O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s S p a r s e I n d u c e r

s u p p o r t
Minimal support for the rule. Depending on the data set it should be set to sufficiently high value to avoid running out of working memory (default: 0.3).

c o n f i d e n c e
Minimal confidence for the rule.

s t o r e _ e x a m p l e s
http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/ 1/5

2/28/2014

Association rules and frequent itemsets (associate) Orange Documentation Orange

Store the examples covered by each rule and those confirming it.

m a x _ i t e m _ s e t s
The maximal number of itemsets induced. Orange will stop with inference of frequent itemsets once this number of itemsets is reached.

_ _ c a l l _ _ (data , weight_id)
Induce rules from the provided data set.

g e t _ i t e m s e t s (data )
For a given data set, return a list of frequent itemsets. List elements are pairs, where the first element includes indices of features in the item set (negative for sparse data) and the second element a list of indices supporting the itemset. If s t o r e _ e x a m p l e sis False, the second element is None. To test this rule inducer, we will first create a sparse data sets consisting of list of words in sentences from a brief description of Spanish Inquisition, given by Palin et al.: NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our three weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our four...no... Amongst our weapons.... Amongst our weaponry...are such elements as fear, surprise.... Ill come in again. NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn! After some cleaning (e.g., removal of stopwords and punctuation marks), our data set looks like (i n q u i s i t i o n . b a s k e t ):

n o b o d y ,e x p e c t s ,t h e ,S p a n i s h ,I n q u i s i t i o n o u r ,c h i e f ,w e a p o n ,i s ,s u r p r i s e ,s u r p r i s e ,a n d ,f e a r , f e a r ,a n d ,s u r p r i s e o u r ,t w o ,w e a p o n s ,a r e ,f e a r ,a n d ,s u r p r i s e ,a n d ,r u t h l e s s ,e f f i c i e n c y o u r ,t h r e e ,w e a p o n s ,a r e ,f e a r ,s u r p r i s e ,a n d ,r u t h l e s s ,e f f i c i e n c y ,a n d ,a n ,a l m o s t ,f a n a t i c a l ,d e v o t i o n ,t o ,t h e ,P o p e o u r ,f o u r ,n o a m o n g s t ,o u r ,w e a p o n s a m o n g s t ,o u r ,w e a p o n r y ,a r e ,s u c h ,e l e m e n t s ,a s ,f e a r ,s u r p r i s e I ' l l ,c o m e ,i n ,a g a i n n o b o d y ,e x p e c t s ,t h e ,S p a n i s h ,I n q u i s i t i o n a m o n g s t ,o u r ,w e a p o n r y ,a r e ,s u c h ,d i v e r s e ,e l e m e n t s ,a s ,f e a r ,s u r p r i s e ,r u t h l e s s ,e f f i c i e n c y ,a n ,a l m o s t ,f a n a t i c a l ,d e

The following script induces the association rules: i m p o r tO r a n g e d a t a=O r a n g e . d a t a . T a b l e ( " i n q u i s i t i o n . b a s k e t " ) r u l e s=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s S p a r s e I n d u c e r ( d a t a ,s u p p o r t=0 . 5 ) p r i n t" % 5 s % 5 s "%( " s u p p " ," c o n f " ) f o rri nr u l e s : p r i n t" % 5 . 3 f % 5 . 3 f % s "%( r . s u p p o r t ,r . c o n f i d e n c e ,r ) The induced rules are surprisingly fear-full: 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 0 . 5 0 0 1 . 0 0 0 1 . 0 0 0 1 . 0 0 0 1 . 0 0 0 1 . 0 0 0 1 . 0 0 0 1 . 0 0 0 0 . 7 1 4 1 . 0 0 0 0 . 7 1 4 1 . 0 0 0 0 . 7 1 4 f e a r>s u r p r i s e s u r p r i s e>f e a r f e a r>s u r p r i s eo u r f e a rs u r p r i s e>o u r f e a ro u r>s u r p r i s e s u r p r i s e>f e a ro u r s u r p r i s eo u r>f e a r o u r>f e a rs u r p r i s e f e a r>o u r o u r>f e a r s u r p r i s e>o u r o u r>s u r p r i s e

To get only a list of supported item sets, one should call the method get_itemsets: i n d u c e r=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s S p a r s e I n d u c e r ( s u p p o r t=0 . 5 ,s t o r e _ e x a m p l e s=T r u e ) i t e m s e t s=i n d u c e r . g e t _ i t e m s e t s ( d a t a ) Now itemsets is a list of itemsets along with the examples supporting them since we set store_examples to True. > > >i t e m s e t s [ 5 ] ( ( 1 1 ,7 ) ,[ 1 ,2 ,3 ,6 ,9 ] ) > > >[ d a t a . d o m a i n [ i ] . n a m ef o rii ni t e m s e t s [ 5 ] [ 0 ] ] [ ' s u r p r i s e ' ,' o u r ' ] The sixth itemset contains features with indices -11 and -7, that is, the words surprise and our. The examples supporting it are those with indices 1,2, 3, 6 and 9. This way of representing the itemsets is memory efficient and faster than using objects like D e s c r i p t o rand I n s t a n c e . Feature-value (non-sparse) data sets > > >

http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/

2/5

2/28/2014

Association rules and frequent itemsets (associate) Orange Documentation Orange

A s s o c i a t i o n R u l e s I n d u c e rworks with non-sparse data. class O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s I n d u c e r Association rule induction from non-sparse data sets. An item is a feature-value combination. Unknown values in the data table are ignored. The algorithm can also be used to search only for classification rules where the feature on the right-hand side is the class variable.

s u p p o r t
Minimal support of the induced rule (default: 0.3)

c o n f i d e n c e
Minimal confidence of the induced rule.

c l a s s i f i c a t i o n _ r u l e s
If True, the classification rules are constructed instead of general association rules (default: False).

s t o r e _ e x a m p l e s
Store the examples covered by each rule and those confirming it.

m a x _ i t e m _ s e t s
The maximal number of itemsets induced. After reaching this limit the inference algorithm will stop.

_ _ c a l l _ _ (data , weight_id)
Induce rules from the given data set.

g e t _ i t e m s e t s (data )
For a given data set, return a list of frequent itemsets. The list consists of pairs, where the first element includes indices of features in the item set (negative for sparse data), and the second element a list of indices supporting the item set. If s t o r e _ e x a m p l e sis False, the second element is None. Following is an example script that uses A s s o c i a t i o n R u l e s I n d u c e r : i m p o r tO r a n g e d a t a=O r a n g e . d a t a . T a b l e ( " l e n s e s " ) p r i n t" A s s o c i a t i o nr u l e s : " r u l e s=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s I n d u c e r ( d a t a ,s u p p o r t = 0 . 3 ) f o rri nr u l e s : p r i n t" % 5 . 3 f % 5 . 3 f % s "%( r . s u p p o r t ,r . c o n f i d e n c e ,r ) Script reports the following rules (first colon is support, second confidence): 0 . 3 3 3 0 . 3 3 3 0 . 3 3 3 0 . 3 3 3 0 . 5 0 0 0 . 5 0 0 0 . 5 3 3 0 . 6 6 7 0 . 5 3 3 0 . 6 6 7 0 . 8 0 0 1 . 0 0 0 l e n s e s = n o n e>p r e s c r i p t i o n = h y p e r m e t r o p e p r e s c r i p t i o n = h y p e r m e t r o p e>l e n s e s = n o n e l e n s e s = n o n e>a s t i g m a t i c = y e s a s t i g m a t i c = y e s>l e n s e s = n o n e l e n s e s = n o n e>t e a r _ r a t e = r e d u c e d t e a r _ r a t e = r e d u c e d>l e n s e s = n o n e

To infer classification rules we can use a similar script but set classificationRules to 1: p r i n t" C l a s s i f i c a t i o nr u l e s " r u l e s=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s I n d u c e r ( d a t a ,s u p p o r t=0 . 3 ,c l a s s i f i c a t i o n R u l e s=1 ) These rules are a subset of association rules that in a consequent include only a class variable: 0 . 3 3 3 0 . 6 6 7 p r e s c r i p t i o n = h y p e r m e t r o p e>l e n s e s = n o n e 0 . 3 3 3 0 . 6 6 7 a s t i g m a t i c = y e s>l e n s e s = n o n e 0 . 5 0 0 1 . 0 0 0 t e a r _ r a t e = r e d u c e d>l e n s e s = n o n e Frequent itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of features, as before, but with tuples (feature-index, value-index): i n d u c e r=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s I n d u c e r ( s u p p o r t=0 . 3 ,s t o r e _ e x a m p l e s=T r u e ) i t e m s e t s=i n d u c e r . g e t _ i t e m s e t s ( d a t a ) p r i n ti t e m s e t s [ 8 ] The script prints out: ( ( ( 2 ,1 ) ,( 4 ,0 ) ) ,[ 2 ,6 ,1 0 ,1 4 ,1 5 ,1 8 ,2 2 ,2 3 ] ) reporting that the ninth itemset contains the second value of the third feature (2, 1), and the first value of the fifth (4, 0). Representation of rules Methods for induction of association rules return the induced rules in A s s o c i a t i o n R u l e s , which is basically a list of A s s o c i a t i o n R u l e instances. class O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e

http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/

3/5

2/28/2014

Association rules and frequent itemsets (associate) Orange Documentation Orange

_ _ i n i t _ _ (left, right, n_applies_left, n_applies_right, n_applies_both, n_examples)


C onstruct an association rule and compute evaluation scores (see below) based on counts given in the arguments of the call.

_ _ i n i t _ _ (left, right, support, confidence )


C onstruct association rule and compute its support and confidence. For manual construction of such such a rule set other attributes manually, as AssociationRuless constructor cannot compute anything only from support and confidence.

_ _ i n i t _ _ (rule )
Given an association rule as the argument, constructor a copy of the rule.

l e f t ,r i g h t
The left and the right side of the rule. Both are given as O r a n g e . d a t a . I n s t a n c e . In rules created by A s s o c i a t i o n R u l e s S p a r s e I n d u c e r from data instances that contain all values as meta-values, left and right are data instances in the same form. Otherwise, values in left that do not appear in the rule are dont care, and value in right are dont know. Both can, however, be tested by i s _ s p e c i a l ( ) .

n _ l e f t ,n _ r i g h t
The number of items on the left and on the right side of the rule.

n _ a p p l i e s _ l e f t ,n _ a p p l i e s _ r i g h t ,n _ a p p l i e s _ b o t h
The number of data instances matching the left, right and both sides of the rule, correspondingly.

n _ e x a m p l e s
The total number of training instances.

s u p p o r t
nAppliesBoth/nExamples.

c o n f i d e n c e
n_applies_both/n_applies_left.

c o v e r a g e
n_applies_left/n_examples.

s t r e n g t h
n_applies_right/n_applies_left.

l i f t
n_examples * n_applies_both / (n_applies_left * n_applies_right).

l e v e r a g e
(n_Applies_both * n_examples - n_applies_left * n_applies_right).

e x a m p l e s ,m a t c h _ l e f t ,m a t c h _ b o t h
If store_examples was True during induction, examples contain a copy of the data table used to induce the rules. Attributes match_left and match_both are lists of indices of data instances that match the left, right and both sides of the rule, respectively.

a p p l i e s _ l e f t (data_instance ) a p p l i e s _ r i g h t (data_instance ) a p p l i e s _ b o t h (data_instance )


Tests if data instance is matched by the left, right or both sides of the rule, respectively. The data instance must be in the same representation as data from which the rule was inferred. Association rule inducers do not store information on supporting data instances from training data set. Let us write a script that finds the data instances that match the rule (fit both sides of it) and those that contradict it (fit the left-hand side but not the right): i m p o r tO r a n g e d a t a=O r a n g e . d a t a . T a b l e ( " l e n s e s " ) r u l e s=O r a n g e . a s s o c i a t e . A s s o c i a t i o n R u l e s I n d u c e r ( d a t a ,s u p p o r t = 0 . 3 ) r u l e=r u l e s [ 0 ] p r i n t" R u l e :" ,r u l e ," \ n " p r i n t" S u p p o r t i n gd a t ai n s t a n c e s : " f o rdi nd a t a : i fr u l e . a p p l i e s B o t h ( d ) : p r i n td p r i n t p r i n t" C o n t r a d i c t i n gd a t ai n s t a n c e s : " f o rdi nd a t a : i fr u l e . a p p l i e s _ l e f t ( d )a n dn o tr u l e . a p p l i e s _ r i g h t ( d ) :

http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/

4/5

2/28/2014
p r i n td p r i n t

Association rules and frequent itemsets (associate) Orange Documentation Orange

The latter printouts get simpler and faster if we instruct the inducer to store the examples: p r i n t" M a t c hl e f t :" p r i n t" \ n " . j o i n ( s t r ( r u l e . e x a m p l e s [ i ] )f o rii nr u l e . m a t c h _ l e f t ) p r i n t" \ n M a t c hb o t h :" p r i n t" \ n " . j o i n ( s t r ( r u l e . e x a m p l e s [ i ] )f o rii nr u l e . m a t c h _ b o t h ) The contradicting examples are those whose indices are found in match_left but not in match_both. The memory friendlier and the faster way to compute this is: > > >[ xf o rxi nr u l e . m a t c h _ l e f ti fn o txi nr u l e . m a t c h _ b o t h ] [ 0 ,2 ,8 ,1 0 ,1 6 ,1 7 ,1 8 ] > > >s e t ( r u l e . m a t c h _ l e f t )-s e t ( r u l e . m a t c h _ b o t h ) s e t ( [ 0 ,2 ,8 ,1 0 ,1 6 ,1 7 ,1 8 ] ) > > >

Utilities O r a n g e . a s s o c i a t e . p r i n t _ r u l e s (rules, ms=[]) Print the association rules. Parameters: rules (AssociationRules) Association rules. ms (list) Attributes of the rule to be printed with the rule (default: []). To report on confidence and support use m s = [ " s u p p o r t " ," c o n f i d e n c e " ]

O r a n g e . a s s o c i a t e . s o r t (rules, ms=['support']) Sort the rules according to the given criteria. Parameters: rules (AssociationRules) Association rules. ms (list) Sort keys (list of association rules attributes, default: [support].

References [AgrawalSrikant1994] R Agrawal and R Srikant: Fast algorithms for mining association rules in large databases. In Proc. 20th International C onference on Very Large Data Bases, pages 487-499, Santiago, C hile, September 1994. P-N Tan, M Steinbach and V Kumar: Introduction to Data Mining, chapter on Association analysis: basic concepts and algorithms, Addison Wesley, 2005.

[TanSteinbachKumar2005]

For documentation suggestions or questions please use our forum.

About Screenshots Features License C ontributing to Orange C itation C ontact

Get Started Download Widgets Datasets

Scripting Tutorial Reference Examples

Extending Widgets Add-ons

Community Blog Open Source Development Forum

http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/

5/5

Вам также может понравиться