Академический Документы
Профессиональный Документы
Культура Документы
Introduction to Data mining Exploring an example on Frequent patterns, Associations and Correlation Exploring an example on Clustering Analysis
Data mining is a fairly new concept which was emerged in the late 1980s. But it soon attracted huge interests for research works and flourishes with many new and remarkable techniques being discovered throughout the 1990s.
Data mining is a process of extracting the data from the huge amount of databases.
Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.
Frequent patterns: These are the patterns that appear in a data set
frequently.
for example, a set of items, such as A=apple, B=bread, C=cheese, D=drink, E=eggs, that appear frequently together in a transaction data set is a frequent item-set. Example: TID
T100
T200 T300 T400 T500 T600 T700 T800 T900
C1
Item set Sup. Count
6 7 6 2 2
L1
Item set {A} {B} {C} {D} {E} Sup. Count 6 7 6 2 2
Generate C2 candidates fromL1
C2
Item-set
{A, B} {A, C} {A, D} {A,E} {A, E} {B, C} {B, D} {B, E} {C, D} {C,E} {D, E}
Scan D for count of each candidate
C2
Item set {A, B} {A, C} {A, D} {A, E} {B, C} {B, D} {B, E} {C, D} {C,E} {D, E} Sup. Count 4 4 1 2 4 2 2 0 1 0
Compare candidate support count with minimum support count
L2
Item set {A, B} {A, C} {A, E} {B, C} {B, D} {B, E} Sup. Count 4 4 2 4 2 2
C3
Item set {A, B, C} {A, B, E}
C3
Scan D for count of each candidate
Sup. Count 2 2
L3
Item set {A, B, C} {A, B, E} Sup. Count 2 2
What is Cluster Analysis? Finding the dissimilarity between two binary variables and example on it
Clustering:
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.
Cluster analysis has been widely used in numerous applications, including market research, patterns recognition, data analysis, and image processing. In business, clustering can help marketers discover dispurchasing patterns.
Object j
1 1 object i 0 q s 0 r t sum q+r s+t
sum
q+s
r+t
A contingency table for binary variables Formula for calculating dissimilarity between i and j: d (i, j) = (r+s)/(q+r+s+t) Sim(i.j)=(q/(q+r+s))=1-d(i,j) (t is ignored)
A relational table3 where patients are described by binary attributes: Name Jack Mary gender M F fever P P cough N N test-1 P P test-2 N N test-3 N P test-4 N N
Jim
Here, N
(negative), P
positive
SYMPTOMS
JACK
MARY
SYMPTOMS
JACK
JIM
SYMPTOMS
MARY
JIM
1 0 1 0 1 0
FEVER COUGH
TEST1
TEST2 TEST3
TEST4
1 0 1 0 0 0
1 1 0 0 0 0
TEST4
1 0 1 0 1 0
1 1 0 0 0 0
d(jack,mary) = (0+1)/(2+0+1)=0.33
d(Jack,Jim) = (1+1)/(1+1+1)=0.67 d(Mary,Jim) = (1+2)/(1+1+2)=0.75