Вы находитесь на странице: 1из 18

Seminar by, Shiva Rama Krishna.

Introduction to Data mining Exploring an example on Frequent patterns, Associations and Correlation Exploring an example on Clustering Analysis

Data mining is a fairly new concept which was emerged in the late 1980s. But it soon attracted huge interests for research works and flourishes with many new and remarkable techniques being discovered throughout the 1990s.

Data mining is a process of extracting the data from the huge amount of databases.

Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.

Relational databases, Transactional databases, Advanced database systems, Flat files,

Data streams, and


The World wide web, Advanced database systems include object-relational databases and specific application-oriented databases, such as spatial data bases, time-series databases, text databases, and multimedia databases.

What is Frequent patterns

Example on Transactional data of Customers to a shop

Frequent patterns: These are the patterns that appear in a data set
frequently.

for example, a set of items, such as A=apple, B=bread, C=cheese, D=drink, E=eggs, that appear frequently together in a transaction data set is a frequent item-set. Example: TID
T100
T200 T300 T400 T500 T600 T700 T800 T900

(through apriory Algorithm) List of item_IDs


A,B,E
B,D B,C A,B,D A,C B,C A,C A,B,C,E A,B,C

take minimum support as 2

if we take minimum support count as 2


Scan D for count of each candidate

C1
Item set Sup. Count

{A} {B} {C} {D} {E}

6 7 6 2 2

Compare candidate support count with minimum support count

L1
Item set {A} {B} {C} {D} {E} Sup. Count 6 7 6 2 2
Generate C2 candidates fromL1

C2
Item-set
{A, B} {A, C} {A, D} {A,E} {A, E} {B, C} {B, D} {B, E} {C, D} {C,E} {D, E}
Scan D for count of each candidate

C2
Item set {A, B} {A, C} {A, D} {A, E} {B, C} {B, D} {B, E} {C, D} {C,E} {D, E} Sup. Count 4 4 1 2 4 2 2 0 1 0
Compare candidate support count with minimum support count

L2
Item set {A, B} {A, C} {A, E} {B, C} {B, D} {B, E} Sup. Count 4 4 2 4 2 2

Generate C3 Candidate from L2

C3
Item set {A, B, C} {A, B, E}

C3
Scan D for count of each candidate

Item set {A, B, C} {A, B, E}

Sup. Count 2 2

Compare candidate support count with minimum support count

L3
Item set {A, B, C} {A, B, E} Sup. Count 2 2

What is Cluster Analysis? Finding the dissimilarity between two binary variables and example on it

Clustering:
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.
Cluster analysis has been widely used in numerous applications, including market research, patterns recognition, data analysis, and image processing. In business, clustering can help marketers discover dispurchasing patterns.

Analysis of dissimilarity through Cluster analysis:

Object j
1 1 object i 0 q s 0 r t sum q+r s+t

q is the number of variables that equals to 1 for both the objects

sum

q+s

r+t

A contingency table for binary variables Formula for calculating dissimilarity between i and j: d (i, j) = (r+s)/(q+r+s+t) Sim(i.j)=(q/(q+r+s))=1-d(i,j) (t is ignored)

A relational table3 where patients are described by binary attributes: Name Jack Mary gender M F fever P P cough N N test-1 P P test-2 N N test-3 N P test-4 N N

Jim

Here, N

(negative), P

positive

(P set to be 1 and N be set to be 0)

SYMPTOMS

JACK

MARY
SYMPTOMS

JACK

JIM
SYMPTOMS

MARY

JIM

1 COUGH 0 TEST1 1 TEST2 0 TEST3 0 TEST4 0


FEVER
Calculation:

1 0 1 0 1 0

FEVER COUGH

TEST1
TEST2 TEST3

TEST4

1 0 1 0 0 0

1 1 0 0 0 0

FEVER COUGH TEST1 TEST2 TEST3

TEST4

1 0 1 0 1 0

1 1 0 0 0 0

d(jack,mary) = (0+1)/(2+0+1)=0.33
d(Jack,Jim) = (1+1)/(1+1+1)=0.67 d(Mary,Jim) = (1+2)/(1+1+2)=0.75

Вам также может понравиться