Вы находитесь на странице: 1из 31

Data vs.

Information
Data
raw facts
no context
just numbers
and text

Information
data with
context
processed data
value-added to
data
summarized
organized
analyzed

Data Information
Knowledge
Data
Summarizing the data
Averaging the data
Selecting part of the data
Graphing the data
Adding context
Adding value

Information

Data Information
Knowledge
Information
How is the info tied to outcomes?
Are there any patterns in the info?
What info is relevant to the problem?
How does this info effect the system?
What is the best way to use the info?
How can we add more value to the info?

Knowledge

What Is Data Mining?


Data

mining (knowledge
discovery in databases):
Extraction of interesting (nontrivial, implicit, previously
unknown and potentially useful)
information or patterns from
data in large databases

Data Mining Definition


Finding

hidden information in a
database
Fit data to a model
Similar terms
Exploratory data analysis
Data driven discovery

Motivation:

Why to mine data?

Data explosion problem


Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining


Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases

Query Examples
Database

Find all credit applicants with last name of Smith.


Identify customers who have purchased more than
$10,000 in the last month.
Find all customers who have purchased milk

Data

Mining

Find all credit applicants who are poor credit risks.


(classification)

Identify customers with similar buying habits. (Clustering)


Find all items which are frequently purchased with milk.

Data Mining: Classification


Schemes
Decisions

in data mining

Kinds of databases to be mined


Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
Data

mining tasks

Descriptive data mining


Predictive data mining

Data Mining Tasks


Prediction

Tasks

Use some variables to predict unknown or future


values of other variables
Description

Tasks

Find human-interpretable patterns that describe the


data.

Common data mining tasks


Classification [Predictive]
Clustering [Descriptive]

CLUSTERING

Clustering Definition
Given

a set of data points, each having a


set of attributes, and a similarity measure
among them, find clusters such that
Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.

Similarity

Measures:

Euclidean Distance if attributes are continuous.


Other Problem-specific Measures.

Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster
Intraclusterdistances
distances
are
areminimized
minimized

Intercluster
Interclusterdistances
distances
are
aremaximized
maximized

Clustering example

Clustering: Application 1
Market

Segmentation:

Goal: subdivide a market into distinct


subsets of customers where any subset
may conceivably be selected as a market
target to be reached with a distinct
marketing mix.
Approach:
Collect different attributes of customers based on
their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.

Clustering: Application 2
Document

Clustering:

Goal: To find groups of documents that are


similar to each other based on the
important terms appearing in them.
Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of
different terms. Use it to cluster.
Gain: Information Retrieval can utilize the
clusters to relate a new document or
search term to clustered documents.

Google News: automatic clustering


gives an effective news
presentation metaphor

Incremental clustering
Process

the data one element at a

time.
Usually only stores a small number of
elements.
Its difficult to handle for a large
volume data.

Various distance measures


1)

Euclidean distance

2)

Cosine distance

Euclidean distance

1)

One dimension

2)

Two dimension

3)

N- dimension

Cosine distance measures


The

cosine distance between two points is one


minus the
cosine of the included angle between points
(treated as
vectors)
Cosine similarity is used to compare how
similar are two vectors
cosine_distance = 1 - cosine_similarity

K means clustering
Assume

k clusters
Define k centers
Each point should be associated with
the cluster
Recalculate k new centroids
New binding should be done with each
attribute
k centers change their location step by
step until no more changes are done.

K means clustering

K means clustering algorithm

Algorithmic steps for k-means


clustering
LetX

= {x1,x2,x3,..,xn} be the set of data points


and V = {v1,v2,.,vc} be the set of centers.

1) Randomly selectccluster centers.


2) Calculate the distance between each data point
and cluster centers.
3) Assign the data point to the cluster center whose
distance from the cluster center is minimum of all the
cluster centers..
4) Recalculate the new cluster center using:

where,cirepresents the number of data points


inithcluster.
5) Recalculate the distance between each data point
and new obtained cluster centers.
6) If no data point was reassigned then stop,

Advantages
1)
2)

Fast, robust and easier to understand.


Gives best result when data set are distinct or
well separated from each other

Disadvantages

1)The learning algorithmrequiresappropriate


specification of the number of cluster centers.
2)The use of Exclusive Assignment - Ifthere are two
highly overlapping datathen k-means will not be able to
resolve that there are two clusters.
3) Randomly choosing of the cluster center cannot lead us
to the fruitful result.
4) Algorithm fails for non-linear data set.

Category utility

It

measure "category goodness


It attempts to maximize both the probability
that two objects in the same category have
attribute values common.
According to the probability tree is generated.

COBWEB
Incremental

clustering algorithm, which builds a


taxonomy of clusters without having a predefined
number of clusters
The clusters are represented probabilistically by
conditional probability
P(A = v|C) with which attribute A has value v,
given that the instance belongs to class C.
The algorithm starts with an empty root node.
Instances are added one by one.
For each instance the following options are
considered:
classifying the instance into an existing class;
creating a new class and placing the instance
into it;
merge;

COBWEB

1 Function Cobweb (object, root)


2 Incorporate object into the root cluster;
3 If root is a leaf then
4 return expanded leaf with the object;
5 else choose the operator that results in the best
clustering: a) Incorporate the object into the
best host;
b) Create a new class containing the object;
c) Merge the two best hosts;
d) Split the best host;
If (a) or (c) or (d) then call Cobweb (observation,
best host);

COBWEB Example
Consider

4 animals are having following features.

Cobwebs clustering hierarchy

Вам также может понравиться