Unit 5

Data vs.
Information
Data
raw facts
no context
just numbers
and text
Information
data with
context
processed data
value-added to
data
summarized
organized
analyzed
Data Information
Knowledge
Data
Summarizing the data
Averaging the data
Selecting part of the data
Graphing the data
Adding context
Adding value
Information
Data Information
Knowledge
Information
How is the info tied to outcomes?
Are there any patterns in the info?
What info is relevant to the problem?
How does this info effect the system?
What is the best way to use the info?
How can we add more value to the info?
Knowledge
What Is Data Mining?

Data
mining (knowledge
discovery in databases):
Extraction of interesting (nontrivial, implicit, previously
unknown and potentially useful)
information or patterns from
data in large databases
Data Mining Definition

Finding
hidden information in a
database
Fit data to a model
Similar terms
Exploratory data analysis
Data driven discovery
Motivation:
Why to mine data?
Data explosion problem

Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining

Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Query Examples
Database
Find all credit applicants with last name of Smith.

Identify customers who have purchased more than
$10,000 in the last month.
Find all customers who have purchased milk
Data
Mining
Find all credit applicants who are poor credit risks.

(classification)
Identify customers with similar buying habits. (Clustering)

Find all items which are frequently purchased with milk.
Data Mining: Classification

Schemes
Decisions
in data mining
Kinds of databases to be mined

Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
Data
mining tasks
Descriptive data mining

Predictive data mining
Data Mining Tasks

Prediction
Tasks
Use some variables to predict unknown or future

values of other variables
Description
Tasks
Find human-interpretable patterns that describe the

data.
Common data mining tasks

Classification [Predictive]
Clustering [Descriptive]
CLUSTERING
Clustering Definition
Given
a set of data points, each having a

set of attributes, and a similarity measure
among them, find clusters such that
Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.
Similarity
Measures:
Euclidean Distance if attributes are continuous.

Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
Clustering example
Clustering: Application 1
Market
Segmentation:
Goal: subdivide a market into distinct

subsets of customers where any subset
may conceivably be selected as a market
target to be reached with a distinct
marketing mix.
Approach:
Collect different attributes of customers based on
their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.
Clustering: Application 2
Document
Clustering:
Goal: To find groups of documents that are

similar to each other based on the
important terms appearing in them.
Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of
different terms. Use it to cluster.
Gain: Information Retrieval can utilize the
clusters to relate a new document or
search term to clustered documents.
Google News: automatic clustering

gives an effective news
presentation metaphor
Incremental clustering
Process
the data one element at a
time.
Usually only stores a small number of
elements.
Its difficult to handle for a large
volume data.
Various distance measures

1)
Euclidean distance
2)
Cosine distance
Euclidean distance
1)
One dimension
2)
Two dimension
3)
N- dimension
Cosine distance measures

The
cosine distance between two points is one

minus the
cosine of the included angle between points
(treated as
vectors)
Cosine similarity is used to compare how
similar are two vectors
cosine_distance = 1 - cosine_similarity
K means clustering
Assume
k clusters
Define k centers
Each point should be associated with
the cluster
Recalculate k new centroids
New binding should be done with each
attribute
k centers change their location step by
step until no more changes are done.
K means clustering
K means clustering algorithm
Algorithmic steps for k-means

clustering
LetX
= {x1,x2,x3,..,xn} be the set of data points

and V = {v1,v2,.,vc} be the set of centers.
1) Randomly selectccluster centers.

2) Calculate the distance between each data point
and cluster centers.
3) Assign the data point to the cluster center whose
distance from the cluster center is minimum of all the
cluster centers..
4) Recalculate the new cluster center using:
where,cirepresents the number of data points

inithcluster.
5) Recalculate the distance between each data point
and new obtained cluster centers.
6) If no data point was reassigned then stop,
Advantages
1)
2)
Fast, robust and easier to understand.

Gives best result when data set are distinct or
well separated from each other
Disadvantages
1)The learning algorithmrequiresappropriate

specification of the number of cluster centers.
2)The use of Exclusive Assignment - Ifthere are two
highly overlapping datathen k-means will not be able to
resolve that there are two clusters.
3) Randomly choosing of the cluster center cannot lead us
to the fruitful result.
4) Algorithm fails for non-linear data set.
Category utility
It
measure "category goodness

It attempts to maximize both the probability
that two objects in the same category have
attribute values common.
According to the probability tree is generated.
COBWEB
Incremental
clustering algorithm, which builds a

taxonomy of clusters without having a predefined
number of clusters
The clusters are represented probabilistically by
conditional probability
P(A = v|C) with which attribute A has value v,
given that the instance belongs to class C.
The algorithm starts with an empty root node.
Instances are added one by one.
For each instance the following options are
considered:
classifying the instance into an existing class;
creating a new class and placing the instance
into it;
merge;
COBWEB
1 Function Cobweb (object, root)

2 Incorporate object into the root cluster;
3 If root is a leaf then
4 return expanded leaf with the object;
5 else choose the operator that results in the best
clustering: a) Incorporate the object into the
best host;
b) Create a new class containing the object;
c) Merge the two best hosts;
d) Split the best host;
If (a) or (c) or (d) then call Cobweb (observation,
best host);
COBWEB Example
Consider
4 animals are having following features.
Cobwebs clustering hierarchy

Unit 5

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Unit 5

Загружено:

Авторское право:

Доступные форматы

Data vs.

What Is Data Mining?

Data Mining Definition

Why to mine data?

Data explosion problem

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Find all credit applicants with last name of Smith.

Find all credit applicants who are poor credit risks.

Identify customers with similar buying habits. (Clustering)

Data Mining: Classification

Kinds of databases to be mined

Descriptive data mining

Data Mining Tasks

Use some variables to predict unknown or future

Find human-interpretable patterns that describe the

Common data mining tasks

a set of data points, each having a

Euclidean Distance if attributes are continuous.

Goal: subdivide a market into distinct

Goal: To find groups of documents that are

Google News: automatic clustering

the data one element at a

Various distance measures

Cosine distance measures

cosine distance between two points is one

K means clustering algorithm

Algorithmic steps for k-means

= {x1,x2,x3,..,xn} be the set of data points

1) Randomly selectccluster centers.

where,cirepresents the number of data points

Fast, robust and easier to understand.

1)The learning algorithmrequiresappropriate

measure "category goodness

clustering algorithm, which builds a

1 Function Cobweb (object, root)

4 animals are having following features.

Cobwebs clustering hierarchy

Вам также может понравиться