Data Mining

Data Mining
Last lecture…. Today’s topic….

 Counting co- Sequential pattern
occurrences Classification rule
 Mining rules Regression rule
 Association rule Tree structure rule
Neural network
Clustering
Similarity over
search
ODBS
Object, object
identifier
Inheritance
10/15/08 Jyotsna Chauhan 1
Sequential Pattern
A sequence of actions or events is sought.
Example.. If a patient underwent cardiac surgery for blocked
arteries and later developed high blood urea within a year of
surgery, he or she is likely to suffer from kidney failure within
next app.18 months.
Detection of sequential patterns is equivalent to detecting
association among events with certain temporal relationships.

Sequential pattern
 Example
 A sequential rule: A→ B, says that event A
will be immediately followed by event B
with a certain confidence

Classification
 Given a collection of records
 Each record contains a set of attributes, one of the attributes is the class.
 Classification rules help assign new objects to classes.
 E.g., given a new automobile insurance applicant, should he
or she be classified as low risk, medium risk or high risk?
 Classification rules could use a variety of data, such as
educational level, salary, age, etc.
 ∀ person P, P.degree = masters and P.income > 75,000
⇒ P.credit = excellent
 ∀ person P, P.degree = bachelors and
(P.income ≥ 25,000 and P.income ≤ 75,000)
⇒ P.credit = good
Rules are not necessarily exact: there may be some
misclassifications
Classification rules can be shown compactly as a decision tree.
Decision Tree

Classification: Application 1
 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms
the class attribute.
 Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
 Type of business, where they stay, how much they earn, etc.
 Use this information as input attributes to learn a
classifier model.

Classification: Application 2
 Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions.
 Approach:
 Use credit card transactions and the information on its
account-holder as attributes.
 When does a customer buy, what does he buy, how often he
pays on time, etc
 Label past transactions as fraud or fair transactions. This
forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card
transactions on an account.
Regression
 Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
 Greatly studied in statistics, neural network fields.
 Examples:
 Predicting sales amounts of new product based on
advertising expenditure.
 Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
 Time series prediction of stock market indices.

Tree Structure Rule
To discover the classification rule and regression rule from a
relation, we draw a tree called classification and regression
tree.
Classification tree is called DECISION TREE.
Following tree is decision tree

Decision Tree
 A decision tree is a binary tree created using the
collection of classification rules.
 The tree starts from the root to the leaf.
 Each internal node is a predictor attribute.
 These attributes are called as splitting attributes as
the data is split according to the condition over the
attribute.
 The predicate is written as the outgoing edge of the
node.

Algorithm to Build Decision
Tree
INPUT: node n, Partition P, Split Selection method S
OUTPUT: Decision tree for P rooted at node n
BEGIN:
I. Apply S to P to find splitting criteria.
II. If (good splitting criteria found)
then
a. Create children n1, n2 of n
b. Partition P into P1, P2
c. Build tree (n1, P1, S)
d. Build tree (n2, P2,S)
Endif
End.
Clustering

Cluster Analysis
A Cluster
 A collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
Cluster analysis
 Grouping a set of data objects into clusters
Clustering is unsupervised classification: no
predefined classes
Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
10/15/08 Jyotsna Chauhan

Clustering
What is Clustering?
Grouping a set of data objects into clusters
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
Clustering = unsupervised classification (no predefined classes)

Typical usage
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

Clustering Applications
Pattern Recognition
Spatial Data Analysis
 create thematic maps in geographical information
systems by clustering feature spaces
Image Processing
Market research
WWW
 Document classification
 Cluster Weblog data to discover groups of similar
access patterns

Examples of Clustering
Marketing
Applications
 Help marketers discover distinct groups of customers, and then
use this knowledge to develop targeted marketing programs
Insurance
 Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning
 Identifying groups of houses according to their house type, value,
and geographical location
Earth-quake studies
 Observed earth quake epicenters should be clustered along
continent faults
16
Similarity search over
sequences
Information is stored in a database in a particular sequence.
We assume that the user specifies a “query sequence” and
wants to retrieve all data sequences that are similar to query
sequence.
Let us begin by describing sequences and similarity between
sequences.
 A data sequence X is a series of numbers X=< X1,…….,Xk >.
 X is called time series and K is called the length of the sequence.
A subsequence Z = <Z1……Zj> is obtained from another
sequence X by deleting numbers from front and back of the
sequence X.
The Z is subsequence of X if Z1 = Xi, Z2 = X i+1,…… ,Zj = X i+j ,
for some i = 1,……,k+j-1

Similarity queries over sequences can be classified
into two types:
Complete sequence matching:
The query sequence and the sequence in the database
have same length. Given a user- specified and called threshold,

our goal is to retrieve all sequences in the database that are
within ε distance to the query sequence.
Subsequence Matching:
The query sequence is shorter than the sequence in
the database. In this case, we want to find all subsequences of
sequences in the database such that the subsequence is within
distance ε of the query sequence.

Neural network

Neural Computing
Neuroscience
 The objective is to understand the human brain
 Biologically realistic models of neurons
 Biologically realistic connection topologies
Neural networks
 The objective is to develop computation methods
 Highly simplified artificial neurons
 Connection topologies that are aimed at
computational effectiveness

Why study neural
computation?
The motivation is that the brain can do amazing
computations that we do not know how to do with a
conventional computer.
 Vision, language understanding, learning …..
It does them by using huge networks of slow neurons
each of which is connected to thousands of other
neurons.
 Its not at all like a conventional computer that has a big,
passive memory and a very fast central processor that can
only do one simple operation at a time.
It learns to do these computations without any explicit
programming.

The goals of neural
computation
To understand how the brain actually works
 Its big and very complicated and made of yukky stuff
that dies when you poke it around
To understand a new style of computation
 Inspired by neurons and their adaptive connections
 Very different style from sequential computation
 should be good for things that brains are good at (e.g. vision)
 Should be bad for things that brains are bad at (e.g. 23 x 71)
To solve practical problems by using novel

learning algorithms
 Learning algorithms can be very useful even if they
have nothing to do with how the brain works
Object Database
System

Weaknesses of
RDBMSs
Representation of ‘real world’ entities: The process of
normalization generally leads to the creation of relations
that do not correspond to entities in the ‘real world’.
Semantic overloading: The relational model has only one

construct for representing data and data relationships: the
relation.
Homogeneous data: The relational model assumes both

horizontal and vertical homogeneity. Also, intersection of a
row and column must be an atomic value => this structure
is restrictive for many ‘real world’ objects with a complex
structure.

Weaknesses of
RDBMSs
Limited operations: The relational model has a
fixed set of operations (provided in SQL). => does
not allow new operations to be specified.
Recursive queries: It is extremely difficult to

produce recursive queries (queries about
relationships that a relation has with itself).
Impedance mismatch: Result of mixing different

programming paradigms (e.g., SQL is a declarative
language that handles rows of data whereas a
high-level language such as ‘C’ is a procedural
language that can handle only one row at a time).

Why more than
RDBMS’s
Q: Other apps, with more req’s?

A:
 text
 multimedia;financial apps/forecasting
 Geographic Inf. Sys.
 CAD/CAM
 Network management

Object database system have developed along two
distinct paths:
 OODBS (object oriented database system)
 ORDBS ( object relational database system)

What is Object Oriented
Database? (OODB)
A database system that incorporates all
the important object-oriented concepts
Some additional features
 Unique Object identifiers
 Persistent object handling

Advantages of OODBS
Designer can specify the structure of
objects and their behavior (methods)
Better interaction with object-oriented
languages such as Java and C++
Definition of complex and user-defined
types
Encapsulation of operations and user-
defined methods

Object Relational
Database System
It is an attempt to extend RELATIONAL
DATABASE SYSTEM with functionality
necessary to support a broader class of
application and in many ways, a bridge
between a relational and object oriented
paradigm.

Object-Oriented
Concepts
 Abstract Data Types
 Class definition, provides extension to complex
attribute types
 Encapsulation
 Implementation of operations and object structure
hidden
 Inheritance
 Sharing of data within hierarchy scope, supports
code reusability
 Polymorphism
• Operator overloading

Definition of an object
Objects – User defined complex data types
An object has structure or state (variables) and methods
(behavior/operations)
Object in the OO environment is abstract representation of
a real world entity that has a unique identity , embedded
properties and ability to interact with other objects and
itself.
An object is described by four characteristics
Identifier: a system-wide unique id for an object
Name: an object may also have a unique name in DB
(optional)
Lifetime: determines if the object is persistent or transient
Structure: Construction of objects using type constructors
Object Structure
The state (current value) of a complex object may be
constructed from other objects (or other values) by
using certain type constructors
Can be represented by (i,c,v)
 i is an unique id
 c is a type constructor
 v is the object state
Constructors
 Basic types: atom, tuple and set
 Collection type: list, bag and array

Object
Structure
An object has associated with it:
A set of variables that contain the data for the object. The
value of each variable is itself and object.
 A set of messages to which the object responds; each
message may have zero, one, or more parameters.
 A set of methods, each of which is a body of code to
implement a message; a method returns a value as the
response to the message.
The physical representation of data is visible only to the
implementer of the object.
Messages and responses provide the only external interface
to an object.

Object identity
Represented by an object ID(OID)
Unique to that object
OID is assigned by the system at the moment of
the object’s creation
Cannot be changed under any circumstances.
OID can be deleted only if the object is deleted
That OID can never be reused.

Object
Identity
An object retains its identity even if some or all the values of
variables or definitions of methods change over time.
Object identity is stronger notion of identity that in
programming languages or data models not based on object
orientation.
 Value - data value; used in relational systems.
 Name - supplied by user; used for variables in procedures.
 Built-in - identity built into data model or programming language.
 no user-supplied identifier is required.
 form of identity used in object-oriented systems.

Object
Identifiers
Object identifiers used to uniquely identify
objects
 can be stored as a field of an object, to refer to

another object.
 E.g, the spouse field of a person object may be an
identifier of another person object.
 can be system generated (created by the
database) or external (such as social-security
number).

Object
Containment
bicycle
wheel brake gear frame
Rim spokes tire lever pad cable
Each component in a design may contain other

components
Objects containing other objects are called complex or
composite objects.
Multiple levels of containment create a containment
hierarchy: links interpreted as IS-PART-OF, not IS-A.
Allows data to be viewed at different granularities by
different users.
Inherit
ance
E.g., class of bank customers similar to class of bank
employees: both share some variables and messages, e.g,
name and address. But there are variables and messages
specific to each class e.g., salary for employees and credit-
rating for customers.
Every employee is a person; thus employee is a specialization
of person
Similarly, customer is a specialization of person.
Create classes person, employee and customer
 variables/messages applicable to all persons associated with
class person.
 variables/messages specific to employees associated with
class employee; similarly for customer

Inheritance
Place classes into a specialization/IS-A hierarchy
 variables/messages belonging to class person are inherited
by class employee as well as customer
Result is a class hierarchy
person
employee customer
officer teller secretary
Note analogy with ISA hierarchy in the E-R model

Class Hierarchy
class person { Definition
string name;
string address;
};
class customer isa person {
int credit-rating;
};
class employee isa person {
date start-date;
int salary;
};
class officer isa employee {
int office-number;
int expense-account-number;
};
...
Abstract Data Types
Modularity
 Keeps the complexity of a large program
manageable by systematically controlling the
interaction of its components
 Isolates errors
 Eliminatesredundancies
 A modular program is
 Easier to write
 Easier to read
 Easier to modify

Procedural abstraction
 Separates the purpose and use of a module from its
implementation
 A module’s specifications should
 Detail how the module behaves
 Identify details that can be hidden within the module
•Information hiding
–Hides certain implementation details within a module
–Makes these details inaccessible from outside the module

Typical operations on data
 Add data to a data collection
 Remove data from a data collection
 Ask questions about the data in a data collection
Data abstraction
–Asks you to think what you can do to a collection of data
independently of how you do it
–Allows you to develop each data structure in relative
isolation from the rest of the solution
–A natural extension of procedural abstraction

Abstract data type (ADT)
 An ADT is composed of
A collection of data
 A set of operations on that data
 Specifications of an ADT indicate

 What the ADT operations do, not how to implement
them
 Implementation of an ADT
 Includes choosing a particular data structure

"Luck is what happens
when preparation meets
opportunity."
Good Day

Data Mining

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining

Загружено:

Авторское право:

Доступные форматы

Data Mining

Last lecture…. Today’s topic….

10/15/08 Jyotsna Chauhan 2

10/15/08 Jyotsna Chauhan 3

10/15/08 Jyotsna Chauhan 5

10/15/08 Jyotsna Chauhan 6

10/15/08 Jyotsna Chauhan 8

10/15/08 Jyotsna Chauhan 9

10/15/08 Jyotsna Chauhan 10

10/15/08 Jyotsna Chauhan 12

10/15/08 Jyotsna Chauhan

 Similar to one another within the same cluster

 Dissimilar to the objects in other clusters

Clustering = unsupervised classification (no predefined classes)

 As a preprocessing step for other algorithms

10/15/08 Jyotsna Chauhan 14

10/15/08 Jyotsna Chauhan 15

10/15/08 Jyotsna Chauhan 17

have same length. Given a user- specified and called threshold,

10/15/08 Jyotsna Chauhan 18

10/15/08 Jyotsna Chauhan 19

10/15/08 Jyotsna Chauhan 20

10/15/08 Jyotsna Chauhan 21

To solve practical problems by using novel

10/15/08 Jyotsna Chauhan 23

Semantic overloading: The relational model has only one

Homogeneous data: The relational model assumes both

10/15/08 Jyotsna Chauhan 24

Recursive queries: It is extremely difficult to

Impedance mismatch: Result of mixing different

10/15/08 Jyotsna Chauhan 25

Q: Other apps, with more req’s?

10/15/08 Jyotsna Chauhan 26

10/15/08 Jyotsna Chauhan 27

10/15/08 Jyotsna Chauhan 28

10/15/08 Jyotsna Chauhan 29

10/15/08 Jyotsna Chauhan 30

10/15/08 Jyotsna Chauhan 31

10/15/08 Jyotsna Chauhan 33

10/15/08 Jyotsna Chauhan 34

10/15/08 Jyotsna Chauhan 35

 Built-in - identity built into data model or programming language.

 no user-supplied identifier is required.

 form of identity used in object-oriented systems.

10/15/08 Jyotsna Chauhan 36

 can be stored as a field of an object, to refer to

10/15/08 Jyotsna Chauhan 37

wheel brake gear frame

Rim spokes tire lever pad cable

Each component in a design may contain other

10/15/08 Jyotsna Chauhan 39

officer teller secretary

Note analogy with ISA hierarchy in the E-R model

10/15/08 Jyotsna Chauhan 40

10/15/08 Jyotsna Chauhan 42

10/15/08 Jyotsna Chauhan 43

 Ask questions about the data in a data collection

10/15/08 Jyotsna Chauhan 44

 Specifications of an ADT indicate

10/15/08 Jyotsna Chauhan 45

10/15/08 Jyotsna Chauhan 47

Вам также может понравиться