Вы находитесь на странице: 1из 47

Data Mining

Last lecture…. Today’s topic….


 Counting co- Sequential pattern
occurrences Classification rule
 Mining rules Regression rule
 Association rule Tree structure rule

Neural network

Clustering

Similarity over
search
ODBS

Object, object
identifier
Inheritance
10/15/08 Jyotsna Chauhan 1
Sequential Pattern
A sequence of actions or events is sought.
Example.. If a patient underwent cardiac surgery for blocked
arteries and later developed high blood urea within a year of
surgery, he or she is likely to suffer from kidney failure within
next app.18 months.
Detection of sequential patterns is equivalent to detecting
association among events with certain temporal relationships.

10/15/08 Jyotsna Chauhan 2


Sequential pattern

 Example
 A sequential rule: A→ B, says that event A
will be immediately followed by event B
with a certain confidence

10/15/08 Jyotsna Chauhan 3


Classification
 Given a collection of records
 Each record contains a set of attributes, one of the attributes is the class.
 Classification rules help assign new objects to classes.
 E.g., given a new automobile insurance applicant, should he
or she be classified as low risk, medium risk or high risk?
 Classification rules could use a variety of data, such as
educational level, salary, age, etc.
 ∀ person P, P.degree = masters and P.income > 75,000

⇒ P.credit = excellent
 ∀ person P, P.degree = bachelors and
(P.income ≥ 25,000 and P.income ≤ 75,000)
⇒ P.credit = good
Rules are not necessarily exact: there may be some
misclassifications
Classification rules can be shown compactly as a decision tree.
10/15/08 Jyotsna Chauhan 4
Decision Tree

10/15/08 Jyotsna Chauhan 5


Classification: Application 1
 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms
the class attribute.
 Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
 Type of business, where they stay, how much they earn, etc.
 Use this information as input attributes to learn a
classifier model.

10/15/08 Jyotsna Chauhan 6


Classification: Application 2

 Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions.
 Approach:
 Use credit card transactions and the information on its
account-holder as attributes.
 When does a customer buy, what does he buy, how often he
pays on time, etc
 Label past transactions as fraud or fair transactions. This
forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card
transactions on an account.
10/15/08 Jyotsna Chauhan 7
Regression
 Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
 Greatly studied in statistics, neural network fields.
 Examples:
 Predicting sales amounts of new product based on
advertising expenditure.
 Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
 Time series prediction of stock market indices.

10/15/08 Jyotsna Chauhan 8


Tree Structure Rule
To discover the classification rule and regression rule from a
relation, we draw a tree called classification and regression
tree.
Classification tree is called DECISION TREE.
Following tree is decision tree

10/15/08 Jyotsna Chauhan 9


Decision Tree
 A decision tree is a binary tree created using the
collection of classification rules.
 The tree starts from the root to the leaf.
 Each internal node is a predictor attribute.
 These attributes are called as splitting attributes as
the data is split according to the condition over the
attribute.
 The predicate is written as the outgoing edge of the
node.

10/15/08 Jyotsna Chauhan 10


Algorithm to Build Decision
Tree
INPUT: node n, Partition P, Split Selection method S
OUTPUT: Decision tree for P rooted at node n
BEGIN:
I. Apply S to P to find splitting criteria.
II. If (good splitting criteria found)
then
a. Create children n1, n2 of n
b. Partition P into P1, P2
c. Build tree (n1, P1, S)
d. Build tree (n2, P2,S)
Endif
End.
10/15/08 Jyotsna Chauhan 11
Clustering

10/15/08 Jyotsna Chauhan 12


Cluster Analysis
A Cluster
 A collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters

Cluster analysis
 Grouping a set of data objects into clusters
Clustering is unsupervised classification: no
predefined classes
Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

10/15/08 Jyotsna Chauhan


Clustering
What is Clustering?
Grouping a set of data objects into clusters
 Cluster: a collection of data objects

 Similar to one another within the same cluster

 Dissimilar to the objects in other clusters

Clustering = unsupervised classification (no predefined classes)


Typical usage
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

10/15/08 Jyotsna Chauhan 14


Clustering Applications
Pattern Recognition
Spatial Data Analysis
 create thematic maps in geographical information
systems by clustering feature spaces
Image Processing
Market research
WWW
 Document classification
 Cluster Weblog data to discover groups of similar
access patterns

10/15/08 Jyotsna Chauhan 15


Examples of Clustering
Marketing
Applications
 Help marketers discover distinct groups of customers, and then
use this knowledge to develop targeted marketing programs
Insurance
 Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning
 Identifying groups of houses according to their house type, value,
and geographical location
Earth-quake studies
 Observed earth quake epicenters should be clustered along
continent faults

16
Similarity search over
sequences
Information is stored in a database in a particular sequence.
We assume that the user specifies a “query sequence” and
wants to retrieve all data sequences that are similar to query
sequence.
Let us begin by describing sequences and similarity between
sequences.
 A data sequence X is a series of numbers X=< X1,…….,Xk >.
 X is called time series and K is called the length of the sequence.
A subsequence Z = <Z1……Zj> is obtained from another
sequence X by deleting numbers from front and back of the
sequence X.
The Z is subsequence of X if Z1 = Xi, Z2 = X i+1,…… ,Zj = X i+j ,
for some i = 1,……,k+j-1

10/15/08 Jyotsna Chauhan 17


Similarity queries over sequences can be classified
into two types:
Complete sequence matching:
The query sequence and the sequence in the database

have same length. Given a user- specified and called threshold,


our goal is to retrieve all sequences in the database that are
within ε distance to the query sequence.
Subsequence Matching:
The query sequence is shorter than the sequence in
the database. In this case, we want to find all subsequences of
sequences in the database such that the subsequence is within
distance ε of the query sequence.

10/15/08 Jyotsna Chauhan 18


Neural network

10/15/08 Jyotsna Chauhan 19


Neural Computing
Neuroscience
 The objective is to understand the human brain
 Biologically realistic models of neurons
 Biologically realistic connection topologies
Neural networks
 The objective is to develop computation methods
 Highly simplified artificial neurons
 Connection topologies that are aimed at
computational effectiveness

10/15/08 Jyotsna Chauhan 20


Why study neural
computation?
The motivation is that the brain can do amazing
computations that we do not know how to do with a
conventional computer.
 Vision, language understanding, learning …..
It does them by using huge networks of slow neurons
each of which is connected to thousands of other
neurons.
 Its not at all like a conventional computer that has a big,
passive memory and a very fast central processor that can
only do one simple operation at a time.
It learns to do these computations without any explicit
programming.

10/15/08 Jyotsna Chauhan 21


The goals of neural
computation
To understand how the brain actually works
 Its big and very complicated and made of yukky stuff
that dies when you poke it around
To understand a new style of computation
 Inspired by neurons and their adaptive connections
 Very different style from sequential computation
 should be good for things that brains are good at (e.g. vision)
 Should be bad for things that brains are bad at (e.g. 23 x 71)

To solve practical problems by using novel


learning algorithms
 Learning algorithms can be very useful even if they
have nothing to do with how the brain works
10/15/08 Jyotsna Chauhan 22
Object Database
System

10/15/08 Jyotsna Chauhan 23


Weaknesses of
RDBMSs
Representation of ‘real world’ entities: The process of
normalization generally leads to the creation of relations
that do not correspond to entities in the ‘real world’.

Semantic overloading: The relational model has only one


construct for representing data and data relationships: the
relation.

Homogeneous data: The relational model assumes both


horizontal and vertical homogeneity. Also, intersection of a
row and column must be an atomic value => this structure
is restrictive for many ‘real world’ objects with a complex
structure.

10/15/08 Jyotsna Chauhan 24


Weaknesses of
RDBMSs
Limited operations: The relational model has a
fixed set of operations (provided in SQL). => does
not allow new operations to be specified.

Recursive queries: It is extremely difficult to


produce recursive queries (queries about
relationships that a relation has with itself).

Impedance mismatch: Result of mixing different


programming paradigms (e.g., SQL is a declarative
language that handles rows of data whereas a
high-level language such as ‘C’ is a procedural
language that can handle only one row at a time).

10/15/08 Jyotsna Chauhan 25


Why more than
RDBMS’s

Q: Other apps, with more req’s?


A:
 text

 multimedia;financial apps/forecasting
 Geographic Inf. Sys.

 CAD/CAM

 Network management

10/15/08 Jyotsna Chauhan 26


Object database system have developed along two
distinct paths:
 OODBS (object oriented database system)
 ORDBS ( object relational database system)

10/15/08 Jyotsna Chauhan 27


What is Object Oriented
Database? (OODB)
A database system that incorporates all
the important object-oriented concepts
Some additional features
 Unique Object identifiers
 Persistent object handling

10/15/08 Jyotsna Chauhan 28


Advantages of OODBS
Designer can specify the structure of
objects and their behavior (methods)
Better interaction with object-oriented
languages such as Java and C++
Definition of complex and user-defined
types
Encapsulation of operations and user-
defined methods

10/15/08 Jyotsna Chauhan 29


Object Relational
Database System
It is an attempt to extend RELATIONAL
DATABASE SYSTEM with functionality
necessary to support a broader class of
application and in many ways, a bridge
between a relational and object oriented
paradigm.

10/15/08 Jyotsna Chauhan 30


Object-Oriented
Concepts
 Abstract Data Types
 Class definition, provides extension to complex
attribute types
 Encapsulation
 Implementation of operations and object structure
hidden
 Inheritance
 Sharing of data within hierarchy scope, supports
code reusability
 Polymorphism
• Operator overloading

10/15/08 Jyotsna Chauhan 31


Definition of an object
Objects – User defined complex data types
An object has structure or state (variables) and methods
(behavior/operations)
Object in the OO environment is abstract representation of
a real world entity that has a unique identity , embedded
properties and ability to interact with other objects and
itself.
An object is described by four characteristics
Identifier: a system-wide unique id for an object
Name: an object may also have a unique name in DB
(optional)
Lifetime: determines if the object is persistent or transient
Structure: Construction of objects using type constructors
10/15/08 Jyotsna Chauhan 32
Object Structure
The state (current value) of a complex object may be
constructed from other objects (or other values) by
using certain type constructors
Can be represented by (i,c,v)
 i is an unique id
 c is a type constructor
 v is the object state
Constructors
 Basic types: atom, tuple and set
 Collection type: list, bag and array

10/15/08 Jyotsna Chauhan 33


Object
Structure
An object has associated with it:
A set of variables that contain the data for the object. The
value of each variable is itself and object.
 A set of messages to which the object responds; each
message may have zero, one, or more parameters.
 A set of methods, each of which is a body of code to
implement a message; a method returns a value as the
response to the message.
The physical representation of data is visible only to the
implementer of the object.
Messages and responses provide the only external interface
to an object.

10/15/08 Jyotsna Chauhan 34


Object identity
Represented by an object ID(OID)
Unique to that object
OID is assigned by the system at the moment of
the object’s creation
Cannot be changed under any circumstances.
OID can be deleted only if the object is deleted
That OID can never be reused.

10/15/08 Jyotsna Chauhan 35


Object
Identity
An object retains its identity even if some or all the values of
variables or definitions of methods change over time.
Object identity is stronger notion of identity that in
programming languages or data models not based on object
orientation.
 Value - data value; used in relational systems.
 Name - supplied by user; used for variables in procedures.

 Built-in - identity built into data model or programming language.

 no user-supplied identifier is required.

 form of identity used in object-oriented systems.

10/15/08 Jyotsna Chauhan 36


Object
Identifiers
Object identifiers used to uniquely identify
objects

 can be stored as a field of an object, to refer to


another object.
 E.g, the spouse field of a person object may be an
identifier of another person object.
 can be system generated (created by the
database) or external (such as social-security
number).

10/15/08 Jyotsna Chauhan 37


Object
Containment
bicycle

wheel brake gear frame

Rim spokes tire lever pad cable

Each component in a design may contain other


components
Objects containing other objects are called complex or
composite objects.
Multiple levels of containment create a containment
hierarchy: links interpreted as IS-PART-OF, not IS-A.
Allows data to be viewed at different granularities by
different users.
10/15/08 Jyotsna Chauhan 38
Inherit
ance
E.g., class of bank customers similar to class of bank
employees: both share some variables and messages, e.g,
name and address. But there are variables and messages
specific to each class e.g., salary for employees and credit-
rating for customers.
Every employee is a person; thus employee is a specialization
of person
Similarly, customer is a specialization of person.
Create classes person, employee and customer
 variables/messages applicable to all persons associated with
class person.
 variables/messages specific to employees associated with
class employee; similarly for customer

10/15/08 Jyotsna Chauhan 39


Inheritance
Place classes into a specialization/IS-A hierarchy
 variables/messages belonging to class person are inherited
by class employee as well as customer
Result is a class hierarchy
person

employee customer

officer teller secretary

Note analogy with ISA hierarchy in the E-R model

10/15/08 Jyotsna Chauhan 40


Class Hierarchy
class person { Definition
string name;
string address;
};
class customer isa person {
int credit-rating;
};
class employee isa person {
date start-date;
int salary;
};
class officer isa employee {
int office-number;
int expense-account-number;
};
10/15/08 Jyotsna Chauhan 41
...
Abstract Data Types
Modularity
 Keeps the complexity of a large program
manageable by systematically controlling the
interaction of its components
 Isolates errors

 Eliminatesredundancies
 A modular program is
 Easier to write
 Easier to read

 Easier to modify

10/15/08 Jyotsna Chauhan 42


Procedural abstraction
 Separates the purpose and use of a module from its
implementation
 A module’s specifications should
 Detail how the module behaves
 Identify details that can be hidden within the module

•Information hiding
–Hides certain implementation details within a module
–Makes these details inaccessible from outside the module

10/15/08 Jyotsna Chauhan 43


Typical operations on data
 Add data to a data collection
 Remove data from a data collection

 Ask questions about the data in a data collection

Data abstraction
–Asks you to think what you can do to a collection of data   
   independently of how you do it
–Allows you to develop each data structure in relative 
   isolation from the rest of the solution
–A natural extension of procedural abstraction

10/15/08 Jyotsna Chauhan 44


Abstract data type (ADT)
 An ADT is composed of
A collection of data
 A set of operations on that data

 Specifications of an ADT indicate


 What the ADT operations do, not how to implement
them
 Implementation of an ADT
 Includes choosing a particular data structure

10/15/08 Jyotsna Chauhan 45


10/15/08 Jyotsna Chauhan 46
"Luck is what happens
when preparation meets
opportunity."

Good Day

10/15/08 Jyotsna Chauhan 47

Вам также может понравиться