Академический Документы
Профессиональный Документы
Культура Документы
Part I
Source :
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics,
Prentice Hall, 2002.
PART I
Introduction
Related Concepts
Data Mining Techniques
PART II
Classification
Clustering
Association Rules
PART III
Web Mining
Spatial Mining
Temporal Mining
Introduction Outline
Goal: Provide an overview of data mining.
Introduction
Data
Operational data
Output
Precise
Subset of database
Query
Poorly defined
No precise query
Data
language
Not operational data
Output
Fuzzy
Not a subset of database
Query Examples
Database
Find all credit applicants with last name of Smith.
Identify customers who have purchased more
than $10,000 in the last month.
Find all customers who have purchased milk
Data Mining
risks. (classification)
Identify customers with similar buying habits.
(Clustering)
Find all items which are frequently purchased
with milk. (association rules)
Supervised learning
Pattern recognition
Prediction
Unsupervised learning
Segmentation
Partitioning
Characterization
Generalization
Affinity Analysis
Association Rules
Sequential Analysis determines sequential patterns.
KDD Process
Selection:
Select log data (dates and locations) to use
Preprocessing:
Remove identifying URLs
Remove error logs
Transformation:
Sessionize (sort and group)
Data Mining:
Identify and count patterns
Construct data structure
Interpretation/Evaluation:
Identify and display frequently accessed sequences.
Potential User Applications:
Cache prediction
Personalization
Similarity Measures
Hierarchical Clustering
IR Systems
Imprecise Queries
Textual Data
Web Search Engines
Bayes Theorem
Regression Analysis
EM Algorithm
K-Means Clustering
Time Series Analysis
Neural Networks
Decision Tree Algorithms
KDD Issues
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
Social Implications of
DM
Privacy
Profiling
Unauthorized use
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
Scalability
Real World Data
Updates
Ease of Use
Visualization Techniques
Graphical
Geometric
Icon-based
Pixel-based
Hierarchical
Hybrid
Database/OLTP Systems
Fuzzy Sets and Logic
Information Retrieval(Web Search Engines)
Dimensional Modeling
Data Warehousing
OLAP/DSS
Statistics
Machine Learning
Pattern Matching
Schema
(ID,Name,Address,Salary,JobNo)
Data Model
ER
Relational
Transaction
Query:
SELECT Name
FROM T
WHERE Salary > 100000
f(x): Probability x is in F.
EX:
Fuzzy Sets
Classification/Prediction
is Fuzzy
Loan
Reject
Reject
Amnt
Accept
Simple
Accept
Fuzzy
Information Retrieval
Information Retrieval
(contd)
IR Query Result
Measures and
Classification
IR
Classification
Dimensional Modeling
LocID
Dallas
Houston
Dallas
Dallas
Fort
Worth
Chicago
Seattle
Rochester
Bradenton
Chicago
Date
022900
020100
031500
031500
021000
Quantity
5
10
1
5
5
UnitPrice
25
20
100
95
80
012000
030100
021500
022000
012000
20
5
200
15
10
75
50
5
20
25
Aggregation Hierarchies
Star Schema
Data Warehousing
Operational Data
Data Warehouse
Application
Use
Temporal
Modification
Orientation
Data
Size
Level
Access
Response
Data Schema
OLTP
Precise Queries
Snapshot
Dynamic
Application
Operational Values
Gigabits
Detailed
Often
Few Seconds
Relational
OLAP
Ad Hoc
Historical
Static
Business
Integrated
Terabits
Summarized
Less Often
Minutes
Star/Snowflake
OLAP
OLAP Operations
Roll Up
Drill Down
Single Cell
Multiple Cells
Slice
Dice
Statistics
Machine Learning
Pattern Matching
(Recognition)
Query
Data
Result Output
s
DB/OLT Precis Database
Precis DB
P
e
e
Objects or
Aggregati
on
IR
Precis Documents
Vague Document
e
s
OLAP
Analysi Multidimensio Precis DB
s
nal
e
Objects or
Aggregati
on
DM
Vague Preprocessed Vague KDD
Objects
mining techniques
Statistical
Point Estimation
Bayes Theorem
Hypothesis Testing
Similarity Measures
Decision Trees
Neural Networks
Activation Functions
Genetic Algorithms
Point Estimation
Estimation Error
Why square?
Root Mean Square Error (RMSE)
Jackknife Estimate
Maximize L.
MLE Example
ExpectationMaximization (EM)
EM Example
EM Algorithm
Bayes Theorem
Bayes Theorem:
x1
x5
x9
x2
x6
x10
x3
x7
x11
x4
x8
x12
Bayes Example(contd)
Training Data:
ID Income
1 4
2 3
3 2
4 3
5 4
6 2
7 3
8 2
9 3
10 1
Credit
Excellent
Good
Excellent
Good
Good
Excellent
Bad
Bad
Bad
Bad
Class
h1
h1
h1
h1
h1
h1
h2
h2
h3
h4
xi
x4
x7
x2
x7
x8
x2
x11
x10
x11
x9
Bayes Example(contd)
P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
=(1/6)(0.6)/0.1=1.
x4 in class h1.
Regression
Linear Regression
Correlation
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
Similarity Measures
Similarity Measures
Distance Measures
Decision Trees
Decision Trees
Decision Tree
Algorithm to create the tree
Algorithm that applies the tree to data
DT Advantages/Disadvantages
Advantages:
Easy to understand.
Easy to generate rules
Disadvantages:
May suffer from overfitting.
Classifies by rectangular partitioning.
Does not easily handle nonnumeric data.
Can be quite large pruning is necessary.
Neural Networks
Neural Networks
NN Node
NN Activation Functions
NN Activation Functions
NN Learning
Neural Networks
NN Advantages
Learning
Can continue learning even after training set
has been applied.
Easy parallelization
Solves many problems
NN Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a priori.
Input values must be numeric.
Verification difficult.
Genetic Algorithms
Genetic Algorithms
Crossover Examples
000 000
000 111
000 000 00
000 111 00
111 111
111 000
111 111 11
111 000 11
Parents
Children
Parents
Children
a) Single Crossover
a) Multiple Crossover
Genetic Algorithm
GA Advantages/Disadvantages
Advantages
Easily parallelized
Disadvantages
Difficult to understand and explain to end users.
Abstraction of the problem and method to
represent individuals is quite difficult.
Determining fitness function is difficult.
Determining how to perform crossover and
mutation is difficult.