Академический Документы
Профессиональный Документы
Культура Документы
Introduction
1
Why Data Mining?
2
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
3
What Is Data Mining?
4
Knowledge Discovery (KDD) Process
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
5
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
6
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
8
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
9
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
10
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
11
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
12
Data Mining Functionalities
Multidimensional concept description: Characterization and
discrimination
Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
Frequent patterns, association, correlation vs. causality
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown or missing numerical values
13
Data Mining Functionalities (2)
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Periodicity analysis
Similarity-based analysis
14
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
15
Are All the “Discovered” Patterns Interesting?
16
Find All and Only Interesting Patterns?
18
Primitives that Define a Data Mining Task
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Type of knowledge to be mined
Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns
19
Primitive 3: Background Knowledge
20
Primitive 4: Pattern Interestingness Measure
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule quality,
discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise threshold
(description)
Novelty
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
21
Primitive 5: Presentation of Discovered Patterns
22
DMQL—A Data Mining Query Language
Motivation
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL has on
relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
Design
DMQL is designed with the primitives described earlier
23
An Example Query in DMQL
24
Other Data Mining Languages &
Standardization Efforts
Association rule language specifications
MSQL (Imielinski & Virmani’99)
MineRule (Meo Psaila and Ceri’96)
Query flocks based on Datalog syntax (Tsur et al’98)
OLEDB for DM (Microsoft’2000) and recently DMX (Microsoft
SQLServer 2005)
Based on OLE, OLE DB, OLE DB for OLAP, C#
Integrating DBMS, data warehouse and data mining
DMML (Data Mining Mark-up Language) by DMG (www.dmg.org)
Providing a platform and process structure for effective data mining
Emphasizing on deploying data mining technology to solve business
problems
25
Integration of Data Mining and Data Warehousing
26
Coupling Data Mining with DB/DW Systems
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
28
What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the
organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
29
Data Warehouse—Subject-
Oriented
30
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data
sources
relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
31
Data Warehouse—Time Variant
32
Data Warehouse—Nonvolatile
33
Data Warehouse vs.
Heterogeneous DBMS
34
Data Warehouse vs. Operational
DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
35
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
36
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency control,
recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which operational
DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP
analysis directly on relational databases
37
From Tables and Spreadsheets to Data
Cubes
A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day,
week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
38
Chapter 3: Data Generalization, Data
Warehousing, and On-line Analytical
Processing
Data generalization and concept description
39
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
40
Conceptual Modeling of Data
Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a set of
dimension tables
Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
41
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
42
Example of Snowflake
Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
43
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
45
Defining Star Schema in DMQL
46
Defining Snowflake Schema in
DMQL
47
Defining Fact Constellation in
DMQL
48
Measures of Data Cube: Three
Categories
49
A Concept Hierarchy: Dimension
(location)
all all
50
Multidimensional Data
Office Day
Month
51
A Sample Data Cube
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
52
Cuboids Corresponding to the
Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
53
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end
54
Fig. 3.10 Typical OLAP
Operations
55
Design of Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data warehouse
Top-down view
allows selection of the relevant information necessary for the
data warehouse
Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from the view
of end-user
56
Data Warehouse Design
Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
57
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be materialized
59
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
61
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
62
OLAP Server Architectures
63
Efficient Data Cube
Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?
n
cube
T data
Materialization of ( Li 1)
i 1
Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
64
Data warehouse Implementation
65
Cube Operation
66
Multi-Way Array Aggregation
multiple dimensions
Intermediate aggregate values are BC
A B A C
re-used for computing ancestor
cuboids
A BC
Cannot do Apriori pruning: No
iceberg optimization
67
Multi-way Array Aggregation for
Cube Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access and
storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A 68
Multi-way Array Aggregation
for Cube Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
69
Multi-way Array Aggregation
for Cube Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
70
Multi-Way Array Aggregation
for Cube Computation (Cont.)
Method: the planes should be sorted and computed
according to their size in ascending order
Idea: keep the smallest plane in the main memory, fetch and
compute only one chunk at a time for the largest plane
Limitation of the method: computing well only for a small
number of dimensions
If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods can be
explored
71
Indexing OLAP Data: Bitmap
Index
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for
the indexed column
not suitable for high cardinality domains
73
Efficient Processing OLAP
Queries
74
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
75
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
OLAP-based exploratory data analysis
Mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
Integration and swapping of multiple mining
functions, algorithms, and tasks
76
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
77
UNIT II- Data Preprocessing
Data cleaning
Data integration and transformation
Data reduction
Summary
78
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization: part of data reduction, of particular
importance for numerical data
79
Data Cleaning
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics
“Data cleaning is the number one problem in data warehousing”—DCI
survey
Data extraction, cleaning, and transformation comprises the majority of
the work of building a data warehouse
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
80
Data in the Real World Is Dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names,
e.g.,
Age=“42” Birthday=“03/07/1997”
Was rating “1,2,3”, now rating “A, B, C”
discrepancy between duplicate records
81
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected
and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
82
Multi-Dimensional Measure of Data Quality
83
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred
84
How to Handle Missing Data?
85
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
86
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
87
Simple Discretization Methods: Binning
88
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
89
Regression
Y1
Y1’ y=x+1
X1 x
90
Cluster Analysis
91
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and
92
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
93
Handling Redundancy in Data Integration
rp ,q
( p p )(q q ) ( pq) n p q
(n 1) p q (n 1) p q
95
Correlation (viewed as linear
relationship)
correlatio n( p, q) p q
96
Data Transformation
A function that maps the entire set of values of a given
attribute to a new set of replacement values s.t. each old
value can be identified with one of the new values
Methods
Smoothing: Remove noise from data
Aggregation: Summarization, data cube construction
Generalization: Concept hierarchy climbing
Normalization: Scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
97
Data Transformation: Normalization
99
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Principal component analysis
Singular value decomposition
Supervised and nonlinear techniques (e.g., feature selection)
100
Dimensionality Reduction: Principal
Component Analysis (PCA)
Find a projection that captures the largest amount of
variation in data
Find the eigenvectors of the covariance matrix, and these
eigenvectors define the new space
x2
x1
101
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing “significance”
or strength
Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
Works for numeric data only
102
Feature Subset Selection
103
Heuristic Search in Feature
Selection
There are 2d possible feature combinations of d features
Typical heuristic feature selection methods:
Best single features under the feature independence assumption:
choose by significance tests
Best step-wise feature selection:
The best single-feature is picked first
Then next best feature condition to the first, ...
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Best combined feature selection and elimination
Optimal branch and bound:
Use feature elimination and backtracking
104
Feature Creation
Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
Three general methodologies
Feature extraction
domain-specific
Mapping data to new space (see: data reduction)
E.g., Fourier transformation, wavelet transformation
Feature construction
Combining features
Data discretization
105
Mapping Data to a New Space
Fourier transform
Wavelet transform
106
Numerosity (Data) Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
Example: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
107
Parametric Data Reduction:
Regression and Log-Linear Models
108
Regress Analysis and Log-Linear
Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to
be estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above
Log-linear models:
The multi-way table of joint probabilities is approximated by a
109
Data Reduction:
Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT): linear signal
processing, multi-resolutional analysis
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
110
DWT for Image Compression
Image
111
Data Cube Aggregation
112
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
113
Data Compression
ss y
lo
Original Data
Approximated
114
Data Reduction: Histograms
Divide data into buckets and store 40
average (sum) for each bucket
35
Partitioning rules:
Equal-width: equal bucket range 30
Equal-frequency (or equal-depth)
25
V-optimal: with the least histogram
variance (weighted sum of the 20
original values that each bucket
15
represents)
MaxDiff: set bucket boundary 10
between each pair for pairs have the
β–1 largest differences
5
0
10000 30000 50000 70000 90000
115
Data Reduction Method: Clustering
117
Types of Sampling
118
Sampling: With or without Replacement
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
119
Sampling: Cluster or Stratified
Sampling
120
Data Reduction: Discretization
121
Discretization and Concept
Hierarchy
Discretization
Reduce the number of values for a given continuous attribute by dividing
the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,
middle-aged, or senior)
122
Discretization and Concept Hierarchy
Generation for Numeric Data
Typical methods: All the methods can be applied recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split
Interval merging by 2 Analysis: unsupervised, bottom-up merge
Segmentation by natural partitioning: top-down split, unsupervised
123
Discretization Using Class Labels
124
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2
using boundary T, the information gain after partitioning is
| S1 | |S |
I (S , T ) Entropy ( S 1) 2 Entropy ( S 2)
|S| |S|
Entropy is calculated based on class distribution of the samples in the
set. Given m classes, the entropy of S1 is
m
Entropy ( S1 ) pi log 2 ( pi )
i 1
126
Interval Merge by 2 Analysis
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge them to form
larger intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered to be one
interval
2 tests are performed for every pair of adjacent intervals
Adjacent intervals with the least 2 values are merged together, since low 2
values for a pair indicate similar class distributions
This merge process proceeds recursively until a predefined stopping criterion
is met (such as significance level, max-interval, max inconsistency, etc.)
127
Segmentation by Natural Partitioning
128
Example of 3-4-5 Rule
count
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
130
Automatic Concept Hierarchy Generation
132
What Is Frequent Pattern
Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.
133
Why Is Freq. Pattern Mining
Important?
134
Basic Concepts: Frequent Patterns
and Association Rules
Transaction-id Items bought Itemset X = {x1, …, xk}
10 A, B, D Find all the rules X Y with minimum
20 A, C, D support and confidence
support, s, probability that a
30 A, D, E
transaction contains X Y
40 B, E, F confidence, c, conditional
137
Scalable Methods for Mining Frequent
Patterns
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
138
Apriori: A Candidate Generation-and-Test
Approach
139
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E
1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2
{A, E} 1 {A, C}
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
140
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
141
Important Details of Apriori
142
How to Generate Candidates?
143
How to Count Supports of Candidates?
144
Example: Counting Supports of
Candidates
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
145
Efficient Implementation of Apriori in SQL
146
Challenges of Frequent Pattern
Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
147
Partition: Scan Database Only
Twice
148
DHP: Reduce the Number of
Candidates
149
Sampling for Frequent Patterns
150
DIC: Reduce Number of Scans
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
151
Bottleneck of Frequent-pattern
Mining
152
Mining Frequent Patterns Without
Candidate Generation
153
Construct FP-tree from a Transaction
Database
155
Partition Patterns and
Databases
156
Find Patterns Having P From P-conditional
Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
157
From Conditional Pattern-bases to Conditional FP-
trees
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
159
A Special Case: Single Prefix Path in FP-
tree
a3:n3
{} r1
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
160
Mining Frequent Patterns With FP-
trees
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and database
partition
Method
For each frequent item, construct its conditional pattern-base, and
then its conditional FP-tree
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only one path—
single path will generate all the combinations of its sub-paths,
each of which is a frequent pattern
161
Scaling FP-growth by DB Projection
162
Partition-based Projection
Tran. DB
Parallel projection needs a lot fcamp
of disk space fcabm
fb
Partition projection saves it cbp
fcamp
am-proj DB cm-proj DB
fc f …
fc f
fc f
163
FP-Growth vs. Apriori: Scalability With the
Support Threshold
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
164
FP-Growth vs. Tree-Projection: Scalability
with the Support Threshold
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
165
Why Is FP-Growth the Winner?
Divide-and-conquer:
decompose both the mining task and DB according to the
frequent patterns obtained so far
leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test
compressed database: FP-tree structure
no repeated scan of entire database
basic ops—counting local freq items and building sub FP-tree, no
pattern search and matching
166
Implications of the Methodology
167
MaxMiner: Mining Max-patterns
1st scan: find frequent items Tid Items
A, B, C, D, E 10 A,B,C,D,E
20 B,C,D,E,
2 scan: find support for
nd
30 A,C,D,F
AB, AC, AD, AE, ABCDE
BC, BD, BE, BCDE
CD, CE, CDE, DE, Potential
Since BCDE is a max-pattern, no needmax-patterns
to check BCD, BDE,
CDE in later scan
R. Bayardo. Efficiently mining long patterns from
databases. In SIGMOD’98
168
Mining Frequent Closed Patterns:
CLOSET
169
CLOSET+: Mining Closed Itemsets by
Pattern-Growth
Itemset merging: if Y appears in every occurrence of X, then Y
is merged with X
Sub-itemset pruning: if Y כX, and sup(X) = sup(Y), X and all of
X’s descendants in the set enumeration tree can be pruned
Hybrid tree projection
Bottom-up physical tree-projection
Top-down pseudo tree-projection
Item skipping: if a local frequent item has the same support in
several header tables at different levels, one can prune it from
the header table at higher levels
Efficient subset checking
170
CHARM: Mining by Exploring Vertical Data
Format
171
Further Improvements of Mining
Methods
AFOPT (Liu, et al. @ KDD’03)
A “push-right” method for mining condensed frequent pattern
(CFP) tree
Carpenter (Pan, et al. @ KDD’03)
Mine data sets with small rows but numerous columns
Construct a row-enumeration tree for efficient mining
172
Visualization of Association Rules: Plane Graph
173
Visualization of Association Rules: Rule Graph
174
Visualization of Association
Rules
(SGI/MineSet 3.0)
175
Mining Various Kinds of Association
Rules
176
Mining Multiple-Level Association
Rules
Items often form hierarchies
Flexible support settings
Items at the lower level are expected to have lower support
Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)
177
Multi-level Association: Redundancy
Filtering
178
Mining Multi-Dimensional
Association
Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
Quantitative Attributes: numeric, implicit ordering among
values—discretization, clustering, and gradient approaches
179
Mining Quantitative Associations
Techniques can be categorized by how numerical
attributes, such as age or salary are treated
1. Static discretization based on predefined concept
hierarchies (data cube methods)
2. Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97)
one dimensional clustering then association
4. Deviation: (such as Aumann and Lindell@KDD99)
Sex = female => Wage: mean=$7/hr (overall mean = $9)
180
Static Discretization of Quantitative
Attributes
age(X,”34-35”) income(X,”30-50K”)
buys(X,”high resolution TV”)
182
Mining Other Interesting Patterns
183
Interestingness Measure:
Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)
184
Are lift and 2 Good Measures of
Correlation?
“Buy walnuts buy milk [1%, 80%]” is misleading
if 85% of customers buy milk
Support and confidence are not good to represent correlations
So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
P ( A B )
lift Milk No Milk Sum (row)
P ( A) P ( B )
Coffee m, c ~m, c c
sup( X ) No Coffee m, ~c ~m, ~c ~c
all _ conf Sum(col.) m ~m
max_ item _ sup( X )
DB m, c ~m, c m~c ~m~c lift all-conf coh 2
186
Constraint-based (Query-Directed)
Mining
187
Constraints in Data Mining
188
Constrained Mining vs. Constraint-Based
Search
189
Anti-Monotonicity in Constraint
Pushing
TDB (min_sup=2)
Anti-monotonicity TID Transaction
When an intemset S violates the constraint, 10 a, b, c, d, f
so does any of its superset 20 b, c, d, f, g, h
sum(S.Price) v is anti-monotone 30 a, c, d, e, f
40 c, e, f, g
sum(S.Price) v is not anti-monotone
Item Profit
Example. C: range(S.profit) 15 is anti-
a 40
monotone b 0
Itemset ab violates C c -20
So does every superset of ab d 10
e -30
f 30
g 20
h -10
190
Monotonicity for Constraint
Pushing
TDB (min_sup=2)
TID Transaction
Monotonicity
10 a, b, c, d, f
When an intemset S satisfies the 20 b, c, d, f, g, h
constraint, so does any of its superset 30 a, c, d, e, f
sum(S.Price) v is monotone 40 c, e, f, g
min(S.Price) v is monotone Item Profit
Example. C: range(S.profit) 15 a 40
b 0
Itemset ab satisfies C
c -20
So does every superset of ab d 10
e -30
f 30
g 20
h -10
191
Succinctness
Succinctness:
Given A1, the set of items satisfying a succinctness constraint C,
then any set S satisfying C is based on A1 , i.e., S contains a
subset belonging to A1
Idea: Without looking at the transaction database, whether an
itemset S satisfies constraint C can be determined based on the
selection of items
min(S.Price) v is succinct
sum(S.Price) v is not succinct
Optimization: If C is succinct, C is pre-counting
pushable
192
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
193
Naïve Algorithm: Apriori +
Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
194
The Constrained Apriori Algorithm:
Push an Anti-monotone Constraint
Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
195
The Constrained Apriori Algorithm:
Push a Succinct Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5} to be used
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price } <= 1
196
Converting “Tough” Constraints
TDB (min_sup=2)
TID Transaction
Convert tough constraints into anti- 10 a, b, c, d, f
monotone or monotone by properly 20 b, c, d, f, g, h
ordering items 30 a, c, d, e, f
Examine C: avg(S.profit) 25 40 c, e, f, g
198
Can Apriori Handle Convertible
Constraint?
199
Mining With Convertible
Constraints
Item Value
C: avg(X) >= 25, min_sup=2 a 40
f 30
List items in every transaction in value descending
g 20
order R: <a, f, g, d, b, h, c, e>
d 10
C is convertible anti-monotone w.r.t. R
b 0
Scan TDB once
h -10
remove infrequent items c -20
Item h is dropped e -30
Itemsets a and f are good, …
TDB (min_sup=2)
Projection-based mining
TID Transaction
Imposing an appropriate order on item projection
10 a, f, d, b, c
Many tough constraints can be converted into (anti)-
20 f, g, d, b, c
monotone
30 a, f, d, c, e
40 f, g, h, c, e
200
Handling Multiple Constraints
201
What Constraints Are Convertible?
202
Constraint-Based Mining—A General
Picture
sum(S) v ( a S, a 0 ) yes no no
sum(S) v ( a S, a 0 ) no yes no
range(S) v yes no no
range(S) v no yes no
support(S) no yes no
203
A Classification of Constraints
Antimonotone Monotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible
204
Chapter 6. Classification and
Prediction
206
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
207
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
209
Supervised vs. Unsupervised Learning
210
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing
values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
211
Issues: Evaluating Classification Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
212
Decision Tree Induction: Training Dataset
age?
<=30 overcast
31..40 >40
no yes no yes
214
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in
advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
There are no samples left
215
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
m
in D: Info( D) pi log 2 ( pi )
i 1
218
Gain Ratio for Attribute Selection (C4.5)
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
221
Comparing Attribute Selection Measures
222
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test
for independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistics: has a close approximation to χ2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
223
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if this would
result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a sequence
of progressively pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
224
Enhancements to Basic Decision Tree Induction
225
Classification in Large Databases
226
Scalable Decision Tree Induction Methods
227
Scalability Framework for RainForest
228
Rainforest: Training Set and Its AVC Sets
230
BOAT (Bootstrapped Optimistic Algorithm
for Tree Construction)
231
Presentation of Classification Results
232
Visualization of a Decision Tree in SGI/MineSet 3.0
233
Interactive Visual Mining by Perception-Based
Classification (PBC)
234
Chapter 6. Classification and
Prediction
237
Bayesian Theorem
239
Derivation of Naïve Bayes
Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i ) P( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ ( x ) 2
1
g ( x, , ) e 2 2
2
and P(xk|Ci) is
P ( X | C i ) g ( xk , C i , C i )
240
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
241
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
242
Avoiding the 0-Probability
Problem
Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) P( xk | C i)
k 1
Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their “uncorrected”
counterparts
243
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
244
Bayesian Belief Networks
247
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
ncovers = # of tuples covered by R
ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule is triggered, need conflict resolution
Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
Class-based ordering: decreasing order of prevalence or misclassification cost per
class
Rule-based ordering (decision list): rules are organized into one long priority list,
according to some measure of rule quality or by experts
248
Rule Extraction from a Decision Tree
age?
249
Rule Extraction from the Training Data
250
How to Learn-One-Rule?
Star with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Picks the one that most improves the rule quality
Rule-Quality measures: consider both coverage and accuracy
Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
pos' pos
FOIL _ Gain pos '(log
It favors rules that have high accuracy 2 and cover log
many
2 positive) tuples
pos' neg ' pos neg
Rule pruning based on an independent set of test tuples
pos neg
FOIL _ Prune( R)
Pos/neg are # of positive/negative tuples covered by R.
pos neg
If FOIL_Prune is higher for the pruned version of R, prune R
251
Classification: A Mathematical Mapping
Classification:
predicts categorical class labels
E.g., Personal homepage classification
xi = (x1, x2, x3, …), yi = +1 or –1
x1 : # of a word “homepage”
x2 : # of a word “welcome”
Mathematically
x X = n, y Y = {+1, –1}
We want a function f: X Y
252
Linear Classification
Binary Classification
problem
The data above the red
line belongs to class ‘x’
The data below red line
x belongs to class ‘o’
x x
x x Examples: SVM,
x Perceptron, Probabilistic
x x o Classifiers
x
o
x o
o o o
o o o
o o o o
253
Discriminative Classifiers
Advantages
prediction accuracy is generally high
As compared to Bayesian methods – in general
Criticism
long training time
difficult to understand the learned function (weights)
Bayesian networks can be used easily for pattern discovery
254
Perceptron & Winnow
• Vector: x, w
x2
• Scalar: x, y, w
Input: {(x1, y1), …}
Output: classification function f(x)
f(xi) > 0 for yi = +1
f(xi) < 0 for yi = -1
f(x) => wx + b = 0
or w1x1+w2x2+b = 0
• Perceptron: update W
additively
• Winnow: update W
multiplicatively
x1
255
Classification by
Backpropagation
257
A Neuron (= a perceptron)
- k
x0 w0
x1 w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi k )
vector x vector w sum function i 0
258
A Multi-Layer Feed-Forward Neural Network
Output vector
Err j O j (1 O j ) Errk w jk
Output layer k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden layer Err j O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input layer
I j wij Oi j
i
Input vector: X
259
How A Multi-Layer Neural Network Works?
261
Backpropagation
Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the
actual target value
Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
Steps
Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
262
Backpropagation and Interpretability
Efficiency of backpropagation: Each epoch (one interation through the
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in the worst
case
Rule extraction from networks: network pruning
Simplify the network structure by removing weighted links that have the
least effect on the trained network
Then perform link, unit, or activation value clustering
The set of input and activation values are studied to derive rules describing
the relationship between the input and hidden unit layers
Sensitivity analysis: assess the impact that a given input variable has
on a network output. The knowledge gained from this analysis can be
represented in rules
263
Associative Classification
Associative classification
Association rules are generated and analyzed for use in classification
Search for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels
Classification: Based on evaluating a set of rules in the form of
264
Typical Associative Classification Methods
265
A Closer Look at CMAR
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01 )
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R1 and R2, if the antecedent of R1 is more general than that
of R2 and conf(R1) ≥ conf(R2), then R2 is pruned
Prunes rules for which the rule antecedent and class are not positively
correlated, based on a χ2 test of statistical significance
Classification based on generated/pruned rules
If only one rule satisfies tuple X, assign the class label of the rule
If a rule set S satisfies X, CMAR
divides S into groups according to class labels
uses a weighted χ2 measure to find the strongest group of rules,
based on the statistical correlation of rules within a group
assigns X the class label of the strongest group
266
Associative Classification May Achieve High
Accuracy and Efficiency (Cong et al. SIGMOD05)
267
The k-Nearest Neighbor
Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples
_
_
_ _ .
+
_ .
+
xq + . . .
_ + . 268
Discussion on the k-NN Algorithm
269
Case-Based Reasoning (CBR)
CBR: Uses a database of problem solutions to solve new problems
Store symbolic description (tuples or cases)—not points in a Euclidean
space
Applications: Customer-service (product-related diagnosis), legal ruling
Methodology
Instances represented by rich symbolic descriptions (e.g., function graphs)
Search for similar cases, multiple retrieved cases may be combined
Tight coupling between case retrieval, knowledge-based reasoning, and
problem solving
Challenges
Find a good similarity metric
Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
270
Genetic Algorithms (GA)
272
Fuzzy Set
Approaches
274
Linear Regression
Linear regression: involves a response variable y and a single
predictor variable x
y = w 0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
| D|
(x x )( yi y )
w w yw x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
Multiple linear regression: involves more than one predictor variable
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
Solvable by extension of least square method or using SAS, S-Plus
Many nonlinear functions can be transformed into the above
275
Nonlinear Regression
Some nonlinear models can be modeled by a polynomial
function
A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Other functions, such as power function, can also be
transformed to linear model
Some models are intractable nonlinear (e.g., sum of
exponential terms)
possible to obtain least square estimates through extensive
calculation on more complex formulae
276
Other Regression-Based Models
Generalized linear model:
Foundation on which linear regression can be applied to modeling
categorical response variables
Variance of y is a function of the mean value of y, not a constant
Logistic regression: models the prob. of some event occurring as a linear
function of a set of predictor variables
Poisson regression: models the data that exhibit a Poisson distribution
Log-linear models: (for categorical data)
Approximate discrete multidimensional prob. distributions
Also useful for data compression and smoothing
Regression trees and model trees
Trees to predict continuous values rather than class labels
277
Regression Trees and Model
Trees
Regression tree: proposed in CART system (Breiman et al. 1984)
CART: Classification And Regression Trees
Each leaf stores a continuous-valued prediction
It is the average value of the predicted attribute for the training tuples
that reach the leaf
Model tree: proposed by Quinlan (1992)
Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
A more general case than regression tree
Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple linear
model
278
Predictive Modeling in Multidimensional Databases
279
Prediction: Numerical Data
280
Prediction: Categorical Data
281
C1 C2
True positive False negative
Classifier Accuracy Measures C1
C2 False positive True negative
282
UNIT IV- Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Outlier Analysis
283
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
284
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
285
Examples of Clustering
Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
286
Quality: What Is Good
Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
287
Measure the Quality of
Clustering
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
288
Requirements of Clustering in Data
Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine
input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
289
Data Structures
Data matrix
(two modes) x11 ... x1f ... x1p
... ... ... ... ...
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Dissimilarity matrix
(one mode)
0
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
290
Type of data in clustering analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
291
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
292
Similarity and Dissimilarity
Between Objects
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip j p
293
Similarity and Dissimilarity
Between Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
294
Binary Variables
Object j
1 0 sum
A contingency table for binary 1 a b a b
Object i
data 0 c d cd
sum a c b d p
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
296
Nominal Variables
d (i, j) p
p
m
297
Ordinal Variables
rif 1
zif
M f 1
compute the dissimilarity using methods for interval-scaled
variables
298
Ratio-Scaled Variables
299
Variables of Mixed Types
300
Vector Objects
301
Major Clustering Approaches
(I)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some
criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
302
Major Clustering Approaches
(II)
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that
model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
303
Typical Alternatives to Calculate the
Distance between Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
304
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN 1(t )
Cm N
ip
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
305
Partitioning Algorithms: Basic
Concept
Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
306
The K-Means Clustering Method
307
The K-Means Clustering Method
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial 5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
308
Comments on the K-Means Method
309
Variations of the K-Means Method
310
What Is the Problem of the K-Means
Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
311
The K-Medoids Clustering Method
312
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8 Compute
9
8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3
2
3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
313
PAM (Partitioning Around Medoids)
(1987)
314
PAM Clustering: Total swapping cost
TCih=jCjih
10 10
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
h
8
8
j
7
7
6
6
5
5 i
i h j
t
4
4
3
3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
316
CLARA (Clustering Large Applications)
(1990)
317
CLARANS (“Randomized” CLARA)
(1994)
319
Outlier Discovery:
Statistical
Approaches
320
Outlier Discovery: Distance-Based
Approach
321
Density-Based Local
Outlier Detection
Distance-based outlier detection
is based on global distance
distribution
It encounters difficulties to
identify outliers if data is not
uniformly distributed Local outlier factor (LOF)
Assume outlier is not crisp
Ex. C1 contains 400 loosely Each point has a LOF
distributed points, C2 has 100
tightly condensed points, 2
outlier points o1, o2
Distance-based method cannot
identify o2 as an outlier
Need the concept of local outlier
322
Outlier Discovery: Deviation-Based
Approach
323