Вы находитесь на странице: 1из 224

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220633275

Data Mining - Concepts and Techniques.

Article · January 2002


Source: DBLP

CITATIONS READS

28 3,325

1 author:

Petra Perner
Institute of Computer Vision and applied Computer Sciences IBaI
196 PUBLICATIONS   1,675 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Worldcongress The Frontier of Intelligent Data and Signal Analysis DSA 2018 View project

All content following this page was uploaded by Petra Perner on 29 June 2015.

The user has requested enhancement of the downloaded file.


Data Mining:
Concepts and Techniques
— Tutorial —
M. Vazirgiannis, M. Halkidi
{mvazirg, mhalk}@aueb.gr
Dept. of Informatics,
Athens Univ. of Economics & Bussiness,
Athens, Greece
http://www.db-net.aueb.gr
A Brief History of Data Mining Society
¾ 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-
Shapiro)
9 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
¾ 1991-1994 Workshops on Knowledge Discovery in Databases
9 Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
¾ 1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
9 Journal of Data Mining and Knowledge Discovery (1997)
¾ 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD
Explorations
¾ More conferences on data mining
9 PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Tutorial Outline

¾ Introduction
¾ Data Preprocessing
¾Cluster Analysis – Unsupervised Learning
¾Classification – Supervised Learning
¾Association Rules
¾Quality Assessment
¾Uncertainty Handling
¾Data Mining Applications & Trends

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Part 1 - Introduction

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Introduction
• Motivation: Why data mining?
• What is data mining?
• Data Mining: On what kind of data?
• Data mining functionality
• Are all the patterns interesting?
• Classification of data mining systems
• Major issues in data mining
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Motivation:
“Necessity is the Mother of Invention”
„ Data explosion problem
9 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
„ We are drowning in data, but starving for knowledge!
„ Solution: Data warehousing and data mining
9 Data warehousing and on-line analytical processing
9 Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Evolution of Database Technology
• 1960s:
9 Data collection, database creation, IMS and network DBMS
• 1970s:
9 Relational data model, relational DBMS implementation
• 1980s:
9 RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
• 1990s—2000s:
9 Data mining and data warehousing, multimedia databases,
and Web databases
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
What Is Data Mining?
• Data mining (knowledge discovery in
databases):
9 Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases
• Alternative names and their “inside stories”:
9 Data mining: a misnomer?
9 Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
• What is not data mining?
9 (Deductive) queryM.Vazirgiannis
processing. & M. Halkidi,
23.8.02 HELDiNET 10/2000
9 Expert systems or small ML/statistical programs
Why Data Mining? — Potential Applications
• Database analysis and decision support
– Market analysis and management
9target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
– Risk analysis and management
9Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
– Fraud detection and management
• Other Applications
– Text mining (news group, email, documents) and Web
analysis.
– Intelligent query answering
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining: A KDD Process
– Data mining: the core of
Pattern Evaluation
knowledge discovery process.

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Architecture of a Typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine

Database or data Knowledge-base


warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
9 Object-oriented and object-relational databases
9 Spatial databases
9 Time-series data and temporal data
9 Text databases and multimedia databases
9 Heterogeneous and legacy databases
9 WWW
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Type of data in clustering analysis

• Interval-scaled variables:
• Binary variables:
• Nominal, ordinal, and ratio variables:
• Variables of mixed types:

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Interval-valued variables
• Standardize data
– Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)

where m f = 1n (x1 f + x2 f + ... + xnf )


.

– Calculate the standardized measurement (z-score)


xif − m f
zif = sf

• Using mean absolute deviation is more robust than using


standard deviation

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or


dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance

d(i, j) =| x − x | +| x − x | +...+| x − x |
i1 j1 i2 j2 ip jp
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Similarity and Dissimilarity Between Objects (Cont.)

• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity measures.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Binary Variables
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p
• Simple matching coefficient (invariant, if the binary variable is
symmetric): d (i, j ) = b+c
a+b+c+d
• Jaccard coefficient (noninvariant if the binary variable is
asymmetric): d (i, j ) = b+c
a+b+c
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Nominal Variables
• A generalization of the binary variable in that it can take more
than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables

d ( i , j ) = p −p m

• Method 2: use a large number of binary variables


– creating a new binary variable for each of the M nominal
states

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Ordinal Variables
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled
9 replacing xif by their rank r if ∈ {1,..., M f }
9 map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by

r if − 1
z if =
M f − 1
9 compute the dissimilarity using methods for interval-scaled
variables
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Ratio-Scaled Variables
• Ratio-scaled variable: a positive measurement on a nonlinear
scale, approximately at exponential scale, such as AeBt or Ae-Bt
• Methods:
9 treat them like interval-scaled variables - not a good choice!
(why?)
9 apply logarithmic transformation
yif = log(xif)
9 treat them as continuous ordinal data treat their rank as
interval-scaled.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Variables of Mixed Types
• A database may contain all the six types of variables
– symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio.
• One may use a weighted formula to combine their effects.
Σ p
δ ( f )
d( f )
d (i, j ) = f = 1 ij ij
Σ p
f = 1
δ ( f )
ij
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled
z if = r − 1
if

M −1f

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Mining Functionalities (1)

• Concept description: Characterization and discrimination


– Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
• Association (correlation and causality)
– Multi-dimensional vs. single-dimensional association
– age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”)
[support = 2%, confidence = 60%]
– contains(T, “computer”) à contains(x, “software”) [1%,
75%]

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Mining Functionalities (2)
• Classification and Prediction
– Finding models (functions) that describe and distinguish classes
or concepts for future prediction
– E.g., classify countries based on climate, or classify cars based
on gas mileage
– Presentation: decision-tree, classification rule, neural network
– Prediction: Predict some unknown or missing numerical values
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
– Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining Functionalities (3)
• Outlier analysis
– Outlier: a data object that does not comply with the general
behavior of the data
– It can be considered as noise or exception but is quite
useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Are All the “Discovered” Patterns Interesting?
• A data mining system/query may generate thousands of patterns,
not all of them are interesting.
– Suggested approach: Human-centered, query-based, focused
mining
• Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some
degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures:
– Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

Machine
Data Mining Visualization
Learning

Information Other
Science Disciplines

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
A Multi-Dimensional View of Data Mining
Classification
• Databases to be mined
– Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous,
legacy, WWW, etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA
mining, stock market analysis, Web mining, Weblog analysis,
23.8.02
etc. M.Vazirgiannis & M. Halkidi,
HELDiNET 10/2000
Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of
abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Major Issues in Data Mining (2)
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and
global information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
– Protection of data security, integrity, and privacy
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Summary
• Data mining: discovering interesting patterns from large
amounts of data
• A natural evolution of database technology, in great demand,
with wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of information
repositories
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
• Classification of data mining systems
• Major issues in data M.Vazirgiannis
mining & M. Halkidi,
23.8.02 HELDiNET 10/2000
Part 2
Data Preprocessing

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration and transformation
• Data reduction
• Summary

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results!


– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality
data

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Value added
– Interpretability
– Accessibility
• Broad categories:
– intrinsic, contextual, representational, and accessibility.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the
same or similar analytical results
• Data discretization
– Part of data reduction but with particular importance,
especially for numerical data
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Cleaning

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Missing Data
• Data is not always available
9 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
9 equipment malfunction
9 inconsistent with other recorded data and thus deleted
9 data not entered due to misunderstanding
9 certain data may not be considered important at the time
of entry
9 not register history or changes of the data
• Missing data may need to be inferred.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Noisy Data

• Noise: random error or variance in a measured variable


• Incorrect attribute values may due to
9 faulty data collection instruments
9 data entry problems
9 data transmission problems
9 technology limitation
9 inconsistency in naming convention
• Other data problems which requires data cleaning
9 duplicate records
9 incomplete data
9 inconsistent data
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
How to Handle Noisy Data?

• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
• Regression
– smooth by fitting the data into regression functions

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Cluster Analysis

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Regression
y

Y1

Y1’ y=x+1

X1 x

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– possible reasons: different representations, different scales,
e.g., metric vs. British units
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Handling Redundant Data in Data Integration

• Redundant data occur often when integration of multiple


databases
9 The same attribute may have different names in different
databases
9 One attribute may be a “derived” attribute in another table,
e.g., annual revenue
• Redundant data may be able to be detected by correlational
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Transformation

• Smoothing: remove noise from data


• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Transformation: Normalization
• min-max normalization
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

• z-score normalization
v − mean A
v'=
stand _ dev A

• normalization by decimal scaling

v where j is the smallest integer such that Max(| v ' |)<1


v' = j
10
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Reduction Strategies
• Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the complete
data set
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
• Data reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Numerosity reduction
– Discretization and concept hierarchy generation
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Cube Aggregation

• The lowest level of a data cube


– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the
task
• Queries regarding aggregated information should be
answered using data cube, when possible
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection):
– Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
– reduce # of patterns in the patterns, easier to understand

• Heuristic methods (due to exponential # of choices):


– step-wise forward selection
– step-wise backward elimination
– combining forward selection and backward elimination
– decision-tree induction
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Compression

Original Data Compressed


lossless
Data

s s y
lo
Original Data
Approximated

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Wavelet Transforms
Haar2 Daubechie4
• Discrete wavelet transform (DWT): linear signal processing
• Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
• Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
• Method:
9 Length, L, must be an integer power of 2 (padding with 0s,
when necessary)
9 Each transform has 2 functions: smoothing, difference
9 Applies to pairs of data, resulting in two set of data of length
L/2
9 Applies two functions recursively, until reaches the desired
length
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Principal Component Analysis
• Given N data vectors from k-dimensions, find c <= k
orthogonal vectors that can be best used to represent data
– The original data set is reduced to one consisting of N
data vectors on c principal components (reduced
dimensions)
• Each data vector is a linear combination of the c principal
component vectors
• Works for numeric data only
• Used when the number of dimensions is large

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Principal Component Analysis
X2

Y1
Y2

X1

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Numerosity Reduction
• Parametric methods
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
– Log-linear models: obtain value at a point in m-D space as
the product on appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Regression and Log-Linear Models

• Linear regression: Data are modeled to fit a straight


line
– Often uses the least-square method to fit the line
• Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
• Log-linear model: approximates discrete
multidimensional probability distributions
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Regress Analysis and Log-Linear Models
• Linear regression: Y = α + β X
– Two parameters , α and β specify the line and are to be
estimated by using the data at hand.
– using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
• Log-linear models:
– The multi-way table of joint probabilities is approximated by
a product of lower-order tables.
– Probability: p(a, b, c, d) = αab βacχad δbcd
Part 3
Clustering (Unsupervised Learning)

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering –Unsupervised Learning
ƒCluster analysis
9Grouping
a set of data objects into clusters
ƒCluster: a collection of data objects
9Similar to one another within the same cluster
9Dissimilar to the objects in other clusters

ƒClustering is unsupervised learning:


learning
9no predefined classes and no examples
that would show what kind of desirable relations should be
valid among the data

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering criteria

¾ high intra-cluster similarity


¾low inter-cluster similarity

cluster1

Inter-class cluster3
similarity

cluster2 x

x
Intra-class
similarity

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering Applications
ƒ Pattern Recognition
ƒ Spatial Data Analysis
9 detect spatial clusters and explain them in spatial data
mining
ƒ Image Processing
ƒ Data reduction.
ƒ Economic Science (especially market research)
ƒ WWW
9 Document classification
9 Cluster Web-log data to discover groups of similar
access patterns

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Requirements of Clustering in Data Mining
ƒ Scalability
ƒ Ability to deal with different types of attributes
ƒ Discovery of clusters with arbitrary shape
ƒ Minimal requirements for domain knowledge to
determine input parameters
ƒ Able to deal with noise and outliers
ƒ Insensitive to order of input records
ƒ High dimensionality
ƒ Interpretability and usability
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Steps to develop a clustering process
ƒFeature selection. Select properly the features on which
clustering is to be performed

ƒClustering algorithm.
9Proximity measure.
9Clustering criterion

ƒValidation of the results. The correctness of clustering


algorithm results is verified using appropriate criteria and
techniques

ƒInterpretation of the results.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
A Categorization of Clustering Methods
¾Technique used in order to define clusters.
ƒPartitional
Methods
ƒHierarchical Methods
ƒDensity-Based Methods
ƒGrid-Based Methods

¾The type of variables


ƒStatistical – Numerical Data
ƒConceptual- Categorical Data
¾Theory used in order to extract clusters
ƒFuzzy Clustering
ƒCrisp Clustering
ƒKohonen net clustering
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Partitioning Algorithms: Basic Concept

ƒ Partitional method: Decompose the data set into a set of


k disjoint clusters.

ƒ Problem Definition: Given an integer k, find a partition


of k clusters that optimizes the chosen partitioning
criterion

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
The K-Means Clustering Method

2
E K = ∑ x k − mc ( x k )
k

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Comments on the K-Means Method
ƒ Advantages
9 Relatively efficient: O(tkn), where n is the number of objects,
k is the number of clusters, and t is the number of iterations.
Normally, k, t << n.
9 Often terminates at a local optimum.

ƒ Problems
9 Applicable only to numerical data sets
9 Need to specify the number of clusters in advance
9 Unable to handle noisy data and outliers
9 Not suitable to discover clusters with non-convex shapes
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Variations of the K-Means Method
ƒ A few variants of the k-means which differ in
9 Selection of the initial k centers
9 Dissimilarity calculations
9 Strategies to calculate cluster centers

ƒ Handling categorical data: k-modes (Huang’98)


9 Replacing cluster centers with modes
9 Using new dissimilarity measures to deal with categorical
objects
9 A mixture of categorical and numerical data: k-prototype
method
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
The K-Medoids Clustering Method

Find representative objects, called medoids,


medoids in clusters

ƒ PAM (Kaufmann & Rousseeuw, 1987)

ƒ CLARA (Kaufmann & Rousseeuw, 1990)

ƒ CLARANS (Ng & Han, 1994): Randomized sampling

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
PAM (Partitioning Around Medoids)
ƒ Step1: Select k representative objects (medoids) arbitrarily
ƒ Step2: Compute Total Swapping Cost for all pairs of objects Oi,
Oh where Oi is currently selected, and Oh is not
TCih=∑jCjih
ƒ Step3: Select the pair Oi, Oh which minimizes TCih .

ƒ Step4: If min TCih<0,


replace Oi with Oh and go to step 2
else
for each non -selected object find the most
similar representative object. Halt
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
PAM Clustering: Total swapping cost TCih=∑jCjih
10 10

9 9
j
8
t 8 t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0


10
10
9
9

h
8
8
7
7
j 6

i
6
5

i
5

4
t
4

3
h j
3

2
2

1
t
1
0
0 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)


M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
CLARA
(Clustering LARge Applications)

ƒ It draws multiple samples of the data set, applies PAM on


each sample, and gives the best clustering as the output
ƒ Advantage: deals with larger data sets than PAM
ƒ Problems:
9 Efficiency depends on the sample size
9 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
CLARANS
(“Randomized” CLARA)
ƒ The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
ƒ If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local
optimum
ƒ Input: number of local minimum, number of neighbors
ƒ CLARANS draws sample of neighbors dynamically
ƒ It is more efficient and scalable than both PAM and
CLARA
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Hierarchical Clustering
Hierarchical clustering proceeds successively by either merging
smaller clusters into larger ones, or by splitting larger clusters.
The result of the algorithm is a tree of clusters, called dendrogram.
dendrogram
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative

a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Hierarchical Clustering Algorithms

„BIRCH (1996): uses CF-tree and incrementally


adjusts the quality of sub-clusters

„ CURE (1998): is robust to outliers and identifies


clusters of non-spherical shapes.

„ROCK (1999): is a robust clustering algorithm for


Boolean and categorical data. It introduces two new
concepts, that is a point's neighbours and links.
links

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
BIRCH (1996)
(Balanced Iterative Reducing and Clustering using Hierarchies)

ƒ Incrementally construct a CF (Clustering Feature) tree,


tree a
hierarchical data structure for multiphase clustering
9 Phase 1:
1 scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
9 Phase 2:
2 use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
ƒ Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
ƒ Problem: handles only numeric data, and sensitive to the
order of the data record.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: ∑Ni=1=Xi
CF = (5, (16,30),(54,190))
SS: ∑Ni=1=Xi2

10

9
(3,4)
8

7
(2,6)
(4,5)
6

2 (4,7)
1

0
0 1 2 3 4 5 6 7 8 9 10 (3,8)

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
B = 7(non-leaf entries),
CF Tree T= radius or diameter of leaf entries,
Root L = 6 (leaf entries)

CF1 CF2 CF3 CF6


child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
CURE
(Clustering
(C Using REpresentatives
RE )

• Drawbacks of partitional
and hierarchical clustering
methods
9 Consider only one point
as representative of a
cluster
9 Good only for convex
shaped, similar size and
density, and if k can be
reasonably estimated

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
CURE
ƒ Identifies clusters having non-spherical shape.
shape
ÎUses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary shaped
clusters
ƒ Handles outliers.
outliers
ÎShrink the multiple representative points towards the
gravity center by a fraction of α.

ƒ CURE employs a combination of


9 Random sampling
9 Partitioning
ƒ Problems: Priori-known number of clusters, k.
Time complexity: O(n2logn)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
CURE: The Algorithm

– Draw random sample s.


– Partition sample to p partitions with size s/p
– Partially cluster partitions into s/pq clusters
– Eliminate outliers
– Cluster partial clusters.
– Label data in disk

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering Categorical Data: ROCK
ƒ ROCK:
9 Use links to measure similarity/proximity
9 Not distance based

ƒ Basic ideas:
9 Similarity function :
T1 ∩ T2
Sim ( T1 , T2 ) =
T1 ∪ T2
9 Neighbors:
Sim(T1, T2) ≥θ Ö T1 and T2 neighbors
9 Links: The number of common neighbors for the two
points.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Density-Based Clustering Algorithms
ƒ Clustering based on density (local cluster criterion), such as
density-connected points

ƒ Major features:
9 Discover clusters of arbitrary shape
9 Handle noise
9 Need density parameters as termination condition

ƒ Representative algorithms:
9 DBSCAN:
DBSCAN Ester, et al. (KDD’96)
9 DENCLUE:
DENCLUE Hinneburg & D. Keim (KDD’98)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Density-Based Clustering: Background
ƒ Two parameters:
9 Eps: Maximum radius of the neighbourhood
9 MinPts: Minimum number of points in an Eps-neighbourhood
of that point

ƒ NEps(p): {q belongs to D | dist(p,q) <= Eps}


ƒ Directly density-reachable:
– 1) p belongs to NEps(q) p MinPts = 5
– 2) core point condition: q
Eps = 1 cm
|NEps (q)| >= MinPts
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Density-Based Clustering: Background (II)
ƒ Density-reachable:
– A point p is density-reachable from
a point q wrt. Eps, MinPts if there p
is a chain of points p1, …, pn, p1 = q, p1
pn = p such that pi+1 is directly q
density-reachable from pi

ƒ Density-connected
– A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is
a point o such that both, p and q are
density-reachable from o wrt. Eps o
and MinPts.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
DBSCAN
ƒ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
ƒ Discovers clusters of arbitrary shape in spatial databases with
noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
DBSCAN: The Algorithm
– Arbitrary select a point p

– Retrieve all points density-reachable from p wrt Eps and


MinPts.

– If p is a core point, a cluster is formed.

– If p is a border point, no points are density-reachable from


p and DBSCAN visits the next point of the database.

– Continue the process until all of the points have been


processed.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
DENCLUE
ƒ Uses grid cells. It keeps information about grid cells that do
actually contain data points
ƒ The basic idea is to model the overall point density as the
sum of influence functions of the data points.
9 Influence function : describes the impact of a data point
within its neighborhood (square wave function,
Gaussian,…)
9 Density function : Sum of the influences of all data points

Clusters can be identified by determining density attractors.


Density attractors are local maximum of the overall density
function.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
DENCLUE

10

Grid -cell 8

0
0 1 2 3 4 5 6 7 8 9 10

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Grid - Based Clustering Method

ƒ Quantize the space into a finite number of cells and then do all
operations on the quantized space

ƒ Several interesting methods


9 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
9 WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
9 CLIQUE: Agrawal, et al. (SIGMOD’98)

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
WaveCluster (1998)
ƒ Clustering from a signal processing perspective using
wavelets.

ƒ Input parameters:
9 number of grid cells for each dimension
9 the wavelet, and
9 the number of applications of wavelet transform.

Problem: Does not work in high dimensional space

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
WaveCluster : The Algorithm
–Spatial data is represented in a n-dimensional feature space.
–Partition the data space by a grid
–Collection of objects in the feature space composes a signal.
Basic Idea: Apply wavelet-transformation to the feature space
9A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.
oHigh frequency parts of signal Ö Boundaries of clusters
oLow frequency Ö Clusters

–Find the connected components as clusters in the average sub-bands


of transformed feature space, at different levels
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Fuzzy Clustering
¾Crisp clustering, meaning that a data point either
belongs to a class or not.
¾Fuzzy Clustering a data point may belong to more than
one clusters with different degrees of belief

Representative fuzzy clustering algorithm: Fuzzy C-Means(FCM).

FCM objective function:


c n
J m (U ,V ) = ∑ ∑ U ikm d 2 (x k , vi )
i = 1 k =1

m Æ 1 Ö clusters Æ crisp
m Æ ∝ Ö clusters Æ fuzzy, Uik Æ1/c
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Category Partitional

Name Type of data Compexity Geometry Outliers, Input Results Clustering criterion
(memory, space) noise parameter
s
K-Mean Numerical O(n) non-convex No Number of Center of minv1,v2,…,vk (Ek)
shapes clusters clusters k n
Ek =∑∑d2 (xk ,vi )
i=1 k=1
K-mode Categorical O(n) non-convex No Number of Modes of minQ1,Q2,…,Qk (Ek)
shapes clusters clusters k n
E = ∑∑d(Xl ,Qi )
i=1 l =1
D(Xi,Ql) = distance between
categorical objects Xl, and modes
Qi
PAM Numerical O(k(n-k)2) non-convex No Number of Medoids of min ( TCih )
shapes clusters clusters TCih = Σj Cjih
CLARA Numerical O(k(40+k)2 + k(n-k)) non-convex No Number of Medoids of min ( TCih )
shapes clusters clusters TCih = Σj Cjih
(Cjih = the cost of replacing
center i with h as far as Oj is
concerned)
CLARANS Numerical O(kn2) non-convex No Number of Medoids of
shapes clusters, clusters min ( TCih )
maximum TCih = Σj Cjih
number of
neighbors
examined
FCM Numerical O(n) Non-convex No Number of Center of minU,v1,v2,…,vk (Jm(U,V))
Fuzzy C-Means shapes clusters cluster, k n
beliefs J (U,V) =
m ∑∑
Um d2 (x , v )
i=1 j=1
ik j i
Category Hierarchical

Name Type of data Compexity Geometry Outliers Input Results Clustering criterion
(memory, space) parameter
s
BIRCH Numerical O(n) Convex shapes Yes Radius of CF = A point is assigned to closest
clusters, (number of node (cluster) according to a
branching points in the chosen distance metric.
factor cluster N, Also, the clusters definition is
linear sum based on the requirement that the
of the points number of points in each cluster
in the cluster must satisfy a threshold
LS, the requirement
square sum
of N data
points SS )
CURE Numerical O(n2logn), O(n) Arbitrary shapes Yes Number of Assignment The clusters with the closest pair
clusters, of data of representatives (well scatteres
number of values to points) are merged at each step.
clusters clusters
representat
ives
ROCK Categorical O(n2+nmmma+ Arbitrary shapes Yes Number of max (El)
n2logn), clusters Assignment
O(n2 , nmmma) of data k link( pq , pr )
values to El = ∑ni ∑
clusters i=1 pq , pr ∈Vi ni1+2 f (θ )
- vi center of cluster I
- link (pq, pr) = the number of
common neighbors between pi
and pr.
Category Density-based

Name Type of data Compexity Geometry Outliers, Input Results Clustering criterion
(memory, space) noise parameters
DBSCAN Numerical O(nlogn) Arbitrary Yes Cluster
shapes radius, Assignment of Merge points that are density
minimum data values to reachable into one cluster.
number of clusters
objects
DENCLUE Numerical O(logn) Arbitrary Yes Cluster
shapes radius σ, Assignment of d ( x * , x1 ) 2

Minimum
number of
data values to
clusters
f D
Gauss (x ) =
*
∑e 2σ 2

x1∈near ( x * )
*
objects ξ x density attractor for a point x if
FGauss > ξ then x attached to the
cluster belonging to x*.

Category Grid-Based

Name Type of data Compexity Geometry Outliers Input parameters Output Clustering criterion

Wave- Spatial data O(n) Arbitrary Yes Wavelets, Clustered Decompose feature space applying
Cluster shapes number of grid cells objects wavelet transformation
for each dimension, Average subband Æ clusters
the number of Detail subbandsÆ clusters
applications of boundaries
wavelet transform.

STING Spatial data O(K) Arbitrary Yes Number of objects Clustered Divide the spatial area into rectangle
K is the shapes in a cell objects cells and employ an hierarchical
number of structure.
grid cells at Each cell at a high level is
the lowest partitioned into a number of smaller
level cells in the next lower level
Part 4
Classification (Supervised Learning)

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Classification
Classification can be described as a function that maps (classifies) a
data item into one of the several predefined classes.

Requirements
ƒA well-defined set of classes and
ƒa training set of pre-classified examples characterize the
classification.

Goal: Induce a model that can be used to classify future data items
whose classification is unknown

Classification Methods
9Bayesian classification
9Decision Trees
9 Neural Networks
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Bayesian classification
The aim is to classify a sample x to one of the given classes c1, c2,.., cN
using a probability model defined according to Bayes theory

Requirements
— a priori probability for each class ci.
— conditional probability density function p(x/ci)∈[0,1]
Ø Bayes formula
p ( x / ci ) p (ci )
q ( ci / x ) = C
Posterior probability
∑ p ( x / c ) p (c )
j =1
i i

A pattern is classified in the class with the highest posterior


probability.
Problem : complete knowledge of probability laws is necessary
in order to perform the classification
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Decision Trees
¾Decision trees are one of the widely used techniques for
classification and prediction.
Requirements
— clusters (categories)
— a training set of pre-classified data.
Characteristics of Decision Trees
—Internal node Æ a test of an attribute
— Branch descending of a node Æ one of the possible values for
this attribute
— Leaf Æ one of the defined classes
overlook

sunny rain
overcast
hummidity wind
P
high normal
true false
N P M.Vazirgiannis & M. Halkidi,
N P
23.8.02 HELDiNET 10/2000
Decision Trees
Constructing Decision Trees in two phases
ƒ Building phase. The training data set is recursively partitioned until
all the instances in a partition have the same class
ƒ Pruning phase. Nodes are pruned to prevent over fitting and to obtain
a tree with higher accuracy

Decision trees Algorithms


The various decision tree generation algorithms use different algorithms
for selecting the test criterion for partitioning a set of records

¾ID3, C4.5. Information gain


¾CLS.
CLS Examines the solution space of all possible decision trees to some
fixed depth. It selects a test that minimizes the computational cost of
classifying a record.
¾SLIQ, SPRINT select the attribute to test, based on the GINI index.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ID3 : The Algorithm
Step 1:
If all instances in C are positive, then create YES node and
halt.
If all instances in C are negative, create a NO node and
halt.
Otherwise select an attribute, A with values v1, ..., vn and
create a decision node.

Step 2:
Partition the training instances in C into subsets C1, C2, ...,
Cn according to the values of A.

Step 3:
Apply the algorithm recursively to each of the sets Ci.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ID3 : Attribute Selection

How does ID3 decide which attribute is the best?


A statistical property, called information gain,
gain is used.

Let a data set S = {A1, A2,…, An}

The information needed to identify an element of S and the


information needed to identify an element of S after the value of
attribute A has been obtained.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
ID3: Definitions
The information needed to identify the class of an element of S,
called Entropy of S,
S is:
Info(S) = -Σp(I) log2 p(I)
where p(I) is the proportion of S belonging to class I.

The information needed to identify the class of an element of S,


Info (S,A), after we partition S on basis of the value of an
attribute A into sets Sv:
Info (S,A) = Σ [(|Sv| / |S|) * Entropy(Sv)]

Gain(S, A) is information gain of example set S on attribute A.

Gain(S, A) = Info(S) – Info ( S,A)


M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ID3: Example
Outlook Temperature Humidity Wind Play_ball
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
RAin Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ID3: Example
Outlook
Which attribute will be the root node ?
Gain(S, Outlook) = 0.246 Sunny Overcast Rain
Gain(S, Temperature) = 0.029
Gain(S, Humidity) = 0.151 ?
Gain(S, Wind) = 0.048

Outlook

Sunny Overcast Rain


Gain(Sunny, Humidity) = 0.970
Humidity
Gain(Sunny, Temperature) = 0.570
Gain(Sunny, Wind) = 0.019 High Normal

No Yes
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ID3 Example : Final Decision Tree

Outlook

Rain
Sunny Overcast

Humidity Wind
Yes

Strong Weak
High Normal

No Yes
No Yes
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
C4.5

C4.5 is an extension of ID3 that accounts for

ƒ Unavailable values
ƒ Continuous attribute values ranges
ƒ Pruning of decision trees
ƒ Rule derivation

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Neural Networks
¾Neural Networks

An approach used in many data mining applications for


prediction and classification.

Input level Hidden level Output level


Χ1

Χ2 Ζ1

Χ3
Ζ2
Χ4

Χ5

Structure of Neural Network

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Neural Networks (continue …)
The main steps for the process of building a classification model are:
¾identification of the input and output features,
¾setting up of a network with an appropriate topology,
¾selection of a right training set,
¾training the network on a representative data set. The data should be
represented in such a way as to maximize the ability of the network to
recognize patterns in it,
¾testing the network using a test set that is independent from the
training set.

The model generated by the network is applied to predict the classes


(outcomes) of unknown instances (inputs).

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
PART 5–
Association Rules

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
What Is Association Mining?
• Association rule mining:
9 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
• Applications:
9 Basket data analysis, cross-marketing, catalog design, loss-
leader analysis, clustering, classification, etc.
• Examples.
9 Rule form: “Body ÆHead [support, confidence]”.
9 buys(x, “diapers”) Æ buys(x, “beers”) [0.5%, 60%]
9 major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%,
75%]
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Association Rule: Basic Concepts
• Given: (1) database of transactions, (2) each transaction is a list
of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of items with
that of another set of items
9 E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
• Applications
9 * ⇒ Maintenance Agreement (What the store should do to
boost Maintenance Agreement sales)
9 Home Electronics ⇒ * (What other products should the
store stocks up?)
9 Attached mailing in direct marketing
9 Detecting “ping-pong”ing of patients, faulty “collisions”
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Rule Measures: Support and Confidence
Customer
buys both
Customer • Find all the rules X & Y ⇒ Z with
buys diaper minimum confidence and support
– support, s, probability that a
transaction contains {X  Y 
Z}
Customer
– confidence, c, conditional
buys beer probability that a transaction
having {X  Y} also contains Z
Transaction ID Items Bought Let minimum support 50%, and
2000 A,B,C minimum confidence 50%, we
1000 A,C have
4000 A,D – A ⇒ C (50%, 66.6%)
5000 B,E,F – C ⇒ A (50%, 100%)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Association Rule Mining: A Road Map
• Boolean vs. quantitative associations (Based on the types of values
handled)
9 buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x,
“DBMiner”) [0.2%, 60%]
9 age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”)
[1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex.
Above)
• Single level vs. multiple-level analysis
9 What brands of beers are associated with what brands of
diapers?
• Various extensions
9 Correlation, causality analysis
• Association does not necessarily imply correlation or causality
9 Maxpatterns and closed itemsets
9 Constraints enforcedM.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
• E.g., small sales (sum < 100) trigger big buys (sum >
1,000)?
Mining Association Rules—An Example
Transaction ID Items Bought
Min. support 50%
2000 A,B,C
Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
5000 B,E,F {A} 75%
{B} 50%
{C} 50%
{A,C} 50%
For rule A ⇒ C:
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Mining Frequent Itemsets: the Key Step

• Find the frequent itemsets: the sets of items that have


minimum support
9 A subset of a frequent itemset must also be a frequent
itemset
y i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset
9 Iteratively find frequent itemsets with cardinality from 1
to k (k-itemset)
• Use the frequent itemsets to generate association rules.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of
a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3
Scan D {2} 3
200 235 {3} 3
{3} 3
300 1235 {4} 1
{5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not inM.Vazirgiannis
Lk-1) then Halkidi, c from Ck
& M.delete
23.8.02 HELDiNET 10/2000
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Example of Generating Candidates

• L3={abc, abd, acd, ace, bcd}


• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace

• Pruning:
– acde is removed because ade is not in L3

• C4={abcd}
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Methods to Improve Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
• Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
• Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
• Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
• Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Is Apriori Fast Enough?-Performance Bottlenecks
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
– Use database scan and pattern matching to collect counts for
the candidate itemsets
• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
9104 frequent 1-itemset will generate 107 candidate 2-
itemsets
9To discover a frequent pattern of size 100, e.g., {a1, a2,
…, a100}, one needs to generate 2100 ≈ 1030 candidates.
– Multiple scans of database:
9Needs (n +1 ) scans, n is the length of the longest pattern
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Iceberg Queries
• Icerberg query: Compute aggregates over one or a set of
attributes only for those whose aggregate values is above certain
threshold
• Example:
select P.custID, P.itemID, sum(P.qty)
from purchase P
group by P.custID, P.itemID
having sum(P.qty) >= 10
• Compute iceberg queries efficiently by Apriori:
– First compute lower dimensions
– Then compute higher dimensions only when all the lower
ones are above the threshold
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Multi-Dimensional Association: Concepts
• Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
• Multi-dimensional rules:  2 dimensions or predicates
– Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒
buys(X,“coke”)
– hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
• Categorical Attributes
– finite number of possible values, no ordering among values
• Quantitative Attributes
– numeric, implicit ordering among values
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Techniques for Mining MD Associations
• Search for frequent k-predicate set:
– Example: {age, occupation, buys} is a 3-predicate set.
– Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes
– Quantitative attributes are statically discretized by using
predefined concept hierarchies.
2. Quantitative association rules
– Quantitative attributes are dynamically discretized into
“bins”based on the distribution of the data.
3. Distance-based association rules
– This is a dynamic discretization process that considers the
distance between data points.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Static Discretization of Quantitative Attributes

• Discretized prior to mining using concept hierarchy.


• Numeric values are replaced by ranges.
• In relational database, finding all frequent k-predicate sets will
require k or k+1 table scans. ()
• Data cube is well suited for mining.
• The cells of an n-dimensional (age) (income) (buys)

cuboid correspond to the


predicate sets.
(age, income) (age,buys) (income,buys)
• Mining from data cubes
can be much faster.
(age,income,buys)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Quantitative Association Rules
• Numeric attributes are dynamically discretized
– Such that the confidence or compactness of the rules mined
is maximized.
• 2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat
• Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid.
• Example:

age(X,”30-34”) ∧ income(X,”24K - 48K”)


⇒ buys(X,”high resolution TV”)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ARCS (Association Rule Clustering System)

How does ARCS work?

1. Binning

2. Find frequent predicateset

3. Clustering

4. Optimize

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Limitations of ARCS

• Only quantitative attributes on LHS of rules.


• Only 2 attributes on LHS. (2D limitation)
• An alternative to ARCS
– Non-grid-based
– equi-depth binning
– clustering based on a measure of partial completeness.
– “Mining Quantitative Association Rules in Large
Relational Tables” by R. Srikant and R. Agrawal.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Mining Distance-based Association Rules
• Binning methods do not capture the semantics of interval data
Equi-width Equi-depth Distance-
Price($) (width $10) (depth 2) based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]

• Distance-based partitioning, more meaningful discretization


considering:
– density/number of points in an interval
– “closeness” of points in an interval
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Clusters and Distance Measurements

• S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set


X
• The diameter of S[X]:

∑ ∑
N N
i =1 j =1
dist X (ti[ X ], tj[ X ])
d ( S [ X ]) =
N ( N − 1)

– distx:distance metric, e.g. Euclidean distance or Manhattan

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clusters and Distance Measurements(Cont.)

• The diameter, d, assesses the density of a cluster CX , where


d (CX ) ≤ d 0 X

CX ≥ s 0
• Finding clusters and distance-based rules

– the density threshold, d0 , replaces the notion of support

– modified version of the BIRCH clustering algorithm

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Interestingness Measurements

• Objective measures
Two popular measurements:
œsupport; and
confidence

• Subjective measures (Silberschatz & Tuzhilin,


KDD95)
A rule (pattern) is interesting if
œit is unexpected (surprising to the user); and/or
actionable (the user can do something with it)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Criticism to Support and Confidence
• Example 1: (Aggarwal & Yu, PODS98)
– Among 5000 students
• 3000 play basketball
• 3750 eat cereal
• 2000 both play basket ball and eat cereal
– play basketball ⇒ eat cereal [40%, 66.7%] is misleading
because the overall percentage of students eating cereal is
75% which is higher than 66.7%.
– play basketball ⇒ not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Criticism to Support and Confidence (Cont.)

• Example 2:
– X and Y: positively correlated, X 1 1 1 1 0 0 0 0
– X and Z, negatively related Y 1 1 0 0 0 0 0 0
– support and confidence of Z 0 1 1 1 1 1 1 1
X=>Z dominates
• We need a measure of dependent
or correlated events
Rule Support Confidence
P( A∪ B) X=>Y 25% 50%
corrA, B =
P( A) P( B) X=>Z 37,50% 75%
• P(B|A)/P(B) is also called the lift of
rule A => B M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Other Interestingness Measures: Interest
• Interest (correlation, lift) P( A ∧ B)
P ( A) P ( B )
– taking both P(A) and P(B) in consideration

– P(A^B)=P(B)*P(A), if A and B are independent events


– A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated

Itemset Support Interest


X 1 1 1 1 0 0 0 0
X,Y 25% 2
Y 1 1 0 0 0 0 0 0 X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Association Rules:Summary
• Association rule mining
– probably the most significant contribution from the
database community in KDD
– A large number of papers have been published
• Many interesting issues have been explored
• An interesting research direction
– Association analysis in other types of data: spatial data,
multimedia data, time series data, etc.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Part 6
Clustering Quality Assessment

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering Quality Assessment

What is
Good Clustering?

9“How many clusters are there in the data set?”


9“Does the defined clustering scheme fits our data set?”
9“Is there a better clustering possible?

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering Quality Assessment

Clustering Quality Criteria

¾ high intra-cluster similarity


y Variance
Complete Linkage
cluster1
cluster3
x
¾low inter-cluster similarity
x

y Single Linkage Single


y Complete Linkage Linkage Comparison of centroids

y Comparison of centroids x
cluster2

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
cluster1 Series2 cluster3 cluster1 cluster2
40
40
35 35

30 30

25 25

20 20
15 15
10 10
5 5
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Eps=2, Nps=4 Eps=6, Nps=4

cluster1 cluster2 cluster3 cluster4

450
450

400 400

350 350
4
300 300
250 250
200 200 2 3
150 150
1
100 100
50 50
0 0
0 100 200 300 400 500 0 100 200 300 400 500
(a) (b)
Clustering Validity Indices
A number of cluster validity indices are described in literature
¾Crisp Clustering
9Separation index ( Dunn)
9 DB (Davies-Bouldin)
9 RMSSTD & RS (Subhash Sharma )
¾Fuzzy Clustering
9Partition coefficient (Bezdek)
9Classification entropy (Bezdek)
9…
Drawbacks
ƒ computationally expensive.
ƒ monotonous dependency on the number of clusters
ƒ the lack of direct connection to the geometry of the data
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Clustering Quality Index - SD
¾Variance of data set

( )
n 2
1 1 n
σ xp = ∑ xkp − x ∑
p
, µε X = x k , ∀x k ∈ X
n k =1 n k = 1

n 2
1
¾Variance of cluster i. σ vi =
p
ni
∑ (x
k =1
p
k − vi p )

1 c
∑ σ (vi )
¾Average scattering c
Scat (c) = i =1
for clusters. σ( X )
−1
Dmax c
 c

¾Total scattering between clusters. Dis (c) = Dmin
∑  ∑
k =1  z =1
v k − v z 

SD(c)
SD(c)==Dis(c
Dis(cmax )*Scat(c) + Dis(c)
max)*Scat(c) + Dis(c)

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering Quality Index -Definistion (II)
¾Intra clusters variance - Average scattering for clusters
c
σ (vi )
1
c
∑ i =1
Scat (c ) =
σ (X )

(x )
2
where 1 n


p
σ x
p
= k
p
− x
n k =1

1

n
where
p
x is the p th dimension of X = x
k =1 k
, ∀xk ∈ X
n
2

∑ (x )
ni
p
σ p
vi = k
p
− vi ni
k =1

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering Quality Index -Definistion (I)
¾Inter-cluster Density (ID) - Average density between clusters

c
 1 c

Dens _ bw ( c ) = ∑ 
 c − 1
∑ density _ between ( v i , v j ,
)
i =1 j =1 
i≠ j 
where density _ between(vi , v j ) = density(uij ) dis(vi , v j )

vi* *uij *vj

n
density(uij ) = ∑ f ( xl , uij ), n = number of tuples, xl ∈ S
l =1

0, if d(x, u) > σ


f ( x, u ) = 
1, otherwise

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Clustering Quality Index -S_Dbw

S_Dbw
S_Dbw == a*Scat(c)
a*Scat(c) ++ Dens_bw(c)
Dens_bw(c)

a is a factor for normalizing the value of S_Dbw and is defined


as:
a = maxc=1,..,cmax Dens_bw(c)

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
S_Dbw = a*Scat(c) + Dens_bw(c)

1
1

3 1

2 2

Scat Ï & Dens_bw Ô

1 3
1 3

3
4
1
2 4
2

Scat Ô & Dens_bw Ï Scat Ï & Dens_bw Ï


M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
1200
Number ofclusters S_bw
Quality measure (S_Dbw)
1000
8 1139,98
800 7 1122,26
6 936,01
600
5 869,41
400 4 933,31
200
3 969,39
2 471,81
0
0 2 4 6 8 10
number of clusters(c)

100

80

60

40

20

0
0 100 200 300 400 500
Comparison of S_Dbw with other validity indices
DataSet1 DataSet2 DataSet3 DataSet4
[

RS, Copt= 2 Copt =2 Copt =3 Copt = 4


RMSSTD
DB Copt =2 Copt =3 Copt =3 Copt =4
SD Copt =5 Copt =3 Copt =3 Copt =3
SD_bw Copt =2 Copt =2 Copt =3 Copt =3

It is obvious that S_Dbw finds the correct number of clusters fitting a


data set while other validity indices fail in some cases.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Best partitioning proposed by a specific algorithm as defined S_Dbw
Select the algorithm that gives the best
partitioning

C lu ste rin g S_D bw


A lg o rith m C=3
K -m ea n s 4 58 .54 7
D BSCA N 5 7 .2 9 1
CURE 1 05 .87 8

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Quality Assessment:Summary
We address the important issue of assessing the quality of
clustering algorithm results, i.e. how close are the results to the
real partitions of the data set

Contributions
¾ a new validity measure (S_Dbw) for
9selecting the best clustering scheme for a data set
9assessing the results of a specific clustering algorithms

Further work
¾definition of an index that works properly in the case of clusters
of non concave (i.e. rings)
¾an integrated algorithm for cluster discovery putting emphasis in
the geometric features of clusters.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
PART 7 –
Uncertainty handling
in the data mining
process with fuzzy logic

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Uncertainty handling in the data mining
process with fuzzy logic

OUTLINE

„Motivation for uncertainty handling

„Classification Framework

„Experimental evaluation

„Decision support

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Motivation for uncertainty handling
Tid Client_salary client_age Price
Data mining results classifications, association rules,
1 6387 64 567 decision trees etc.
2 4048 70 261
3 5829 53 307
4
5
6576
7832
60
46
166
1169
e.g. client_salary[8000,11000] and client_age[25-40] ⇒
6 8243 54 713 price[1300,2000]
7 9218 21 1458
8 3857 76 1038
9 5030 22 681 Most KDD approaches classify data values into one
10 4447 19 136
11 9765 36 1292 category
12 6822 37 1136
13 8763 79 1444
14 1643 66 8 Issues:
15 5387 73 283
16 2943 71 173 • the clusters are not overlapping.
17
18
4584
6963
76
69
641
983
• the data values are treated equally in the
19 2323 80 742 classification process.
• Classification results may hide "useful" knowledge
for our data set
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Classification Scheme - Classification space (CS)
client_salary Some clustering or learning
low Medium High
Min 1500 2500 4000
algorithm resulted in
Max 3000 5500 10000
Function decr triangle Increasing

client_age
young Medium Old
Min 18 30 50
Max 40 60 80
Function decr triangle increasing

price
very cheap Cheap moderate expensive
Min 1 10 35 70
Max 15 50 80 150
Function decr triangle triangle triangle

1,2
1
young
0,8
Mapping functions 0,6
medium
old
0,4
0,2
0
18 23 28 33 38 43 48 53 58 63 68 73 78
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Classification Scheme -
Classification Value Space (CVS)
Classification
Categories
li
CVS (S)

Tuples tk

Data Set S
for each attribute Ai in CS Ai
for each category Cj of Ai
Attributes
for each value tk.Ai
compute d.o.b.(Ai, Cj, tk.Ai)
end
end
end
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Information Measures in CVS
¾ Category Energy metric. li
9attribute category importance
1/q
 
∑[µ ] 9overall belief that the data
E li (S.Ai ) =  l
i
(S.t k .A i ) q
n r  9 set includes objects of the category li
 k 
Αi
¾ Attribute Energy metric
lc
∑ [ E ( A )]
li i
l1 9information content of the
E = li dataset regarding attribute Ai;
Ai
c

¾ CVS Energy
Ai 9overall information content
E CVS
= ∑ [ E ( Ai )] Classification in the CVS
Ai
= ∑ wi [ E ( Ai )], 0<= wi<=1
Value space
E CVS
i
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Multiple criteria classification
Representation of the d.o.b. related to composite classifications of tuples.
e.g. to what degree a tuple satisfies multiple criteria: “morning and
cheap purchases”.
The term “morning and cheap” defines a new category in the composite
attribute “time and price”.
Two alternatives:
9Classification based on multidimensional clusters.
Clusters (initial categories) are found in multiple dimensions
and then the data set is classified according to these
multidimensional categories.

9Classification based on one-dimensional clusters.


We adopt the min measure for composition of fuzzy predicates.
: µmorningandcheap(tk ) = min(µmorning(tk.time), µcheap(tk.price))
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Experimental Evaluation –
Predefined number of clusters
¾ stock market data base
¾ schema: {close_price, high_price, volume}
¾ One dim. clustering: 3clusters for close_price and volume respectively
¾Two dim. Clustering: 9 clusters for combined attributes

Close_Price, Volume
One-dimensional clustering
C1 C2 C3 C4 C5 C6 C7 C8 C9
Category Energy 0,1758 0,8340 0,0683 0,1222 0,3611 0,0216 0,0488 0,221 0

Ecl_vol 0,2058

Two-dimensional clustering
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8 Cat9
Category Energy 0,1739 0,1745 0,0548 0,1376 0,3699 0,3516 0,3191 0,669 0,1414
Ecl_vol 0,2658

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Experimental Evaluation –
Optimal number of clusters
¾One dim. clustering: 4clusters for close_price and 7 for volume
¾Two dim. Clustering: 8 clusters for combined attributes

One-dimensional clustering
C1 C2 C3 C4 C5 C6 C7 C8 C9
Category 0,117 0,164 0,0106 0,0856 0,5699 0,2995 0,2221 0,1011 0,0782
Energy C10 C11 C12 C13 C14 C15 C16 C17 C18
0,0005 0,0327 0,1946 0,1458 0,1327 0,0116 0,1235 0 0,00222
C19 C20 C21 C22 C23 C24 C25 C26 C27
0,1409 0,0698 0,1011 0,0638 0,1219 0,0452 0,0653 0,3099 0,22054
C28
0,1035
Ecl_vol 0,1262

Two-dimensional clustering
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8
Category Energy 0,1741 0,1758 0,0548 0,1377 0,3908 0,3521 0,3196 0,6699
Ecl_vol 0,2844

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Decisions based on information measures
Single data set, single attribute queries
Query Value returned
“What is the belief that the data set contains high Ehigh(S.salary)
salaries”
if (Ehigh(S.salary) > Emedium(S.salary))
“Does the attribute salary include mostly high or return Ehigh(S.salary)
medium salaries?” else
return Emedium(S.salary)

Multi-dataset queries
Query Value returned
“Which of the S1, S2 contains more If (Emorning(S1.time_of_p) > Emorning(S2.time_of_p))
transactions made early morning?” return Emorning(S1.time_of_p))
else
return Emorning(S2.time_of_p)
“In which supermarket there are more If (Echeap and evening(S1.price, S.time_of_p) > Echeap and evening(S2.price, S.time_of_p))
cheap purchases made in the evening?” return Echeap and evening(S1.price, S.time_of_p)
else
return Echeap and evening(S2.price, S.time_of_p)

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Uncertainty handling:Summary
¾ Uncertainty is an important though under-addressed issue in
the data mining process

Contributions
¾ a scheme for representation of uncertainty in classification
based on fuzzy logic
¾ multi-criteria classification is more successful when clustering
is applied to multiple dimensions

Further work
¾ Classification quality assessment
¾ Decision support facilities for intra- & inter- dataset
cases…..through information measures
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Part 8-
Data Mining Applications and
Trends

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Mining Applications

• Data mining is a young discipline with wide and diverse


applications
9 There is still a nontrivial gap between general principles
of data mining and domain-specific, effective data mining
tools for particular applications

• Some application domains (covered in this chapter)


9 Biomedical and DNA data analysis
9 Financial data analysis
9 Retail industry
9 Telecommunication industry
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Biomedical Data Mining and DNA Analysis
• DNA sequences: 4 basic building blocks (nucleotides): adenine
(A), cytosine (C), guanine (G), and thymine (T).
• Gene: a sequence of hundreds of individual nucleotides
arranged in a particular order
• Humans have around 100,000 genes
• Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
• Semantic integration of heterogeneous, distributed genome
databases
9 Current: highly distributed, uncontrolled generation and use
of a wide variety of DNA data
9 Data cleaning and data integration methods developed in
data mining will help
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining for Financial Data Analysis
• Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
• Design and construction of data warehouses for
multidimensional data analysis and data mining
9 View the debt and revenue changes by month, by region, by
sector, and by other factors
9 Access statistical information such as max, min, total,
average, trend, etc.
• Loan payment prediction/consumer credit policy analysis
9 feature selection and attribute relevance ranking
9 Loan payment performance
9 Consumer credit rating
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Financial Data Mining
• Classification and clustering of customers for targeted
marketing
9 multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer
groups or associate a new customer to an appropriate
customer group

• Detection of money laundering and other financial crimes


9 integration of from multiple DBs (e.g., bank transactions,
federal/state crime history DBs)
9 Tools: data visualization, linkage analysis, classification,
clustering tools, outlier analysis, and sequential pattern
analysis tools (find unusual access sequences)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining for Retail Industry

• Retail industry: huge amounts of data on sales, customer


shopping history, etc.
• Applications of retail data mining
9 Identify customer buying behaviors
9 Discover customer shopping patterns and trends
9 Improve the quality of customer service
9 Achieve better customer retention and satisfaction
9 Enhance goods consumption ratios
9 Design more effective goods transportation and
distribution policies
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining for Telecomm. Industry (1)

• A rapidly expanding and highly competitive industry and


a great demand for data mining
9 Understand the business involved
9 Identify telecommunication patterns
9 Catch fraudulent activities
9 Make better use of resources
9 Improve the quality of service
• Multidimensional analysis of telecommunication data
9 Intrinsically multidimensional: calling-time, duration,
location of caller, location of callee, type of call, etc.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining tools -
How to choose a data mining system?
• Commercial data mining systems have little in common
9 Different data mining functionality or methodology
9 May even work with completely different kinds of data sets
• Need multiple dimensional view in selection
• Data types: relational, transactional, text, time sequence,
spatial?
• System issues
9 running on only one or on several operating systems?
9 a client/server architecture?
9 Provide Web-based interfaces and allow XML data as input
and/or output?
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining tools -
How to Choose a Data Mining System? (2)
• Data sources
9 ASCII text files, multiple relational data sources
9 support ODBC connections (OLE DB, JDBC)?
• Data mining functions and methodologies
9 One vs. multiple data mining functions
9 One vs. variety of methods per function
• More data mining functions and methods per function provide the
user with greater flexibility and analysis power
• Coupling with DB and/or data warehouse systems
9 Four forms of coupling: no coupling, loose coupling,
semitight coupling, and tight coupling
• Ideally, a data mining system should be tightly coupled with a
database system M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining tools -
How to Choose a Data Mining System? (3)
• Scalability
9 Row (or database size) scalability
9 Column (or dimension) scalability
9 Curse of dimensionality: it is much more challenging to
make a system column scalable that row scalable
• Visualization tools
9 “A picture is worth a thousand words”
9 Visualization categories: data visualization, mining result
visualization, mining process visualization, and visual data
mining
• Data mining query language and graphical user interface
9 Easy-to-use and high-quality graphical user interface
9 Essential for user-guided, highly interactive data mining
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Examples of Data Mining Systems (1)
• IBM Intelligent Miner
– A wide range of data mining algorithms
– Scalable mining algorithms
– Toolkits: neural network algorithms, statistical methods, data
preparation, and data visualization tools
– Tight integration with IBM's DB2 relational database system
• SAS Enterprise Miner
– A variety of statistical analysis tools
– Data warehouse tools and multiple data mining algorithms
• Mirosoft SQLServer 2000
– Integrate DB and OLAP with mining
– Support OLEDB for DM standard
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Examples of Data Mining Systems (2)

• SGI MineSet
– Multiple data mining algorithms and advanced
statistics
– Advanced visualization tools
• Clementine (SPSS)
– An integrated data mining development
environment for end-users and developers
– Multiple data mining algorithms and visualization
tools
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Visual Data Mining
• Visualization: use of computer graphics to create visual images
which aid in the understanding of complex, often massive
representations of data
• Visual Data Mining: the process of discovering implicit but
useful knowledge from large data sets using visualization
techniques
• Purpose of Visualization
– Gain insight into an information space by mapping data onto
graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities,
relationships among data.
– Help find interesting regions and suitable parameters for
further quantitative analysis.
– Provide a visual proof of computer representations derived
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Visual Data Mining & Data Visualization
• Integration of visualization and data mining
9 data visualization
9 data mining result visualization
9 data mining process visualization
9 interactive visual data mining
• Data visualization
9 Data in a database or data warehouse can be viewed
• at different levels of granularity or abstraction
• as different combinations of attributes or dimensions
9 Data can be presented in various visual forms
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Boxplots from Statsoft: multiple variable
combinations

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Data Mining Result Visualization
• Presentation of the results or knowledge obtained from
data mining in visual forms
• Examples
– Scatter plots and boxplots (obtained from descriptive data
mining)
– Decision trees
– Association rules
– Clusters
– Outliers
– Generalized rules
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Visualization of data mining results in SAS
Enterprise Miner: scatter plots

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Visualization of association rules in
MineSet 3.0

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Visualization of a decision tree in
MineSet 3.0

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Visualization of cluster groupings in
IBM Intelligent Miner

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Interactive Visual Data Mining

• Using visualization tools in the data mining process to


help users make smart data mining decisions
• Example
– Display the data distribution in a set of attributes using
colored sectors or columns (depending on whether the
whole space is represented by either a circle or a set of
columns)
– Use the display to which sector should first be selected for
classification and where a good split point for this sector
may be

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Interactive Visual Mining by Perception-
Based Classification (PBC)
(PBC

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Audio Data Mining

• Uses audio signals to indicate the patterns of data or the


features of data mining results
• An interesting alternative to visual mining
• An inverse task of mining audio (such as music) databases
which is to find patterns from audio data
• Visual data mining may disclose interesting patterns using
graphical displays, but requires users to concentrate on
watching patterns
• Instead, transform patterns into sound and music and listen
to pitches, rhythms, tune, and melody in order to identify
anything interesting or unusual
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Theoretical Foundations of Data Mining (1)
• Data reduction
– The basis of data mining is to reduce the data
representation
– Trades accuracy for speed in response
• Data compression
– The basis of data mining is to compress the given data by
encoding in terms of bits, association rules, decision trees,
clusters, etc.
• Pattern discovery
– The basis of data mining is to discover patterns occurring
in the database, such as associations, classification models,
sequential patterns, etc.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Theoretical Foundations of Data Mining (2)
• Probability theory
– The basis of data mining is to discover joint probability
distributions of random variables
• Microeconomic view
– A view of utility: the task of data mining is finding patterns
that are interesting only to the extent in that they can be used
in the decision-making process of some enterprise
• Inductive databases
– Data mining is the problem of performing inductive logic on
databases,
– The task is to query the data and the theory (i.e., patterns) of
the database
– Popular among many researchers in database systems
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Is Data Mining a Hype or Will It Be
Persistent?

• Data mining is a technology


• Technological life cycle
– Innovators
– Early adopters
– Chasm
– Early majority
– Late majority
– Laggards
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Life Cycle of Technology Adoption

• Data mining is at Chasm!?


– Existing data mining systems are too generic
– Need business-specific data mining solutions and smooth
integration of business logic with data mining functions
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining: Merely Managers'
Business or Everyone's?
• Data mining will surely be an important tool for
managers’ decision making
– Bill Gates: “Business @ the speed of thought”
• The amount of the available data is increasing, and
data mining systems will be more affordable
• Multiple personal uses
– Mine your family's medical history to identify genetically-
related medical conditions
– Mine the records of the companies you deal with
– Mine data on stocks and company performance, etc.
• Invisible data mining
– Build data mining functions into many intelligent tools (i.e.
amazon.com)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Social Impacts: Threat to Privacy
and Data Security?
• Is data mining a threat to privacy and data security?
– “Big Brother”, “Big Banker”, and “Big Business” are carefully
watching you
– Profiling information is collected every time
• You use your credit card, debit card, supermarket loyalty card, or
frequent flyer card, or apply for any of the above
• You surf the Web, reply to an Internet newsgroup, subscribe to a
magazine, rent a video, join a club, fill out a contest entry form,
• You pay for prescription drugs, or present you medical care number
when visiting the doctor
– Collection of personal data may be beneficial for companies
and consumers, there is also potential for misuse
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Protect Privacy and Data Security
• Fair information practices
– International guidelines for data privacy protection
– Cover aspects relating to data collection, purpose, use,
quality, openness, individual participation, and
accountability
– Purpose specification and use limitation
– Openness: Individuals have the right to know what
information is collected about them, who has access to the
data, and how the data are being used
• Develop and use data security-enhancing techniques
– Blind signatures
– Biometric encryption
– Anonymous databases
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Trends in Data Mining (1)

• Application exploration
– development of application-specific data mining system
– Invisible data mining (mining as built-in function)
• Scalable data mining methods
– Constraint-based mining: use of constraints to guide data
mining systems in their search for interesting patterns
• Integration of data mining with database systems,
data warehouse systems, and Web database systems
• Quality asessement
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Trends in Data Mining (2)
• Standardization of data mining language
– A standard will facilitate systematic development, improve
interoperability, and promote the education and use of data
mining systems in industry and society
• Visual data mining
• Uncertainty handling
• New methods for mining complex types of data
– More research is required towards the integration of data
mining methods with existing data analysis techniques for
the complex types of data
• Web mining
• Privacy protection and information security in data
mining M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Summary
• Domain-specific applications include biomedicine (DNA),
finance, retail and telecommunication data mining
• There exist some data mining systems and it is important to know
their power and limitations
• Visual data mining include data visualization, mining result
visualization, mining process visualization and interactive visual
mining
• There are many other scientific and statistical data mining
methods developed but not covered in this book
• Also, it is important to study theoretical foundations of data
mining
• Intelligent query answering can be integrated with mining
• It is important to watch privacy and security issues in data
mining
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
http://www.db-net.aueb.gr
{mvazirg, mhalk}@aueb.gr

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Ερωτήσεις – Συζήτηση

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
APPENDIX – REFERENCES

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
Where to Find References?
• Data mining and KDD (SIGKDD member CDROM):
– Conference proceedings: KDD, and others, such as PKDD, PAKDD,
etc.
– Journal: Data Mining and Knowledge Discovery
• Database field (SIGMOD member CD ROM):
– Conference proceedings: ACM-SIGMOD, ACM-PODS, VLDB,
ICDE, EDBT, DASFAA
– Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
• AI and Machine Learning:
– Conference proceedings: Machine learning, AAAI, IJCAI, etc.
– Journals: Machine Learning, Artificial Intelligence, etc.
• Statistics:
– Conference proceedings: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization:
– Conference proceedings: CHI, etc.
– Journals: IEEE Trans. visualization and computer graphics, etc.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References-Data Processing
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2000.
• T. Imielinski and H. Mannila. A database perspective on knowledge discovery.
Communications of ACM, 39:58-64, 1996.
• G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge
discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge
Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.
• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.
AAAI/MIT Press, 1991.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
References -Clustering (1)
• R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering
of high dimensional data for data mining applications. SIGMOD'98
• M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
• M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to
identify the clustering structure, SIGMOD’99.
• P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific,
1996
• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
• M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
• D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2:139-172, 1987.
• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach
based on dynamic systems. In Proc. VLDB’98.
• S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
• A. K. Jain and R. C. Dubes. Algorithms for &Clustering
M.Vazirgiannis M. Halkidi, Data. Printice Hall, 1988.
23.8.02 HELDiNET 10/2000
References Clustering (2)
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
• P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
• R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
• G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
• W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to
Spatial Data Mining, VLDB’97.
• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering
method for very large databases. SIGMOD'96.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000
References-Clustering Quality Assessment
yS. Theodoridis, K. Koutroubas. Pattern recognition, Academic Press, 1999
yRajesh N. Dave. "Validating fuzzy partitions obtained through c-shells clustering",
Pattern Recognition Letters, Vol .17, pp613-623, 1996
yJ. C. Dunn. "Well separated clusters and optimal fuzzy partitions", J. Cybern. Vol.4, pp.
95-104, 1974.
yMartin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu. "A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Proc. of 2nd
Int. Conf. Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231, 1996.
yI. Gath, B. Geva. "Unsupervised Optimal Fuzzy Clustering". IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol 11, No7, July 1989.
yRamze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. "A new cluster validity index for the
fuzzy c-mean", Pattern Recognition Letters, 19, pp237-246, 1998.
yPadhraic Smyth. "Clustering using Monte Carlo Cross-Validation". KDD 1996, 126-133.
yXunali Lisa Xie, Genardo Beni. "A Validity measure for Fuzzy Clustering", IEEE
Transactions on Pattern Analysis and machine Intelligence, Vol13, No4, August 1991.
yM.Halkidi, M.Vazirgiannis, I.Batistakis. “Quality scheme assessment in the clustering
process”, PKDD 2000, Lyon, France.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References-Classification (1)
•.M. Berry, G. Linoff. Data Mining Techniques For marketing, Sales and Customer
Support. John Willey & Sons, Inc, 1996.
•.S. Chiu. "Extracting Fuzzy Rules from Data for Function Approximation and Pattern
Classification". Fuzzy Information Engineering- A Guided Tour of Applications.(Eds.: D.
Dubois, H. Prade, R Yager), 1997.
•.P. Cheeseman, J. Stutz. "Bayesian Classification (AutoClass): Theory and Results".
Advances in Knowledge Discovery and Data Mining. (Eds:U. Fayyad,et al), AAAI
Press,1996.
•.T. Horiuchi. "Decision Rule for Pattern Classification by Integrating Interval Feature
Values". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.20, No.4,
April 1998, pp.440-448.
•.M. Melta, R. Agrawal, J. Rissanen. "SLIQ: A fast scalable classifier for data mining". In
EDBT’96, Avigon France, March 1996.
•.T. Mitchell. Machine Learning. McGraw-Hill, 1997
•.J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
•.R. Rastori, K. Shim. "PUBLIC: A Decision Tree Classifier that Integrates Building and
Pruning". Proceeding of the 24th VLDB Conference, New York, USA, 1998.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References-Classification (2)

•.J.Shafer, R. Agrawal, M. Mehta. "SPRINT: A scalable parallel classifier for data


mining". In Proc. of the VLDB Conference, Bombay, India, September 1996
.Glymour C., MadiganD., Pregibon D, Smyth P, “Statistical Inference and Data
Mining”, in CACM v39 (11), 1996, pp. 35-42
.Cezary Z. Janikow, "Fuzzy Decision Trees: Issues and Methods", IEEE Transactions
on Systems, Man, and Cybernetics, Vol. 28, Issue 1, pp 1-14, 1998.
.M. Vazirgiannis, "A classification and relationship extraction scheme for relational
databases based on fuzzy logic", in the proceedings of the Pacific -Asian Knowledge
Discovery & Data Mining ’98 Conference, Melbourne, Australia.
M. Vazirgiannis, M. Halkidi. "Uncertainty handling in the datamining process with
fuzzy logic", to appear in the proceedings of the IEEE-FUZZ conference, San
Antonio, Texas, May, 2000.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
.
References – Association rules(1)
• R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of
frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High
Performance Data Mining), 2000.
• R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in
large databases. SIGMOD'93, 207-216, Washington, D.C.
• R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499,
Santiago, Chile.
• R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
• R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle,
Washington.
• S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association
rules to correlations. SIGMOD'97, 265-276, Tucson, Arizona.
• S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication
rules for market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.
• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes.
SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
• D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in
large databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.
• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing
iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References – Association rules(2)
• G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets.
ICDE'00, 512-521, San Diego, CA, Feb. 2000.
• Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases.
KDOOD'95, 39-46, Singapore, Dec. 1995.
• T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-
dimensional optimized association rules: Scheme, algorithms, and visualization.
SIGMOD'96, 13-23, Montreal, Canada.
• E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules.
SIGMOD'97, 277-288, Tucson, Arizona.
• J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series
database. ICDE'99, Sydney, Australia.
• J. Han and Y. Fu. Discovery of multiple-level association rules from large databases.
VLDB'95, 420-431, Zurich, Switzerland.
• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD'00, 1-12, Dallas, TX, May 2000.
• T. Imielinski and H. Mannila. A database perspective on knowledge discovery.
Communications of ACM, 39:58-64, 1996.
• M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional
association rules using data cubes. KDD'97, 207-210, Newport Beach, California.
• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94, 401-408,
Gaithersburg, Maryland. M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References – Association rules(3)
• F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast,
quantifiable data mining. VLDB'98, 582-593, New York, NY.
• B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231,
Birmingham, England.
• H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association
rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery
(DMKD'98), 12:1-12:7, Seattle, Washington.
• H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94, 181-192, Seattle, WA, July 1994.
• H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997.
• R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules.
VLDB'96, 122-133, Bombay, India.
• R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461,
Tucson, Arizona.
• R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning
optimizations of constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington.
• N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for
association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References – Association rules(4)
• J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining
association rules. SIGMOD'95, 175-186, San Jose, CA, May 1995.
• J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed
Itemsets. DMKD'00, Dallas, TX, 11-20, May 2000.
• J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00.
Boston, MA. Aug. 2000.
• G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-
Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238.
AAAI/MIT Press, 1991.
• B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-
421, Orlando, FL.
• J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining
association rules. SIGMOD'95, 175-186, San Jose, CA.
• S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns
in association rules. VLDB'98, 368-379, New York, NY..
• S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle,
WA.
• A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association
rules in large databases. VLDB'95, 432-443, Zurich, Switzerland.
• A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a
large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References – Association rules(5)
• C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal
structures. VLDB'98, 594-605, New York, NY.
• R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419,
Zurich, Switzerland, Sept. 1995.
• R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables.
SIGMOD'96, 1-12, Montreal, Canada.
• R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints.
KDD'97, 67-73, Newport Beach, California.
• H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145,
Bombay, India, Sept. 1996.
• D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks:
A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
• K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized
rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
• M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of
association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.
• M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug.
2000.
• O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive
Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References - Conclusions(1)
• M. Ankerst, C. Elsen, M. Ester, and H.-P. Kriegel. Visual classification: An interactive
approach to decision tree construction. KDD'99, San Diego, CA, Aug. 1999.
• P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press,
1998.
• S. Benninga and B. Czaczkes. Financial Modeling. MIT Press, 1997.
• L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
• M. Berthold and D. J. Hand. Intelligent Data Analysis: An Introduction. Springer-
Verlag, 1999.
• M. J. A. Berry and G. Linoff. Mastering Data Mining: The Art and Science of
Customer Relationship Management. John Wiley & Sons, 1999.
• A. Baxevanis and B. F. F. Ouellette. Bioinformatics: A Practical Guide to the Analysis
of Genes and Proteins. John Wiley & Sons, 1998.
• Q. Chen, M. Hsu, and U. Dayal. A data-warehouse/OLAP framework for scalable
telecommunication tandem traffic analysis. ICDE'00, San Diego, CA, Feb. 2000.
• W. Cleveland. Visualizing Data. Hobart Press, Summit NJ, 1993.
• S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal
description length. VLDB'98, New York, NY, Aug. 1998.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
References - Conclusions (2)
• J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed.
Duxbury Press, 1995.
• A. J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall,
1990.
• B. Gates. Business @ the Speed of Thought. New York: Warner Books, 1999.
• M. Goebel and L. Gruenwald. A survey of data mining and knowledge discovery
software tools. SIGKDD Explorations, 1:20-33, 1999.
• D. Gusfield. Algorithms on Strings, Trees and Sequences, Computer Science and
Computation Biology. Cambridge University Press, New York, 1997.
• J. Han, Y. Huang, N. Cercone, and Y. Fu. Intelligent query answering by knowledge
discovery techniques. IEEE Trans. Knowledge and Data Engineering, 8:373-390,
1996.
• R. C. Higgins. Analysis for Financial Management. Irwin/McGraw-Hill, 1997.
• C. H. Huberty. Applied Discriminant Analysis. New York: John Wiley & Sons, 1994.
• T. Imielinski and H. Mannila. A database perspective on knowledge discovery.
Communications of ACM, 39:58-64, 1996.
• D. A. Keim and H.-P. Kriegel. VisDB: Database exploration using multidimensional
visualization. Computer Graphics and Applications, pages 40-49, Sept. 94.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
View publication stats

References - Conclusions (3)


• J. M. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data
mining. Data Mining and Knowledge Discovery, 2:311-324, 1998.
• H. Mannila. Methods and problems in data mining. ICDT'99 Delphi, Greece, Jan. 1997.
• R. Mattison. Data Warehousing and Data Mining for Telecommunications. Artech
House, 1997.
• R. G. Miller. Survival Analysis. New York: Wiley, 1981.
• G. A. Moore. Crossing the Chasm: Marketing and Selling High-Tech Products to
Mainstream Customers. Harperbusiness, 1999.
• R. H. Shumway. Applied Statistical Time Series Analysis. Prentice Hall, 1988.
• E. R. Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire,
CT, 1983.
• E. R. Tufte. Envisioning Information. Graphics Press, Cheshire, CT, 1990.
• E. R. Tufte. Visual Explanations : Images and Quantities, Evidence and Narrative.
Graphics Press, Cheshire, CT, 1997.
• M. S. Waterman. Introduction to Computational Biology: Maps, Sequences, and
Genomes (Interdisciplinary Statistics). CRC Press, 1995.

M.Vazirgiannis & M. Halkidi,


23.8.02 HELDiNET 10/2000

Вам также может понравиться