Академический Документы
Профессиональный Документы
Культура Документы
net/publication/220633275
CITATIONS READS
28 3,325
1 author:
Petra Perner
Institute of Computer Vision and applied Computer Sciences IBaI
196 PUBLICATIONS 1,675 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Worldcongress The Frontier of Intelligent Data and Signal Analysis DSA 2018 View project
All content following this page was uploaded by Petra Perner on 29 June 2015.
¾ Introduction
¾ Data Preprocessing
¾Cluster Analysis – Unsupervised Learning
¾Classification – Supervised Learning
¾Association Rules
¾Quality Assessment
¾Uncertainty Handling
¾Data Mining Applications & Trends
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data
Databases Warehouse
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
9 Object-oriented and object-relational databases
9 Spatial databases
9 Time-series data and temporal data
9 Text databases and multimedia databases
9 Heterogeneous and legacy databases
9 WWW
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Type of data in clustering analysis
• Interval-scaled variables:
• Binary variables:
• Nominal, ordinal, and ratio variables:
• Variables of mixed types:
d(i, j) =| x − x | +| x − x | +...+| x − x |
i1 j1 i2 j2 ip jp
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Similarity and Dissimilarity Between Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity measures.
d ( i , j ) = p −p m
r if − 1
z if =
M f − 1
9 compute the dissimilarity using methods for interval-scaled
variables
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Ratio-Scaled Variables
• Ratio-scaled variable: a positive measurement on a nonlinear
scale, approximately at exponential scale, such as AeBt or Ae-Bt
• Methods:
9 treat them like interval-scaled variables - not a good choice!
(why?)
9 apply logarithmic transformation
yif = log(xif)
9 treat them as continuous ordinal data treat their rank as
interval-scaled.
M −1f
Database
Statistics
Technology
Machine
Data Mining Visualization
Learning
Information Other
Science Disciplines
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
• Regression
– smooth by fitting the data into regression functions
Y1
Y1’ y=x+1
X1 x
• z-score normalization
v − mean A
v'=
stand _ dev A
A1? A6?
s s y
lo
Original Data
Approximated
Y1
Y2
X1
cluster1
Inter-class cluster3
similarity
cluster2 x
x
Intra-class
similarity
Clustering algorithm.
9Proximity measure.
9Clustering criterion
2
E K = ∑ x k − mc ( x k )
k
Problems
9 Applicable only to numerical data sets
9 Need to specify the number of clusters in advance
9 Unable to handle noisy data and outliers
9 Not suitable to discover clusters with non-convex shapes
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Variations of the K-Means Method
A few variants of the k-means which differ in
9 Selection of the initial k centers
9 Dissimilarity calculations
9 Strategies to calculate cluster centers
9 9
j
8
t 8 t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
h
8
8
7
7
j 6
i
6
5
i
5
4
t
4
3
h j
3
2
2
1
t
1
0
0 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Hierarchical Clustering Algorithms
10
9
(3,4)
8
7
(2,6)
(4,5)
6
2 (4,7)
1
0
0 1 2 3 4 5 6 7 8 9 10 (3,8)
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
• Drawbacks of partitional
and hierarchical clustering
methods
9 Consider only one point
as representative of a
cluster
9 Good only for convex
shaped, similar size and
density, and if k can be
reasonably estimated
Basic ideas:
9 Similarity function :
T1 ∩ T2
Sim ( T1 , T2 ) =
T1 ∪ T2
9 Neighbors:
Sim(T1, T2) ≥θ Ö T1 and T2 neighbors
9 Links: The number of common neighbors for the two
points.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Density-Based Clustering Algorithms
Clustering based on density (local cluster criterion), such as
density-connected points
Major features:
9 Discover clusters of arbitrary shape
9 Handle noise
9 Need density parameters as termination condition
Representative algorithms:
9 DBSCAN:
DBSCAN Ester, et al. (KDD’96)
9 DENCLUE:
DENCLUE Hinneburg & D. Keim (KDD’98)
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Density-Based Clustering: Background
Two parameters:
9 Eps: Maximum radius of the neighbourhood
9 MinPts: Minimum number of points in an Eps-neighbourhood
of that point
Density-connected
– A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is
a point o such that both, p and q are
density-reachable from o wrt. Eps o
and MinPts.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
DBSCAN
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with
noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
10
Grid -cell 8
0
0 1 2 3 4 5 6 7 8 9 10
Quantize the space into a finite number of cells and then do all
operations on the quantized space
Input parameters:
9 number of grid cells for each dimension
9 the wavelet, and
9 the number of applications of wavelet transform.
m Æ 1 Ö clusters Æ crisp
m Æ ∝ Ö clusters Æ fuzzy, Uik Æ1/c
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Category Partitional
Name Type of data Compexity Geometry Outliers, Input Results Clustering criterion
(memory, space) noise parameter
s
K-Mean Numerical O(n) non-convex No Number of Center of minv1,v2,…,vk (Ek)
shapes clusters clusters k n
Ek =∑∑d2 (xk ,vi )
i=1 k=1
K-mode Categorical O(n) non-convex No Number of Modes of minQ1,Q2,…,Qk (Ek)
shapes clusters clusters k n
E = ∑∑d(Xl ,Qi )
i=1 l =1
D(Xi,Ql) = distance between
categorical objects Xl, and modes
Qi
PAM Numerical O(k(n-k)2) non-convex No Number of Medoids of min ( TCih )
shapes clusters clusters TCih = Σj Cjih
CLARA Numerical O(k(40+k)2 + k(n-k)) non-convex No Number of Medoids of min ( TCih )
shapes clusters clusters TCih = Σj Cjih
(Cjih = the cost of replacing
center i with h as far as Oj is
concerned)
CLARANS Numerical O(kn2) non-convex No Number of Medoids of
shapes clusters, clusters min ( TCih )
maximum TCih = Σj Cjih
number of
neighbors
examined
FCM Numerical O(n) Non-convex No Number of Center of minU,v1,v2,…,vk (Jm(U,V))
Fuzzy C-Means shapes clusters cluster, k n
beliefs J (U,V) =
m ∑∑
Um d2 (x , v )
i=1 j=1
ik j i
Category Hierarchical
Name Type of data Compexity Geometry Outliers Input Results Clustering criterion
(memory, space) parameter
s
BIRCH Numerical O(n) Convex shapes Yes Radius of CF = A point is assigned to closest
clusters, (number of node (cluster) according to a
branching points in the chosen distance metric.
factor cluster N, Also, the clusters definition is
linear sum based on the requirement that the
of the points number of points in each cluster
in the cluster must satisfy a threshold
LS, the requirement
square sum
of N data
points SS )
CURE Numerical O(n2logn), O(n) Arbitrary shapes Yes Number of Assignment The clusters with the closest pair
clusters, of data of representatives (well scatteres
number of values to points) are merged at each step.
clusters clusters
representat
ives
ROCK Categorical O(n2+nmmma+ Arbitrary shapes Yes Number of max (El)
n2logn), clusters Assignment
O(n2 , nmmma) of data k link( pq , pr )
values to El = ∑ni ∑
clusters i=1 pq , pr ∈Vi ni1+2 f (θ )
- vi center of cluster I
- link (pq, pr) = the number of
common neighbors between pi
and pr.
Category Density-based
Name Type of data Compexity Geometry Outliers, Input Results Clustering criterion
(memory, space) noise parameters
DBSCAN Numerical O(nlogn) Arbitrary Yes Cluster
shapes radius, Assignment of Merge points that are density
minimum data values to reachable into one cluster.
number of clusters
objects
DENCLUE Numerical O(logn) Arbitrary Yes Cluster
shapes radius σ, Assignment of d ( x * , x1 ) 2
−
Minimum
number of
data values to
clusters
f D
Gauss (x ) =
*
∑e 2σ 2
x1∈near ( x * )
*
objects ξ x density attractor for a point x if
FGauss > ξ then x attached to the
cluster belonging to x*.
Category Grid-Based
Name Type of data Compexity Geometry Outliers Input parameters Output Clustering criterion
Wave- Spatial data O(n) Arbitrary Yes Wavelets, Clustered Decompose feature space applying
Cluster shapes number of grid cells objects wavelet transformation
for each dimension, Average subband Æ clusters
the number of Detail subbandsÆ clusters
applications of boundaries
wavelet transform.
STING Spatial data O(K) Arbitrary Yes Number of objects Clustered Divide the spatial area into rectangle
K is the shapes in a cell objects cells and employ an hierarchical
number of structure.
grid cells at Each cell at a high level is
the lowest partitioned into a number of smaller
level cells in the next lower level
Part 4
Classification (Supervised Learning)
Requirements
A well-defined set of classes and
a training set of pre-classified examples characterize the
classification.
Goal: Induce a model that can be used to classify future data items
whose classification is unknown
Classification Methods
9Bayesian classification
9Decision Trees
9 Neural Networks
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Bayesian classification
The aim is to classify a sample x to one of the given classes c1, c2,.., cN
using a probability model defined according to Bayes theory
Requirements
a priori probability for each class ci.
conditional probability density function p(x/ci)∈[0,1]
Ø Bayes formula
p ( x / ci ) p (ci )
q ( ci / x ) = C
Posterior probability
∑ p ( x / c ) p (c )
j =1
i i
sunny rain
overcast
hummidity wind
P
high normal
true false
N P M.Vazirgiannis & M. Halkidi,
N P
23.8.02 HELDiNET 10/2000
Decision Trees
Constructing Decision Trees in two phases
Building phase. The training data set is recursively partitioned until
all the instances in a partition have the same class
Pruning phase. Nodes are pruned to prevent over fitting and to obtain
a tree with higher accuracy
Step 2:
Partition the training instances in C into subsets C1, C2, ...,
Cn according to the values of A.
Step 3:
Apply the algorithm recursively to each of the sets Ci.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ID3 : Attribute Selection
Outlook
No Yes
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
ID3 Example : Final Decision Tree
Outlook
Rain
Sunny Overcast
Humidity Wind
Yes
Strong Weak
High Normal
No Yes
No Yes
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
C4.5
Unavailable values
Continuous attribute values ranges
Pruning of decision trees
Rule derivation
Χ2 Ζ1
Χ3
Ζ2
Χ4
Χ5
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Methods to Improve Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
• Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
• Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
• Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
• Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Is Apriori Fast Enough?-Performance Bottlenecks
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
– Use database scan and pattern matching to collect counts for
the candidate itemsets
• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
9104 frequent 1-itemset will generate 107 candidate 2-
itemsets
9To discover a frequent pattern of size 100, e.g., {a1, a2,
…, a100}, one needs to generate 2100 ≈ 1030 candidates.
– Multiple scans of database:
9Needs (n +1 ) scans, n is the length of the longest pattern
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Iceberg Queries
• Icerberg query: Compute aggregates over one or a set of
attributes only for those whose aggregate values is above certain
threshold
• Example:
select P.custID, P.itemID, sum(P.qty)
from purchase P
group by P.custID, P.itemID
having sum(P.qty) >= 10
• Compute iceberg queries efficiently by Apriori:
– First compute lower dimensions
– Then compute higher dimensions only when all the lower
ones are above the threshold
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Multi-Dimensional Association: Concepts
• Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
• Multi-dimensional rules: 2 dimensions or predicates
– Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒
buys(X,“coke”)
– hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
• Categorical Attributes
– finite number of possible values, no ordering among values
• Quantitative Attributes
– numeric, implicit ordering among values
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Techniques for Mining MD Associations
• Search for frequent k-predicate set:
– Example: {age, occupation, buys} is a 3-predicate set.
– Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes
– Quantitative attributes are statically discretized by using
predefined concept hierarchies.
2. Quantitative association rules
– Quantitative attributes are dynamically discretized into
“bins”based on the distribution of the data.
3. Distance-based association rules
– This is a dynamic discretization process that considers the
distance between data points.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Static Discretization of Quantitative Attributes
1. Binning
3. Clustering
4. Optimize
∑ ∑
N N
i =1 j =1
dist X (ti[ X ], tj[ X ])
d ( S [ X ]) =
N ( N − 1)
CX ≥ s 0
• Finding clusters and distance-based rules
• Objective measures
Two popular measurements:
support; and
confidence
• Example 2:
– X and Y: positively correlated, X 1 1 1 1 0 0 0 0
– X and Z, negatively related Y 1 1 0 0 0 0 0 0
– support and confidence of Z 0 1 1 1 1 1 1 1
X=>Z dominates
• We need a measure of dependent
or correlated events
Rule Support Confidence
P( A∪ B) X=>Y 25% 50%
corrA, B =
P( A) P( B) X=>Z 37,50% 75%
• P(B|A)/P(B) is also called the lift of
rule A => B M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Other Interestingness Measures: Interest
• Interest (correlation, lift) P( A ∧ B)
P ( A) P ( B )
– taking both P(A) and P(B) in consideration
What is
Good Clustering?
y Comparison of centroids x
cluster2
30 30
25 25
20 20
15 15
10 10
5 5
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Eps=2, Nps=4 Eps=6, Nps=4
450
450
400 400
350 350
4
300 300
250 250
200 200 2 3
150 150
1
100 100
50 50
0 0
0 100 200 300 400 500 0 100 200 300 400 500
(a) (b)
Clustering Validity Indices
A number of cluster validity indices are described in literature
¾Crisp Clustering
9Separation index ( Dunn)
9 DB (Davies-Bouldin)
9 RMSSTD & RS (Subhash Sharma )
¾Fuzzy Clustering
9Partition coefficient (Bezdek)
9Classification entropy (Bezdek)
9…
Drawbacks
computationally expensive.
monotonous dependency on the number of clusters
the lack of direct connection to the geometry of the data
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Clustering Quality Index - SD
¾Variance of data set
( )
n 2
1 1 n
σ xp = ∑ xkp − x ∑
p
, µε X = x k , ∀x k ∈ X
n k =1 n k = 1
n 2
1
¾Variance of cluster i. σ vi =
p
ni
∑ (x
k =1
p
k − vi p )
1 c
∑ σ (vi )
¾Average scattering c
Scat (c) = i =1
for clusters. σ( X )
−1
Dmax c
c
¾Total scattering between clusters. Dis (c) = Dmin
∑ ∑
k =1 z =1
v k − v z
SD(c)
SD(c)==Dis(c
Dis(cmax )*Scat(c) + Dis(c)
max)*Scat(c) + Dis(c)
(x )
2
where 1 n
∑
p
σ x
p
= k
p
− x
n k =1
1
∑
n
where
p
x is the p th dimension of X = x
k =1 k
, ∀xk ∈ X
n
2
∑ (x )
ni
p
σ p
vi = k
p
− vi ni
k =1
n
density(uij ) = ∑ f ( xl , uij ), n = number of tuples, xl ∈ S
l =1
S_Dbw
S_Dbw == a*Scat(c)
a*Scat(c) ++ Dens_bw(c)
Dens_bw(c)
1
1
3 1
2 2
1 3
1 3
3
4
1
2 4
2
100
80
60
40
20
0
0 100 200 300 400 500
Comparison of S_Dbw with other validity indices
DataSet1 DataSet2 DataSet3 DataSet4
[
Contributions
¾ a new validity measure (S_Dbw) for
9selecting the best clustering scheme for a data set
9assessing the results of a specific clustering algorithms
Further work
¾definition of an index that works properly in the case of clusters
of non concave (i.e. rings)
¾an integrated algorithm for cluster discovery putting emphasis in
the geometric features of clusters.
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
PART 7 –
Uncertainty handling
in the data mining
process with fuzzy logic
OUTLINE
Classification Framework
Experimental evaluation
Decision support
client_age
young Medium Old
Min 18 30 50
Max 40 60 80
Function decr triangle increasing
price
very cheap Cheap moderate expensive
Min 1 10 35 70
Max 15 50 80 150
Function decr triangle triangle triangle
1,2
1
young
0,8
Mapping functions 0,6
medium
old
0,4
0,2
0
18 23 28 33 38 43 48 53 58 63 68 73 78
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Classification Scheme -
Classification Value Space (CVS)
Classification
Categories
li
CVS (S)
Tuples tk
Data Set S
for each attribute Ai in CS Ai
for each category Cj of Ai
Attributes
for each value tk.Ai
compute d.o.b.(Ai, Cj, tk.Ai)
end
end
end
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Information Measures in CVS
¾ Category Energy metric. li
9attribute category importance
1/q
∑[µ ] 9overall belief that the data
E li (S.Ai ) = l
i
(S.t k .A i ) q
n r 9 set includes objects of the category li
k
Αi
¾ Attribute Energy metric
lc
∑ [ E ( A )]
li i
l1 9information content of the
E = li dataset regarding attribute Ai;
Ai
c
¾ CVS Energy
Ai 9overall information content
E CVS
= ∑ [ E ( Ai )] Classification in the CVS
Ai
= ∑ wi [ E ( Ai )], 0<= wi<=1
Value space
E CVS
i
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Multiple criteria classification
Representation of the d.o.b. related to composite classifications of tuples.
e.g. to what degree a tuple satisfies multiple criteria: “morning and
cheap purchases”.
The term “morning and cheap” defines a new category in the composite
attribute “time and price”.
Two alternatives:
9Classification based on multidimensional clusters.
Clusters (initial categories) are found in multiple dimensions
and then the data set is classified according to these
multidimensional categories.
Close_Price, Volume
One-dimensional clustering
C1 C2 C3 C4 C5 C6 C7 C8 C9
Category Energy 0,1758 0,8340 0,0683 0,1222 0,3611 0,0216 0,0488 0,221 0
Ecl_vol 0,2058
Two-dimensional clustering
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8 Cat9
Category Energy 0,1739 0,1745 0,0548 0,1376 0,3699 0,3516 0,3191 0,669 0,1414
Ecl_vol 0,2658
One-dimensional clustering
C1 C2 C3 C4 C5 C6 C7 C8 C9
Category 0,117 0,164 0,0106 0,0856 0,5699 0,2995 0,2221 0,1011 0,0782
Energy C10 C11 C12 C13 C14 C15 C16 C17 C18
0,0005 0,0327 0,1946 0,1458 0,1327 0,0116 0,1235 0 0,00222
C19 C20 C21 C22 C23 C24 C25 C26 C27
0,1409 0,0698 0,1011 0,0638 0,1219 0,0452 0,0653 0,3099 0,22054
C28
0,1035
Ecl_vol 0,1262
Two-dimensional clustering
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8
Category Energy 0,1741 0,1758 0,0548 0,1377 0,3908 0,3521 0,3196 0,6699
Ecl_vol 0,2844
Multi-dataset queries
Query Value returned
“Which of the S1, S2 contains more If (Emorning(S1.time_of_p) > Emorning(S2.time_of_p))
transactions made early morning?” return Emorning(S1.time_of_p))
else
return Emorning(S2.time_of_p)
“In which supermarket there are more If (Echeap and evening(S1.price, S.time_of_p) > Echeap and evening(S2.price, S.time_of_p))
cheap purchases made in the evening?” return Echeap and evening(S1.price, S.time_of_p)
else
return Echeap and evening(S2.price, S.time_of_p)
Contributions
¾ a scheme for representation of uncertainty in classification
based on fuzzy logic
¾ multi-criteria classification is more successful when clustering
is applied to multiple dimensions
Further work
¾ Classification quality assessment
¾ Decision support facilities for intra- & inter- dataset
cases…..through information measures
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Part 8-
Data Mining Applications and
Trends
• SGI MineSet
– Multiple data mining algorithms and advanced
statistics
– Advanced visualization tools
• Clementine (SPSS)
– An integrated data mining development
environment for end-users and developers
– Multiple data mining algorithms and visualization
tools
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Visual Data Mining
• Visualization: use of computer graphics to create visual images
which aid in the understanding of complex, often massive
representations of data
• Visual Data Mining: the process of discovering implicit but
useful knowledge from large data sets using visualization
techniques
• Purpose of Visualization
– Gain insight into an information space by mapping data onto
graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities,
relationships among data.
– Help find interesting regions and suitable parameters for
further quantitative analysis.
– Provide a visual proof of computer representations derived
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Visual Data Mining & Data Visualization
• Integration of visualization and data mining
9 data visualization
9 data mining result visualization
9 data mining process visualization
9 interactive visual data mining
• Data visualization
9 Data in a database or data warehouse can be viewed
• at different levels of granularity or abstraction
• as different combinations of attributes or dimensions
9 Data can be presented in various visual forms
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Boxplots from Statsoft: multiple variable
combinations
• Application exploration
– development of application-specific data mining system
– Invisible data mining (mining as built-in function)
• Scalable data mining methods
– Constraint-based mining: use of constraints to guide data
mining systems in their search for interesting patterns
• Integration of data mining with database systems,
data warehouse systems, and Web database systems
• Quality asessement
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Trends in Data Mining (2)
• Standardization of data mining language
– A standard will facilitate systematic development, improve
interoperability, and promote the education and use of data
mining systems in industry and society
• Visual data mining
• Uncertainty handling
• New methods for mining complex types of data
– More research is required towards the integration of data
mining methods with existing data analysis techniques for
the complex types of data
• Web mining
• Privacy protection and information security in data
mining M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
Summary
• Domain-specific applications include biomedicine (DNA),
finance, retail and telecommunication data mining
• There exist some data mining systems and it is important to know
their power and limitations
• Visual data mining include data visualization, mining result
visualization, mining process visualization and interactive visual
mining
• There are many other scientific and statistical data mining
methods developed but not covered in this book
• Also, it is important to study theoretical foundations of data
mining
• Intelligent query answering can be integrated with mining
• It is important to watch privacy and security issues in data
mining
M.Vazirgiannis & M. Halkidi,
23.8.02 HELDiNET 10/2000
http://www.db-net.aueb.gr
{mvazirg, mhalk}@aueb.gr