You are on page 1of 42

Density Based Clustering

Ramalingaswamy cheruku
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)

2
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition: p MinPts = 5

|NEps (q)| ≥ MinPts Eps = 1 cm


q
3
Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p1
q
p such that pi+1 is directly density-
reachable from pi
• Density-connected
– A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a p q
point o such that both, p and q are
density-reachable from o w.r.t. Eps o
and MinPts

4
DBSCAN
Published by bMartin Ester, Hans-Peter Kriegel, Jorg
Sander, Xiaowei Xu at KDD-96 proceedings.
Test of Time award at KDD 2014
11500 citations on Google Scholar

Ester, Martin, et al. "A density-based algorithm for


discovering clusters in large spatial databases with
noise." Kdd. Vol. 96. No. 34. 1996.
Main Idea
• Three types of points
Core point
Boundary point
Noise (Outlier) Point
• Connect Core points into clusters
• Assign boundary points to clusters
Core, Border & Noise Points

• Density: Number of points within the radius Eps.


• A point is a core point if it has more than a specified number of
points (MinPts) within Eps
—These are points that are at the interior of a cluster.
• A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point.
• A noise point is any point that is not a core point nor a border
point.
Density-reachability
• Directly density-reachable: A point p is directly density-reachable from a
point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition:
p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
Or
• An object q is directly density-reachable from object p if p is a core object
and q is in p’s Eps-neighborhood.
Density-reachability
• Density-reachable: A point p is density-reachable from a point q w.r.t. Eps,
MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
• A point p is density-connected to a point q w.r.t. Eps, MinPts if
there is a point o such that both, p and q are density-
reachable from o w.r.t. Eps and MinPts

p q

o
DBSCAN: The Algorithm
1. Arbitrary select a point p

2. Retrieve all points density-reachable from p w.r.t. Eps and


MinPts

3. If p is a core point, a cluster is formed

4. If p is a border point, no points are density-reachable from p


and DBSCAN visits the next point of the database

5. Continue the process until all of the points have been


processed.

11
When DBSCAN Works Well
When DBSCAN Does NOT
Work Well
DBSCAN: Sensitive to Parameters

14
Choosing parameters of DBSCAN
algorithm
• DBSCAN algorithm requires 2 parameters
• - epsilon , which specifies how close points should be
to each other to be considered a part of a cluster;
and
• minPts , which specifies how many neighbors a point
should have to be included into a cluster.
• However, you may not know these values in advance.
Estimating epsilon:
 Estimating distance to the
nearest neighbor : It
calculates distance from
each point to its nearest
neighbor within the same
cluster.
 Distance to Nearest
Neighbor produces a
histogram which is depicted
in figure .
 It indicates that the vast
majority of points lie within
21.7027 units from their
nearest neighbor. So, 22 may
be a reasonable guess for
the epsilon parameter.
MinPts estimation:

Counting point's neighbors

 After you have chosen the value


for epsilon , you may wonder how
many points lie within each
point's epsilon-neighborhood.
 Counting point's neighbors :
which counts each point's
neighbors and builds a histogram
which may look like side figure:
 This histogram was obtained on
a data set of 400,000 points,
with epsilon = 22. It indicates that
some points (about 25,000,
which is 6.25% of all points) have
too few neighbors. Probably they
are noise points. A smaller
fraction (about 15,000, which is
3.75% of all points) have 65 to
129 neighbors, and starting from
129, the number of neighbors
begins to grow.
 Based on the histograms above, I
would try clustering my data set
with the following parameters:

epsilon = 22, minPts = 129.


DBSCAN Pros and Cons
• Pros
No need to decide K
Not sensitive to noisy
• Cons
Sensitive to Eps and MinPts Parameters
Can’t handle varying densities.
DBSCAN Visualization

https://www.naftaliharris.com/blog/visualizing-
dbscan-clustering/
OPTICS: Ordering Points To Identify
Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
DBSCAN extension.
Idea: Higher density points should be processed
first. i.e. Find the high-density clusters first.
OPTICS store such a clustering order using two pieces
of information:
1. Core-distance
2. Reachability- distance
OPTICS: Terminology
• Core Distance: Core distance of object p is the smallest value
of Eps such that Eps-neighborhood of p has at least MinPts
objects
• Reachability Distance of object p from the core object q is the
min. radius value that makes p density-reachable from q.
Mathematically:
Max ( Core-distance(p), distance(p,q)).
Reachability plot for dataset

Reachability
-distance

undefined


‘

Cluster-order
of the objects
Reachability plot for dataset
Reachability plot for dataset

 Since points belongs to a cluster


have a low reachability distance
to their nearest neighbor ,
Valleys corresponds to clusters

 The deeper the valley, the denser


the cluster.
OPTICS For Hierarchal Nested Clusters
DENCLUE: Using Statistical Density Functions

• DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)


total influence
• Using statistical density functions: on x
d ( x , xi ) 2


2
d ( x,y) N

( x)  2
D 2

f Gaussian ( x, y)  e 2 2 f Gaussian i 1
e
d ( x , xi ) 2
influence of y 
( x, xi )  i 1 ( xi  x)  e
N
• Major features
on x f Gaussian
D 2 2

gradient of x in
– Solid mathematical foundation the direction of
xi
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
– Significant faster than existing algorithm (e.g., DBSCAN)
– But needs a large number of parameters
26
DENCLUE:
 It builds on kernel density estimation functions.
 It estimate the probability density of the data directly from the
data instances.
 In DENCLUE the probability density in the data space is
estimated as a function of all data instances:

• The influences of the data instances in the data space are


modeled via a simple kernel function,

• The sum of all kernels gives an estimate of the probability at any


point x in the data space
DENCLUE:
• The quantity σ or h > 0 specifies to what degree a data instance
is smoothed over data space.
• When h is large, an instance stretches its influence up to more
distant regions.
• When h is small, an instance effects only the local
neighborhood.
• We illustrate the idea of kernel density estimation on one-
dimensional data as shown in figure 1.
What is a Cluster

• A clustering in the DENCLUE is defined by the local maxima of the estimated density
function.
• A hill-climbing procedure is started for each data instance, which assigns the instance
to a local maxima.
• In case of Gaussian kernels, the hill climbing is guided by the gradient of ^p(x), which
takes the form

• The hill climbing procedure starts at a data point and iterates until the density does
not grow anymore. The update formula of the iteration to proceed from x(l) to x(l+1) is

• The step size I is a small positive number.


• In the end, those end points of the hill climbing iteration, which are closer than 2 are
considered, to belong to the same local maximum. Instances, which are assigned to
the same local maximum, are put into the same cluster.
Contd…
• In the presence of random noise in the data, the DENCLUE
framework provides an extra parameter ξ» > 0, which treats
all points assigned to local maxima 𝑥ො with 𝑝(Ƹ 𝑥)
ො < ξ as outliers.
• Figure 2 sketches the idea of a DENCLUE clustering.
DENCLUE
• Example
Influence of σ value:
Parameter-σ or h:
 It describes the influence of a data point in the data space. It
determines the number of clusters.
DENCLUE Parameter Setting
Parameter-σ or h:
 Choose σ such that number of clusters is constant for the
longest interval of σ.
Parameter ξ estimation
• Once σ is known, the results of clustering depend on
noise threshold ξ. Since practical databases always
contain large amounts of noisy data, we estimate ξ as
follows:

• where d is the number of dimensions,


• c is a constant, 0<c<1, and
• 𝐷𝑁 is the size of noisy dataset.
• Reference: Gan, W., & Li, D. (2003, May). Optimal choice
of parameters for a density-based clustering algorithm.
In International Workshop on Rough Sets, Fuzzy Sets,
Data Mining, and Granular-Soft Computing (pp. 603-
606). Springer, Berlin, Heidelberg.
DENCLUE
Experiment
• Polygonal CAD data (11-dimensional feature vectors)

Comparison between DBSCAN and DENCLUE


DENCLUE Features
• Clusters are defined according to the point density
function which is the sum of influence functions of
the data points.
• It has good clustering in data sets with large
amounts of noise.
• It can deal with high-dimensional data sets.
• It is significantly faster than existing algorithms
Queries ??
Thank you !
OPTICS Pros and Cons
• Less sensitive to parameter setting
• Finds Hierarchical Nested Clusters
DENCLUE:
 It builds on kernel density estimation functions.
 It estimate the probability density of the data directly from the
data instances.
 In DENCLUE the probability density in the data space is
estimated as a function of all data instances:

• The influences of the data instances in the data space are


modeled via a simple kernel function,

• The sum of all kernels gives an estimate of the probability at any


point x in the data space
DENCLUE:
 It builds on kernel density estimation functions.
 It estimate the probability density of the data directly from the
data instances.
 In DENCLUE the probability density in the data space is
estimated as a function of all data instances:

 The influences of the data instances in the data space are 2


d ( x,y)
modeled via a simple kernel function, 
f Gaussian ( x, y)  e 2
2

influence of y
 The sum of all kernels gives an estimate of on x

the probability at any point x in the data space


total influence
on x
d ( x , xi ) 2

 i 1 e
N
( x)  2
D 2
f Gaussian
Summary
• arbitrary shaped clusters
• good scalability
• explicit definition of noise
• noise invariance
• high dimensional clustering