Вы находитесь на странице: 1из 18

Density Based Clustering

Algorithm (Expected Features)

A cluster, defined as a connected dense


component.
A cluster grows in a direction along which
density attains its maximum.
Determine number of clusters from an input
dataset.
Density-based algorithms are capable of
discovering clusters of arbitrary shapes.
Provides a natural protection against outliers.
Usually work with low-dimensional data.

DBSCAN
Density-based Clustering locates regions of high density
that are separated from one another by regions of low
density.

Density = number of points within a specified radius (Eps)

DBSCAN is a density-based algorithm.

A point is a core point if it has more than a specified number of


points (MinPts) within Eps
These are points that are at the interior of a cluster
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point

DBSCAN

A noise point is any point that is not a core


point or a border point.
Any two core points are close enough within
a distance Eps of one another are put in the
same cluster
Any border point that is close enough to a
core point is put in the same cluster as the
core point
Noise points are discarded

Border & Core


Outlier
Border
Core

= 1unit
MinPts = 5

Concepts: -Neighborhood

-Neighborhood - Objects within a radius of


from an object. (epsilon-neighborhood)
Core objects - -Neighborhood of an object
contains at least MinPts of objects

qq

pp

-Neighborhood of p
-Neighborhood of q
p is a core object (MinPts = 4)
q is not a core object

Concepts: Reachability
Directly density-reachable

An object q is directly density-reachable from


object p if q is within the -Neighborhood of p
and p is a core object.
q is directly density-

qq

pp

reachable from p
p is not directly densityreachable from q?

Concepts: Reachability

Density-reachable:

An object p is density-reachable from q w.r.t and


MinPts if there is a chain of objects p1,,pn, with
p1=q, pn=p such that pi+1is directly density-reachable
from pi w.r.t and MinPts for all 1 <= i <= n

q is density-reachable

qq
pp

from p
p is not density- reachable
from q?
asymmetric

Concepts: Connectivity
Density-connectivity

Object p is density-connected to object q w.r.t


and MinPts if there is an object o such that
both p and q are density-reachable from o
w.r.t and MinPts
P and q are density-

qq
rr

pp

connected to each other


by r
Density-connectivity is
symmetric

Concepts: cluster & noise

Cluster: a cluster C in a set of objects D w.r.t


and MinPts is a non empty subset of D satisfying

Maximality: For all p, q if p C and if q is densityreachable from p w.r.t and MinPts, then also q C.
Connectivity: for all p, q C, p is density-connected
to q w.r.t and MinPts in D.
Note: cluster contains core objects as well as border
objects

Noise: objects which are not directly densityreachable from at least one core object.

(Indirectly) Density-reachable:
p
p1

q
Density-connected

q
o

DBSCAN: The Algorithm

select a point p
Retrieve all points density-reachable from p wrt and
MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from
p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been
processed.

Result is independent of the order of processing the points

An Example
MinPts = 4

C1
C1

C1

How to assign the values of


and MinPts.

For DBSCAN, the parameters and MinPts are


needed. The parameters must be specified by the
user.
As a rule of thumb, minPts can be derived from
the number of dimensions D in the data set, as
minPts >= D+1.
MinPts=1 does not make sense, as then every
point on its own will already be a cluster.
With MinPts=2, the result will be the same as of
hierarchical clustering with the single link metric,
with the dendrogram cut at height

How to assign the values of


and MinPts.

However, larger values are usually better for


data sets with noise and will yield more
significant clusters.
The value for can be chosen by using MST of
data points,
If is chosen too small, a large part of the data
will not be clustered; whereas for a too high
value of , clusters will merge and the majority of
objects will be in the same cluster.

The Merits of DBSCAN


Algorithm

DBSCAN does not require one to specify the


number of clusters in the data a priori, as
opposed to k-means.
DBSCAN can find arbitrarily shaped clusters. It
can even find a cluster completely surrounded
by (but not connected to) a different cluster. Due
to the MinPts parameter, the so-called single-link
effect (different clusters being connected by a
thin line of points) is reduced.
DBSCAN has a notion of noise.

The Merits of DBSCAN


Algorithm

DBSCAN requires just two parameters and is


mostly insensitive to the ordering of the points in
the database. (However, points sitting on the
edge of two different clusters might swap cluster
membership if the ordering of the points is
changed, and the cluster assignment is unique.)

The De-Merits of DBSCAN


Algorithm

The quality of DBSCAN depends on the distance


measure used in the function region Query(P,). The
most common distance metric used is Euclidean
distance. Especially for high-dimensional data, this
metric can be rendered almost useless due to the
so-called "Curse of dimensionality", making it
difficult to find an appropriate value for . This
effect, however, is also present in any other
algorithm based on Euclidean distance.

The De-Merits of DBSCAN


Algorithm

DBSCAN cannot cluster data sets well

with large differences in densities, since


the minPts- combination cannot then be
chosen appropriately for all clusters.