Вы находитесь на странице: 1из 18

Density Based Clustering

Algorithm (Expected Features)

A cluster, defined as a connected dense

A cluster grows in a direction along which
density attains its maximum.
Determine number of clusters from an input
Density-based algorithms are capable of
discovering clusters of arbitrary shapes.
Provides a natural protection against outliers.
Usually work with low-dimensional data.

Density-based Clustering locates regions of high density
that are separated from one another by regions of low

Density = number of points within a specified radius (Eps)

DBSCAN is a density-based algorithm.

A point is a core point if it has more than a specified number of

points (MinPts) within Eps
These are points that are at the interior of a cluster
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point


A noise point is any point that is not a core

point or a border point.
Any two core points are close enough within
a distance Eps of one another are put in the
same cluster
Any border point that is close enough to a
core point is put in the same cluster as the
core point
Noise points are discarded

Border & Core


= 1unit
MinPts = 5

Concepts: -Neighborhood

-Neighborhood - Objects within a radius of

from an object. (epsilon-neighborhood)
Core objects - -Neighborhood of an object
contains at least MinPts of objects



-Neighborhood of p
-Neighborhood of q
p is a core object (MinPts = 4)
q is not a core object

Concepts: Reachability
Directly density-reachable

An object q is directly density-reachable from

object p if q is within the -Neighborhood of p
and p is a core object.
q is directly density-



reachable from p
p is not directly densityreachable from q?

Concepts: Reachability


An object p is density-reachable from q w.r.t and

MinPts if there is a chain of objects p1,,pn, with
p1=q, pn=p such that pi+1is directly density-reachable
from pi w.r.t and MinPts for all 1 <= i <= n

q is density-reachable


from p
p is not density- reachable
from q?

Concepts: Connectivity

Object p is density-connected to object q w.r.t

and MinPts if there is an object o such that
both p and q are density-reachable from o
w.r.t and MinPts
P and q are density-



connected to each other

by r
Density-connectivity is

Concepts: cluster & noise

Cluster: a cluster C in a set of objects D w.r.t

and MinPts is a non empty subset of D satisfying

Maximality: For all p, q if p C and if q is densityreachable from p w.r.t and MinPts, then also q C.
Connectivity: for all p, q C, p is density-connected
to q w.r.t and MinPts in D.
Note: cluster contains core objects as well as border

Noise: objects which are not directly densityreachable from at least one core object.

(Indirectly) Density-reachable:



DBSCAN: The Algorithm

select a point p
Retrieve all points density-reachable from p wrt and
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from
p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been

Result is independent of the order of processing the points

An Example
MinPts = 4



How to assign the values of

and MinPts.

For DBSCAN, the parameters and MinPts are

needed. The parameters must be specified by the
As a rule of thumb, minPts can be derived from
the number of dimensions D in the data set, as
minPts >= D+1.
MinPts=1 does not make sense, as then every
point on its own will already be a cluster.
With MinPts=2, the result will be the same as of
hierarchical clustering with the single link metric,
with the dendrogram cut at height

How to assign the values of

and MinPts.

However, larger values are usually better for

data sets with noise and will yield more
significant clusters.
The value for can be chosen by using MST of
data points,
If is chosen too small, a large part of the data
will not be clustered; whereas for a too high
value of , clusters will merge and the majority of
objects will be in the same cluster.

The Merits of DBSCAN


DBSCAN does not require one to specify the

number of clusters in the data a priori, as
opposed to k-means.
DBSCAN can find arbitrarily shaped clusters. It
can even find a cluster completely surrounded
by (but not connected to) a different cluster. Due
to the MinPts parameter, the so-called single-link
effect (different clusters being connected by a
thin line of points) is reduced.
DBSCAN has a notion of noise.

The Merits of DBSCAN


DBSCAN requires just two parameters and is

mostly insensitive to the ordering of the points in
the database. (However, points sitting on the
edge of two different clusters might swap cluster
membership if the ordering of the points is
changed, and the cluster assignment is unique.)

The De-Merits of DBSCAN


The quality of DBSCAN depends on the distance

measure used in the function region Query(P,). The
most common distance metric used is Euclidean
distance. Especially for high-dimensional data, this
metric can be rendered almost useless due to the
so-called "Curse of dimensionality", making it
difficult to find an appropriate value for . This
effect, however, is also present in any other
algorithm based on Euclidean distance.

The De-Merits of DBSCAN


DBSCAN cannot cluster data sets well

with large differences in densities, since

the minPts- combination cannot then be
chosen appropriately for all clusters.