Вы находитесь на странице: 1из 18

# Density Based Clustering

## A cluster, defined as a connected dense

component.
A cluster grows in a direction along which
density attains its maximum.
Determine number of clusters from an input
dataset.
Density-based algorithms are capable of
discovering clusters of arbitrary shapes.
Provides a natural protection against outliers.
Usually work with low-dimensional data.

DBSCAN
Density-based Clustering locates regions of high density
that are separated from one another by regions of low
density.

## A point is a core point if it has more than a specified number of

points (MinPts) within Eps
These are points that are at the interior of a cluster
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point

DBSCAN

## A noise point is any point that is not a core

point or a border point.
Any two core points are close enough within
a distance Eps of one another are put in the
same cluster
Any border point that is close enough to a
core point is put in the same cluster as the
core point

## Border & Core

Outlier
Border
Core

= 1unit
MinPts = 5

Concepts: -Neighborhood

## -Neighborhood - Objects within a radius of

from an object. (epsilon-neighborhood)
Core objects - -Neighborhood of an object
contains at least MinPts of objects

qq

pp

-Neighborhood of p
-Neighborhood of q
p is a core object (MinPts = 4)
q is not a core object

Concepts: Reachability
Directly density-reachable

## An object q is directly density-reachable from

object p if q is within the -Neighborhood of p
and p is a core object.
q is directly density-

qq

pp

reachable from p
p is not directly densityreachable from q?

Concepts: Reachability

Density-reachable:

## An object p is density-reachable from q w.r.t and

MinPts if there is a chain of objects p1,,pn, with
p1=q, pn=p such that pi+1is directly density-reachable
from pi w.r.t and MinPts for all 1 <= i <= n

q is density-reachable

qq
pp

from p
p is not density- reachable
from q?
asymmetric

Concepts: Connectivity
Density-connectivity

## Object p is density-connected to object q w.r.t

and MinPts if there is an object o such that
both p and q are density-reachable from o
w.r.t and MinPts
P and q are density-

qq
rr

pp

## connected to each other

by r
Density-connectivity is
symmetric

## Cluster: a cluster C in a set of objects D w.r.t

and MinPts is a non empty subset of D satisfying

Maximality: For all p, q if p C and if q is densityreachable from p w.r.t and MinPts, then also q C.
Connectivity: for all p, q C, p is density-connected
to q w.r.t and MinPts in D.
Note: cluster contains core objects as well as border
objects

Noise: objects which are not directly densityreachable from at least one core object.

(Indirectly) Density-reachable:
p
p1

q
Density-connected

q
o

## DBSCAN: The Algorithm

select a point p
Retrieve all points density-reachable from p wrt and
MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from
p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been
processed.

An Example
MinPts = 4

C1
C1

C1

and MinPts.

## For DBSCAN, the parameters and MinPts are

needed. The parameters must be specified by the
user.
As a rule of thumb, minPts can be derived from
the number of dimensions D in the data set, as
minPts >= D+1.
MinPts=1 does not make sense, as then every
point on its own will already be a cluster.
With MinPts=2, the result will be the same as of
hierarchical clustering with the single link metric,
with the dendrogram cut at height

and MinPts.

## However, larger values are usually better for

data sets with noise and will yield more
significant clusters.
The value for can be chosen by using MST of
data points,
If is chosen too small, a large part of the data
will not be clustered; whereas for a too high
value of , clusters will merge and the majority of
objects will be in the same cluster.

Algorithm

## DBSCAN does not require one to specify the

number of clusters in the data a priori, as
opposed to k-means.
DBSCAN can find arbitrarily shaped clusters. It
can even find a cluster completely surrounded
by (but not connected to) a different cluster. Due
to the MinPts parameter, the so-called single-link
effect (different clusters being connected by a
thin line of points) is reduced.
DBSCAN has a notion of noise.

Algorithm

## DBSCAN requires just two parameters and is

mostly insensitive to the ordering of the points in
the database. (However, points sitting on the
edge of two different clusters might swap cluster
membership if the ordering of the points is
changed, and the cluster assignment is unique.)

Algorithm

## The quality of DBSCAN depends on the distance

measure used in the function region Query(P,). The
most common distance metric used is Euclidean
distance. Especially for high-dimensional data, this
metric can be rendered almost useless due to the
so-called "Curse of dimensionality", making it
difficult to find an appropriate value for . This
effect, however, is also present in any other
algorithm based on Euclidean distance.

Algorithm

## with large differences in densities, since

the minPts- combination cannot then be
chosen appropriately for all clusters.