2 views

Uploaded by Y SAHITH

DBSCAN

DBSCAN

© All Rights Reserved

- 2016 ns2 projects | Cluster-Based Routing for the Mobile Sink in wsn.docx
- Module-5
- kuvempu university Data Warehousing
- Introduction to Five Data Clustering
- 2cb9d37e5e54c5d20644ff7025cdee14995f
- 06005661
- Submersible Pumps DEBS 2015 Paper
- Decision Models for Record Linkage
- i Jer 4721324153800
- 304Various Image Segmentation Method for Underwater Acoustic Image a Survey PDF
- Impact of Climate Change in Agriculture With Data Mining Concepts
- ROCK clustering example
- Remote Sensing Ieee 2015 Projects
- An Agglomerative Analysis of Nifty Companies for an Investment Perspective
- HC
- Line+Segment+Intersection
- Assignment No 1
- Lecture 38
- xu2018
- dm5part1

You are on page 1of 42

Ramalingaswamy cheruku

Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as

density-connected points

• Major features:

– Discover clusters of arbitrary shape

– Handle noise

– One scan

• Several interesting studies:

– DBSCAN: Ester, et al. (KDD’96)

– OPTICS: Ankerst, et al (SIGMOD’99).

– DENCLUE: Hinneburg & D. Keim (KDD’98)

2

Density-Based Clustering: Basic Concepts

• Two parameters:

– Eps: Maximum radius of the neighbourhood

– MinPts: Minimum number of points in an Eps-

neighbourhood of that point

• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}

• Directly density-reachable: A point p is directly density-

reachable from a point q w.r.t. Eps, MinPts if

– p belongs to NEps(q)

– core point condition: p MinPts = 5

q

3

Density-Reachable and Density-Connected

• Density-reachable:

– A point p is density-reachable from a p

point q w.r.t. Eps, MinPts if there is a

chain of points p1, …, pn, p1 = q, pn = p1

q

p such that pi+1 is directly density-

reachable from pi

• Density-connected

– A point p is density-connected to a

point q w.r.t. Eps, MinPts if there is a p q

point o such that both, p and q are

density-reachable from o w.r.t. Eps o

and MinPts

4

DBSCAN

Published by bMartin Ester, Hans-Peter Kriegel, Jorg

Sander, Xiaowei Xu at KDD-96 proceedings.

Test of Time award at KDD 2014

11500 citations on Google Scholar

discovering clusters in large spatial databases with

noise." Kdd. Vol. 96. No. 34. 1996.

Main Idea

• Three types of points

Core point

Boundary point

Noise (Outlier) Point

• Connect Core points into clusters

• Assign boundary points to clusters

Core, Border & Noise Points

• A point is a core point if it has more than a specified number of

points (MinPts) within Eps

—These are points that are at the interior of a cluster.

• A border point has fewer than MinPts within Eps, but is in the

neighborhood of a core point.

• A noise point is any point that is not a core point nor a border

point.

Density-reachability

• Directly density-reachable: A point p is directly density-reachable from a

point q w.r.t. Eps, MinPts if

– p belongs to NEps(q)

– core point condition:

p MinPts = 5

|NEps (q)| ≥ MinPts Eps = 1 cm

q

Or

• An object q is directly density-reachable from object p if p is a core object

and q is in p’s Eps-neighborhood.

Density-reachability

• Density-reachable: A point p is density-reachable from a point q w.r.t. Eps,

MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is

directly density-reachable from pi

Density-connected

• A point p is density-connected to a point q w.r.t. Eps, MinPts if

there is a point o such that both, p and q are density-

reachable from o w.r.t. Eps and MinPts

p q

o

DBSCAN: The Algorithm

1. Arbitrary select a point p

MinPts

and DBSCAN visits the next point of the database

processed.

11

When DBSCAN Works Well

When DBSCAN Does NOT

Work Well

DBSCAN: Sensitive to Parameters

14

Choosing parameters of DBSCAN

algorithm

• DBSCAN algorithm requires 2 parameters

• - epsilon , which specifies how close points should be

to each other to be considered a part of a cluster;

and

• minPts , which specifies how many neighbors a point

should have to be included into a cluster.

• However, you may not know these values in advance.

Estimating epsilon:

Estimating distance to the

nearest neighbor : It

calculates distance from

each point to its nearest

neighbor within the same

cluster.

Distance to Nearest

Neighbor produces a

histogram which is depicted

in figure .

It indicates that the vast

majority of points lie within

21.7027 units from their

nearest neighbor. So, 22 may

be a reasonable guess for

the epsilon parameter.

MinPts estimation:

for epsilon , you may wonder how

many points lie within each

point's epsilon-neighborhood.

Counting point's neighbors :

which counts each point's

neighbors and builds a histogram

which may look like side figure:

This histogram was obtained on

a data set of 400,000 points,

with epsilon = 22. It indicates that

some points (about 25,000,

which is 6.25% of all points) have

too few neighbors. Probably they

are noise points. A smaller

fraction (about 15,000, which is

3.75% of all points) have 65 to

129 neighbors, and starting from

129, the number of neighbors

begins to grow.

Based on the histograms above, I

would try clustering my data set

with the following parameters:

DBSCAN Pros and Cons

• Pros

No need to decide K

Not sensitive to noisy

• Cons

Sensitive to Eps and MinPts Parameters

Can’t handle varying densities.

DBSCAN Visualization

https://www.naftaliharris.com/blog/visualizing-

dbscan-clustering/

OPTICS: Ordering Points To Identify

Clustering Structure

Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)

DBSCAN extension.

Idea: Higher density points should be processed

first. i.e. Find the high-density clusters first.

OPTICS store such a clustering order using two pieces

of information:

1. Core-distance

2. Reachability- distance

OPTICS: Terminology

• Core Distance: Core distance of object p is the smallest value

of Eps such that Eps-neighborhood of p has at least MinPts

objects

• Reachability Distance of object p from the core object q is the

min. radius value that makes p density-reachable from q.

Mathematically:

Max ( Core-distance(p), distance(p,q)).

Reachability plot for dataset

Reachability

-distance

undefined

‘

Cluster-order

of the objects

Reachability plot for dataset

Reachability plot for dataset

have a low reachability distance

to their nearest neighbor ,

Valleys corresponds to clusters

the cluster.

OPTICS For Hierarchal Nested Clusters

DENCLUE: Using Statistical Density Functions

total influence

• Using statistical density functions: on x

d ( x , xi ) 2

2

d ( x,y) N

( x) 2

D 2

f Gaussian ( x, y) e 2 2 f Gaussian i 1

e

d ( x , xi ) 2

influence of y

( x, xi ) i 1 ( xi x) e

N

• Major features

on x f Gaussian

D 2 2

gradient of x in

– Solid mathematical foundation the direction of

xi

– Good for data sets with large amounts of noise

– Allows a compact mathematical description of arbitrarily shaped

clusters in high-dimensional data sets

– Significant faster than existing algorithm (e.g., DBSCAN)

– But needs a large number of parameters

26

DENCLUE:

It builds on kernel density estimation functions.

It estimate the probability density of the data directly from the

data instances.

In DENCLUE the probability density in the data space is

estimated as a function of all data instances:

modeled via a simple kernel function,

point x in the data space

DENCLUE:

• The quantity σ or h > 0 specifies to what degree a data instance

is smoothed over data space.

• When h is large, an instance stretches its influence up to more

distant regions.

• When h is small, an instance effects only the local

neighborhood.

• We illustrate the idea of kernel density estimation on one-

dimensional data as shown in figure 1.

What is a Cluster

• A clustering in the DENCLUE is defined by the local maxima of the estimated density

function.

• A hill-climbing procedure is started for each data instance, which assigns the instance

to a local maxima.

• In case of Gaussian kernels, the hill climbing is guided by the gradient of ^p(x), which

takes the form

• The hill climbing procedure starts at a data point and iterates until the density does

not grow anymore. The update formula of the iteration to proceed from x(l) to x(l+1) is

• In the end, those end points of the hill climbing iteration, which are closer than 2 are

considered, to belong to the same local maximum. Instances, which are assigned to

the same local maximum, are put into the same cluster.

Contd…

• In the presence of random noise in the data, the DENCLUE

framework provides an extra parameter ξ» > 0, which treats

all points assigned to local maxima 𝑥ො with 𝑝(Ƹ 𝑥)

ො < ξ as outliers.

• Figure 2 sketches the idea of a DENCLUE clustering.

DENCLUE

• Example

Influence of σ value:

Parameter-σ or h:

It describes the influence of a data point in the data space. It

determines the number of clusters.

DENCLUE Parameter Setting

Parameter-σ or h:

Choose σ such that number of clusters is constant for the

longest interval of σ.

Parameter ξ estimation

• Once σ is known, the results of clustering depend on

noise threshold ξ. Since practical databases always

contain large amounts of noisy data, we estimate ξ as

follows:

• c is a constant, 0<c<1, and

• 𝐷𝑁 is the size of noisy dataset.

• Reference: Gan, W., & Li, D. (2003, May). Optimal choice

of parameters for a density-based clustering algorithm.

In International Workshop on Rough Sets, Fuzzy Sets,

Data Mining, and Granular-Soft Computing (pp. 603-

606). Springer, Berlin, Heidelberg.

DENCLUE

Experiment

• Polygonal CAD data (11-dimensional feature vectors)

DENCLUE Features

• Clusters are defined according to the point density

function which is the sum of influence functions of

the data points.

• It has good clustering in data sets with large

amounts of noise.

• It can deal with high-dimensional data sets.

• It is significantly faster than existing algorithms

Queries ??

Thank you !

OPTICS Pros and Cons

• Less sensitive to parameter setting

• Finds Hierarchical Nested Clusters

DENCLUE:

It builds on kernel density estimation functions.

It estimate the probability density of the data directly from the

data instances.

In DENCLUE the probability density in the data space is

estimated as a function of all data instances:

modeled via a simple kernel function,

point x in the data space

DENCLUE:

It builds on kernel density estimation functions.

It estimate the probability density of the data directly from the

data instances.

In DENCLUE the probability density in the data space is

estimated as a function of all data instances:

d ( x,y)

modeled via a simple kernel function,

f Gaussian ( x, y) e 2

2

influence of y

The sum of all kernels gives an estimate of on x

total influence

on x

d ( x , xi ) 2

i 1 e

N

( x) 2

D 2

f Gaussian

Summary

• arbitrary shaped clusters

• good scalability

• explicit definition of noise

• noise invariance

• high dimensional clustering

- 2016 ns2 projects | Cluster-Based Routing for the Mobile Sink in wsn.docxUploaded byLakshmiDhanam
- Module-5Uploaded byPawan Hardikar
- kuvempu university Data WarehousingUploaded byPrince Raj
- Introduction to Five Data ClusteringUploaded byerkanbesdok
- 2cb9d37e5e54c5d20644ff7025cdee14995fUploaded byFinbarr Timbers
- 06005661Uploaded byShobhita Gupta
- Submersible Pumps DEBS 2015 PaperUploaded byAnonymous yjLUF9gDTS
- Decision Models for Record LinkageUploaded byIsaias Prestes
- i Jer 4721324153800Uploaded byGandhi Pmk Pillai
- 304Various Image Segmentation Method for Underwater Acoustic Image a Survey PDFUploaded byprabhabathi devi
- Impact of Climate Change in Agriculture With Data Mining ConceptsUploaded byNRGK Prasad
- ROCK clustering exampleUploaded byshehzad791
- Remote Sensing Ieee 2015 ProjectsUploaded bykaran
- An Agglomerative Analysis of Nifty Companies for an Investment PerspectiveUploaded byEditor IJRITCC
- HCUploaded byJibesh Kumar Basuri
- Line+Segment+IntersectionUploaded byBhudev Mahato
- Assignment No 1Uploaded byqqq
- Lecture 38Uploaded byWinny Shiru Machira
- xu2018Uploaded byAnass Cherrafi
- dm5part1Uploaded byPurushothama Reddy
- Classification of Cluster Area Forsatellite ImageUploaded byIJSTR Research Publication
- Automated recognition of diabetic retinopathy from ophthalmoscopic fundus facsimiles using segmentation and morphological operationsUploaded byIrjiet
- Network Lifetime and Energy Efficiency Maximization Using Ant Colony Optimization in WsnUploaded byijteee
- A Clustering Sleep Scheduling Mechanism Based on Sentinel Nodes Monitor for WSNUploaded byAmany Morsey
- Leach RoutingUploaded byAdilNasir
- Zhang - Anomaly Detection in High-dimensional Network Data Streams - 2008Uploaded byfatekuniska
- Pc II Audit Bsch15Uploaded byAqeel Hasan
- Otsn AnshumanUploaded byHarshit Dua
- [IJET-V2I5P5] Authors: CHETANA M, SHIVA MURTHY. GUploaded byInternational Journal of Engineering and Techniques
- NRL Capacity2 06Uploaded bySaiKiran

- 1801.00631Uploaded byY SAHITH
- LOF ExampleUploaded byY SAHITH
- 337748610-Operations-Research-JK-Sharma.pdfUploaded byY SAHITH
- IEOR4004-notes1Uploaded byPedro Carmona
- Programming Assignment 1 - Particle Swarm OptimizationUploaded byY SAHITH
- CE 461 module 2Uploaded byY SAHITH
- Programming Assignment 2 - Decision Trees and Random ForestsUploaded byY SAHITH
- How to Spot and Evaluate an Opportunity - TECH MAHINDRAUploaded byY SAHITH
- CE461 Module 1Uploaded byY SAHITH
- 9 IDC New Product Development SimplifiedUploaded byY SAHITH
- CB101 Topics Covered Till First Mid-term 1Uploaded byY SAHITH
- 2018164610003_EAF_1Uploaded byY SAHITH
- Pile FoundationsUploaded byY SAHITH
- PSU Recruitment Through GATE - Preparation Guidelines and PSU DetailsUploaded byY SAHITH
- GATE 2018-Certificate From Head of the DepartmentUploaded byManas Tarai
- Statistical Year Book 2015Uploaded byKranti Kumar
- Geo SyntheticsUploaded byY SAHITH
- Equipment EconomicsUploaded byY SAHITH
- Module 5Uploaded byY SAHITH
- CE415 Module 1Uploaded byY SAHITH
- BM CE415 RoofsUploaded byY SAHITH
- BM CE415 GlassUploaded byY SAHITH
- Wood_pptUploaded byY SAHITH
- Solving Indeterminate Structures - CompatibilityUploaded byY SAHITH
- CE309 FundamentalsUploaded byY SAHITH
- Lab Session #3Uploaded byY SAHITH
- Lab Session #2Uploaded byY SAHITH
- Lab Session #1Uploaded byY SAHITH
- CE 309 mid 2Uploaded byY SAHITH

- It6502 Digital Signal Processing l t p c 3 1 0 4 ObjectivesUploaded byAnonymous 1iVm0EwGyz
- 11 554 Sr7 Integrated Soil ManagementUploaded byHumboldt35
- Ayurvedic-Constitution-Questionnaire.pdfUploaded byDumitru Mihaela Elena
- Federico Guarracino, Alastair C. Walker-Energy methods in structural mechanics _ a comprehensive introduction to matrix and finite element methods of analysis-Thomas Telford Ltd (1999).pdfUploaded bykovary
- Global Information Technology Report 2009-2010Uploaded byWorld Economic Forum
- Texas Roadhouse ® Sweet Yeast RollsUploaded byBobby Threadgill
- Bit Rev 2Uploaded byGoAway Chang
- Anorexic Behaviour Female Competition and StressUploaded byvizicsiko
- Metamorphic TexturesUploaded byCris Reven Gibaga
- J01B4400003MDTAUploaded byhnoaman
- forensic pathologyUploaded byjmosser
- Asset Integrity ManagementUploaded byjoesuhre
- The Evolution of Leadership Theory RevisedUploaded byAndrada Grosu
- End-To-End Automation With IBM Tivoli System Automation for Multi Platforms Sg247117Uploaded bybupbechanh
- Jumpstart Your Wealth Gene 2018 v2bUploaded byGabrielaMilas
- 12.4.15 Nego Team Minutes of Meeting.docUploaded byjane abiera
- An Experimental Investigation on Normal Concrete using Rice Husk Ash, Dolomite Powder and Banana FibreUploaded byInternational Journal of Advanced Scientific Research and Development
- Lr_Rc.No.AEE3 NREGSUploaded byCharan Reddy
- Sociology and ReligionUploaded byEgaKusumaAnindhita
- SafetyPerceptionSurvey-A Case StudyUploaded byPillai Sreejith
- Cardiorespiratory Responses of Air Filtration_ a Randomized Crossover Intervention Trial in Seniors Living in Beijing_ Beijing Indoor Air Purifier StudY, BIAPSY - ScienceDirectUploaded byShailendra Tripathi
- Complete Perl TutorialUploaded byCryptex Mmx
- Oracle MidtermUploaded bytheweeknd1
- ch01Uploaded byRaheel Butt
- Google NetUploaded byNitin Panj
- d 3159 - 98 _rdmxntktukveUploaded byStuart
- ChapterUploaded bychat2ram2014
- A Systems Approach to Membership DevelopmentUploaded bysufijinn
- spss_v16 manualUploaded byRodel Lojares Ortañez
- Dunant's pyramid- Thoughts on the “humanitarian space”.pdfUploaded byalvaromellado