Вы находитесь на странице: 1из 43

Clustering Categorical

Data
The Case of Quran Verses

Presented By
Muhammad Al-Watban

IS 598

1
Outline
 Introduction
 Preprocessing of Quran Verses
 Similarity Measures
 Assisting Clusters Similarities
 Shortcomings of Traditional clustering methods with
categorical data
 ROCK - Major definitions
 ROCK clustering Algorithm
 ROCK example
 Conclusion and future work

2
Introduction
 The holy Quran covers a wide range of
topics.
 Quran does not cover each topic by a set of
sequenced verses or sura’s.
 A single verse usually deals with many
subjects
 Project goal: to cluster the verses of The
Holy Quran based on the verse’s subjects.

3
Preprocessing of Quran Verses
 it is necessary to perform manual preprocessing for the
Quran text to capture the subjects of the verses into a
tabular format
 Verses in the Holy Quran can be viewed as records and
the related subjects as attributes of the record. This is
demonstrated by the following table:

 The data in the above table is similar to what is known as


market-basket data.
 Here, we will call it verses-treasues data

4
Similarity Measures
 Two types of attributes:
1. Continuous attributes:
 range of attribute value is continuous and ordered
 includes Attributes with numeric values (e.g. salary)
 also includes attributes whose allowed set of values are
thought to be part of an ordered set of a meaningful
sequence (e.g. professional ranks, disease severity
levels)
 The similarity (or dissimilarity) between objects is
computed based on distance between them.
 the most commonly used distance measure is
Euclidean distance, and Manhattan distance
5
Similarity Measures
2. Categorical attributes:
 consists of attributes whose underlying domain is not ordered
 Examples : colors, blood type.
 If the attribute has only two states (namely 0 and 1), then it is
called binary; if it has more than two states, it is called nominal.

 there is no easy way to measure a distance between objects


 We can define dissimilarity based on the simple matching
approach
 d (i, j) = p −
p
m

 Where m is the number of matched attribute, and p is the total


number of attributes.

6
Similarity Measures
 Where does the verses treasures data fit?
 Each verse can be represented by a record with
Boolean attributes, each attribute corresponds to
a single subject
 The attribute corresponding to a subject is T if the
verse contains that subjects; otherwise, it is F
 As we said, Boolean attributes are a special case
of categorical attributes

7
Assisting Clusters Similarities
 Many clustering algorithm(such as hirarchical
clustering) requires computing distance between
clusters (rather than elements)
 There are several standard methods:

1- Single linkage:
 D(r,s): distance between clusters r and s is defined as the
distance between the closest pair of objects

D(r,s)

8
Assisting Clusters Similarities
2. Complete linkage
 distance is defined as the distance between the farthest pair of
objects

D(r,s)

3. Average linkage
 distance is defined as the average of distances between all pairs
of objects r and s, where r and s belong to different clusters

9
Assisting Clusters Similarities

4. Centroid Linkage:
 distance between clusters is defined as the
distance between the pair of cluster
centroids.

D(r,s)

10
Shortcomings of Traditional clustering
methods with categorical data
 Example
 Consider the following 4 market basket transactions
 T1= {1, 2, 3, 4}
 T2= {1, 2, 4}
 T3= {3}
 T4= {4}
 converting these transactions to Boolean points, we get:
 P1= (1, 1, 1, 1)
 P2= (1, 1, 0, 1)
 P3= (0, 0, 1, 0)
 P4= (0, 0, 0, 1)
 using Euclidean distance to measure the closeness between all pairs of
points, we find that d(p1,p2) is the smallest distance :

d ( p1, p2) = (|1−1|2 + |1−1|2 + |1− 0 |2 + |1−1|2 ) =1


11
Shortcomings of Traditional clustering
methods with categorical data
 If we use the centroid-based hierarchical algorithm then we
merge P1 and P2 and get a new cluster (P12) with (1, 1, 0.5, 1)
as a centroid
 Then, using Euclidean distance again, we find:
 d(p12,p3)= √3.25
 d(p12,p4)= √2.25
 d(p3,p4)= √2
 So, we should merge P3 and P4 since the distance between
them is the shortest.
 However, T3 and T4 don't have even a single common item.
 So, using distance metrics as similarity measure for categorical
data is not appropriate
 The solution is ROCK

12
ROCK - Major definitions
 Similarity function
 Neighbors
 Links
 Criterion function
 Goodness measure

13
Similarity function
 Let Sim (Pi, Pj) be a similarity function that is
used to measure the closeness between
points pi and Pj.
 ROCK assumes that Sim function is
normalized to return a value between 0 and 1
 For Quran treasures data, a possible
definition for the sim function is based on the
Jaccard coefficient:
sim ( Pi , Pj ) = | Pi ∩Pj |
| Pi ∪Pj |
14
Example : similarity function
 Suppose two verses (P1 and P2) contain the
following subjects
 P1={ judgment, faith, prayer, fair}
 P2={ fasting, faith, prayer}
 Sim(P1,P2)= | P1∩ P2| / | P1∪P2|
 = 2 / 5 = 0.40

15
Major definitions
 Similarity for data objects
 Neighbors
 Links
 Criterion function
 Goodness measure

16
Neighbors and Links

 one main problem of traditional clustering is:local properties


involving only the two points are considered.
 Neighbor
 If similarity between two points exceeds certain similarity
threshold (θ ), they are neighbors.
 Link
 The Link for pair of points is: the number of their common
neighbors.
 Obviously, Link incorporates global information about the
other points in the neighborhood of the two points. The larger
the Link, the higher probability that this pair of points are in
the same clusters.

17
Example : neighboring and linking
 Example :
 Assume that we have three distinct points: p1,p2 and p3;
where
 neighbor(p1)={p1,p2}
 neighbor(p2)={p1,p2,3}
 neighbor(p3)={p3,p2}

 Neighboring graph 
 To define the number of links between two points, say p1 and
p3, we have to find the number of their common neighbors;
hence, we can define the linkage function between p1 and p3
to be:
 Link (p1,p3) = | neighbor(p1) ∩ neighbor(p3) |= | {P2}|
 Or Link (p1,p3) = 1

18
Example : minimum linkages
 If we have four points:P1,P2,P3,P4
 suppose that similarity threshold (θ ) is equal to 1
 Then, Two Points are neighbors if sim(Pi,Pj)>=1
 hence, points are considered neighbors only to
identical points (i.e. only to themselves)
 To find Link(P1,P2):
 neighbor(P1)={P1}
 neighbor(P2)={P2}
 link (P1,P2)= |neighbor(p1) ∩ neighbor(p2) | =0

19
 The following table shows the number of links
(common neighbors) between the four points:

 We can depict the neighboring graph:

20
Example : maximum linkages
 If we have four points:P1,P2,P3,P4
 suppose that similarity threshold (θ ) is equal to 0
 Then, Two Points are neighbors if sim(Pi,Pj)>=0
 hence, any pair of points are neighbors
 To find Link(P1,P2):
 neighbor(P1)={P1,P2,P3,P4}
 neighbor(P2)={P1,P2,P3,P4}
 link (P1,P2)= |neighbor(P1) ∩ neighbor(P2) | =4

21
 The following table shows the number of links
(common neighbors) between the four points:

 We can depict the neighboring graph:

22
Example :illustrating links
 from the previous example, we have:
 neighbor(P1)={P1,P2,P3,P4}

 neighbor(P3)={P1,P2,P3,P4}

 link (P1,P3)= |neighbor(P1) ∩ neighbor(P3) | =4 links


 we can depict these four different links (or paths) through these
four different neighbors as follows:

23
Major definitions
 Similarity for data objects
 Neighbors
 Links
 Criterion function
 Goodness measure

24
Criterion function
 to get the best clusters, we have to maximize this
Criterion Function
k link ( p , p )
E = ∑n × ∑ θ
q
1+ 2 f ( )
r

n
l i
i =1 pq , p r ∈Ci
 Where Ci denotes cluster i i
 ni is the number of points in Ci
 k is the number of clusters
 θ is the similarity threshold

• Suppose in Ci, each point has roughly nf(θ) neighbors.


• A suitable choice for basket data is : f(θ)=(1-θ)/(1+θ)

25
Criterion function
 By maximizing this criterion function, we are
maximizing the sum of links of intra cluster point
pairs and at the same time minimizing the sum of
links among pairs of points belonging to different
clusters (i.e. among inter cluster point pairs)

26
Major definitions
 Similarity for data objects
 Neighbors
 Links
 Criterion function
 Goodness measure

27
Goodness measure
 Goodness Function
link [C , C ]
g (C , C ) = i j

(n + n ) −n −n
i j θ
1+ 2 f ( ) θ 1+ 2 f ( ) 1+ 2 f ( θ )

i j i j

 During clustering, we use this goodness


measure in order to maximize the criterion
function.
 This goodness measure helps to identify the
best pair of clusters to be merged during
each step of ROCK.
28
ROCK Clustering algorithm
 Input: A set S of data points
 Number of k clusters to be found
 The similarity threshold
 Output: Groups of clustered data

 The ROCK algorithm is divided into three major parts:


1. Draw a random sample from the data set:
2. Perform a hierarchical agglomerative clustering algorithm
3. Label data on disk
 in our case, we do not deal with a very huge data set. So, we
will consider the whole data in the process of forming clusters,
i.e. we skip step1 and step3

29
ROCK Clustering algorithm

1. Draw a random sample from the data set:


 sampling is used to ensure scalability to
very large data sets
 The initial sample is used to form clusters,
then the remaining data on disk is assigned
to these clusters
 in our case, we will consider the whole data
in the process of forming clusters.

30
ROCK Clustering algorithm
2. Perform a hierarchical agglomerative clustering
algorithm:
 ROCK performs the following steps which are
common to all hierarchical agglomerative
clustering algorithms, but with different definition to
the similarity measures:

a. places each single data point into a separate cluster


b. compute the similarity measure for all pairs of clusters
c. merge the two clusters with the highest similarity
(goodness measure)
d. Verify a stop condition. If it is not met then go to step b
31
3. Label data on disk:

Finally, the remaining data points in the disk are
assigned to the generated clusters.
 This is done by selecting a random sample Li
from each cluster Ci, then we assign each point
p to the cluster for which it has the strongest
linkage with Li.
 As we said, we will consider the whole data in
the process of forming clusters.

32
ROCK Clustering algorithm
 Computation of links:
1. using the similarity threshold θ , we can
convert the similarity matrix into an
adjacency matrix (A)
2. Then we obtain a matrix indicating the
number of links by calculating (A x A ) , i.e.,
by multiplying the adjacency matrix A with
itself

33
ROCK Example
 Suppose we have four verses contains some subjects , as follows:
 P1={ judgment, faith, prayer, fair}
 P2={ fasting, faith, prayer}
 P3={ fair, fasting, faith}
 P4={ fasting, prayer, pilgrimage}
 the similarity threshold = 0.3, and number of required cluster is 2.
 using Jaccard coefficient as a similarity measure,
we obtain the following similarity table :

34
ROCK Example

 Since we have a similarity


threshold equal to 0.3, then we
derive the adjacency table:
 By multiplying the adjacency
table with itself, we derive the
following table which shows the
number of links (or common
neighbors) :

35
ROCK Example
 we compute the goodness
measure for all adjacent
link[ Pi , Pj ]
g ( Pi , Pj ) =
points ,assuming that f(θ ) (n + m)1+ 2 f (θ ) − n1+ 2 f (θ ) − m1+ 2 f (θ )
=1-θ / 1+θ
 we obtain the
following table:

 we have an equal
goodness measure for
merging ((P1,P2), (P2,P1),
(P3,P1))

36
ROCK Example
 Now, we start the hierarchical algorithm by merging,
say P1 and P2.
 A new cluster (let’s call it C(P1,P2)) is formed.
 It should be noted that for some other hierarchical
clustering techniques, we will not start the clustering
process by merging P1 and P2, since Sim(P1,P2) =
0.4,which is not the highest. But, ROCK uses the
number of links as the similarity measure rather than
distance.

37
ROCK Example
 Now, after merging P1 and P2,
we have only three clusters.
The following table shows the
number of common neighbors
for these clusters:

 Then we can obtain the


following goodness measures
for all adjacent clusters:

38
ROCK Example
 Since the number of required clusters is 2,
then we finish the clustering algorithm by
merging C(P1,P2) and P3, obtaining a new
cluster C(P1,P2,P3) which contains
{P1,P2,P3} leaving P4 alone in a separate
cluster.

39
Conclusion and future work (1/3)
 We aim to apply a clustering technique on the
verses of the Holy Quran
 We should first perform manual
preprocessing for the Quran text to capture
the subjects of the verses into a tabular
format.
 Then we can apply a clustering algorithm
which clusters each set of similar verses into
the same group.
40
Conclusion and future work (2/3)
 Most traditional clustering algorithm uses distance
based similarity measures which is not appropriate
for clustering our categorical-type datasets.
 we will apply the general framework of the ROCK
algorithm.
 The ROCK (RObust Clustering using linKs)
algorithm is an agglomerative hierarchical clustering
algorithm for clustering categorical data. It presents
a new notion of link to measure similarity between
data objects.

41
Conclusion and future work (3/3)
 We will adopt JAVA language to implement
ROCK clustering algorithm.
 During testing, will try to form clusters of
verses belonging to a single sura, and verses
belonging to many different suras.
 Insha Allah, we will achieve success in
performing this mission.

42
Thank You for your attention

I will be glad to answer your questions

43